

# Computer Organization and Assembly Language CS / EE 320 Spring 2025

Lecture 1
Shahid Masud



# Difference between Computer Organization and Computer Architecture

<u>Architecture:</u> View from outside, high level specs (Programs looking towards CPU Instructions)

Organization: View from Inside (CPU looking outwards towards buses, memory and peripherals)



Performance Graphs from the report of National Academy, USA

'The Future of Computer Performance

Game Over or Next Lavel

Ed. Samuel H Fuller





FIGURE 3.3 Microprocessor-clock frequency (MHz) over time (1985-2010).





FIGURE 3.1 Microprocessor power dissipation (watts) over time (1985-2010).





FIGURE 2.1 Transistors, frequency, power, performance, and cores over time





FIGURE A.1 Integer application performance (SPECint2000) over time (1985-2010). Computer Organization and Assembly Language Lecture 1 Spring 2025





FIGURE A.2 Floating-point application performance (SPECfp2000) over time (1985-2010).

Computer Organization and Assembly Language Lecture 1 Spring 2025

# **Dennard Scaling**

Robert Dennard and colleagues described in 1974 a scaling methodology for metal-oxide-semiconductor field-effect transistors (MOSFETs) that would deliver consistent improvements in transistor area, performance, and power reduction. The methodology called for the scaling of transistor gate length, gate width, gate oxide thickness, and supply voltage all by the same scale factor, and increasing channel doping by the inverse of the same scale factor (see Figure 1). The result would be transistors with smaller area, higher drive current (higher performance), and lower parasitic capacitance (lower active power). This method for scaling MOSFET transistors is generally referred to as "classic" or "traditional" scaling and was very successfully used by the industry up until the 130-nm generation in the early 2000s.



Figure 1.
Traditional MOSFET scaling as described by robert dennard.

# Depiction of Dennard Scaling Effects



Figure 2. Sources of computing performance have been challenged by the end of Dennard scaling in 2004. All additional approaches to further performance improvements end in approximately 2025 due to the end of the roadmap for improvements to semiconductor lithography. Figure from Kunle Olukotun, Lance Hammond, Herb Sutter, Mark Horowitz and extended by John Shalf. (Online version in colour.)

# Future Improvements in Computer Performance



### **The Bottom**

for example, semiconductor technology

**Performance gains after Moore's law ends.** In the post-Moore era, improvements in computing power will increasingly come from technologies at the "Top" of the computing stack, not from those at the "Bottom", reversing the historical trend.

Ref: Leiserson et al., Science 368, 1079 (2020) 5 June 2020

# Example of Performance Gains through Architecture

**Table 1. Speedups from performance engineering a program that multiplies two 4096-by-4096 matrices.** Each version represents a successive refinement of the original Python code. "Running time" is the running time of the version. "GFLOPS" is the billions of 64-bit floating-point operations per second that the version executes. "Absolute speedup" is time relative to Python, and "relative speedup," which we show with an additional digit of precision, is time relative to the preceding line. "Fraction of peak" is GFLOPS relative to the computer's peak 835 GFLOPS. See Methods for more details.

| Version | Implementation              | Running time (s) | GFLOPS  | Absolute speedup | Relative speedup | Fraction<br>of peak (%) |
|---------|-----------------------------|------------------|---------|------------------|------------------|-------------------------|
| 1       | Python                      | 25,552.48        | 0.005   | 1                | -                | 0.00                    |
| 2       | Java                        | 2,372.68         | 0.058   | 11               | 10.8             | 0.01                    |
| 3       | С                           | 542.67           | 0.253   | 47               | 4.4              | 0.03                    |
| 4       | Parallel loops              | 69.80            | 1.969   | 366              | 7.8              | 0.24                    |
| 5       | Parallel divide and conquer | 3.80             | 36.180  | 6,727            | 18.4             | 4.33                    |
| 6       | plus vectorization          | 1.10             | 124.914 | 23,224           | 3.5              | 14.96                   |
| 7       | plus AVX intrinsics         | 0.41             | 337.812 | 62,806           | 2.7              | 40.45                   |

AVX = Intel Advanced Vector Extension

# Look at some recent research papers for inspiration



### The 50 Year History of the Microprocessor as Five Technology Eras

John L. Hennessy 🔍 Stanford University, Stanford, CA, 94305, USA

organized into five eras, each distinguished by ommon trends in the evolution of microproc essors. Most of these eras are around ten years and represent a shift from the previous era. I have had the privilege of being involved in some way for roughly 48 of the 50 years, so this is also a somewhat personal

This decade, which began with the birth of the Intel 4004 in 1971 was dominated by three trends:

- cessors to go from 4 to 8 to 16 and eventually 32
- 2. a rapid increase in instruction sets, often motivated by assembly language examples and enabled by microcode implementations (see the early Intel and Zilog microprocessors):
- 3. a rapid increase in clock speeds enabled by faster transistors.

The emergence of the personal computer (Apple II in 1977 and IBM PC in 1979) enabled the shrinkwrap software industry and reinforced the importance of object code compatibility, which the first microprocessors did not exhibit. The Motorola 68000, which appeared in production in the late 1980s, was the first 32-bit microprocessor and offered many of the features associated with minicomputers.

As Moore's law progressed, it seemed likely that puters. The accompanying growth in DRAM capacity (from 1 Kib in 1971 to 16 Kib in 1981) reduced the need

Digital Object Identifier 10.1109/MM.2021.3112301 Date of current version 19 November 2021

to hand-optimize code and program in assembly language. The subsequent movement to higher level lan guages inspired the groups at IBM, Berkeley, and Stanford to explore what became the RISC ideas. In addition to targeting compiler output, rather than handcrafted assembly language, they also emphasized elimination of microcode and compilation to an efficient hardware implementation.

The RISC ideas led to an explosion in the use of pipelining in microprocessors, which generated a rapid increase in clock rate performance. This era was characterized by incredible annual performance growth of approximately 1.5 times, enabled by the inclusion of caches and much faster clocks.

The capstone events in this era were the introduc tion of 64-bit processors (the R4000, followed by the DEC Alpha, and others) and a growth in pipeline depth from 5 to 7–10 or more stages, which led to clock rates rivaling ECL mainframes.

The third era is characterized by an intensive focus on exploiting instruction-level parallelism and trying to reduce the clock cycles per instruction to less than 1 The ILP-intensive processors fall into two broad cate-(VLIW). The superscalar processors used a combination of hardware techniques and software scheduling to issue more than one instruction per clock, while the VLIW approach relied on little hardware support and intensive compiler scheduling to organize independent operations into issue packets. The VLIW approach did not succeed in the end, due to a variety of factors, most importantly the inability to achieve high performance on less structured integer programs

The initial superscalar processors (such as the MIPS R8000, PowerPC 604, and later Intel Pentium) used static scheduling, meaning that instruction issue was blocked if the next instruction's operands were not yet available. This approach was rapidly followed by a shift to dynamic scheduling (allowing instructions to be executed out of order) and speculation (allowing instructions to be executed before a preceding branch

IEEE Micro nited to: LAHORE UNIV OF MANAGEMENT SCIENCES. Downloaded on June 27,2022 at 07:24:54 UTC from IEEE Xelore. Restrictions appli REVIEW SCHWARY

COMPLYTCH SCIENCE

There's plenty of room at the Top: What will drive computer performance after Moore's law?

district will bee challenger to their produc-toric Securiteises, apportunities for growth

to competing purhassasses will still be used

able, especially at the "Exp" of the competing

technology via k: anthrare, algorithms, and

APPRINCES. Software and for tracin storm offi-

rised by performance engineering; metrorise

ing authorize to make it may have. Performance

that size to represent on application's devel

run. Performance originarring can also talled

to countries to take advantage of parallel pro-

Algorithms offer more efficient ways to sell

ranic boson as softency blust, giving from

Charles E. Letterson, No. C. Thompson'; Jost'S. Steen, Bradley C. Konneyal, Baller M. Langeon, Stated Secretar, No. S. Schmidt

power can claim a large share of the credit for many of the rivings that we take for granted to our modern From well-thouse that are more 27 years ago, beterned arrows for moutly built presental expressesperies. Retarly has more crease expensionly over time.

Much of the improvement in company see because come from decides of menutes trains of topopoler progression, a bread that was finhard Fermann to be 1610 address. the Assertion Photosical Society. In MTS, Sould tocaler Gordon Moore predated the regio larly of the mentalcoloutes trend, now called Boor's los, which, said recently, Audited the puniter el'insceliere ès aniquer d'âpe every

eriddens, findaed, about the late SFTs, the owner to refer the musicous flow position together tion is repeating and of steam in a stable way. made in much from etapethonic infrances much more come of the "Bottom" Excepts. I a grow sharefflow profiles access seconds.



Parlaments gate after Board for each, it the good blace etc. Appropriate decomposing games or Accountify come from bedresingso, of the "Top" of the computing which, ted from From at the "Bettier married the Enterior Inval-

turing lecture and spreadingly and most allimately but, the microfilling refusion. As quells, we see like high gost besuffix costing fruit algorithms for new

Historian and historians Instance, through process to mentaced with a storage

nee that require trees mentions. The facet-op triposition healigh can flow to todeployed in other ways - the example by humaning the signifies of previous core existing its parallel, which can lead to large prodution. Another tiers of strenglining is united for a particular application domain. Oteraality that is not much of for the dismain.

It can also allow more communication to the

ection density by, making having and how developing new thousestaid muchina

specific characteristics of the Assuata, Lie inexample, for decreasing flusting quiet, precision r machine borning applications: In the past-Mount era, performance inburdence problement will be readingly reof the stack. These changes will be make to to-

emounts passe of wee, if they were within typically inner than a relition lines of resix or tendence of concentitie constitute. When a continuent, modelscitt our le mois mells mover motor and broadly can be product to that the big compound one by positfed by benefits

DATE DOWN AN ADDRESS OF PERSONS AND ADDRESS. Metastic improvements at the Sottom will no hanger percents; the producted to herael found estawed his some than 30 years, helivope our formation originately, development of algo-tificate, and hardware presenting at the Top can continue to make computer applicaintrolog gains at the Boltom, however, gain as the Top-will be opportunistic, sowner, and operatio. However, they will be entired to boosking returns as specific competences became before explored, a

Innovations like domain-specific hardware. enhanced security, open instruction sets, and agile chip development will lead the way.

BY JOHN L. HENNESSY AND DAVID A. PATTERSON

### A New Golden Age for Computer **Architecture**

WE BEGAN OUR Turing Lecture June 4, 201811 with a review of computer architecture since the 1960s. In addition to that review, here, we highlight current challenges and identify future opportunities, projecting another golden age for the field of computer architecture in the next decade, much like the 1980s when we did the research that led to our award, delivering gains in cost, energy, and security, as well as performance.

"Those who cannot remember the past are condemned to repeat it." -George Santayana, 1905

Software talks to hardware through a vocabulary called an instruction set architecture (ISA). By the early 1960s, IBM had four incompatible lines of computers, each with its own ISA, software stack, I/O system, and market niche-targeting small business, large business, scientific, and real time, respectively. IBM

ngineers, including ACM A.M. Turing Award laureate Fred Brooks, Ir., nought they could create a single ISA that would efficiently unify all four of these ISA bases.

They needed a technical solution for how computers as inexpensive as

Elevating the hardware/softwa

48 COMMUNICATIONS OF THE ACM | FEBRUARY 2010 | VOL. 62 | NO.

# Processor Driver vs Memory Driven Computing



FIGURE 7: Memory-driven computing requires flexible clocking [59]. DSP: digital signal processor. ASIC: application-specific integrated circuit; RISC-V: reduced instruction set computing.



# **Topics**

- 1. Discussion on Moore's Law, through paper 'MORE THAN MOORE' by M. Mitchell Waldrop, published In Nature, February 2016
- 2. Computer Performance Graphs from National Academy, USA, report
- 3. Introduction to the specialization area of 'Computer Architecture' through paper by Hennessy and Patterson, 'A New Golden Age for Computer Architecture', published in Communications of the ACM, February 2019, pages 48 to 60.

### Important Directions for Computers of the future:

- a. Moore's Law failing to keep up
- b. Domain Specific Architectures
- c. Enhanced security features inside microprocessors
- d. Open Instruction Sets and Extension of Instruction Sets through Customized Accelerators
- e. Agile Hardware Design for Microprocessors
- f. Combination of Domain Specific Languages and Domain Specific Architectures
- 4. Intel Processor Timeline
- 5. Reviewed course outline including detailed topics and grading breakup of lectures and labs.



## Video of Turing Lecture by Patterson, 2019

<u>David Patterson - A New Golden Age for Computer Architecture: History, Challenges and Opportunities – YouTube</u>

