

#### Module 1

# Introduction to Computer Architecture Part 1

#### Course Goals – View of the Machine

- In this class we will focus on the Assembly/Machine Language Level down to the Digital Logic Level
- The decisions we take at each level both drive the complexity and scale the performance.
- We will study the history of past architectures, and look at the trends where research is taking us.

#### **Levels of the Virtual Machine**

**Application Level** 

High Level Language Level

Assembly/Machine Language Level

System Architecture Level

**Digital Logic Level** 

**Electronic Design Level** 

Semiconductor Physics Level

# Typical Computer Architecture Prerequisites

- Experience with a high level language
  - Java
  - o C, etc.
- Assembly language programming
- Digital logic circuits
- Many students have also taken VHDL or Verilog Programing.
  - Module 2 will provide an introduction to VHDL and Verilog
- The class will also use a FPGA prototype board for discussion topics and projects
  - We will examine the systems level concepts, and then dive into the hardware description.

#### Course Objectives

- Understand the systems engineering of computer architecture
- Provide the tools that allows the student to transition into computer design
- Develop a reference framework that supports the students investigation into new architectures
- Provide the reference framework that supports the student creating an architecture in hardware and software

## Course Organization

- On line classes are typically Lectures, Homework and discussion forum.
  - Twice a semester there are tests
- This class will have Lectures, Homework and Discussion Forum
  - No Tests (The homework problems will be the primary material review)
  - One Project Paper and presentation
- The discussion form and some of the homeworks will focus on emulating the MIPS processor on the Nexsys 4 DDR FPGA board
- The Yamin Li text provides example MIPS processors coded in Verilog
- For many of the discussion forum, Labs are provided. The class is expected to work as a team to solve the labs.
- Hardware and software will be provided to each student.

#### Course Outline

- 1: Introduction to Computer Architecture
- 2: Digital Logic Design, Introduction to VHDL, Verilog and System Verilog
- 3: Computer Arithmetic
- 4: Advanced Computer Arithmetic
- 5: The Processor (CPU and Control Unit) Single Cycle
- 6: MultiCycle Processors
- 7: Pipeline Processing
- 8: More Pipeline Processing, Hazard Detection
- 9: Memory Subsystems
- 10: More Memory Subsystems
- 11: Computer I/O
- 12: Parallel Processing
- 13: Computer Networks Clusters and Multiprocessing
- 14: Project Presentations



#### **Understanding Performance**

- Algorithm
  - Determines number of operations executed
- Programming language, compiler, architecture
  - Determine number of machine instructions executed per operation
- Processor and memory system
  - Determine how fast instructions are executed
- I/O system (including OS)
  - Determines how fast I/O operations are executed



#### Eight Great Ideas In Computer Architecture

Design for Moore's Law



Use abstraction to simplify design



Make the common case fast



• Performance via parallelism



• Performance via pipelining



• Perioriiance via prediction



• *Hierarchy* of memories



• **Dependability** via redundancy





- Application software
  - Written in high-level language
- System software
  - Compiler: translates HLL code to machine code
  - Operating System: service code
    - Handling input/output
    - Managing memory and storage
    - Scheduling tasks & sharing resources
- Hardware
  - Processor, memory, I/O controllers



#### Levels of Program Code

- High-level language
  - Level of abstraction closer to problem domain
  - Provides for productivity and portability
- Assembly language
  - Textual representation of instructions
- Hardware representation
  - Binary digits (bits)
  - Encoded instructions and data

High-level language program (in C)

Assembly language program (for MIPS)

swap(int v[], int k)
{int temp;
 temp = v[k];
 v[k] = v[k+1];
 v[k+1] = temp;
}



swap:

muli \$2, \$5,4
add \$2, \$4,\$2
lw \$15, 0(\$2)
lw \$16, 4(\$2)
sw \$16, 0(\$2)
sw \$15, 4(\$2)
ir \$31



Binary machine language program (for MIPS) 

# A common Thread – Computer Components

- Computer Architectures have common components, with unique data bus:
  - Regardless of scale, the computers have input, output, memory, datapath and control.
- Speed is accomplished using parallel operations, wide buses, and load balancing between resources
  - High speed was also accomplished by making the most frequent instruction very fast, and by limiting the number of instructions offered to keep the hardware optimal.



#### Touchscreen

- PostPC device
- Supersedes keyboard and mouse
- Resistive and Capacitive types
  - Most tablets, smart phones use capacitive
  - Capacitive allows multiple touches simultaneously





### Through the Looking Glass

- LCD screen: picture elements (pixels)
  - Mirrors content of frame buffer memory













#### Inside the Processor (CPU)

- Datapath: performs operations on data
- Control: sequences datapath, memory, ...
- Cache memory
  - Small fast SRAM memory for immediate access to data



#### Inside the Processor

Apple A5





#### Update – Ipad Pro 12.9 inch







 In addition to 5K retina display, the Ipad Pro also uses the Apple APL 1021 A9X 64bit Processor, and a NXP Semiconductor LPC11U37 ARM Coretex-MO Microcontroller. This unit also had 4Gbytes RAM, and 32GB NAND Flash.

#### A Safe Place for Data

- Volatile main memory
  - Loses instructions and data when power off
- Non-volatile secondary memory
  - Magnetic disk
  - Flash memory
  - Optical disk (CDROM, DVD)











#### **Networks**

- Communication, resource sharing, nonlocal access
- Local area network (LAN): Ethernet
- Wide area network (WAN): the Internet
- Wireless network: WiFi, Bluetooth







**Technology Trends** 

- Electronics technology continues to evolve
  - Increased capacity and performance
  - Reduced cost



**DRAM** capacity

| Year | Technology                 | Relative performance/cost |  |
|------|----------------------------|---------------------------|--|
| 1951 | Vacuum tube                | 1                         |  |
| 1965 | Transistor                 | 35                        |  |
| 1975 | Integrated circuit (IC)    | 900                       |  |
| 1995 | Very large scale IC (VLSI) | 2,400,000                 |  |
| 2013 | Ultra large scale IC       | 250,000,000,000           |  |



#### Semiconductor Technology

- Silicon: semiconductor
- Add materials to transform properties:
  - Conductors
  - Insulators
  - Switch

#### Manufacturing ICs



Yield: proportion of working dies per wafer



#### Intel Core i7 Wafer



- 300mm wafer, 280 chips, 32nm technology
- Each chip is 20.7 x 10.5 mm



## Integrated Circuit Cost

```
Cost per die = \frac{\text{Cost per wafer}}{\text{Dies per wafer} \times \text{Yield}}
Dies per wafer ≈ Wafer area/Die area
Yield = \frac{1}{(1+(Defects per area \times Die area/2))^2}
```

- Nonlinear relation to area and defect rate
  - Wafer cost and area are fixed
  - Defect rate determined by manufacturing process
  - Die area determined by architecture and circuit design



#### Intermission

## Computer Architecture Performance Measurement

### Defining Performance

Which airplane has the best performance?











#### Response Time and Throughput

- Response time
  - How long it takes to do a task
- Throughput
  - Total work done per unit time
    - e.g., tasks/transactions/... per hour
- How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - o Adding more processors?
- We'll focus on response time for now...



#### Relative Performance

- Define Performance = 1/Execution Time
- "X is n time faster than Y"

Performance<sub>x</sub>/Performance<sub>y</sub>

- = Execution time $_{Y}$ /Execution time $_{X} = n$
- Example: time taken to run a program
  - 10s on A, 15s on B
  - Execution Time<sub>B</sub> / Execution Time<sub>A</sub>
     = 15s / 10s = 1.5
  - So A is 1.5 times faster than B



#### Measuring Execution Time

- Elapsed time
  - Total response time, including all aspects
    - o Processing, I/O, OS overhead, idle time
  - Determines system performance
- CPU time
  - Time spent processing a given job
    - o Discounts I/O time, other jobs' shares
  - Comprises user CPU time and system CPU time
  - Different programs are affected differently by CPU and system performance



 Operation of digital hardware governed by a constant-rate clock



- Clock period: duration of a clock cycle
  - e.g.,  $250ps = 0.25ns = 250 \times 10^{-12}s$
- Clock frequency (rate): cycles per second
  - e.g., 4.0GHz = 4000MHz =  $4.0 \times 10^9$ Hz



#### **CPU Time**

```
CPU Time = CPU Clock Cycles × Clock Cycle Time

= CPU Clock Cycles

Clock Rate
```

- Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count

ep.jhu.edu 3′

#### **CPU Time Example**

- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B
  - Aim for 6s CPU time
  - Can do faster clock, but causes 1.2 × clock cycles
- How fast must Computer B clock be?

Clock Rate<sub>B</sub> = 
$$\frac{\text{Clock Cycles}_{\text{B}}}{\text{CPU Time}_{\text{B}}} = \frac{1.2 \times \text{Clock Cycles}_{\text{A}}}{6\text{s}}$$

Clock Cycles<sub>A</sub> = CPU Time<sub>A</sub> × Clock Rate<sub>A</sub>

$$= 10\text{s} \times 2\text{GHz} = 20 \times 10^{9}$$

Clock Rate<sub>B</sub> =  $\frac{1.2 \times 20 \times 10^{9}}{6\text{s}} = \frac{24 \times 10^{9}}{6\text{s}} = 4\text{GHz}$ 



#### Instruction Count and CPI

Clock Cycles = Instruction Count × Cycles per Instruction

CPU Time = Instruction Count × CPI × Clock Cycle Time

Instruction Count × CPI

Clock Rate

- Instruction Count for a program
  - Determined by program, ISA and compiler
- Average cycles per instruction
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix



- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- Which is faster, and by how much?

$$\begin{aligned} \text{CPU Time}_{A} &= \text{Instruction Count} \times \text{CPI}_{A} \times \text{Cycle Time}_{A} \\ &= \text{I} \times 2.0 \times 250 \text{ps} = \text{I} \times 500 \text{ps} & \text{A is faster...} \end{aligned}$$

$$\begin{aligned} \text{CPU Time}_{B} &= \text{Instruction Count} \times \text{CPI}_{B} \times \text{Cycle Time}_{B} \\ &= \text{I} \times 1.2 \times 500 \text{ps} = \text{I} \times 600 \text{ps} \end{aligned}$$

$$\begin{aligned} &\frac{\text{CPU Time}_{B}}{\text{CPU Time}_{A}} &= \frac{\text{I} \times 600 \text{ps}}{\text{I} \times 500 \text{ps}} = 1.2 & \text{...by this much} \end{aligned}$$



#### **CPI in More Detail**

 If different instruction classes take different numbers of cycles

$$Clock \ Cycles = \sum_{i=1}^{n} (CPI_{i} \times Instruction \ Count_{i})$$

Weighted average CPI

$$CPI = \frac{Clock \ Cycles}{Instruction \ Count} = \sum_{i=1}^{n} \left( CPI_i \times \frac{Instruction \ Count_i}{Instruction \ Count} \right)$$

Relative frequency



#### CPI Example

 Alternative compiled code sequences using instructions in classes A, B, C

| Class            | А | В | С |
|------------------|---|---|---|
| CPI for class    | 1 | 2 | 3 |
| IC in sequence 1 | 2 | 1 | 2 |
| IC in sequence 2 | 4 | 1 | 1 |

- Sequence 1: IC = 5
  - Clock Cycles= 2×1 + 1×2 + 2×3= 10
  - Avg. CPI = 10/5 = 2.0

- Sequence 2: IC = 6
  - Clock Cycles= 4×1 + 1×2 + 1×3= 9
  - Avg. CPI = 9/6 = 1.5



# Performance Summary

$$CPU \, Time = \frac{Instructions}{Program} \times \frac{Clock \, cycles}{Instruction} \times \frac{Seconds}{Clock \, cycle}$$

- Performance depends on
  - Algorithm: affects IC, possibly CPI
  - Programming language: affects IC, CPI
  - Compiler: affects IC, CPI
  - Instruction set architecture: affects IC, CPI, T<sub>c</sub>

### **Power Trends**



In CMOS IC technology

Power = Capacitive load × Voltage<sup>2</sup> × Frequency

×30

×1000

# Reducing Power

- Suppose a new CPU has
  - 85% of capacitive load of old CPU
  - 15% voltage and 15% frequency reduction

$$\frac{P_{\text{new}}}{P_{\text{old}}} = \frac{C_{\text{old}} \times 0.85 \times (V_{\text{old}} \times 0.85)^2 \times F_{\text{old}} \times 0.85}{C_{\text{old}} \times V_{\text{old}}^2 \times F_{\text{old}}} = 0.85^4 = 0.52$$

- The power wall
  - We can't reduce voltage further
  - We can't remove more heat
- How else can we improve performance?







# Multiprocessors

- Multicore microprocessors
  - More than one processor per chip
- Requires explicitly parallel programming
  - Compare with instruction level parallelism
    - Hardware executes multiple instructions at once
    - Hidden from the programmer
  - Hard to do
    - Programming for performance
    - Load balancing
    - Optimizing communication and synchronization

## SPEC CPU Benchmark

- Programs used to measure performance
  - Supposedly typical of actual workload
- Standard Performance Evaluation Corp (SPEC)
  - Develops benchmarks for CPU, I/O, Web, ...
- SPEC CPU2006
  - Elapsed time to execute a selection of programs
    - Negligible I/O, so focuses on CPU performance
  - Normalize relative to reference machine
  - Summarize as geometric mean of performance ratios
    - CINT2006 (integer) and CFP2006 (floating-point)



### CINT2006 for Intel Core i7 920

| Description                       | Name       | Instruction<br>Count x 10 <sup>9</sup> | CPI  | Clock cycle time<br>(seconds x 10 <sup>-9</sup> ) | Execution<br>Time<br>(seconds) | Reference<br>Time<br>(seconds) | SPECratio |
|-----------------------------------|------------|----------------------------------------|------|---------------------------------------------------|--------------------------------|--------------------------------|-----------|
| Interpreted string processing     | perl       | 2252                                   | 0.60 | 0.376                                             | 508                            | 9770                           | 19.2      |
| Block-sorting compression         | bzip2      | 2390                                   | 0.70 | 0.376                                             | 629                            | 9650                           | 15.4      |
| GNU C compiler                    | gcc        | 794                                    | 1.20 | 0.376                                             | 358                            | 8050                           | 22.5      |
| Combinatorial optimization        | mcf        | 221                                    | 2.66 | 0.376                                             | 221                            | 9120                           | 41.2      |
| Go game (AI)                      | go         | 1274                                   | 1.10 | 0.376                                             | 527                            | 10490                          | 19.9      |
| Search gene sequence              | hmmer      | 2616                                   | 0.60 | 0.376                                             | 590                            | 9330                           | 15.8      |
| Chess game (AI)                   | sjeng      | 1948                                   | 0.80 | 0.376                                             | 586                            | 12100                          | 20.7      |
| Quantum computer simulation       | libquantum | 659                                    | 0.44 | 0.376                                             | 109                            | 20720                          | 190.0     |
| Video compression                 | h264avc    | 3793                                   | 0.50 | 0.376                                             | 713                            | 22130                          | 31.0      |
| Discrete event simulation library | omnetpp    | 367                                    | 2.10 | 0.376                                             | 290                            | 6250                           | 21.5      |
| Games/path finding                | astar      | 1250                                   | 1.00 | 0.376                                             | 470                            | 7020                           | 14.9      |
| XML parsing                       | xalancbmk  | 1045                                   | 0.70 | 0.376                                             | 275                            | 6900                           | 25.1      |
| Geometric mean                    | -          | _                                      | -    | _                                                 | -                              | -                              | 25.7      |



### SPEC Power Benchmark

- Power consumption of server at different workload levels
  - Performance: ssj\_ops/sec
  - Power: Watts (Joules/sec)

Overall ssj\_ops per Watt = 
$$\left(\sum_{i=0}^{10} ssj_ops_i\right) / \left(\sum_{i=0}^{10} power_i\right)$$

## SPECpower\_ssj2008 for Xeon X5650

| Target Load %                      | Performance<br>(ssj_ops) | Average Power<br>(Watts) |
|------------------------------------|--------------------------|--------------------------|
| 100%                               | 865,618                  | 258                      |
| 90%                                | 786,688                  | 242                      |
| 80%                                | 698,051                  | 224                      |
| 70%                                | 607,826                  | 204                      |
| 60%                                | 521,391                  | 185                      |
| 50%                                | 436,757                  | 170                      |
| 40%                                | 345,919                  | 157                      |
| 30%                                | 262,071                  | 146                      |
| 20%                                | 176,061                  | 135                      |
| 10%                                | 86,784                   | 121                      |
| 0%                                 | 0                        | 80                       |
| Overall Sum                        | 4,787,166                | 1,922                    |
| $\Sigma$ ssj_ops/ $\Sigma$ power = |                          | 2,490                    |



### Pitfall: Amdahl's Law

 Improving an aspect of a computer and expecting a proportional improvement in overall performance

$$T_{\text{improved}} = \frac{T_{\text{affected}}}{\text{improvement factor}} + T_{\text{unaffected}}$$

- Example: multiply accounts for 80s/100s
  - How much improvement in multiply performance to get 5× overall?

$$20 = \frac{80}{n} + 20$$
 • Can't be done!

Corollary: make the common case fast



- Look back at i7 power benchmark
  - At 100% load: 258W
  - At 50% load: 170W (66%)
  - At 10% load: 121W (47%)
- Google data center
  - Mostly operates at 10% 50% load
  - At 100% load less than 1% of the time
- Consider designing processors to make power proportional to load

### Pitfall: MIPS as a Performance Metric

- MIPS: Millions of Instructions Per Second
  - Doesn't account for
    - Differences in ISAs between computers
    - Differences in complexity between instructions

$$\begin{aligned} \text{MIPS} &= \frac{Instruction \, count}{Execution \, time \times 10^6} \\ &= \frac{Instruction \, count}{Instruction \, count \times CPI} = \frac{Clock \, rate}{CPI \times 10^6} \end{aligned}$$

CPI varies between programs on a given CPU

# Concluding Remarks

- Cost/performance is improving
  - Due to underlying technology development
- Hierarchical layers of abstraction
  - In both hardware and software
- Instruction set architecture
  - The hardware/software interface
- Execution time: the best performance measure
- Power is a limiting factor
  - Use parallelism to improve performance

## Scales, Units, and Conventions

| Term      | Normal Usage     | As a power of 2                     |
|-----------|------------------|-------------------------------------|
| K (kilo-) | 10 <sup>3</sup>  | 2 <sup>10</sup> = 1024              |
| M (mega-) | 10 <sup>6</sup>  | $2^{20} = 1,048,576$                |
| G (giga-) | 10 <sup>9</sup>  | 2 <sup>30</sup> = 1,073,741,824     |
| T (tera-) | 10 <sup>12</sup> | 2 <sup>40</sup> = 1,099,511,627,776 |

Note the differences between usages. You should commit the powers of 2 and 10 to memory.

| Term       | Usage             |
|------------|-------------------|
| m (milli-) | 10 <sup>-3</sup>  |
| ∞(micro-)  | 10 <sup>-6</sup>  |
| n (nano-)  | 10 <sup>-9</sup>  |
| p (pico-)  | 10 <sup>-12</sup> |

Powers of 2 are used to describe memory sizes.

Units: Bit (b), Byte (B), Nibble, Word (w), Double Word, Long Word Second (s), Hertz (Hz)



# Coming Up Next

- Part 2 Instruction Set Architecture
- Part 3 Tools