# 332 Advanced Computer Architecture Chapter 1

Introduction and review of Pipelines, Performance, and Caches

January 2019
Paul H J Kelly

These lecture notes are partly based on the course text, Hennessy and Patterson's Computer Architecture, a quantitative approach (6<sup>th</sup> ed), and on the lecture slides of David Patterson's Berkeley course (CS252)

### What is this course about?

- How the latest microprocessors work
- Why they are built that way and what are the alternatives?
- How you can make software that uses the hardware in the best possible way
- How you can make a compiler that does it for you
- How you can design a computer for your problem
- What does a big computer look like?
- What are the fundamental big ideas and challenges in computer architecture?
- What is the scope for theory?



Intel
Skylake
(this
laptop)
https://en.
wikichip.org
/wiki/intel/
microarchit
ectures/skyl
ake (client)



#### Sunway TaihuLight

http://www.net lib.org/utk/peo ple/JackDongarr a/PAPERS/sunw ay-report-2016.pdf



## This is a textbook-based course



Computer Architecture: A Quantitative Approach (6<sup>th</sup> Edition)

John L. Hennessy, David A. Patterson

- 936 pages. Morgan Kaufmann (2017)
- ISBN: 9780128119051
- Price: around £62 (shop around!)
- Publisher's companion web site:
  - <a href="https://www.elsevier.com/books-and-journals/book-companion/9780128119051">https://www.elsevier.com/books-and-journals/book-companion/9780128119051</a>
  - Textbook includes some vital introductory material as appendices:
  - Appendix C: tutorial on pipelining (read it NOW)
  - Appendix B: tutorial on memory hierarchy (read it NOW)
- Further appendices (some in book, some online) cover more advanced material (some very relevant to parts of the course), eg
  - Networks
  - Parallel applications
  - Embedded systems
  - Storage systems
  - VLIW
  - Computer arithmetic (esp floating point)
  - Historical perspectives

# Who are these guys anyway and why should I read their book?

- John Hennessy:
  - Founder, MIPS Computer Systems
  - President (2000-2016),
     Stanford University
  - Board member, Cisco, chair of Alphabet Inc (parent company of Google)
  - The "godfather of Silicon Valley" (Wikipedia)

#### David Patterson

- Leader, Berkeley RISC project
- RAID (redundant arrays of inexpensive disks)
- Professor, University of California, Berkeley
- President of ACM 2004-6
- Served on Information Technology Advisory Committee to the US





By Peg Skorpinski - Subject of pictures emailed it upon request, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=3207893



RISC.

RAID-I (1989)
consisted of a Sun
4/280 workstation
with 128 MB of
DRAM, four dualstring SCSI
controllers, 28 5.25inch SCSI disks
and specialized
disk striping
software.



RISC-I (1982) Contains 44,420 transistors, fabbed in 5 micron NMOS, with a die area of 77 mm², ran at 1 MHz. This chip is probably the first VLSI

#### Joint winners of the 2017 ACM Turing Award

"For pioneering a systematic, quantitative approach to the design and evaluation of computer architectures with enduring impact on the microprocessor industry"

## Course organisation

- Lecturer:
  - Paul Kelly Leader, Software Performance Optimisation research group
    - With help from Jessica Vandebon and Philippos Papaphilippou
- Two two-hour slots per week
- We will use the time flexibly with some sessions in the labs
- In the last couple of lectures we will spend some time on exam preparation
- Assessment:
  - Exam
    - The exam will take place in last week of term
    - The goal of the course is to teach you how to think about computer architecture
    - The exam usually includes some architectural ideas not presented in the lectures

#### Coursework

- You will be assigned two substantial, laboratory-based exercises
- You will learn about performance tuning for computationally-intensive kernels
- You will learn about using simulators, and experimentally evaluating hypotheses to understand system performance
- You are welcome to bring laptops to class to get started and get help during tutorials (we will go to the DoC labs when necessary)
- Please do not use computers for anything else during classes

- Ch1
  - Review of pipelined, in-order processor architecture and simple cache structures
- Ch2
  - Caches in more depth
  - Software techniques to improve cache performance
  - Virtual memory
- Ch3
  - Dynamic scheduling, out-of-order
  - Register renaming
  - Speculative execution
- Ch4
  - Branch prediction
- Ch5
  - Compiler issues
  - Compiler techniques loop nest transformations
  - Loop parallelisation, interchange, tiling/blocking, skewing

#### Ch6

#### **Course overview (plan)**

- Static instruction scheduling
- Software pipelining
- instruction-set support for speculation and register renaming
- Multi-threading
- Ch7
  - Data-parallelism, SIMD and vector
  - Graphics processors and manycore
- Ch8
  - Shared-memory multiprocessors
  - Cache coherency
  - Large-scale cache-coherency; ccNUMA. COMA
- Lab-based coursework exercise:
  - Simulation study
  - "challenge"
  - Using performance analysis tools
- Exam:
  - Partially based on recent processor architecture article, which we will study in advance (see past papers)

### External Students – Registration for DoC Courses

- Apply at: https://dbc.doc.ic.ac.uk/externalreg/
- 2 Then,
  - Your department's endorser will approve/reject your application
- If approved,
  - DoC's External Student Liaison will approve/reject your application
- If approved (again!),
  - Students will get access to DoC resources (DoC account, CATE, . . . )
  - No access after a few days? Check status of approval and contact relevant person(s)

#### **Key Dates**

- Exams for DoC  $3^{rd}/4^{th}$  yr. Spring courses take place at the end of the Term in week 11
- Registration for credit opens end January and closes mid February for Spring courses

If in doubt, read the guidelines available at the link above ©



Co-inventor of, and chief engineer on, the ENIAC, arguably the first general-purpose computer (first operational Feb 14<sup>th</sup> 1946)

27 tonnes, 150KW, 5000 cycles/sec



J.G. Brainerd & T.K. Sharpless. "The ENIAC." pp 163-172 Electrical Engineering, Feb 1948.

## ENIAC: "setting up the machine"



ENIAC was designed to be set up manually by plugging arithmetic units together (reconfigurable logic)

- You could plug together quite complex configurations
- Parallel with multiple units working at the same time





Jean Jennings (left), Marlyn Wescoff (center), and Ruth Lichterman program ENIAC <a href="https://imgur.com/gallery/nh38c">https://imgur.com/gallery/nh38c</a> and <a href="https://imgur.com/gallery/nh38c">https://imgur.com/gallery/nh38c</a> a

#### A PARALLEL CHANNEL COMPUTING MACHINE

J. P. Eckert, Jr.
Electronic Control Company

... Again I wish to reiterate the point that all the arguments for parallel operation are only valid provided one applies them to the steps which the built in or wired in programming of the machine operates. Any steps which are programmed by the operator, who sets up the machine, should be set up only in a serial fashion. It has been shown over and over again that any departure from this procedure results in a system which is much too complicated to use.

## ENIAC: "setting up the machine"



- •The "big idea": stored-program mode -
  - Plug the units together to build a machine that fetches instructions from memory - and executes them
  - So any calculation could be set up completely automatically – just choose the right sequence of instructions

We now formulate a set of motorictions to effect this 4-way decision between (x)-(S). We state again the contents of the short tanks already assigned:

11) Nnin 2,) wmin 3,) wx Fil Wym. 6,) Nm(-30) 7,) W (x(-30) 8,) W /A(-30) 5,) Nn.,, 9,) NIGO 10,) W/8(-20) 11,) ... + 8 Now Let the instructions occupy the (long tanh) words 1, 2, ... 1) T - 5 0) Nm'-m (-30) John von Neumann wrote his (-30) (1) N 18 (-30) (-30) (-30) 2,) 9, s 7, for n' = n first "program" in 1945  $3,) \sigma \rightarrow \overline{1},$ for n' = n It's clear he had the stored 4) 7, - 5, program idea in mind 0) N 18 (-30) S,) TO, s 8, for m' = m It was a couple of years before a 6,) 0 + 13, for n' = m machine to do it was actually 7,) 2,- 6, 0) Wm'-ne (-30) built 8,) 13, = 12, (a) N - (1) -for m' ? m Knuth, D. E. 1970. Von O) N 10 1/4 (-30) for mem, n'= n Neumann's First me my me a m Computer Program. ACM  $m' \ll m, m' \ll m$ Comput. Surv. 2, 4 (Dec. i.e for (8)(4), respectively. 9,) O → II, 1970), 247-260.

11,) k,/p/0,/2 -> C For (4), (3), (1)(0), respectively

ひし ボー・タ

1917/49 Kilbrum Highest Factor Routine (anence)-

| l         |                   |                   |                 |                             |               |        |                          |
|-----------|-------------------|-------------------|-----------------|-----------------------------|---------------|--------|--------------------------|
| boution.  | <u>_</u>          | 2.5!              | 267             | 27                          | 1             | 012344 | 13 14 6                  |
| -24 € C,  | -6.               | - 1               | <u> </u>        |                             | ] / [         | 00011  | 010                      |
| € 26      | ١. ١              | ļļ                | ~ (r)           |                             | 12 ;          | 0(011  | 110                      |
| - 26 ts C | $ L_{i} $         |                   |                 | !                           | 3             | 01011  | 0,10                     |
| e = 27    |                   |                   | - 5             | 6.                          | 14            | 11011  | 1:0                      |
| -23 5C    | d                 | $	au_{n=1,\dots}$ | -6 <sub>0</sub> | $L_{A}$                     | 5             | 11101  | 0.0                      |
| duly 27   | $a \rightarrow b$ |                   |                 |                             | 6             | 11011  | 001                      |
| tet       |                   |                   |                 | i                           | <u>ا د ا</u>  |        | Ø                        |
| 2412056   |                   |                   | ] .             |                             | 3             | 00101  | 100                      |
| Seels. 26 | +4                |                   |                 |                             | 9             | 0 (0)1 | 001                      |
| c ts 25   |                   | T' x              | Į į             | 1 1                         | 10            | 10011  | 110                      |
| -72 C.C   |                   |                   | ( )<br>(        | 1 [                         | $\mathcal{H}$ | 100(1  | 010                      |
| _ 3et_    |                   | L                 |                 |                             | 12            |        | 0 (1                     |
| 1 top     | 0                 | 0                 | -60             | 6.                          | 13            |        | $\underline{\mathbf{u}}$ |
| ~=6 G C   | V.                | $\tau_{\rm e}$    | -62             | 6.                          | 14            | 01011  | 010                      |
| butt. 25  | (-7)              |                   |                 | [                           | 15            | 10/01  | 001                      |
| z t 27    |                   |                   | l               | (Free )                     | 16            | 11011  | 110                      |
| -27 C C   | $+G_{n+1}$        |                   | ا ما            |                             | [7]           | 11011  | 010                      |
| * E 26    |                   |                   | $-b_{n,a,c}$    | . !                         | 14,           |        | 110                      |
| 325 (18.  |                   | 4.                | -6-             | $\langle r_{i,+,+} \rangle$ | <u>: (q</u>   | 01101  | 000                      |
|           |                   |                   |                 |                             |               |        |                          |

 This is the first program to actually run!



ON 10100



Manchester Small-Scale Experimental Machine (SSEM), nicknamed Baby Ran its first program on 21 June 1948 – the first program ever!



Manchester Small-Scale Experimental Machine (SSEM), nicknamed Baby Rebuilt for the 60<sup>th</sup> anniversary, now in in the Museum of Science and Industry in Manchester



John von Neumann
http://en.wikipedia.org/wiki/John\_von\_Neumann

John Backus

"Can Programming be
Liberated from the von
Neumann Style?" (1979)

# The "von Neumann bottleneck"

#### The price to pay:

- Stored-program mode was serial – one instruction at a time
- How can we have our cake - and eat it?
  - Flexibility and ease of programming
  - Performance of parallelism

How to beat the "Turing Tax"



- Q: How many registers can we address?
- Q: What is the largest signed immediate operand for "subw r1,r2,X"?
- Q: What range of addresses can a conditional branch jump to?

#### A machine to execute these instructions

- To execute this instruction set we need a machine that fetches them and does what each instruction says
- A "universal" computing device a simple digital circuit that, with the right code, can compute *anything*
- Something like:

```
Instr = Mem[PC]; PC+=4;
rs1 = Reg[Instr.rs1];
rs2 = Reg[Instr.rs2];
imm = SignExtend(Instr.imm);
Operand1 = if(Instr.op==BRANCH) then PC else rs1;
Operand2 = if(immediateOperand(Instr.op)) then imm else rs2;
res = ALU(Instr.op, Operand1, Operand2);
switch(Instr.op) {
case BRANCH:
 if (rs1==0) then PC=PC+imm; continue;
case STORE:
 Mem[res] = rs1; continue;
case LOAD:
 Imd = Mem[res];
Reg[Instr.rd] = if (Instr.op==LOAD) then Imd else res;
```

## 5 Steps of MIPS Datapath





5-stage MIPS pipeline with pipeline buffers Instruction Instr. Decode **Execute Memory** Write Addr. Calc **Fetch** Access Reg. Fetch **Back Next PC** Next SEQ PC Next SEQ PC Zero1 RS1 Reg RS2 WB Data Sign Extend Imm RD RD RD Decode Opcode

- Data stationary control
  - Control signals are needed to configure the MUXes, ALU, read/write
  - Carried with the corresponding instruction along the pipeline



- Pipelining doesn't help latency of single instruction
  - it helps throughput of entire workload
- Pipeline rate limited by slowest pipeline stage
- Potential speedup = Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- ◆ Time to "fill" pipeline and time to "drain" it reduces speedup
- Speedup comes from parallelism for free no new hardware
- Many pipelines are more complex Pentium 4 "Netburst" has 31 stages.



- Pipelining doesn't help latency of single instruction
  - it helps throughput of entire workload
- Pipeline rate limited by slowest pipeline stage
- Potential speedup = Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- ◆ Time to "fill" pipeline and time to "drain" it reduces speedup
- Speedup comes from parallelism for free no new hardware
- ◆ Many pipelines are more complex Pentium 4 "Netburst" has 31 stages.

Figure 3.3 Page 133 CA-AQA 2e





ព្លម្ភ ៦ឆ្កេរខ្លួន ines are more complex eedup comes from parallelism - for free – no new hardware ne to "fill" pipeline and time to "drain" it reduces speedup balanced lengths of pipe stages reduces speedup tential speedup = Number pipe stages Pentium 4 "Netburst" has 31 s



ព្លម្ភ <u>២ ខ្ពៃ ខ្ព</u>ុំ ines are more complex -

eedup comes from parallelism - for free – no new hardware

Pentium 4 "Netburst" has 31 s

ne to "fill" pipeline and time to "drain" it reduces speedup

balanced lengths of pipe stages reduces speedup

tential speedup = Number pipe stages

## It's Not That Easy

- Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
  - Structural hazards: the hardware cannot support this combination of instructions
  - <u>Data hazards</u>: Instruction depends on result of a prior instruction still in the pipeline
  - Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

## Structural Hazard: example – one RAM port

#### Time (clock cycles)



- Eg if there is only one memory for both instructions and data
- Two different stages may need access at same time
- Example: IBM/Sony/Toshiba Cell processor

Figure 3.6. Page 142 . CA:AQA 2e

## Structural Hazard: example – one RAM port

Time (clock cycles)



- Instr 3 cannot be loaded in cycle 4
- ID stage has nothing to do in cycle 5
- EX stage has nothing to do in cycle 6, etc. "Bubble" propagates

Figure 3.7. Page 143 . CA:AQA 2e

## Structural Hazard: example – one RAM port

Time (clock cycles)



- Instr 3 cannot be loaded in cycle 4
- ID stage has nothing to do in cycle 5
- EX stage has nothing to do in cycle 6, etc. "Bubble" propagates

Figure 3.7. Page 143 . CA:AQA 2e

## Data Hazard on R1

Time (clock cycles)

add r1,r2,r3 n S t sub r4,r1,r3 r. and r6,r1,r7 0 d r8,r1,r9 or e xor r10,r1,r11



## Forwarding to Avoid Data Hazard

Figure 3.10, Page 149, CA:AQA 2e

n

S

r.

0

d

e

r

#### Time (clock cycles)



## HW Change for Forwarding

Figure 3.20, Page 161, CA:AQA 2e

- Add forwarding ("bypass") paths
- Add multiplexors to select where ALU operand should come from
- Determine mux control in ID stage
- If source register is the target of an instrn that will not WB in time



# Forwarding builds the data flow graph

#### Time (clock cycles)



- Values are passed directly from the output of the ALU to the input
- Via forwarding wires
- That are dynamically configured by the instruction decoder/control
- (This gets much more exciting when we have multiple ALUs and multiple-issue)



## Data Hazard Even with Forwarding

Figure 3.12, Page 153, CA:AQA 2e

#### Time (clock cycles)



## Data Hazard Even with Forwarding

Figure 3.12, Page 153, CA:AQA 2e

Time (clock cycles)



EX stage waits in cycle 4 for operand Following instruction ("and") waits in ID stage Missed instruction issue opportunity...

### Software Scheduling to Avoid Load Hazards

#### Try producing fast code for

$$a = b + c;$$
  
 $d = e - f;$ 

assuming a, b, c, d ,e, and f in memory.



Show the stalls explicitly

### **Control Hazard on Branches**



If we're not smart we risk a three-cycle stall

#### Pipelined MIPS Datapath with early branch determination



- Add extra hardware to the decode stage, to determine branch direction and target earlier
- We still have a one-cycle delay we just have to fetch and start executing the next instruction
- If the branch is actually taken, block the MEM and WB stages and fetch the right instruction

#### Eliminating hazards with simultaneous multi-threading

- If we had no stalls we could finish one instruction every cycle
- If we had no hazards we could do without forwarding – and decode/control would be simpler too



Example:
PowerPC
processing
element (PPE)
in the Cell
Broadband
Engine (Sony
PlayStation 3)

- IF maintains two Program Counters
- Even cycle fetch from PC0
- Odd cycle fetch from PC1
- Thread 0 reads and writes thread-0 registers
- No register-to-register hazards between adjacent pipeline stages (cf "C-Slowing".....)

### So – how fast can this design go?

- A simple 5-stage pipeline can run at 5-9GHz
- Limited by critical path through slowest pipeline stage logic
- Tradeoff: do more per cycle? Or increase clock rate?
  - Or do more per cycle, in parallel...
- At 3GHz, clock period is 330 picoseconds.
  - The time light takes to go about four inches
  - About 10 gate delays
    - for example, the Cell BE is designed for 11 FO4 ("fan-out=4") gates per cycle: www.fe.infn.it/~belletti/articles/ISSCC2005-cell.pdf
    - Pipeline latches etc account for 3-5 FO4 delays leaving only 5-8 for actual work
- How can we build a RAM that can implement our MEM stage in 5-8 FO4 delays?

#### Life used to be so easy

#### **Processor-DRAM Memory Gap (latency)**



In 1980 a large RAM's access time was close to the CPU cycle time. 1980s machines had little or no need for cache. Life is no longer quite so simple.

Levels of the Memory Hierarchy



<sup>~</sup> Exponential increase in access latency, block size, capacity

### The Principle of Locality:

 Programs access a relatively small portion of the address space at any instant of time.

## Two Different Types of Locality:

- Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
- Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
- Most modern architectures are heavily reliant (totally reliant?) on locality for speed

## 1 KB "Direct Mapped" Cache, 32B blocks

- For a 2<sup>N</sup> byte cache:
  - The uppermost (32 N) bits are always the Cache Tag
  - The lowest M bits are the Byte Select (Block Size =  $2^{M}$ )



## 1 KB "Direct Mapped" Cache, 32B blocks

- For a 2<sup>N</sup> byte cache:
  - The uppermost (32 N) bits are always the Cache Tag
  - The lowest M bits are the Byte Select (Block Size =  $2^{M}$ )



### 1 KB Direct Mapped Cache, 32B blocks



### Associativity conflicts in a direct-mapped cache

Consider a loop that repeatedly reads part of two different arrays:

```
int A[256];
int B[256];
int r = 0;
for (int i=0; i<10; ++i) {
  for (int j=0; j<64; ++j) {
    r += A[j] + B[j];
  }
}</pre>
```

For the accesses to A and B to be mostly cache hits, we need a cache big enough to hold 2x64 ints, ie 512B

Consider the 1KB direct-mapped cache on the previous slide - what might go wrong?

### Associativity conflicts in a direct-mapped cache

Consider a loop that repeatedly reads part of two different arrays:

```
int A[256];
int B[256];
int r = 0;
for (int i=0; i<10; ++i) {
  for (int j=0; j<64; ++j) {
    r += A[j] + B[j];
  }
}</pre>
```

For the accesses to A and B to be mostly cache hits, we need a cache big enough to hold 2x64 ints, ie 512B

Consider the 1KB direct-mapped cache on the previous slide - what might go wrong?



## Direct-mapped Cache - structure

- Capacity: C bytes (eg 1KB)
- Blocksize: B bytes (eg 32)
- Byte select bits: 0..log(B)-1 (eg 0..4)
- Number of blocks: C/B (eg 32)
- Address size: A (eg 32 bits)
- Cache index size: I=log(C/B) (eg log(32)=5)
- Tag size: A-I-log(B) (eg 32-5-5=22)



# Two-way Set Associative Cache

- N-way set associative: N entries for each Cache Index
  - N direct mapped caches operated in parallel (N typically 2 to 4)
- Example: Two-way set associative cache
  - Cache Index selects a "set" from the cache
  - The two tags in the set are compared in parallel
  - Data is selected based on the tag result



# Disadvantage of Set Associative Cache

- N-way Set Associative Cache v. Direct Mapped Cache:
  - N comparators vs. 1
  - Extra MUX delay for the data
  - Data comes AFTER Hit/Miss
- In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
  - Possible to assume a hit and continue. Recover later if miss.



#### Example: Intel Pentium 4 Level-1 cache (pre-Prescott)

- Capacity: 8K bytes (total amount of data cache can store)
- Block: 64 bytes (so there are 8K/64=128 blocks in the cache)
- Ways: 4 (addresses with same index bits can be placed in one of 4 ways)
- Sets: 32 (=128/4, that is each RAM array holds 32 blocks)
- Index: 5 bits (since 25=32 and we need index to select one of the 32 ways)
- Tag: 21 bits (=32 minus 5 for index, minus 6 to address byte within block)
- Access time: 2 cycles, (.6ns at 3GHz; pipelined, dual-ported [load+store])



# 4 Questions for Memory Hierarchy

- Q1: Where can a block be placed in the upper level?
  - Block placement
- Q2: How is a block found if it is in the upper level?
  - Block identification
- Q3: Which block should be replaced on a miss?
  - Block replacement
- Q4: What happens on a write?
  - Write strategy

#### Q1: Where can a block be placed in the upper level?



In a direct-mapped cache, block 12 can only be placed in one cache location, determined by its low-order address bits –

(12 mod 8) = 4



In a two-way set-associative cache, the set is determined by its low-order address bits –

$$(12 \mod 4) = 0$$

Block 12 can be placed in either of the two cache locations in set 0



In a fully-associative cache, block 12 can be placed in any location in the cache

- More associativity:
  - More comparators larger, more energy
  - Better hit rate (diminishing returns)
  - Reduced storage layout sensitivity more predictable

#### Q2: How is a block found if it is in the upper level?



- Tag on each block
  - No need to check index or block offset



Increasing associativity shrinks index, expands tag

### Q3: Which block should be replaced on a miss?

- With Direct Mapped there is no choice
- With Set Associative or Fully Associative we want to choose
  - Ideal: least-soon re-used
  - LRU (Least Recently Used) is a popular approximation
  - Random is remarkably good in large caches

| Assoc: | 2-way |       | 4-way |       | 8-way |       |
|--------|-------|-------|-------|-------|-------|-------|
|        |       |       |       |       | LRU   |       |
| 16 KB  | 5.2%  | 5.7%  | 4.7%  | 5.3%  | 4.4%  | 5.0%  |
| 64 KB  | 1.9%  | 2.0%  | 1.5%  | 1.7%  | 1.4%  | 1.5%  |
| 256 KB | 1.15% | 1.17% | 1.13% | 1.13% | 1.12% | 1.12% |

Benchmark studies show that LRU beats random only with small caches

LRU can be pathologically bad......

## Q4: What happens on a write?

- Write through—The information is written to both the block in the cache and to the block in the lower-level memory
- Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
  - is block clean or dirty?
- Pros and Cons of each?
  - WT: read misses cannot result in writes
  - WB: no repeated writes to same location
- WT always combined with write buffers so that don't wait for lower level memory

# Caches are a big topic

- Cache coherency
  - If your data can be in more than one cache, how do you keep the copies consistent?
- Victim caches
  - Stash recently-evicted blocks in a small fully-associative cache (a "competitive strategy")
- Prefetching
  - Use a predictor to guess which block to fetch next –
     before the processor requests it
- And much much more......

## What's at the bottom of the memory hierarchy?

- StorageTek STK 9310 ("Powderhorn")
  - 2,000, 3,000, 4,000, 5,000, or 6,000
     cartridge slots per library storage module (LSM)
  - Up to 24 LSMs per library (144,000 cartridges)
  - 120 TB (1 LSM) to 28,800 TB capacity (24 LSM)
  - Each cartridge holds 300GB, readable up to 40 MB/sec
- Up to 28.8 petabytes
- Ave 4s to load tape
- 2017 product: Oracle SL8500
- Up to 1.2 Exabyte per unit
- Combine up to 32 units into single robot tape drive system
- http://www.oracle.com/us/products/ser vers-storage/storage/tapestorage/034341.pdf



https://www.itnews.com.au/gallery/inside-suns-multi-storey-colorado-data-centre-135385/page1





#### Can we live without cache?

- Interesting exception: Cray/Tera MTA, first delivered June 1999:
  - www.cray.com/products/systems/mta/
- Each CPU switches every cycle between 128 threads
- Each thread can have up to 8 outstanding memory accesses
- 3D toroidal mesh interconnect
- Memory accesses hashed to spread load across banks
- MTA-1 fabricated using Gallium Arsenide, not silicon
- "nearly un-manufacturable" (wikipedia)
- Third-generation Cray XMT:
  - http://www.cray.com/Products/XMT.aspx
  - YarcData's uRiKA (<a href="http://www.yarcdata.com/products.html">http://www.yarcdata.com/products.html</a>)