# Computation Structures

## Memory Hierarchy & Caches Worksheet



Keep the most often-used data in a small, fast SRAM (often local to CPU chip). The reason this strategy works: LOCALITY.

Locality of reference: Access to address X at time t implies that access to address  $X+\Delta X$  at time  $t+\Delta t$  becomes more probable as  $\Delta X$  and  $\Delta t$  approach zero.



### **AMAT = HitTime + MissRatio \* MissPenalty**



Example: 2-way set-associative cache, 8 sets, 4-word block size, write-back



Replacement strategy choices: least-recently used (LRU); first in, first out (FIFO); random Write-policy choices: write-through, write-behind, write-back

#### Problem 1.

(A) The timing for a particular cache is as follows: checking the cache takes 1 cycle. If there's a hit the data is returned to the CPU at the end of the first cycle. If there's a miss, it takes 10 additional cycles to retrieve the word from main memory, store it in the cache, and return it to the CPU. If we want an average memory access time of 1.4 cycles, what is the minimum possible value for the cache's hit ratio?



(B) If the cache block size, i.e., words/cache line, is doubled but the total number of data words in the cache is unchanged, how will the following cache parameters change? Please circle the best answer.

```
# of offset bits: UNCHANGED ... (+1) ... -1 ... 2x ... 0.5x ... CAN'T TELL
  # of tag bits: UNCHANGED ... +1 ... -1 ... 2x ... 0.5x ... CAN'T TELL
# of cache lines: UNCHANGED ... +1 ... -1 ... 2x ... (0.5x)... CAN'T TELL
```

Consider a direct-mapped cache with 64 total data words with 1 word/cache line, which uses a LRU replacement strategy and a write-back write strategy. This cache architecture is used for parts (C) through (F).

(C) If cache line number 5 is valid and its tag field has the value 0x1234, what is the address in main memory of the data word currently residing in cache line 5? (5 - 0 | 0)  $\Rightarrow$  000 | 0 | 0 |

Main memory address of data word in cache line 5: 0x + 23 + 14

The program shown on the right repeatedly executes an inner loop that sums the 16 elements of an array that is stored starting in location 0x310.

The program is executed for many iterations, then a measurement of the cache statistics is made during one iteration through all the code, i.e., starting with the execution of the instruction labeled outer loop: until just before the next time that instruction is executed.

```
outer_loop:
  CMOVE(16, R0)
               // initialize loop index J
  CMOVE(0,R1)
loop:
                // add up elements in array
  SUBC(R0,1,R0) // decrement index
  MULC(R0,4,R2) // convert to byte offset
  LD(R2,0x310,R3)// load value from A[J]
  ADD(R3,R1,R1) // add to sum
  BNE(R0,loop) // loop until all words are summed
  BR(outer_loop) // perform test again!
```



| (D) In total, how many instruction fetches occur during one complete iteration of the out                                                                                                                 | iter loop?                      |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|
| Number of instruction fetches:                                                                                                                                                                            |                                 |
| Number of data reads:                                                                                                                                                                                     | <u>[b.</u>                      |
| (E) How many instruction fetch misses occur during one complete iteration of the outer How many data read misses? Hint: remember that the array starts at address 0x310                                   |                                 |
| Number of instruction fetch misses:                                                                                                                                                                       | 4                               |
| Number of data read misses:                                                                                                                                                                               | 4                               |
| (F) What is the hit ratio measured after one complete iteration of the outer loop?  Hit ratio:                                                                                                            | <del>1</del> / <del>2</del> 9   |
| Problem 2.                                                                                                                                                                                                | //                              |
| The Beta Engineering Team is working on the design of a cache. They've decided that t will have a total of $2^{10} = 1024$ data words, but are still thinking about the other aspects cache architecture. |                                 |
| First assume the team chooses to build a direct-mapped write-back cache with a block si words.                                                                                                            | ze of 4                         |
| (A) Please answer the following questions:                                                                                                                                                                | - (                             |
| Number of lines in the cache:                                                                                                                                                                             | 5b,                             |
| Number of bits in the tag field for each cache entry:    words = 26)t offset.   32 (8tx) - 4     Words = 46 t                                                                                             | 20<br>d, if it's a<br>ek cycles |
| Average memory access time assuming 90% hit rate (clock cycles):                                                                                                                                          | +                               |
| 2+20×10%                                                                                                                                                                                                  |                                 |

Now assume the team chooses to build a 2-way set-associative write-back cache with a block size of 4 words. *The total number of data words in the entire cache is still 1024*. The cache uses a LRU replacement strategy.

| (C) Please answer the following questions: $\chi \chi \circ \rho$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Address bits used as offset (including byte offset): $A[  ]$ : $$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| $\nearrow$ $\leftarrow$ 128. Address bits used as cache line index: A[ $\frac{10}{10}$ : $\frac{4}{10}$ ]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Address bits used for tag comparison: A[ 3 : 1 : 1 ]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| (D) To implement the LRU replacement strategy this cache requires some additional state for each set. How many state bits are required for each set?                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Number of state bits needed for each set for LRU:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| When change, a line = 2 sets, there two cases, left or right.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| To test this set-associative cache, the team runs the benchmark code shown on the right. The code sums the elements of a 16-element array. The first instruction of the code is at location 0x0 and the first element of the array is at location 0x10000. Assume that the cache is empty when execution starts and remember the cache has a block size of 4 words.  (E) How many instruction misses will occur when running the benchmark?  Number of instruction misses when running the benchmark:  Number of instruction misses when running the benchmark:    ADDC (R0, 4, R0)   CMPLTC (R0, 64, R2) |
| (F) How many data misses (i.e., misses caused by the memory access from the LD instruction) will occur when running the benchmark?  ∴ = 0x10000 ⇒ 0.  A: LONG(1)                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Number of data misses when running the benchmark:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| (g) What's the exact hit rate when the complete benchmark is executed?  Benchmark hit rate: $\frac{93}{100}$ $\frac{1}{100}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |

#### Problem 3.

The program from the Cache performance lab I = 0x240// location of program is shown at the right. Assume the program is A = 0x420// location of array A being run on a Beta with a cache with the N = 16// size of array (in words) following parameters:  $\cdot = I$ // start program here • 2-way set-associative test: • block size of 2, i.e., 2 data words are stored ox 240 CMOVE(N,R0) // initialize loop index J in each cache line CMOVE(0,R1) • total number of data words in the cache is 32 • LRU replacement strategy // add up elements in array loop: TH SUBC(R0,1,R0) // decrement index (A) The cache will divide the 32-bit address MULC(R0,4,R2) // convert to byte offset supplied by the Beta into three fields: B LD(R2,A,R3) // load value from A[J] bits of block offset (including byte offset ADD(R3,R1,R1) // add to sum bits), L bits of cache line index, and T bits BNE(R0,loop) // loop N times of tag field. Based on the cache parameters given above, what are the BR(test) // perform test again! appropriate values for B, L, and T? value for B: 3 // allocate space to hold array . = A STORAGE(N) // N words value for T: 26(B) If the MULC instruction is resident in a cache line, what will be its cache line index? the value of the tag field for the cache? Cache line index for MULC when resident in cache: Tag field for MULC when resident in cache: 0x\_\_\_\_\_\_\_ (C) With the values of I, A, and N as shown, list all the values j ( $0 \le j < N$ ) where the location holding the value A[i] will map to the same eache line index as the MULC instruction in the program. List all j where A[j] have the same cache line index as MULC: (0, 1]  $0 \times 420 \Rightarrow 0 \mid 0 \circ 0 \circ (0 \circ 00)$   $\mid ndex = 4$ . (D) If the outer loop is run many times, give the steady-state hit ratio for the cache, i.e., assume that the number of compulsory misses as the cache is first filled are insignificant compared  $(6,7) \Rightarrow 7$ to the number of hits and misses during execution. Steady-state hit ratio (%): 

#### Problem 4.

2×4 ×2 = 16 words.

Consider a 2-way set-associative cache where each way has 4 cache lines with a block size of 2 words. Each cache line includes a valid bit (V) and a dirty bit (D), which is used to implement a write-back strategy. The replacement policy is least-recently-used (LRU). The cache is used for both instruction fetch and data (LD,ST) accesses. Please use this cache when answering questions (A) through (D).

(A) Using this cache, a particular benchmark program experiences an average memory access time (AMAT) of 1.3 cycles. The access time on a cache hit is 1 cycle; the miss penalty (i.e., additional access time) is 10 cycles. What is the hit ratio when running the benchmark program? You can express your answer as a formula if you wish:

> 1.5= 1+ 10 × (1- a)

(B) The circuitry for this cache uses various address bits as the block offset, cache line index and tag field. Please indicate which address bits A[31:0] are used for each purpose by placing a "B" in each address bit used for the block offset, "L" in each address bit used for the cache

line index, and "T" in each address bit used for the tag field.

Fill in each box with "B", "L", or "T"

| 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|---|---|---|---|---|---|---|---|---|---|
| T  | 1  |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |   |   |   | 7 | T | 7 | ۷ | B | 0 | 0 |

(C) This cache needs room to store new data and based on the LRU replacement policy has chosen the cache line whose information is shown to the right for replacement. Since the current contents of that line are marked as dirty (D = 1), the cache must write some information back to main memory. What is the address of each memory location to be written? Please give each address in hex.

Way: 0

Cache line index: 3 Valid bit (V): 1

Dirty bit (D): 1 Tag field: 0x123

Addresses of each ocation to be written (in hex): 2478 and 247 C

000/00/00/11/1000

(D) This cache is used to run the following benchmark program. The code starts at memory address 0; the array referenced by the code has its first element at memory address 0x2000. First determine the number of memory accesses (both instruction and data) made during each iteration through the loop. Then estimate the steady-state average hit ratio for the program, i.e., the average hit ratio after many iterations through the loop.

```
// byte index into array
   CMOVE(0,R0)
                        // initialize checksum accumulator
   CMOVE(0,R1)
                        // load next element of array
  LD(R0,array,R2)
  SHLC(R1,1,R1)
                        // shift checksum
                        // increment checksum
   \(\alpha\)DC(1,R1,R1)
                        // include data value in checksum
   ADD(R2,R1,R1)
                        // byte index of next array element

⟨ADDC(R0,4,R0)
   CMPLTC(R0,1000,R2) // process 250 entries
   BT(R2,loop)
   HALT()
                      250
. = 0x2000
array:
   ... array contents here ...
```

Number of memory accesses made during each iteration of the loop:

Estimated steady-state average hit ratio: \_\_\_\_\_\_\_\_

#### Problem 5.

Consider the diagram to the right for a 2-way set associative cache to be used with our Beta design. Each cache line holds a single 32-bit word of data along with its associated tag and valid bit (0 when the cache line is invalid, 1 when the cache line is valid).

(A) The Beta produces 32-bit byte addresses, A[31:0]. To ensure the best cache performance, which address bits should be used for the cache index? For the tag field?

address bits used for cache index:  $A[ \frac{4}{5} : \frac{2}{5} ]$  address bits used for tag field:  $A[ \frac{3}{5} : \frac{5}{5} ]$ 



(B) Suppose the Beta does a read of location 0x5678. Identify which cache location(s) would be checked to see if that location is in the cache. For each location specify the cache section (A or B) and row number (0 through 7). E.g., 3A for row 3, section A. If there is a cache hit on this access what would be the contents of the tag data for the cache line that holds the data for this location?

Olologous Cache location(s) checked on access to 0x5678:

cache tag data on hit for location 0x5678 (hex): 0x 283

(C) Assume that checking the cache on each read takes 1 cycle and that refilling the cache on a miss takes an *additional* 8 cycles. If we wanted the *average* access time over many reads to be 1.1 cycles, what is the minimum hit ratio the cache must achieve during that period of time? You needn't simplify your answer.

minimum hit ratio for 1.1 cycle average access time: 98.75%

1.1= H8 (1-cn)

(D) Estimate the approximate cache hit ratio for the following program. Assume the cache is empty before execution begins (all the valid bits are 0) and that an LRU replacement strategy is used. Remember the cache is used for both instruction and data (LD) accesses.

CMOVE(source, R0)

CMOVE(0,R1)

CMOVE(0x1000,R2)

LD(R0,0,R3)

ADD(R0,4,R0)

ADD(R3,R1,R1)

SUBC(R2,1,R2)

BNE(R2,loop)

ST(R1,source)

HALT()

. = 0x100

source:

. = ( + 0x4000 // Set source to 0x100, reserve 1000 words

approximate hit ratio:

(E) After the program of part (D) has finished execution what information is stored in row 4 of the cache? Give the addresses for the two locations that are cached (one in each of the sections) or briefly explain why that information can't be determined.

Addresses whose data is cached in "Row 4":  $0 \times 10^{\circ}$ , and  $0 \times 10^{\circ}$ .

Heet 4, 5 6-8 of 11
Memory Hierarchy & Caches

#### Problem 6.

A standard unpipelined Beta is connected to a 2-way set-associative cache containing 8 sets, with a block size of 4 32-bit words. The cache uses a LRU replacement strategy. At a particular point during execution, a snapshot is taken of the cache contents, which are shown below. All values are in hex; assume that any hex digits not shown are 0.

|     |            |   |   |     | Wa       | ay #1     |           |          |     | Way #2 |   |          |          |          |          |                    |          |            |  |  |
|-----|------------|---|---|-----|----------|-----------|-----------|----------|-----|--------|---|----------|----------|----------|----------|--------------------|----------|------------|--|--|
|     | Line#      | V | D | TAG | A3       | A2        | A1        | AO       |     | Line#  | V | D        | TAG      | В3       | В2       | 81                 | ВО       |            |  |  |
|     | 7          | 1 | 1 | f29 | a6218800 | 77f70002  | c01f0034  | 8359c800 |     | 7      | 1 | 0        | 77e      | d01e001e | 61040008 | c01f0044           | 80a32800 |            |  |  |
|     | 6          | 1 | 0 | 027 | c01f0037 | 73fffefa7 | 77ee0002  | d33ac000 |     | 6      | 1 | 0        | 1f0      | c23f0394 | 77e00002 | 80400000           | c01f0046 |            |  |  |
|     | 5          | 0 | 0 | 000 | 00000000 | 00000000  | 00000000  | 00000000 |     | 5      | 1 | 0        | d1c      | f79c0001 | e809ffff | аааааааа           | 5555555  |            |  |  |
| L - | 4          | 1 | 0 | 33a | 8358c000 | 73fa0002  | c17f0f0f  | 73fffef0 | L-  | 4      | 1 | 0        | 00a      | d37affff | dc000000 | 73fffe93           | 80600000 |            |  |  |
|     | 3          | 1 | 0 | 801 | 641f0608 | 60ff0600  | c01f0043  | 80200000 | _   | 3      | 1 | 0        | 01b      | c01f0035 | d01e04b0 | 80a42800           | 73fffeab |            |  |  |
|     | 2          | 1 | 0 | 002 | 90072800 | 73fffeb0  | 80a42800  | d0052ca0 |     | 2      | 1 | 0        | 02b      | c6510001 | 77f90002 | 90082800           | c01f0041 |            |  |  |
|     | 1          | 1 | 0 | 5c8 | 80400000 | 8358c000  | 73ff0003  | a5ab6000 |     | 1      | 1 | 0        | f99      | 77ee0002 | f39b001f | 64040000           | fc000000 |            |  |  |
| ,   | - 0        | 1 | 0 | a11 | c01f0045 | d1cd7f0f  | 7f1f0091a | 77e00002 |     | Lo     | 1 | 0        | 7af      | a9ab6000 | c01f003b | 77e00002           | fb5c0011 |            |  |  |
|     | 8          |   |   |     | V        | V         | V         | V        |     | ,      |   |          | Τ.       | V        | *        | V                  | V        |            |  |  |
|     |            |   |   |     | 3        | 2         | 1         | 0 /      | ← в |        |   |          |          | 3        | 2        | 1                  | 0        | <b>←</b> B |  |  |
|     | T ->=      |   |   |     |          |           |           |          |     |        | T | <b>→</b> | <b>*</b> |          | BDA      | Name of the second |          |            |  |  |
|     | AHIT ADATA |   |   |     |          |           |           |          |     |        |   |          | BHIT     |          |          |                    |          |            |  |  |

(A) The cache uses bits from the 32-bit byte address produced by the Beta to select the appropriate set (L), as input to the tag comparisons (T) and to select the appropriate word from the data block (B). For correct and optimal performance what are the appropriate portions of the address to use for L, T and B? Express your answer in the form "A[N:M]" for N and M in the range 0 to 31, or write "CAN'T TELL".

> Address bits to use for L:  $\frac{A[b:4]}{A[b:7]}$ Address bits to use for B: A [ ]

(B) For the following addresses, if the contents of the specified location appear in the cache, give the location's 32-bit contents in hex (determined by using the appropriate value from the cache). If the contents of the specified location are NOT in the cache, write "MISS".

Contents of location 0xA1100 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Contents of location 0x548 (in hex) or "MISS": 0x

Cont

Locations 0x0 and 0x1000 present simultaneously (circle one): YES ... NO

000 000 00 00 000 0000 -9 of 11 - Memory Hierarchy & Caches

(D) (Give a one-sentence explanation of how the D bit got set to 1 for Line #7 of Way #1.

One sentence explanation

ST () changed value

on line f, way #1, and has not

written to Main Memory.

(E) The following code snippet sums the elements of the 32-element integer array X. Assume this code is executing on a Beta with a cache architecture as described above and that, initially, the cache is empty, i.e., all the V bits have been set to 0. Compute the hit ratio as this program runs until it executes the HALT() instruction, a total of 2 + (6\*32) + 1 = 195

Hit ratio: 227

. = 0

CMOVE(0, R0) // loop counter

CMOVE(0, R1) // accumulated sum

Oop:

SHLC(R0, 2, R2) // convert loop counter to byte offset

LD(R2, X, R3) // load next value from array

ADD(R3, R1, R1) // add value to sum

ADDC(R0, 1, R0) // increment loop counter

CMPLTC(R0, 32, R2) // finished with all 32 elements?

BT(R2,loop) // nope, keep going // all done, sum in R1 HALT() // the 32-element integer array X X: LONG(1)LONG(2)

LONG(32)

#### Problem 7.

After his geek hit single *I Hit the Line*, renegade singer Johnny Cache has decided he'd better actually learn how a cache works. He bought three Beta processors, identical except for their cache architectures:

- **Beta1** has a 64-line direct-mapped cache
- Beta2 has a 2-way set associative cache, LRU, with a total of 64 lines
- **Beta3** has a 4-way set associative cache, LRU, with a total of 64 lines

Note that each cache has the same total capacity: 64 lines, each holding a single 32-bit word of data or instruction. All three machines use the same cache for data and instructions fetched from main memory.

Johnny has written a simple test program:

```
// Try a little cache benchmark
              0 X 0 0
I = 0x1000
                        // where program lives
                                 // data region 1
A = 0x2000
               0× 00
                                // data region 2
// size of data regions (BYTES!)
B = 0x3000
               0× 00
N = 16
                           // start program here
// outer loop count
// Loop index I (array offset)
// I = I-1
// read A[I]
// read B[I]
       CMOVE (1000, R6)
P:
       CMOVE (N, R0)
      SUBC (R0, 4, R0)
      LD(R0, A, R1)
     LD(R0, B, R2)
       BNE (R0, R)
                                   // repeat many times
       SUBC (R6,1, R6)
       BNE (R6, Q)
       HALT()
                                        2+ ((6x4) +3) × /000
             16.
```

Johnny runs his program on each Beta, and finds that one Beta model outperforms the other two.

(A) (2 points) Which Beta gets the highest hit ratio on the above benchmark?

Circle one: Beta1 Beta2 Beta3

(B) (2 points) Johnny changes the value of **B** in his program to **0x2000** (same as **A**), and finds a substantial improvement in the hit rate attained by one of the Beta models (approaching 100%). Which model shows this marked improvement?

Circle one: Beta1 Beta2 Beta3

(C) (3 points) Finally, Johnny sets **I**, **A**, and **B** each to **OxO**, and sets **N** to **64**. What is the TOTAL number of cache misses that will occur executing this version of the program on each of the Beta models?

TOTAL cache misses running on Beta1: 16; Beta2: 16; Beta3: 67002