# M3: Computer Simulation.

#### John N. Carter

#### April 2014

#### Abstract

This exercise looks at the operation of real programs on real computers. Three topics are addressed: Variation in Instruction mix; Pipeline hazards and Cache Design.

### 1 Introduction

The aim of this experiment is to introduce you to the concepts of simulating and measuring computer performance, using real programs, before committing to a hardware design

Supervisors should have a look at the annotated note, available from me.

This experiment is in three parts, spaced over a single session. Specifically

- 1. In the first part you will investigate the differing instruction mixes used to solve different problems. Table 1 lists the different tasks and work loads..
- 2. You will investigate the prevalence of data and Branch hazards.
- 3. You will see how programs make use of the cache..

gcc-1K.trace.csv is available on line, the rest are too big and will be preloaded on to the workstations in the Electronics Lab.

As you already know simulation is an important tool for the electronic engineer. In CPU design this is doubly so, not only must one simulate at a hardware level one must simulate at a software level to test that the architecture is appropriate for the task(s) in mind.

Preparation: There is significant preparation for this lab. You should read this document carefully and answer all the questions in italics (like this), writing your answers in your logbook. In writing the answer in you logbook you must also write down where you found it. This is as vital as the information itself.

**Preparation:** Revise your knowledge of C/C++ programming and Computer Architecture.

| Program Name | File Name            | Comments                                       |
|--------------|----------------------|------------------------------------------------|
| art          | art.trace.csv        | Artificial Intelligence                        |
| gcc          | gcc.trace.csv        | gcc compiler, note the gcc-1K version          |
|              | gcc-1K.trace.csv     | is shorter and for testing.                    |
| hmmer        | hmmer.trace.csv      | Hidden Markov Model (speech signal processing) |
| libquantum   | libquantum.trace.csv | Quantum Mechanics Simulation.                  |
| mcf          | mcf.trace.csv        | ?                                              |

Table 1: Different Program Types

The following resources exist to support this experiment.

- This document.
- An M3 web site where you will find down loadable code, links to documentation (and where appropriate local copies). Notification of errors and corrections will be published here.
- The simulation trace files loaded on the ECS workstations.

If code is written in advance it should be printed out, with line numbers before you come into the laboratory (the SciTE text editor will do this for you). Software development should be incremental and evolutionary as you are learning about new concepts. Follow the methodology of this experiment and write simple programs.

### 1.1 Background Reading

The following are good sources of information on ideas related to this experiment.

**Preparation:** You should look at all the following as part of your preparation.

- C++ Programming.
- Module text for ELEC2204.

#### 1.2 The Simulator

In this exercise you will be using the output of a simulator which simulates an Intel 'X86 class processor. This was developed at the University of Pennsylvania. The simulator breaks native instructions down into RICS-like micro-operations. It is these we will be considering. See the experiment web site for the exact format and a description of the trace output. The trace files have been preprocessed into csv format for easier decoding. A generic C++ program (generic.cpp) showing how these can be processed is explained in the following text.

```
1
2
       Generic Program to load, parse, process and report trace files in csv format.
3
4 #include <iostream>
5 #include <fstream>
6 #include <sstream>
7 #include <string>
8 #include <vector>
9 #include <stdlib.h>
  int main() {
10
11
       std::ifstream traceFile;
12
       std::string row;
13
       int counter = 0;
       int limit = 10000000;
14
       traceFile.open("gcc.trace.csv");
15
       while(counter < limit) {</pre>
16
17
            std::stringstream ls;
18
            std::string val;
            std::vector<std::string> all;
19
20
                Initalise for processing here.
21
```

```
22
23
             if (!traceFile.good())
24
                 break;
             traceFile >> row;
25
26
             ls.str(row);
27
             while (std::getline(ls, val, ',')) {
28
                 all.push_back(val);
29
                 }
             if (all.size() != 14) {
30
31
                 std::cout << all.size()
32
                            << "_Bad_trace_size , _ignore_and_continue.\n";</pre>
33
                 continue:
34
35
                 Processing here
36
37
38
             counter++;
             if(counter \% 100000 == 0)  {
39
                 if(counter > 999999) {
40
                      std::cout << (float) counter / 1000000.0
41
                                 << "_Million_Lines_Processed\n";</pre>
42
43
44
                 else {
                      std::cout << (float) counter / 1000.0
45
                                 << "_Thousand_Lines_processed \n";</pre>
46
                      }
47
                 }
48
49
50
        /*
51
             Report results here.
52
53
        std::cout << "Finished\n";
54
        return 0;
55
```

A line by line description follows.

- **4-9** Include files for file streams and basic data structures.
- 10 Main program starts here.
- 11 Declare traceFile to manage the input trace file.
- 12 row defined to hold each line from the trace file as it is read.
- 13 counter to keep a count of the number of files read.
- 14 *limit* to restrict the number of lines to be processed. Make it very big to process the whole file, greater than 100 Million should be enough.
- 15 Open the trace file
- 16 Loop over the lines in the trace files until count exceed limit or the end of file is reached.
- 17 ls is declared as a stringstream. This is a string that can be read like a file.

- 18 val is declared as a string to hold lines from the trace file as they are read.
- 19 all is declared as a vector of strings to hold the component parts of each line of tracing.
- **23-24** checks that we have not reached the end of file.
- 25 A row is read
- **26** Converted to a *streamstring*.
- **27-29** Looping over *ls*, reading and returning chunks ending in ',' and adding these to the vector *val*. Building up a vector or array of the component sub strings.
- **30-34** Checking that there are exactly 14 fields in the trace line (for the version with physical memory addresses attached the number of elements is 15). The error is reported, ignored and processing continues.
- 38 The line counter.
- **39-49** Very pretty progress reporting.
- **53-55** The end.

During the lab you will customise this program to make measurements from the instruction traces.

- 20-22 Add your code to initialise counters and other things.
- **35-37** Your code to process trace lines goes here.
- **50-52** Replace these lines with your code to report your results.

### 1.3 Preparation

- 1. Study the file **generic.cpp** and ensure that you understand how and what it does.
- 2. Compile and run **generic.cpp**, use the file *gcc-1K.trace.csv* as input. Ensure that it behaves as you expect and just reads its input, count lines and exits.
- 3. Study the file **countcode.cpp** (Section 2.1.1) and understand how the occurrence of the same text strings are counted as used in Section 2.1.
- 4. Prepare code to measure the pipeline properties Section 2.2
- 5. Make plans for the simulation of a directly mapped instruction cache, in Section 2.3.
- 6. If you feel that you will achieve the mandatory part of the lab with time to spare make plans to simulate dynamic branch prediction, see section 3.1.

As part of your preparation you will write a number of small program to measure different things about the instruction traces. Remember to save these with different names.

| Instruction Types                                |  |  |
|--------------------------------------------------|--|--|
| Logical                                          |  |  |
| Integer Arithmetic                               |  |  |
| Floating point                                   |  |  |
| Load                                             |  |  |
| Store                                            |  |  |
| Control: jump, Branch, Call, Conditional branch. |  |  |

Table 2: Classification of Instructions.

## 2 Experimental Work.

### 2.1 Different Usage of Instructions for Different Tasks.

As you might expect different computing tasks require the use of different instruction. For example a word processor or editor might have lots of character and integer type instructions, while a MP3 media player might have lots of floating point instructions.

Question 1: Why is this?

Question 2: What mix of instructions would you expect for a game like Quake?

Run a program (section 2.1.1) to count micro-opcodes on the art, hmmer, mcf, gcc and libquantume trace files. Inspect the output and classify the results in the following broad classes, see Table 2.

**Question 3:** Now comparing the distributions obtained for the different traces, what deductions can you make about the different programs?

#### 2.1.1 Counting Instructions.

The following program (**countcode.cpp**) will count instructions.

```
1 #include <iostream>
2 #include <fstream>
3 #include <sstream>
4 #include <string>
5 #include <vector>
6 #include <map>
  #include <stdlib.h>
   int main() {
       std::ifstream traceFile;
9
10
       std::map<std::string, int> frequency;
11
       std::map<std::string, int>::iterator it;
12
       std::string row;
       int counter = 0;
13
       int limit = 10000000;
14
15
       traceFile.open("gcc.trace.csv");
16
       while(counter < limit) {</pre>
17
            std::stringstream ls;
            std::string val;
18
            std::vector<std::string> all;
19
20
            if (!traceFile.good())
```

```
21
                 break:
22
            traceFile >> row;
            ls.str(row);
23
            while (std::getline(ls, val, ',')) {
24
                 all.push_back(val);
25
26
27
            if (all.size() != 14) {
28
                 std::cout << all.size()
                               << "_Bad_trace_size , _ignore_and_continue.\n";</pre>
29
30
                 continue;
31
32
            val = all [13];
            it = frequency.find(val);
33
            if(it == frequency.end()) {
34
                 frequency[val] = 1;
35
36
            else {
37
                 frequency [val] ++;
38
39
40
            counter++;
41
42
        std::cout << frequency.size()
                      << "_different_instructions_used\n";</pre>
43
        for(it = frequency.begin(); it != frequency.end(); ++it) {
44
            std::cout << it->first
45
                         << " => "
46
                          << it->second
47
48
                          << '\n';
49
50
        return 0;
51
        }
```

This program uses a new data structure, the *map*. This is often known as an associative array or dictionary. The best way to think of it is as an array with arbitrary indexes. These could be integers, floats, strings or other data structures. The best way to understand them is to see them in action. In the following annotations the differences from the generic program are emphasised.

- **6** Include the map class.
- 10 Define a map frequency which has string indexes and integer values.
- 11 Define an iterator over the map, it. We use this like integers iterating over a regular array.
- **32** We reuse the variable *val* to take a copy of the 13th element of the trace. This is the text string giving the micro-opcode.
- **33** We search the map *frequency* for an index the same as *val*, and store the location in the variable *it*. If the index is not found in the map, then *it* points or refers to the end of the map.
- **34-36** If the micro-opcode is not found, then we start a new map entry *frequency[val]* and set it equal to 1.
- **37-39** It is found so we just add 1 to frequency[val]. This is counting the occurrences of the micro-opcodes.

**44-49** Finally we reuse the variable *it* to loop over the map from *frequency.begin()* to *frequency.end()* incrementing the iterator as we go. *it->first* refers to the index and *it->second* refers to the value.

**Preparation:** Run this program on the file gcc-1k.trace.csv before you come into the lab. Make sure you understand how it works and what the output means.

Finally to make life easier the file *helper.cpp* on the lab page has functions to count things into a map and a compact function to report progress in millions of lines processed.

### 2.2 The Pipeline.

In this section we will make some measurements and see how data (2.2.1) and branch (2.2.2) hazards will affect the operation of a pipeline. We will assume that this processor executes micro-opcodes in a 4 stage pipeline with FETCH, DECODE, EXECUTE and WRITE/READ. There will be slight differences from the MIPS processor as FETCH will be from the micro-opcode generator, not main memory memory, while the WRITE/READ phase will consist of register writes and memory writes and reads. In this and subsequent sections you need only measure the properties of the file gcc.trace.csv.

**Preparation:** If an instruction uses register 3 as destination. How may instructions must pass before it can safely be used as a source.

#### 2.2.1 Data hazards.

This processor has 16 registers i.e. 0 to 15 and two special registers 44 and 45. A data hazard exists if the source for a register instruction should have been updated by an instruction one or more cycles previously. If there was no pipeline this would not be a problem however there is a pipeline so hazards potentionally exist. Determine the severity of the problem by counting the number of such occurrences.

This should be simple, comparing both the source fields to the destination field of previous instructions. It is easy to determine the previous trace values by defining additional vectors to hold previous values

```
1
2
   int main() {
3
        std:vector<std::string> minus1, minus2;
4
5
6
         * previous two lines, read 1st two lines
7
         * of trace file into these
8
         */
9
        while ( . . . ) {
10
11
12
             minus2 = minus1; // This is two instructions previously.
13
14
             minus1 = all; // This is one instruction previously.
15
16
         }
17
         . . .
18
    }
```

Hint: Rather than trying to program up hazard detection. It will be easier to count the number of register pairs between source and destination in the same way you counted instructions, i.e. in a map. You can make suitable keys by simply concatenating the source and destination strings, i.e. val = all[2] + minus1[4];. Generate the map and print it out. Then count the number of pairs that would give rise to data hazards i.e. the number of times register 3 is both source and destination is an example.

#### 2.2.2 Branch hazards.

Branch hazards occur when the pipeline is filled with the wrong instructions. For example the pipeline contains the following instructions but the branch jumps to a different portion of code.

Question 4: In our 4 stage pipeline how many cycles could be lost for a mistaken branch.

There are various strategies we could adopt:

- Always assume the branch will be taken.
- Always assume the branch will not be taken
- Chose randomly every time. You can use the built in rand() function. The code rand() % 2 will give you 0 and 1 randomly.

The simulator tells us whether each branch is taken or not. Chose a branch strategy and count the number of times this gives the incorrect result and leaves the pipeline filled with useless instructions.

**Question 5:** Which of the three is the best?

If you have time at the end implement dynamic branch prediction (see 3.1).

#### 2.3 The cache.

In this section we are only considering a cache for program instruction no data. Thus it is only references to the program counter. Additionally we must now work with real instructions not micro instructions. Therefore we will only consider program counter accesses to memory on the first micro instruction, when the *Uoc* field is 1.

If you look at the trace for all the different programs the addresses used are very similar. This is because the addresses are virtual addresses. However the cache works with physical memory not virtual.

The simplest way to map virtual memory to physical memory is to do it on a one to one basis. In general this is not a good idea as in a real multi-tasking environment each program will need its own share of memory. Assuming a virtual and physical page size of 20000<sub>16</sub> The virtual memory addresses have been mapped to physical addresses. The physical address translation of the program counter has been appended as a hexadecimal number. Assume that there is  $400000_{16}$  bytes of physical memory, this is equivalent to an address bus that is 23 bits wide. Table 3 shows a suggested allocation of bits with in the address bus for a directly mapped cache. This cache holds 8 bytes, enough for a 64-bit integer or a double precision float. The tag and offset are each 10 bits wide, making a 1k Word cache for 4 Megabytes of memory.

Implement the cache described in Table 3.

| Tag     | Offset  | Block             |
|---------|---------|-------------------|
| 10 bits | 10 bits | 8 bytes or 3 bits |

Table 3: Suggested Cache structure assuming 23 bit Address bus

**Hints:** We are not implementing a fully functional cache, only that part concerning locating a value in cache. This can be done with two arrays, both the same size as the offset bit pattern, in this case 1024. A *bool* or logic array, for the valid flag and an integer array to hold the tag. The valid array should be initialised to false, or not a valid entry.

To implement, read the *abs.mixed.trace.csv* file and interpret the last field, the 14th, as the physical address. Compute the offset or line and the tag. Only do this on the first line of an instruction.

```
1 /*
2 *This code converts from a string containing a hex number to an integer.
3 */
4
5 std::string spad;
6 int ipad;
7
8 spad = all [14];
9 ipad = (int) strtol(spad.c_str(), 0, 16);
```

Question 6: How efficiently does the cache operate? Count the unused cache entries when you have read and processed the abs.mixed.trace.csv file.

**Question 7:** What is the cache hit rate for this configuration?

Vary the number of bits for tag and offset. You may find this snippet of code useful.

```
1
2
    * Controling the cache with two parameters, for a fixed bus width.
3
   const int bus_size = 23; // bits
4
5
   const int block_size = 3;
6
7
   const int line_size = 7;
   const int tag_size = bus_size - (block_size + line_size);
8
9
   const int line_buffer = 1 << line_size;</pre>
10
11
   const int block_mask = (1 \ll block_size) -1;
   const int line_mask = (1 \ll line_size) -1;
   const int tag_mask = (1 \ll tag_size) -1;
```

**Question 8:** Is there an optimum cache configuration? What is it? Vary the size of the sise of the cache line from 8 to 12 to see if there is an optimum.

Question 9: How would you use a map class to implement a fully associative cache.

Finally, remember that this is a very artificial situation. If you were evaluating a cache strategy for a real (Intel, ARM) processor you would have to do significantly more simulation with real work loads. Imagine the simulation that needs to be done for your laptop.

# 3 Additional work.

If you have some spare time you might like to try these

### 3.1 Dynamic branch Prediction

Implement dynamic branch prediction, using a map addressed by the branch location in the memory code to hold the state variable for each branch location. Initialise each branch, on first occurrence to taken.

**Question 10:** Is this better than fixed/random strategies? Is it worth it, in terms of the extra complexity incurred?

Clearly there must be a limit to the number of branch states the can be remembered. How would this be implemented and how would you replace branches when you run out of capacity.

### 3.2 Separate Data and Program cache.

First include the data address for load and store in the cache. Remembering that for a straight forward pipelined Von Neumann machine, FETCH and WRITE/READ can not occur simultaneously.

How many instructions will have pipeline stalls because of this?

Implement a cache for data accesses? What percentage of accesses are hits if one adopts a write through policy?