# Comet with Branch Predictor

Team 11

#### Comet

- Synthesizable RISC-V processor with HLS
- HLS comes with the benefit of simulating and testing processor designs
- RISC-V ISA combined with simple implmentation making implementing and tweaking different architectural designs easy

#### Comet

```
struct FtoDC ftodc;
struct DCtoEx dctoex;
struct ExtoMem extomem:
struct MemtoWB memtowb;
while true do
   ftodc temp = fetch();
   dctoex_temp = decode(ftodc);
   extomem_temp = execute(dctoex);
   memtowb_temp = memory(extomem);
   writeback(memtowb);
   /* -- Handling stalls --
   bool stall[5] = stallLogic();
   if !stall[0] then
      ftodc = ftodc temp;
   end
   if !stall[1] then
      dctoex = dctoex temp;
   end
       -- Handling forwarding --
   bool forward = forwardLogic();
   if forward then
      dctoex.value1 = extomem.result;
   end
end
```

**Algorithm 2:** High-level specification of an explicitly pipelined simulator.



Textbook 5-stage pipelined machine with forwarding and stalling for data hazards, control hazards and multi-cycle arithmetic ops

## Comet, but in Catapult

- The camera-ready version of Comet is written in Catapult, which has different HLS rules and different definitions for arbitrary precision data types
- The design is straightforward and can be easily expanded upon
- However, it's somewhat nontrivial to refactor to vivado synthesizable code

# Combined Design

- Luckily, there is an <u>older implementation in Vivado HLS</u>
- The older version is synthesizable but the overall structure is not great
- So, we manage to combine the two to have the best of both worlds

## **Combined Design**



Sythesis Report for the Design

#### Verification

- It's hard to check every instruction executed and values of each register every cycle
- Instead, verification is done by simulating the design with artifical programs, matrix multiplication and Quick Sort, and analyzing memory content after execution



#### **Branch Prediction**

- Control hazards in pipelined machine cause pipeline flushes, but there are many branches that exhibit regular patterns
- Branch prediction enables many aggressive execution optimizations

#### Branch Prediction: N-Bit Predictor



```
template <int BITS, int ENTRIES>
class BitBranchPredictor : public BranchPredictorWrapper<BitBranchPredictor<BITS, ENTRIES> > {
 static const int LOG ENTRIES = log2const<ENTRIES>::value;
 static const int NT START = (1 << BITS) - 1:
 static const int NT FINAL = (1 << BITS) >> 1;
 static const int T START = 0;
 static const int T FINAL = NT FINAL - 1;
 CORE UINT(BITS) table[ENTRIES];
public:
 BitBranchPredictor()
   for (int i = 0; i < ENTRIES; i++) {
     table[i] = T START;
 void update(CORE UINT(32) pc, bool isBranch)
   CORE UINT(LOG ENTRIES) index = pc.SLC(LOG ENTRIES, 2);
   if (isBranch) {
     table[index] -= table[index] != T START ? 1 : 0;
     table[index] += table[index] != NT START ? 1 : 0;
 void process(CORE UINT(32) pc, bool& isBranch)
   CORE UINT(LOG ENTRIES) index = pc.SLC(LOG ENTRIES, 2);
                                   = table[index] <= T FINAL;
};
```

2-Bit Predictor

N-Bit Predictor in HLS

# Branch Prediction: Correlating Branch Predictor

- Correlating Branch Predictor takes outcomes of previous branches into account based on the observation that branches may be correlated
- It's also really easy to implement with N-Bit Branch Predictor

```
template<int CORRELATION BITS, int PREDICTOR BITS, int ENTRIES>
class CorrelatingPredictor : public BranchPredictorWrapper<CorrelatingPredictor<CORRELATION BITS, PREDICTOR BITS, ENTRIES> > {
  static const int NUM PREDICTORS = (1 << CORRELATION BITS);
  BitBranchPredictor<PREDICTOR_BITS, ENTRIES> bp[NUM_PREDICTORS];
 CORE UINT(CORRELATION BITS) bhr;
  public:
   CorrelatingPredictor() {
      for(int i = 0; i < CORRELATION BITS; i++) {</pre>
        bhr[i] = 0;
    void update(CORE UINT(32) pc, bool isBranch) {
     bp[(int)bhr]. update(pc, isBranch);
     bhr = ((bhr << 1) | isBranch);
   void _process(CORE_UINT(32) pc, bool& isBranch) {
      bp[(int)bhr]. process(pc, isBranch);
```

#### **Branch Prediction: Random Predictor**

- Just a pure guess
- Easy to implement
- Aware of overhead of this type of branch predictor



| w0     | w1     | w2     | w3     | w4     | w5     | w6     | w7     |
|--------|--------|--------|--------|--------|--------|--------|--------|
|        |        |        |        |        |        |        |        |
|        |        |        | ,      | •      |        |        |        |
| -w0+w1 | -w0-w1 | -w2+w3 | -w2-w3 | -w4+w5 | -w4-w5 | -w6+w7 | -w6-w7 |





```
dp = perceptron[index][SIZE];
   ap_int<BITS> weight[SIZE/2];
   ap_int<2> sign[SIZE/2];
#pragma HLS array_partition variable=perceptron dim=2 complete
#pragma HLS array_partition variable=bht
                                                   dim=1 complete
#pragma HLS array_partition variable=weight
                                               dim=1 complete
#pragma HLS array_partition variable=sign
                                               dim=1 complete
    for (int i = 0; i < SIZE/2; i++) {
#pragma HLS PIPELINE
        weight[i] = (bht[i*2] == bht[i*2+1]) ? perceptron[index][i*2+1] : perceptron[index][i*2];
    for (int i = 0; i < SIZE/2; i++) {
#pragma HLS PIPELINE
        sign[i] = bht[i*2] ? -1 : 1;
    for (int i = 0; i < SIZE/2; i++) {
#pragma HLS UNROLL
        dp += weight[i] * sign[i];
    pd = dp >= 0;
```

## dct,perceptron



## dijkstra,perceptron



#### matmul,perceptron



## qsort, perceptron



## **Evaluation**



# Branch Predictor on Top of Comet

- Branch outcome was originally computed in the EX stage, but we move it to the second stage, ID
- We implement branch prediction by changing the proper pipeline registers, and updating the predictor when the outcome is computed



# **Export IP and Other Problems**

- We can synthesize and export RISC-V Core as a Xilinx IP
- However, we can't easily build a real RISC-V processor on FGPA
- We need to combine memory controller, memory, I/O,etc.
- Although we found a <u>repo</u> which can generate bitstream on PYNQ
- We should replace original picorv32 core with Comet
  - Different input and output port
  - Whole design more complicated
  - Vivado version not meeted
- So we only use simulator

Q&A

#### Reference

- 1. What You Simulate Is What You Synthesize: Designing a Processor Core from C++ Specifications
- 2. A new organization for a perceptron-based branch predictor and its FPGA implementation
- 3. RISC-V CPU in HLS
- 4. <u>HL5: A 32-bit RISC-V Processor Designed with High-Level Synthesis</u>
- 5. <u>Two-Level Adaptive Training Branch Prediction</u>
- 6. RISC-V-On-PYNQ