#### FAET630004: AI-Core and RISC Architecture

(Due: 4/23/21)

## Homework Assignment #2

Instructor: Chixiao Chen Name: Chunyu Wang, FudanID: 20210860017

- This HW counts 15% of your final score, please treat it carefully.
- Please submit the electronic copy via mail: faet\_english@126.com before the due date.
- It is encouraged to use LATEX to edit it, the source code of the assignment is available via: https://www.overleaf.com/read/qnqfpcmqvchp
- You can also open it by Office Word, and save it as a .doc file for easy editing. Also, you can print it out, complete it and scan it by your cellphone.
- The assignment needs verilog/SV simulation. It is suggested to use Vivado from Xilinx to complete the simulation. If you do not want to install a local verilog simulator, please use an online tool: https://www.edaplayground.com/, you need register for save.
- You can answer the assignment either in Chinese or English

#### Problem 1: Implement a matrix multiplier on a RISC Core

(8+7=15 points)

Using the following ISA and hardware architecture to compute  $\mathbf{A} \cdot \mathbf{B} + \mathbf{C}$ , where  $\mathbf{A}$ ,  $\mathbf{B}$  and  $\mathbf{C}$  are  $8 \times 8$  matrices. Each element in them are signed integers with 8b length.

#### **Base: Scalar Instruction Extension: Vector Instruction** Vload vrd<sub>3b</sub> , LOAD rs $\mathtt{imm}_{5b}$ rs , imm<sub>5b</sub> $imm_{5b}$ , **Vstore** , vrs<sub>3b</sub> Store rs1 rs2 rs MOV **VMAC** vrd<sub>3b</sub> , vrs1<sub>3b</sub> , vrs2<sub>3b</sub> , funct<sub>4b</sub> rd $imm_{8b}$ , funct<sub>1b</sub> Imm. MUX Gen 16b DCM Optional Scalar ALU RF Scalar D **Config Reg** Ε S MAC Р C ICM **DCM** Α Vector Т D Ε C Vector Н **Reg File** MUX

#### Implementatino of a VMAC module:

```
'timescale 1ns / 1ps
   // Author: Wang Chunyu
   // Description:
   // Additional Comments:
   module VMAC(
        input
                    clk, rst, en,
               [31:0] vrs1,
        input
9
               [31:0] vrs2,
        input
10
        input
               [3:0] funct,
        output reg [31:0] vrd
      );
14
          [31:0] psum[3:0];
     reg
          [31:0] product;
16
     reg
     integer
               i;
     always @(*) begin
19
        if (!funct[3]) begin //mac, do not set bias
20
             product=0;
          for(i=0;i<4;i=i+1)</pre>
22
             product=product+$signed(vrs1[i*8+:8])*$signed(vrs2[i*8+:8]);
23
24
          //output, shift and truncate
25
          if(funct[2]==1'b1) begin
26
             for(i=0;i<4;i=i+1) begin</pre>
               // shift 0, truncate, the lower 8 bit as output. overflow handling, remain sign bit
               if((~psum[i][31]) && (psum[i]>31'b00000000000000000000001111111))
29
                  vrd[i*8+:8]=8'b01111111;
30
               31
                 vrd[i*8+:8]=8'b10000000;
               else
33
                  vrd[i*8+:8]=psum[i][7:0];
             end
35
          end
36
          else begin
37
              vrd=0;
          end
        end
        else begin
41
         product=0;
42
         vrd=0;
43
        end
44
45
     // write psum (scratch pad)
46
     always@(posedge clk or negedge rst) begin
47
        if(!rst) begin
48
           for(i=0;i<4;i=i+1)</pre>
49
50
             psum[i]<=0;
51
        end
        else if(en) begin
          if(funct[3]==1'b1) // set bias
             for(i=0;i<4;i=i+1)</pre>
54
               psum[i] <= $signed(vrs1[i*8+:8]);</pre>
55
          else begin// 4 mac
56
             psum[funct[1:0]] <= psum[funct[1:0]] + product;</pre>
57
          end
58
        end
     end
60
   endmodule
```

#### VMAC simulation:



Figure 1: Simulation (ZoomIn to get details)

The simulation data and waveform is shown in Fig.1, and the result *vrd* is correct.

(a) Write the entire assembly code for computation. (hints: 8 indexed vector register file is not sufficient for 8x8 matrix.)

The input A, B and C can be represented as partitioned matrix form P, Q and R following their storage characteristics, as shown in Fig.2(ZoomIn if you cannot see it). So we can process a 8x8 matrix multiplication by using several 4x4 VMACs.

Then we can write assembly code as follows. Every assembly code module(e.g. line  $6^{\circ}26$ ,  $28^{\circ}48$ , ...) computes continous 4 elements(row) in output matrix Y.



Figure 2: 8x8 matrix multiplication using a 4x4 VMAC (ZoomIn to get details)

### **Assembly Code:**

Vload

vr1,

62

\$3

```
r1,
                   $0,
                        [AB],
                                  $0
                                        // address bias (C)R
            r2,
                   $0,
                        [AW],
                                  $0
                                        // address weights (B)Q
   MOV
                   $0,
                                  $0
                                        // address input X (A)P
            r3,
                       [AX],
3
   MOV
            r4,
                   $0,
                       [AY],
                                  $0
                                        // address output Y = AB+C
5
   Vload
           vr0,
                  r1,
                         $0
                                        // load bias: R00
6
                         /, $1000b
   VMAC
                                        // mac init, set bias
            /,
                  vr0,
   Vload
           vr0,
                  r2,
                         $0
                                        // load weights: Q00
                         $2
   Vload
           vr1,
                  r2,
                                        // load weights: Q01
9
           vr2,
                         $4
                                        // load weights: Q02
   Vload
                  r2,
10
   Vload
                         $6
                                        // load weights: Q03
           vr3,
                  r2,
   Vload
           vr4,
                  r3,
                         $0
                                        // load input: P00
12
                        vr4, $0000b
   VMAC
             /,
                  vr0,
13
                        vr4, $0001b
   VMAC
             /,
                  vr1,
14
   VMAC
                  vr2,
                        vr4, $0010b
15
   VMAC
             /,
                  vr3,
                        vr4, $0011b
16
                                        // load weights: Q10
   Vload
           vr0,
                  r2,
                         $1
17
                         $3
                                        // load weights: Q11
   Vload
           vr1,
                  r2,
18
                                        // load weights: Q12
   Vload
           vr2,
                  r2,
                         $5
19
   Vload
                  r2,
                         $7
                                        // load weights: Q13
           vr3,
20
   Vload
           vr4,
                  r3,
                         $1
                                        // load input: P01
21
                  vr0, vr4, $0000b
   VMAC
             /,
22
   VMAC
             /,
                  vr1,
                        vr4, $0001b
23
                        vr4, $0010b
   VMAC
            /,
                  vr2,
24
   VMAC
           vr7,
                  vr3,
                        vr4, $0111b
25
   Vstore
           $0,
                  r4,
                        vr7
                                        // store output:Y00,Y01,Y02,Y03
26
27
           vr0,
                                        // load bias: R01
   Vload
                  r1,
                         $1
28
                         /, $1000b
   VMAC
            /,
                  vr0,
                                        // mac init, set bias
29
   Vload
                  r2,
                         $8
                                        // load weights: Q04
           vr0,
30
           vr1,
                                        // load weights: Q05
   Vload
                  r2,
                        $10
31
   Vload
                        $12
                                        // load weights: Q06
           vr2,
                  r2,
32
   Vload
           vr3,
                  r2,
                       $14
                                        // load weights: Q07
33
   Vload
           vr4,
                  r3,
                        $0
                                        // load input: P00
34
   VMAC
             /,
                  vr0,
                        vr4, $0000b
35
   VMAC
                  vr1,
                        vr4, $0001b
36
37
   VMAC
                  vr2,
                        vr4, $0010b
38
   VMAC
             /,
                  vr3,
                        vr4, $0011b
                                        // load weights: Q14
39
   Vload
           vr0,
                  r2,
                         $9
                   r2,
                        $11
                                        // load weights: Q15
40
   Vload
           vr1,
                        $13
                                        // load weights: Q16
   Vload
           vr2,
                   r2,
41
                  r2,
                        $15
                                        // load weights: Q17
   Vload
           vr3,
42
           vr4,
                  r3,
   Vload
                         $1
                                        // load input: P01
43
             /,
                        vr4, $0000b
   VMAC
                  vr0,
44
             /,
                  vr1,
                        vr4, $0001b
   VMAC
45
   VMAC
                  vr2,
                        vr4, $0010b
             /,
46
   VMAC
           vr7,
                  vr3,
                        vr4, $0111b
47
   Vstore
           $1,
                   r4,
                        vr7
                                        // store output:Y04,Y05,Y06,Y07
48
49
50
   Vload
           vr0,
                   r1,
                         $2
                                        // load bias: R10
                         /, $1000b
                                        // mac init, set bias
   VMAC
51
            /,
                  vr0,
   Vload
           vr0,
                  r2,
                         $0
                                        // load weights: Q00
52
                  r2,
                         $2
                                        // load weights: Q01
   Vload
           vr1,
53
   Vload
                  r2,
                         $4
                                        // load weights: Q02
           vr2.
54
   Vload
           vr3,
                  r2,
                         $6
                                        // load weights: Q03
55
   Vload
                  r3,
                         $2
                                        // load input: P10
           vr4,
56
                        vr4, $0000b
   VMAC
                  vr0,
57
             /,
   VMAC
                  vr1,
                        vr4, $0001b
58
   VMAC
                  vr2,
                        vr4, $0010b
59
   VMAC
                        vr4, $0011b
60
             /,
                  vr3,
                                        // load weights: Q10
61
   Vload
           vr0,
                  r2,
                         $1
                  r2,
                                        // load weights: Q11
```

```
Vload
             vr2,
                    r2,
                           $5
                                          // load weights: Q12
63
    Vload
             vr3,
                    r2,
                           $7
                                          // load weights: Q13
64
    Vload
            vr4,
                    r3,
                           $3
                                          // load input: P11
 65
                          vr4, $0000b
    VMAC
              /,
                   vr0,
                          vr4, $0001b
    VMAC
              /,
                   vr1,
 67
                   vr2,
                          vr4, $0010b
    VMAC
 68
              /,
    VMAC
                          vr4, $0111b
69
             vr7,
                   vr3,
             $2,
                          vr7
                                          // store output:Y10,Y11,Y12,Y13
    Vstore
                    r4,
70
71
    Vload
             vr0,
                    r1,
                           $3
                                          // load bias: R11
    VMAC
                           /, $1000b
                                          // mac init, set bias
 73
             /,
                   vr0,
    Vload
             vr0,
                    r2,
                           $8
                                          // load weights: Q04
74
                          $10
 75
    Vload
            vr1,
                    r2,
                                          // load weights: Q05
    Vload
                    r2,
                          $12
                                          // load weights: Q06
            vr2,
 76
    Vload
                    r2,
                          $14
                                          // load weights: Q07
 77
            vr3,
    Vload
            vr4,
                   r3,
                         $2
                                          // load input: P10
 78
                   vr0,
    VMAC
                         vr4, $0000b
              /,
 79
    VMAC
              /,
                         vr4, $0001b
                   vr1,
80
    VMAC
                   vr2,
                         vr4, $0010b
81
              /,
    VMAC
                          vr4, $0011b
              /,
                   vr3,
82
            vr0,
                    r2,
                          $9
                                          // load weights: Q14
    Vload
83
    Vload
                    r2,
                          $11
                                          // load weights: Q15
            vr1,
84
    Vload
            vr2,
                    r2,
                          $13
                                          // load weights: Q16
85
    Vload
            vr3,
                    r2,
                          $15
                                          // load weights: Q17
 86
                                          // load input: P11
 87
    Vload
            vr4,
                    r3,
                          $3
                          vr4, $0000b
    VMAC
                   vr0,
              /,
    VMAC
                          vr4, $0001b
              /,
 89
                   vr1,
    VMAC
              /,
                   vr2,
                          vr4, $0010b
90
    VMAC
             vr7,
                   vr3,
                          vr4, $0111b
91
             $3,
    Vstore
                    r4,
                          vr7
                                          // store output:Y14,Y15,Y16,Y17
92
93
    Vload
            vr0,
                                          // load bias: R20
94
                    r1,
    VMAC
             /,
                   vr0,
                           /, $1000b
                                          // mac init, set bias
95
                                          // load weights: Q00
    Vload
             vr0,
                    r2,
                           $0
96
    Vload
                           $2
                                          // load weights: Q01
97
            vr1,
                    r2,
    Vload
            vr2,
                    r2,
                           $4
                                          // load weights: Q02
98
    Vload
                    r2,
                           $6
                                          // load weights: Q03
            vr3,
99
                   r3,
                           $4
                                          // load input: P20
    Vload
            vr4,
100
                   vr0,
    VMAC
                         vr4, $0000b
              /,
101
    VMAC
                   vr1,
                          vr4, $0001b
              /,
102
    VMAC
              /,
                   vr2,
                          vr4, $0010b
    VMAC
              /,
                   vr3,
                          vr4, $0011b
104
    Vload
            vr0,
                    r2,
                           $1
                                          // load weights: Q10
106
    Vload
            vr1,
                    r2,
                           $3
                                          // load weights: Q11
107
    Vload
            vr2,
                    r2,
                           $5
                                          // load weights: Q12
                           $7
                                          // load weights: Q13
108
    Vload
            vr3,
                    r2,
    Vload
            vr4,
                    r3,
                          $5
                                          // load input: P21
109
                          vr4, $0000b
                   vr0,
110
    VMAC
              /,
                          vr4, $0001b
    VMAC
              /,
                   vr1,
    VMAC
                   vr2,
                          vr4, $0010b
              /,
            vr7,
    VMAC
                   vr3,
                          vr4, $0111b
    Vstore
             $4,
                    r4,
                          vr7
                                          // store output:Y20,Y21,Y22,Y23
114
    Vload
             vr0,
                    r1,
                           $5
                                          // load bias: R21
116
117
    VMAC
             /,
                   vr0,
                           /, $1000b
                                          // mac init, set bias
118
    Vload
             vr0,
                    r2,
                          $8
                                          // load weights: Q04
119
    Vload
            vr1,
                    r2,
                          $10
                                          // load weights: Q05
                    r2,
                          $12
                                          // load weights: Q06
120
    Vload
            vr2,
                          $14
    Vload
            vr3,
                    r2,
                                          // load weights: Q07
121
                         $4
    Vload
            vr4,
                    r3,
                                          // load input: P20
122
    VMAC
                   vr0,
                         vr4, $0000b
              /,
123
    VMAC
                   vr1,
              /,
                         vr4, $0001b
124
    VMAC
                         vr4, $0010b
125
              /,
                   vr2,
```

```
VMAC
               /,
                    vr3,
                          vr4, $0011b
126
    Vload
             vr0,
                    r2,
                           $9
                                           // load weights: Q14
127
    Vload
             vr1,
                    r2,
                          $11
                                          // load weights: Q15
128
                                          // load weights: Q16
    Vload
             vr2,
                    r2,
                          $13
129
                                          // load weights: Q17
    Vload
             vr3,
                    r2,
                          $15
                                          // load input: P21
131
    Vload
             vr4,
                    r3,
                           $5
                   vr0,
                          vr4, $0000b
    VMAC
              /,
              /,
    VMAC
                          vr4, $0001b
                   vr1,
              /,
    VMAC
                   vr2,
                          vr4, $0010b
134
    VMAC
             vr7,
                    vr3,
                          vr4, $0111b
135
                                          // store output:Y24,Y25,Y26,Y27
    Vstore
             $5,
                    r4,
                          vr7
136
137
                           $6
    Vload
             vr0,
                    r1,
                                          // load bias: R30
138
    VMAC
                           /, $1000b
             /,
                    vr0,
                                          // mac init, set bias
139
    Vload
                    r2,
                                          // load weights: Q00
140
             vr0,
                           $0
    Vload
             vr1,
                    r2,
                           $2
                                          // load weights: Q01
141
                           $4
    Vload
             vr2,
                    r2,
                                          // load weights: Q02
142
                           $6
    Vload
             vr3,
                    r2,
                                          // load weights: Q03
143
                    r3,
    Vload
             vr4,
                           $6
                                           // load input: P30
144
                          vr4, $0000b
    VMAC
              /,
                   vr0,
145
                          vr4, $0001b
    VMAC
              /,
                   vr1,
146
    VMAC
                   vr2,
                          vr4, $0010b
147
    VMAC
              /,
                    vr3,
                          vr4, $0011b
148
    Vload
             vr0,
                    r2,
                           $1
                                          // load weights: Q10
149
    Vload
             vr1,
                    r2,
                           $3
                                          // load weights: Q11
150
                                          // load weights: Q12
    Vload
             vr2,
                    r2,
                           $5
    Vload
             vr3,
                    r2,
                           $7
                                          // load weights: Q13
152
                                          // load input: P31
    Vload
             vr4,
                    r3,
                           $7
153
    VMAC
              /,
                   vr0,
                          vr4, $0000b
154
                   vr1,
    VMAC
              /,
                          vr4, $0001b
    VMAC
              /,
                   vr2,
                          vr4, $0010b
156
    VMAC
             vr7,
                    vr3,
                          vr4, $0111b
157
    Vstore
             $6,
                    r4,
                          vr7
                                          // store output:Y30,Y31,Y32,Y33
158
159
             vr0,
160
    Vload
                    r1,
                           $7
                                          // load bias: R31
                   vr0,
    VMAC
             /,
                           /, $1000b
                                          // mac init, set bias
161
    Vload
             vr0,
                    r2,
                           $8
                                          // load weights: Q04
162
                          $10
    Vload
             vr1,
                    r2,
                                          // load weights: Q05
163
                          $12
                                          // load weights: Q06
    Vload
             vr2,
                    r2,
164
    Vload
             vr3,
                    r2,
                          $14
                                          // load weights: Q07
165
    Vload
             vr4,
                    r3,
                           $6
                                          // load input: P30
166
    VMAC
              /,
                   vr0,
                          vr4, $0000b
167
    VMAC
              /,
                   vr1,
                          vr4, $0001b
168
169
    VMAC
                    vr2,
                          vr4, $0010b
170
    VMAC
              /,
                   vr3,
                          vr4, $0011b
171
    Vload
             vr0,
                    r2,
                           $9
                                          // load weights: Q14
    Vload
             vr1,
                    r2,
                          $11
                                          // load weights: Q15
172
                                          // load weights: Q16
    Vload
             vr2,
                    r2,
                          $13
                          $15
                                          // load weights: Q17
174
    Vload
             vr3,
                    r2,
                    r3,
             vr4,
                           $7
                                          // load input: P31
    Vload
              /,
                          vr4, $0000b
    VMAC
                   vr0,
                          vr4, $0001b
    VMAC
              /,
                   vr1,
177
    VMAC
                    vr2,
                          vr4, $0010b
178
              /,
    VMAC
             vr7,
                    vr3,
                          vr4, $0111b
179
180
    Vstore
             $7,
                    r4,
                          vr7
                                          // store output:Y34,Y35,Y36,Y37
181
182
    Vload
             vr0,
                    r1,
                           $8
                                          // load bias: R40
                           /, $1000b
183
    VMAC
             /,
                   vr0,
                                          // mac init, set bias
                    r2,
                           $0
    Vload
             vr0,
                                          // load weights: Q00
184
                           $2
    Vload
             vr1,
                    r2,
                                          // load weights: Q01
185
                    r2,
    Vload
                           $4
                                          // load weights: Q02
             vr2.
186
    Vload
             vr3,
                    r2,
                           $6
                                          // load weights: Q03
187
    Vload
                           $8
                                          // load input: P40
188
             vr4,
                    r3,
```

```
VMAC
                   vr0,
                          vr4, $0000b
189
    VMAC
                   vr1,
                          vr4, $0001b
190
                          vr4, $0010b
    VMAC
                   vr2,
191
                          vr4, $0011b
    VMAC
              /,
                   vr3,
192
    Vload
             vr0,
                    r2,
                           $1
                                          // load weights: Q10
193
                                          // load weights: Q11
194
    Vload
             vr1,
                    r2,
                           $3
                    r2,
                           $5
                                          // load weights: Q12
    Vload
195
             vr2,
                    r2,
    Vload
             vr3,
                           $7
                                          // load weights: Q13
196
    Vload
             vr4,
                    r3,
                           $9
                                          // load input: P41
197
                          vr4, $0000b
    VMAC
              /,
                   vr0,
198
    VMAC
                          vr4, $0001b
                   vr1,
199
    VMAC
              /,
                   vr2,
                          vr4, $0010b
200
    VMAC
             vr7,
                   vr3,
                          vr4, $0111b
201
                                          // store output:Y40,Y41,Y42,Y43
    Vstore
             $8,
                    r4,
                          vr7
202
203
    Vload
             vr0,
                    r1,
                           $9
                                          // load bias: R41
204
                           /, $1000b
    VMAC
             /,
                   vr0,
                                          // mac init, set bias
205
             vr0,
                    r2,
                           $8
                                          // load weights: Q04
    Vload
206
    Vload
             vr1,
                    r2,
                          $10
                                          // load weights: Q05
207
    Vload
             vr2,
                    r2,
                          $12
                                          // load weights: Q06
208
    Vload
                    r2,
                          $14
                                          // load weights: Q07
             vr3,
209
    Vload
             vr4,
                    r3,
                          $8
                                          // load input: P40
210
                          vr4, $0000b
    VMAC
              /,
                   vr0,
211
                          vr4, $0001b
    VMAC
                   vr1,
212
    VMAC
                          vr4, $0010b
              /,
                   vr2,
213
    VMAC
                          vr4, $0011b
214
              /,
                   vr3,
                                          // load weights: Q14
    Vload
             vr0,
                    r2,
                          $9
215
    Vload
                    r2,
                          $11
                                          // load weights: Q15
216
             vr1,
                    r2,
    Vload
             vr2,
                          $13
                                          // load weights: Q16
217
    Vload
             vr3,
                    r2,
                          $15
                                          // load weights: Q17
218
    Vload
             vr4,
                    r3,
                           $9
                                          // load input: P41
219
    VMAC
                   vr0,
                          vr4, $0000b
220
              /,
    VMAC
              /,
                   vr1,
                          vr4, $0001b
221
    VMAC
              /,
                   vr2,
                          vr4, $0010b
222
    VMAC
                          vr4, $0111b
223
             vr7,
                   vr3,
    Vstore
             $9,
                    r4,
224
                          vr7
                                          // store output:Y44,Y45,Y46,Y47
225
             vr0,
                          $10
                                          // load bias: R50
    Vload
226
                    r1,
    VMAC
                           /, $1000b
             /,
                   vr0,
                                          // mac init, set bias
227
    Vload
             vr0,
                    r2,
                           $0
                                          // load weights: Q00
228
    Vload
             vr1,
                    r2,
                           $2
                                          // load weights: Q01
229
    Vload
             vr2,
                    r2,
                           $4
                                          // load weights: Q02
230
    Vload
             vr3,
                    r2,
                           $6
                                          // load weights: Q03
231
232
    Vload
             vr4,
                    r3,
                          $10
                                          // load input: P50
233
    VMAC
              /,
                   vr0,
                          vr4, $0000b
                          vr4, $0001b
234
    VMAC
              /,
                   vr1,
                          vr4, $0010b
    VMAC
              /,
                   vr2,
235
                   vr3,
                          vr4, $0011b
236
    VMAC
              /,
             vr0,
                    r2,
                                          // load weights: Q10
                           $1
237
    Vload
                           $3
                                          // load weights: Q11
    Vload
             vr1.
                    r2,
238
    Vload
             vr2,
                    r2,
                           $5
                                          // load weights: Q12
239
                           $7
    Vload
             vr3,
                    r2,
                                          // load weights: Q13
240
                    r3,
                          $11
                                          // load input: P51
241
    Vload
             vr4,
                          vr4, $0000b
    VMAC
              /,
                   vr0,
242
                          vr4, $0001b
243
    VMAC
                   vr1,
                          vr4, $0010b
244
    VMAC
              /,
                   vr2,
                          vr4, $0111b
245
    VMAC
             vr7,
                   vr3,
                                          // store output:Y50,Y51,Y52,Y53
                          vr7
246
    Vstore
            $10,
                    r4,
247
    Vload
             vr0,
                    r1,
                          $11
                                          // load bias: R51
248
    VMAC
                   vr0,
                           /, $1000b
                                          // mac init, set bias
             /,
249
    Vload
             vr0,
                    r2,
                           $8
                                          // load weights: Q04
250
    Vload
                          $10
                                          // load weights: Q05
251
             vr1,
                    r2,
```

```
Vload
             vr2,
                     r2,
                          $12
                                           // load weights: Q06
252
    Vload
             vr3,
                     r2,
                          $14
                                           // load weights: Q07
253
    Vload
             vr4,
                    r3,
                          $10
                                           // load input: P50
254
    VMAC
              /,
                    vr0,
                          vr4, $0000b
255
                          vr4, $0001b
    VMAC
               /,
                    vr1,
                          vr4, $0010b
257
    VMAC
               /,
                    vr2,
    VMAC
                          vr4, $0011b
258
               /,
                    vr3,
                    r2,
    Vload
                           $9
                                           // load weights: Q14
             vr0.
259
                    r2,
    Vload
                          $11
                                           // load weights: Q15
             vr1.
260
    Vload
             vr2,
                    r2,
                          $13
                                           // load weights: Q16
261
             vr3,
                    r2,
                          $15
                                           // load weights: Q17
    Vload
262
    Vload
             vr4,
                    r3,
                          $11
                                           // load input: P51
263
                          vr4, $0000b
    VMAC
               /,
                    vr0,
264
    VMAC
                          vr4, $0001b
               /,
                    vr1,
265
    VMAC
               /,
                    vr2,
                          vr4, $0010b
266
                          vr4, $0111b
    VMAC
             vr7,
                    vr3,
267
             $11,
    Vstore
                    r4,
                          vr7
                                           // store output:Y54,Y55,Y56,Y57
268
269
    Vload
             vr0,
                    r1,
                          $12
                                           // load bias: R60
270
                               $1000b
    VMAC
              /,
                    vr0,
                            /,
                                           // mac init, set bias
271
             vr0,
                            $0
                                           // load weights: Q00
    Vload
                    r2,
272
    Vload
                    r2,
                            $2
                                           // load weights: Q01
             vr1,
273
    Vload
             vr2,
                    r2,
                            $4
                                           // load weights: Q02
274
    Vload
             vr3,
                    r2,
                           $6
                                           // load weights: Q03
275
                                           // load input: P60
    Vload
             vr4,
                    r3,
                          $12
276
                          vr4, $0000b
    VMAC
277
              /,
                    vr0,
                          vr4, $0001b
    VMAC
               /,
278
                    vr1,
                          vr4, $0010b
    VMAC
279
              /,
                    vr2,
    VMAC
              /,
                    vr3,
                          vr4, $0011b
280
    Vload
             vr0,
                    r2,
                           $1
                                           // load weights: Q10
281
    Vload
             vr1,
                    r2,
                            $3
                                           // load weights: Q11
282
    Vload
                    r2,
                                           // load weights: Q12
283
             vr2,
    Vload
             vr3,
                    r2,
                            $7
                                           // load weights: Q13
284
    Vload
             vr4,
                    r3,
                          $13
                                           // load input: P61
285
    VMAC
286
               /,
                    vr0,
                          vr4, $0000b
                          vr4, $0001b
    VMAC
               /,
287
                    vr1,
                          vr4, $0010b
    VMAC
              /,
                    vr2,
288
    VMAC
                          vr4, $0111b
289
             vr7,
                    vr3,
             $12,
    Vstore
                    r4,
                                           // store output:Y60,Y61,Y62,Y63
290
                          vr7
291
    Vload
             vr0,
                    r1,
                          $13
                                           // load bias: R61
292
    VMAC
              /,
                    vr0,
                            /, $1000b
                                           // mac init, set bias
293
    Vload
             vr0,
                    r2,
                            $8
                                           // load weights: Q04
294
295
    Vload
             vr1,
                    r2,
                          $10
                                           // load weights: Q05
296
    Vload
             vr2,
                    r2,
                          $12
                                           // load weights: Q06
297
    Vload
             vr3,
                    r2,
                          $14
                                           // load weights: Q07
    Vload
             vr4,
                    r3,
                          $12
                                           // load input: P60
298
                          vr4, $0000b
                    vr0,
299
    VMAC
              /,
                          vr4, $0001b
    VMAC
300
               /,
                    vr1,
              /,
    VMAC
                    vr2,
                          vr4, $0010b
301
    VMAC
              /,
                    vr3,
                          vr4, $0011b
302
    Vload
             vr0,
                    r2,
                           $9
                                           // load weights: Q14
303
                          $11
                                           // load weights: Q15
    Vload
             vr1,
                    r2,
304
    Vload
             vr2,
                    r2,
                          $13
                                           // load weights: Q16
305
306
    Vload
             vr3,
                    r2,
                          $15
                                           // load weights: Q17
307
    Vload
             vr4,
                    r3,
                          $13
                                           // load input: P61
                          vr4, $0000b
308
    VMAC
               /,
                    vr0,
                          vr4, $0001b
    VMAC
                    vr1,
309
               /,
    VMAC
              /,
                    vr2,
                          vr4, $0010b
310
                    vr3,
    VMAC
             vr7,
                          vr4, $0111b
311
             $13,
                                           // store output:Y64,Y65,Y66,Y67
    Vstore
                    r4,
                          vr7
312
313
             vr0,
                          $14
                                           // load bias: R70
314
    Vload
                    r1,
```

```
VMAC
                /,
                     vr0,
                                   $1000ъ
                              /,
                                              // mac init, set bias
315
     Vload
              vr0,
                             $0
                                                 load weights: Q00
                      r2,
316
     Vload
              vr1,
                      r2,
                             $2
                                                 load weights: Q01
317
     Vload
              vr2,
                      r2,
                             $4
                                              // load weights: Q02
     Vload
              vr3,
                      r2,
                             $6
                                              // load weights: Q03
     Vload
              vr4.
                      r3,
                            $14
                                              // load input: P70
320
                                  $0000ъ
     VMAC
                /,
321
                     vr0,
                            vr4.
                /,
     VMAC
                                  $0001ъ
                     vr1.
                            vr4.
322
     VMAC
                                  $0010Ъ
                /,
                     vr2.
                            vr4.
323
     VMAC
                /,
                     vr3,
                            vr4,
                                  $0011b
324
                                              // load weights: Q10
     Vload
              vr0,
                      r2,
325
                             $3
                                              // load weights: Q11
     Vload
              vr1,
                      r2,
326
     Vload
              vr2,
                      r2,
                             $5
                                              // load weights: Q12
327
     Vload
              vr3,
                      r2,
                             $7
                                              // load weights: Q13
328
     Vload
              vr4,
                      r3,
                            $15
                                              // load input: P71
329
     VMAC
                     vr0,
                                  $0000b
330
                /,
                            vr4.
     VMAC
                     vr1,
                /,
                            vr4,
                                  $0001b
331
     VMAC
                /,
                     vr2,
                            vr4,
                                  $0010b
332
     VMAC
                     vr3,
                                  $0111b
333
              vr7.
                            vr4.
     Vstore
              $14,
                      r4,
                            vr7
                                              // store output:Y70,Y71,Y72,Y73
334
335
     Vload
              vr0,
                      r1,
                            $15
                                              // load bias: R71
336
                     vr0,
337
     VMAC
                                   $1000Ъ
                                              // mac init, set bias
                /,
                              /,
     Vload
              vr0,
                      r2,
                             $8
                                                 load weights: Q04
338
     Vload
              vr1,
                      r2,
                            $10
                                              // load weights: Q05
339
     Vload
              vr2,
                      r2,
                            $12
                                              // load weights: Q06
                            $14
                                              // load weights: Q07
341
     Vload
              vr3,
                      r2,
                      r3,
                                              // load input: P70
342
     Vload
              vr4,
                            $14
     VMAC
                /,
                     vr0.
                            vr4.
                                  $0000ъ
343
     VMAC
                /,
                     vr1,
                            vr4,
                                  $0001ъ
344
     VMAC
                /,
                     vr2,
                            vr4,
                                  $0010b
345
346
     VMAC
                /,
                     vr3,
                            vr4.
                                  $0011b
              vr0,
                             $9
                                              // load weights: Q14
347
     Vload
                      r2,
     Vload
              vr1,
                      r2,
                            $11
                                              // load weights: Q15
348
349
     Vload
              vr2,
                      r2,
                            $13
                                              // load weights: Q16
     Vload
              vr3,
                      r2,
                            $15
                                              // load weights: Q17
              vr4,
                            $15
                                              // load input: P71
351
     Vload
                      r3,
     VMAC
                                  $0000Ъ
352
                /,
                     vr0,
                            vr4,
     VMAC
                     vr1,
                                  $0001b
353
                /,
                            vr4,
     VMAC
                /,
                     vr2.
                            vr4.
                                  $0010Ъ
354
     VMAC
              vr7,
                     vr3,
                            vr4,
                                  $0111b
355
     Vstore
              $15.
                      r4.
                                              // store output:Y74,Y75,Y76,Y77
356
```

# (b) Propose a superscalar strategy (maximum 2 instruction per fetch), and calculate how many cycles needed. Compare the utilization ratio with and without the superscalar strategy. Superscalar strategy:

The hardware architecture is shown in Fig.3, and the super scalar strategy is shown in Fig.4. IF is triggered by double edges when 2 instruction per fetch otherwise positive edge. Only when fetching a VMAC(without output) instruction at the clock's positive edge, it will fetch the next Vload instruction at the clock's negative edge, otherwise fetch the next instruction at the next positive edge.

It is worth mentioning that the WB phase is on the same clock of instruction VMAC and the very next Vload, but it will not cause conflict because the WB phase won't works in instruction VMAC(no output). While VMAC is outputing, it is a single instruction excuting.

#### Cycles and utilization ratio:

Originally, each assembly code vmac computing procedure (e.g. line  $6^26$ ,  $28^48$ , ..., which compute continous 4 elements (row) in output matrix Y) takes 21 instructions which takes 23 cycles, with 7 VMAC cycles in it. We need 16 this procedures to output the whole 8x8 matrix Y, adding 4 instructions that MOV the starting address of  $\mathbf{A}$ ,  $\mathbf{B}$ ,  $\mathbf{B}$ ,  $\mathbf{Y}$ , it uses  $4 + 21 \times 16 = 340$  instructions. As all instructions can be executed sequentially,



Figure 3: Hardware architecture



Figure 4: Superscalar strategy(ZoomIn to get details)

cycles that it takes  $C_{org}$  and utilization ratio  $U_{org}$  are:

$$C_{org} = 4 + 21 \times 16 + 2 = 342 \ cycles$$
  
 $U_{org} = (9 \times 16)/C_{org} = 0.421$ 

With the application of superscalar, it will reduce 8 cycles per precedure. So the cycles that it takes  $C_{ss}$  and utilization ratio  $U_{ss}$  are:

$$C_{ss} = 4 + (21 - 8) \times 16 + 2 = 214 \ cycles$$
  
$$U_{ss} = (9 \times 16)/C_{org} = 0.673$$

The utilization went up by (0.673-0.421)/0.421=59.6% after the application of superscalar stratege.