#### **MLBlocks:** Arming FPGA architectures with Dense & Low Precision units in classic column based manner



- 6 times more 8x8 multiplier comparing to a DSP Block (two 8x8),

RS Data flow, High frequency, flexible data movement. Great for SConv, DWConv, PWConv, Matrix-Matrix Multiplication

- DPS BRAM ratio 1/1 (same as Ultrascale+ arch), Low number of intermediary outputs in practice
- Parameterized (for any budget limitation) can integrate multi precision idea
- 2- Compare with cascade paper (Prof. Nachiket)
- 3- new suggestion to use each 18KBRAM as 36bit streamer using external controler circuit (delivering 662MHz) (in cascade paper: 18bit)

| UltraScale+ architecture distribution: | URAM | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|----------------------------------------|------|--------|-------|--------|-------|--------|-------|--|
|                                        |      | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        | URAM | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        | URAM | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        |      | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        | URAM | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        | URAM | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        |      | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        | URAM | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        | URAM | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        |      | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |
|                                        | URAM | BRAM18 | DSP48 | BRAM18 | DSP48 | BRAM18 | DSP48 |  |

Virtex 7  $\rightarrow$  28nm DSP48E1 Fmax=742MHz Virtex U  $\rightarrow$  20nm Virtex UP  $\rightarrow$  16nm









# My amassing MLBlocks world











Dis Parallel:

1- more fan in and outs (since we are talking about small Pes it is fine)

Dis Systolic:

1- tougher scheduling, rythmic scheduling

2- prevent circuit fusions (less optimization)

$$B = B_{Seq} \times B_{par} \times B_{Sys}$$

# of Physical MAC:  $\times B_{par} \times B_{Sys}$ # of Input:  $\times B_{par}$ 

# of Output:  $\times B_{par}$ 

(without internal serial to parallel)

#### Params = {right side indexes}

```
for param^{i}_{sch\_0}: 0 \rightarrow sch^{i}\_0

for param^{i}_{sch\_1}: 0 \rightarrow sch^{i}\_1

for param^{i}_{sch\_2}: 0 \rightarrow sch^{i}\_2

for param^{i}_{seq}: 0 \rightarrow comp\_seq^{i}

for param^{i}_{uu}: 0 \rightarrow comp\_un^{i}
```









3x3

2x3

2x3

2x3

2x3

2x3

2x3

2x3

2x3

2x3

| README.n | nd         |                               |     |
|----------|------------|-------------------------------|-----|
| Prima    | ary Res    | ults                          |     |
| Size     | Experiment | Computations (A,B)            | A_I |
| DSP48    |            | Dual8x8,27x18                 |     |
| 4x3      | HP_all+R   | 8x8,8x16,16x8,24x8,8x24,16x16 | 6   |
| 4x3      | HP_all     | 8x8,8x16,16x8,24x8,8x24,16x16 | 6   |
| 4x3      | HP_most+R  | 8x8,8x16,16x8,24x8,16x16      | 6   |
| 4x3      | HP_most    | 8x8,8x16,16x8,24x8,16x16      | 6   |
| 4x3      | HP_semi+R  | 8x8,8x16,16x8,16x16           | 4   |
| 4x3      | HP_semI    | 8x8,8x16,16x8,16x16           | 4   |
| 4x3      | HP_apx+R   | 8x8,8x16,16x8,16x16apx        | 4   |
| 4x3      | HP_apx     | 8x8,8x16,16x8,16x16apx        | 4   |
| 4x3      | BYPASS+R   | 8x8+reuse                     | 2   |
| 4x3      | BYPASS     | 8x8                           | 2   |
|          |            |                               |     |

HP all+R

HP most+R

HP semi+R

HP\_apx+R

BYPASS+R

HP most

HP\_semI

HP\_apx

**BYPASS** 

HP all+R

HP all

HP most

HP semi

HP apx

**BYPASS** 

HP apx+R

BYPASS+R

HP all

A D B D ACC D SHIFTER Area

HP

HP

HP

HP

HP

HP

HP apx

HP apx

BYPASS

BYPASS

HP

HP

HP

HP

HP

HP

HP\_apx

HP\_apx

**BYPASS** 

**BYPASS** 

HP

HP

HP

HP

HP

HP

HP apx

HP apx

**BYPASS** 

BYPASS 3267

6 2

2

3

4 4 2

4 2 1

4 4 2

4 2 1

2 2 2

2 1 1

6 2

3 1

4 2

6

6 2

4 4 2

4 2 1

4 4 2

4 2 1

2 2 2

2 1 1

6 4 2

6 2 1

4 4 2

2 1

2 1 1

1

2

4 2

2 2 2

8x8,8x16,16x8,24x8,8x24,16x16 6

8x8,8x16,16x8,24x8,8x24,16x16 6

8x8,8x16,16x8,24x8,8x24,16x16 6

8x8.8x16.16x8.24x8.8x24.16x16 6 3 1

8x8.8x16.16x8.24x8.16x16

8x8,8x16,16x8,24x8,16x16

8x8,8x16,16x8,16x16

8x8,8x16,16x8,16x16

8x8,8x16,16x8,16x16apx

8x8,8x16,16x8,16x16apx

8x8.8x16.16x8.24x8.16x16

8x8,8x16,16x8,16x16

8x8,8x16,16x8,16x16apx

8x8,8x16,16x8,16x16apx

8x8+reuse

HP most+R 8x8,8x16,16x8,24x8,16x16

8x8+reuse

HP semI+R 8x8.8x16.16x8.16x16

8x8

7958

18340

13797

16778

13107

16232

12536

14318

10786

10721

6445

13760

10346

12641

9570

12206

9140

10686

8161

8062

4825

9246

6944

8460

6447

8173

6160

7186

5544

5440

#### 412MHz without pipeline



MLBlobk - PEFlex



# Why Dot product comparing to Systolic?

- Circuit fusion and optimisation
- Both have same unrolling factor
- Efficient pipelining rather than structured pipelines
- Register replacing and retiming
- Systolic are designed for better scaling (we are focusing on PE design which is small size)
  - If we talk for inside a PE ==> dot-product is better
  - If we explore PE-PE structure ==> Systolic manner
- Vector unit or systolic? Number of IO is much better in Systolic arrays
- My MLBlock benefit both Systolic-array and dot-product based accelerators.
  - Without using Systolic interconnections: MLBlock = a dot-product unit. Great for Supertile
  - Using interconnections: MLBlock = a column based systolic array structure Great for TPU/SeanFPT like

# 10 requirements

- Types:
  - Stream (Windowing) vs RAM:
    - Reasonable windowing
  - # of IOs
    - Less out
    - Maximum input

# For every parameter in a given algorithm

$$P = P_{Comp\_Sch} \times P_{Comp\_PE}$$

$$P_{\text{Comp\_Sch}} = \prod P_{\text{Sch-i}}$$

$$P_{Comp\_PE} = P_{Comp\_Un} \times P_{Comp\_Seq}$$

IO aspect of a PE

$$P_{Comp\_Un} \times P_{Comp\_Seq} = P_{IO\_Un} \times P_{IO\_Seq}$$

P<sub>IO Seq</sub>: dictate the clock cycles for fully recharge

P<sub>IO Seq</sub> should be

## For every parameter in a given algorithm

Required Multipliers =  $\Pi P_{comp\_PE}$ 

$$P_{Comp\_Un} \times P_{Comp\_Seq} = P_{IO\_Un} \times P_{IO\_Seq}$$

- P<sub>IO\_Seq</sub>: clock cycles for fully recharge
- P<sub>Comp Seq</sub>:

Extracting from for k = 0 until K do **Algorithm** for c = 0 until C do for y = 0 until Y do for x = 0 until X do for  $f_u = 0$  until  $F_Y$  do Index W for  $f_x = 0$  until  $F_X$  do  $O[b][k][x][y] += I[b][c][x+f_x][y+f_y]$ В  $\times$ **W**[k][c][f<sub>x</sub>][f<sub>u</sub>] K Each index affect everyplace if it is used there. example: "b" is used in "O" and "I". Thus having n times unrolling "b" requires:

Mult

n

n

n

n

n

X

Fy

Fx

4) There is no effect on "W" Streaming does not affect anything.

3) as it is used for "I" in multiplication. It requires n times more Multipliers.

adition to unrolled regisres which can be relaised by the algorithm and loop orders.

**Algorithm 1** CONV layer: simple seven nested loops.

1) n times more output result signals

2) n times more input signals

for b = 0 until B do

Streaming + Windowing reduces the IOs by saving them inside

(each element has the chance to be choosen). X, Fx, Y, Fy all are the candidates. Selecting one or more is acceptable. Since they are in "I", they just affect the "I" requirements. The affects of selecting a variable to be windowed is the added shift registers and in

Who can be windowed? The elements of input's or weight's indexes which includes more than one elements

# For every parameter in a given algorithm

### Precision

Maybe 9x9 is better than 8x8

- 1) It is Xilinx style
- 2) Fit BRAM well and URAM in a good shape (URAM's width = 72 = 8\*9. So both 8 and 9 are great.
- 3)Then supporting 27x18 is available (keep in mind that Acc width is pain full. 45 bits at least).
  - 1)A size: 3
  - 2)B size: 2 x reuse
  - 3)Acc size: reuse x (at least 45)

#### Reusement

Reusement factor for different layers (for input act)

- 1) Standard: KKC
- 2) DW: C
- 3) PW: C
- 4) FC: C
- 5) Mat-Mat Mult: C

# Which PE arch to pick?

- Highest utilisation rate.
- Power will be managed by scheduling (Stanford paper): Power analysis on PE structure is generally non significant
- Scalability
- Flexible precision
- Reusement
- Less partial outputs
- Don't trust on batch parallelism since it is not the case for embedded designs
- Limited number of multipliers

## Ideas

- 1) Lop-based model → for different algorith → ASIC PE
- 2) Bench marking by Ideal PE archs(from 1) of current architectures (How well archs can implement the Ideal PEs)
- 3) How Synthetic arch can be generated?

### Intel Architecture

Same as Xilinx DSP is bigger, BRAM is bigger as well

DSP 28nm:

TABLE II: Area of enhanced DSP blocks and overhead of supporting different modes compared to the baseline.

| as well                    | DSP Block                         | Post-Synth. Area $(\mu m^2)$ | Post-P&R<br>Area (µm²)     | Area<br>Ratio |  |
|----------------------------|-----------------------------------|------------------------------|----------------------------|---------------|--|
|                            | Baseline DSP Block                | 8404                         | 9875                       | 1.00          |  |
|                            | Add 9×9 Mult.                     | 8368                         | 10320                      | 1.04          |  |
|                            | Add 9×9 MAC                       | 8810                         | 10384                      | 1.05          |  |
|                            | Add $4\times4$ Mult.(1)-max reuse | 9571                         | _                          | _             |  |
|                            | Add $4\times4$ Mult.(2)-min reuse | 9104                         | 10752                      | 1.09          |  |
|                            | Add $4\times4$ Mult.(3)-mid reuse | 8909                         | 11651                      | 1.18          |  |
|                            | Add $4\times4$ MAC using $C2$     | 9543                         | 11887                      | 1.20          |  |
|                            | Add $4\times4$ MAC using $C5$     | 9389                         | 11108                      | 1.12          |  |
| intel® Agilex <sup>T</sup> | ™ F-SerieS                        | i                            | ntel® Agilex <sup>TM</sup> | i-SerieS      |  |

|                   | inter® rightex ribertes |      |      |      |      |       |       | ###################################### |       |
|-------------------|-------------------------|------|------|------|------|-------|-------|----------------------------------------|-------|
| M20K              | 1900                    | 2844 | 3792 | 5568 | 7110 | 11616 | 13272 | 11616                                  | 13272 |
| DSP18x19          | 2300                    | 3280 | 4592 | 8000 | 9020 | 12500 | 17056 | 12500                                  | 17056 |
| DSP27x27          | 1150                    | 1640 | 2296 | 4000 | 4510 | 6250  | 8528  | 6250                                   | 8528  |
| ratio             | 1.21                    | 1.15 | 1.21 | 1.44 | 1.27 | 1.08  | 1.29  | 1.08                                   | 1.29  |
| DSP/M20K          | 0.61                    | 0.58 | 0.61 | 0.72 | 0.63 | 0.54  | 0.64  | 0.54                                   | 0.64  |
|                   |                         |      |      |      |      |       |       |                                        |       |
| BRAM18            | 4033                    |      |      |      |      |       |       |                                        |       |
| DSP27x18          | 9024                    |      |      |      |      |       |       |                                        |       |
| ratio<br>DSP/M20K | 2.24                    |      |      |      |      |       |       |                                        |       |

# Roofline model



## Generalised Architecture

- Architecture input
  - Array size: 3x4
  - Selected configurations (after filtering)

\_\_\_



















