# Stable Diffusion XL Turbo UNet FP32 512x512

Shamith Achanta

06.03.2024

## 1 Assumptions

- The set of operators that have the same output memory size are likely to be fused and computed as a single operator to reduce the number of times the output needs to be read from memory. Hence, the total memory of the blocks in red are not counted in the analysis.
- The on-chip memory on the NPU is a parameter. In this analysis, the on-chip memory is set to 4 MB and data (weights + output) with memory size greater than the on-chip memory will need to be stored in the Last-level cache (if-any) or Main Memory

Figure 1: Optimization 1

|          |              |         |        | Inputs     | Weights<br>and Bias | Output     | Weights<br>and Bias |               |            |
|----------|--------------|---------|--------|------------|---------------------|------------|---------------------|---------------|------------|
|          |              | Memory  | Output | Memory     | Memory              | Memory     | Memory              | Output Memory | Memory (in |
| Node     | Operator     |         | Size   | (in Bytes) | (in Bytes)          | (in Bytes) | (in MB)             | (in MB)       | MB)        |
|          | ocReshape    | 2621440 | 655360 |            |                     |            | 0                   |               |            |
|          | ocTranspose  | 2621440 |        | 2621440    |                     | 2621440    |                     |               |            |
| Constant | _{Constant   | 8       | 1      | 0          | 0                   |            |                     | 8.00E-06      |            |
| /down_bl | ocUnsqueeze  | 8       | 1      | 8          | 0                   | 8          | 0                   | 8.00E-06      |            |
| Constant | _{Constant   | 8       | 1      | 0          | 0                   | 8          | 0                   | 8.00E-06      | 8.00E-06   |
| /down_bl | oc Unsqueeze | 8       | 1      | 8          | 0                   | 8          | 0                   | 8.00E-06      | 8.00E-06   |
| Constant | :_{Constant  | 8       | 1      | 0          | 0                   | 8          | 0                   | 8.00E-06      | 8.00E-06   |
| /down_bl | oc Unsqueeze | 8       | 1      | 8          | 0                   | 8          | 0                   | 8.00E-06      | 8.00E-06   |
| /down_bl | o Concat     | 24      | 3      | 24         | 0                   | 24         | 0                   | 2.40E-05      | 2.40E-05   |
| /down_bl | ocReshape    | 2621440 | 655360 | 2621464    | 0                   | 2621440    | 0                   | 2.62144       | 2.62144    |
| /down_bl | ocMatMul     | 4259840 | 655360 | 2621440    | 1638400             | 2621440    | 1.6384              | 2.62144       | 4.25984    |
| /down_bl | .oc Add      | 2624000 | 655360 | 2621440    | 2560                | 2621440    | 0.00256             | 2.62144       | 2.624      |
| /down_bl | oc Div       | 2621440 | 655360 | 2621440    | 0                   | 2621440    | 0                   | 2.62144       | 2.62144    |
| /down_bl | oc Add       | 2621440 | 655360 | 5242880    | 0                   | 2621440    | 0                   | 2.62144       | 2.62144    |
| /down_bl | ocReduceMean | 4096    | 1024   | 2621440    | 0                   | 4096       | 0                   | 0.004096      | 0.004096   |
| /down_bl | .ocSub       | 2621440 | 655360 | 2625536    | 0                   | 2621440    | 0                   | 2.62144       | 2.62144    |
| /down_bl | oc Pow       | 2621440 | 655360 | 2621440    | 0                   | 2621440    | 0                   | 2.62144       | 2.62144    |
| /down_bl | o ReduceMean | 4096    | 1024   | 2621440    | 0                   | 4096       | 0                   | 0.004096      | 0.004096   |
| /down_bl | .oc Add      | 4096    | 1024   | 4096       | 0                   | 4096       | 0                   | 0.004096      | 0.004096   |
| /down_bl | .ocSqrt      | 4096    | 1024   | 4096       | 0                   | 4096       | 0                   | 0.004096      | 0.004096   |
| /down_bl | oc Div       | 2621440 | 655360 | 2625536    | 0                   | 2621440    | 0                   | 2.62144       | 2.62144    |
| /down_bl | oc Mul       | 2624000 | 655360 | 2621440    | 2560                | 2621440    | 0.00256             | 2.62144       | 2.624      |
| /down_bl | oc Add       | 2624000 | 655360 | 2621440    | 2560                | 2621440    | 0.00256             | 2.62144       | 2.624      |
| /down_bl | oc MatMul    | 4259840 | 655360 | 2621440    | 1638400             | 2621440    | 1.6384              | 2.62144       | 4.25984    |
| /down_bl | oc MatMul    | 5440000 | 49280  | 630784     | 5242880             | 197120     | 5.24288             | 0.19712       | 5.44       |
| /down_bl | oc MatMul    | 5440000 | 49280  | 630784     | 5242880             | 197120     | 5.24288             | 0.19712       | 5.44       |
| /down_bl | o Shape      | 24      | 3      | 2621440    | 0                   | 24         | 0                   | 2.40E-05      | 2.40E-05   |

## 2 Operator Memory Distribution

- Output + Weight matrices above on-chip memory size for an operator need to be stored in the Main Memory or last-level cache (if-any)
- $\bullet\,$  Total memory of all operators that have memory size > on-chip memory size is 10 GB

Figure 2: Operator Memory Distribution

#### SDXL Turbo UNet FP32 512x512

Should Weights + Output of an Operator be stored in Main Memory or Last-level cacheduring single inference?

If memory size of the Operator > 4 MB (on-chip memory) with no NPU cache



# 3 Memory Requirement of Individual Operators

Operators that have weights + output memory size > on-chip memory size

Figure 3: Memory Requirement of Individual Operators > 4 MB

SDXL Turbo UNet FP32 512x512

### Should Weights + Output of an Operator be stored in Main Memory or Last-level cacheduring single inference?

If memory size of the Operator > 4 MB (on-chip memory) with no NPU cache



Figure 4: Memory Requirement of Individual Operators > 9 MB

SDXL Turbo UNet FP32 512x512

### Should Weights + Output of an Operator be stored in Main Memory or Last-level cacheduring single inference?

If memory size of the Operator > 9 MB (on-chip memory) with no NPU cache

