# Synthesis Analsysis
In this notebook, we analyze the synthesis results of the [parameter sweeping](https://sdrangan.github.io/hwdesign/loopopt/paramsweep.html).  As discussed there, we synthesize multiple versions of the vector multiplier with different unroll factors.  Here, we will parse the synthesis results to look at the resource usage for different unrolling factors.

 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import os

## Loading the synthesis output files

Follow the steps in the [parameter sweeping](https://sdrangan.github.io/hwdesign/loopopt/paramsweep.html) instructions.  The will create a number of directories:
~~~
   vmult_vitis
   ├── vmult_hls
   │   └── sol_uf1
   │   └── sol_uf2
   │   └── sol_uf4
   │   └── sol_uf8
~~~
where each directory represents the synthesis for different unroll factors 1, 2, 4, and 8.
In each of these directories is a **report file** at the location like:
~~~bash
   sol_uf<n>/syn/report/csynth.xml
~~~

Feel free to open the file.  You will see it has lots of data on the synthesis results.

Since we will parse this data a lot I have written (actually, I got ChatGPT to write it for me)
a parser to read these files.  The code is in the `xilinxutils` package which is part of the course.

Here, we first use the parser on the output for the synthesis with the unroll factor, `UF = 1`.

In [50]:

import importlib
import xilinxutils

# Reload to get latest changes during development
importlib.reload(xilinxutils)

from xilinxutils.csynthparse import CsynthParser
import os


# Parse the synthesis for UF=1 
sol_path = os.path.join(os.getcwd(), '..', 'vmult_hls', 'sol_uf1')
parser = CsynthParser(sol_path=sol_path)

# Get the latency and initiation interval
print('Latency and Initiation Interval:')
parser.get_loop_pipeline_info()
print(parser.loop_df)

# Get the resources
print('\nResource Usage:')
parser.get_resources()
print(parser.res_df)

Latency and Initiation Interval:
                                           PipelineII  PipelineDepth
vec_mult_Pipeline_input_loop:input_loop             2             14
vec_mult_Pipeline_mult_loop:mult_loop               2             13
vec_mult_Pipeline_output_loop:output_loop           2              5

Resource Usage:
                               BRAM_18K  DSP      FF    LUT  URAM
vec_mult_Pipeline_input_loop          0    0     979    313     0
vec_mult_Pipeline_mult_loop           0    3     472    545     0
vec_mult_Pipeline_output_loop         0    0      63    134     0
vec_mult                              8    3    2954   2620     0
Total                                 8    3    2954   2616     0
Available                           280  220  106400  53200     0


We see that we get data for each of the loops.  We will focus on `mult_loop`, the main multiplication loop.  We address pipelining of input and output later.  The trip conunt and total number of cycles  for `n` iterations is:
~~~
   trip = (n + unroll_factor - 1) // unroll_factor
   ncycles = PipelineDepth + PipelineII * (trip - 1)
~~~
So, `PipelineDepth` is the time for the first iteration and `PipelineII` is the additional time for each other iteration.

## Parsing the data for different unroll factors

Now let's parse all the solutions to compare the results.  First, we find all the directories with solutions.

In [None]:
hls_dir = os.path.join(os.getcwd(), '..', 'vmult_hls')
sol_dirs = [d for d in os.listdir(hls_dir) if os.path.isdir(os.path.join(hls_dir, d)) and d.startswith('sol_uf')]
sol_dirs.sort()

print(f"Found {len(sol_dirs)} solution directories:")
for d in sol_dirs:
    print(f"  {os.path.join(hls_dir, d)}")

Found 4 solution directories:
sol_uf1
  c:\Users\sdran\Documents\repos\hwdesign\vector_mult\vmult_vitis\scripts\..\vmult_hls\sol_uf1
sol_uf2
  c:\Users\sdran\Documents\repos\hwdesign\vector_mult\vmult_vitis\scripts\..\vmult_hls\sol_uf2
sol_uf4
  c:\Users\sdran\Documents\repos\hwdesign\vector_mult\vmult_vitis\scripts\..\vmult_hls\sol_uf4
sol_uf8
  c:\Users\sdran\Documents\repos\hwdesign\vector_mult\vmult_vitis\scripts\..\vmult_hls\sol_uf8


Next, we parse each of these directories and get the loop info and resource usage from each.

In [21]:
res_dfs = {}
latency_dfs = {}
for sol in sol_dirs:
    
    sol_path = os.path.join(hls_dir, sol)
    base_name = os.path.basename(sol_path)
    print(f"Processing solution: {base_name}")

    # Parse the synthesis report
    parser = CsynthParser(sol_path=sol_path)
    
    # Get resources
    parser.get_resources()
    res_dfs[base_name] = parser.res_df
    
    # Get latency and initiation interval
    parser.get_loop_pipeline_info()
    latency_dfs[base_name] = parser.loop_df

Processing solution: sol_uf1
Processing solution: sol_uf2
Processing solution: sol_uf4
Processing solution: sol_uf8


Now let's look at how the resource usage increases with the unroll factor.

In [31]:
res_totals = {}

# Get the total resources for each solution from the row 'Total'
for sol, df in res_dfs.items():
    res_totals[sol] = df.loc['Total']
res_totals['Available'] = df.loc['Available']

import pandas as pd
res_total_df = pd.DataFrame.from_dict(res_totals, orient='index')

# Label the index column
res_total_df.index.name = 'Solution'

print(res_total_df)

           BRAM_18K  DSP      FF    LUT  URAM
Solution                                     
sol_uf1           8    3    2954   2616     0
sol_uf2           8    3    3302   2941     0
sol_uf4          14    6    4125   3571     0
sol_uf8          26   12    5644   5033     0
Available       280  220  106400  53200     0


Now, let's look at the pipeline info for the multiplication loop

In [49]:
# Get the total resources for each solution from the row 'Total'
mult_loop_name = 'vec_mult_Pipeline_mult_loop:mult_loop'
loop_data = {}
for sol, df in latency_dfs.items():
    loop_data[sol] = df.loc[mult_loop_name]

loop_df = pd.DataFrame.from_dict(loop_data, orient='index')

# Label the index column
loop_df.index.name = 'Solution'

print(loop_df)

          PipelineII  PipelineDepth
Solution                           
sol_uf1            2             13
sol_uf2            2             14
sol_uf4            2             14
sol_uf8            2             14


We see that the pipeline data per trip count is not significantly different from unrolling.  Of course, by unrolling, more is done per trip.

## BRAM usage

It is useful to compare the resource usage for the BRAM. With unrolling we partition the buffer into banks of size:
~~~
   bank_size = max_size * bit_width / unroll_factor
~~~
Each bank needs
~~~
   nbram_per_bank = ceil( bank_size / bram_size)
~~~
BRAM units.  The banks must go in different BRAM units so they can be addressed indepedenntly.
So, the total number of BRAM units is:
~~~
    bram_exp  = bram_per_bank * nbuf * unroll_factor
~~~
We can also compute the BRAM utilization as:
~~~
   bram_util = bank_size / (nbram_per_bank * bram_size)
~~~
which represents the fraction of the BRAM that we are using.

In [44]:
# Get the unroll factor from the solution names
unroll_factor = []
bram_used = []
bram_exp = []
bram_util = []
max_size = 1024
bit_width = 32
bram_size = 18 * 1024  # in bits
nbuf = 3 # One each for a, b, and c

for k in res_totals.keys():
    if not k.startswith('sol_uf'):
        continue 

    # Get the unroll factor
    uf = int(k.split('_uf')[-1])
    unroll_factor.append(uf)

    # Get the BRAM used
    bram_used.append(res_totals[k]['BRAM_18K'])

    # Get the expected BRAM usage
    bank_size = (max_size * bit_width / uf )
    nbram_per_bank = (bank_size + bram_size - 1) // bram_size  # Ceiling division
    total_bram = int(nbram_per_bank * uf * nbuf)
    bram_exp.append(total_bram)

    # Get the BRAM utilization
    bram_util.append( bank_size / nbram_per_bank / bram_size)

# Create a data frame
bram_df = pd.DataFrame({
    'Unroll_Factor': unroll_factor,
    'BRAM_Used': bram_used,
    'BRAM_Expected': bram_exp,
    'BRAM utilization': bram_util
})

print(bram_df)

   Unroll_Factor  BRAM_Used  BRAM_Expected  BRAM utilization
0              1          8              6          0.888889
1              2          8              6          0.888889
2              4         14             12          0.444444
3              8         26             24          0.222222


We see that the BRAM used is slightly higher than expected.  The additional BRAM is due to some BRAM usage for the AXI interface.  Also, we see that the BRAM starts to grow as we bank the data in more memories.  The total memory is constant.  It is just that we have to split the data into small banks and the BRAM units, which have fixed sizes, starts to go under-utilized.  In an ASIC, this would not be such a problem, since we could create arbitrarily small memories.