In [1]:
from typing import Any
from zigzag.hardware.architecture.Accelerator import Accelerator
from zigzag.parser.accelerator_factory import AcceleratorFactory
from zigzag.parser.AcceleratorValidator import AcceleratorValidator
from zigzag.utils import open_yaml

# Theoretical Part
Here is the theoretical part of what you need to do. Read it carefully along with this [ZigZag - Paper](https://ieeexplore-ieee-org.tudelft.idm.oclc.org/document/9360462)


## Step-1 - Design Definition 

We need to define Asmae's accelerator design into the given yaml file. inputs/hardware/asmaeDesign.yaml . What we need to define
1. Cells - Need to be careful here because you cannot define 1-bit cell. You need to use 4 or 8 and change the number of columns (instead of 128, use 128/4, or 128/8 since bits are unrolled over columns)
    ```yaml
      cells:
        size: 8
        r_bw: 8
        w_bw: 8
        r_cost: 0
        w_cost: 0.095
        area: 0
        r_port: 1
        w_port: 1
        rw_port: 0
        latency: 0
        auto_cost_extraction: True
        operands: [I2]
        ports:
          - fh: w_port_1
            tl: r_port_1
        served_dimensions: [] # Fully unrolled over all multipliers
    ```
    
2. Size of Operational Array 128x128 (or 16x128x8 since we have 8 banks). 
```yaml
    operational_array:
      is_imc: True
      imc_type: digital # it won't matter that much because you are going to overwrite the functions built in zigzag
      input_precision: [8, 8] # unit: bit, first value is for inputs and second value here is for weight-precision
      bit_serial_precision: 1 # unit: bit, Asmae's design has 1-bit serial precision
      adc_resolution: 3 # unit: bit
      dimensions: [D1, D2, D3] # Example
      sizes: [32, 32, 8] # [COLUMNS,ROWS,BANKS] -> change accordingly

```

3. You should also need to define buffers per row and buffer per column for the data transfer (you could keep it standard to 1Bytes and 16Bytes)
4. And another layer in memory Hierarchy, just use DRAM (no need for SRAM as well).  

Important Notes :
We have integrated the broad scope of the structure of Asmae's design into ZigZag. However, there are a few key parameters that are different in terms of how ZigZag performs energy/area/latency calculation. (Since you have figured out the differences in the area estimation, I won't focus on that). ZigZag uses adder trees for accumulation, but Asmae does not. 
You need to find the places where ZigZag uses Adder-trees related functions when it comes to **energy** and **latency** calculation. We have the following
* Peak Energy Consumption (*get_peak_energy_single_cycle(self)*),  Peak Macro Level Performance (* get_macro_level_peak_performance(self) *)   if your design is fully utilized)  **IMPORTANT: THIS IS NOT DEPENDED ON A WORKLOAD - similar to how area is not depended on a workload**
* Layer Energy Consumption (Energy and latency is depended on the Workload you are  going to run. Running multiple convolutional kernels and running a simple MLP are not going to yield the same results energy-wise. To find the energy-per-layer you will need to change *get_energy_for_a_layer(self, layer: LayerNode, mapping: Mapping)*








Here is an image of Asmae's Design
![Asmae's Design](https://i.imgur.com/FE4d7bg.jpeg)

## Step 2 - Workloads & Mapping
The second part that is important to understand is that mapping plays an important role in Energy/Latency (but not area) of a macro. To create a cost-model in ZigZag we need to define two types of mapping Spatial and Temporal. Spatial refers to where each operand resides (in which part of the memory hierachy) during computation. And temporal refers to the way the nested for-loops are executed. 
All 2D convolutions can be written as 7 nested for-loops. How you execute them temporally and spatially is important. For instance, are you going to use loop unrolling for some specific loops? Are you going to tile any of the loops? Are you going to reorder them? All these can be defined via spatial and temporal mapping in ZigZag


To clear things out here is the example of ZigZag paper :
![ZigZag-Workloads](https://i.imgur.com/VFsZEti.png)

[ZigZag - Reference](https://ieeexplore-ieee-org.tudelft.idm.oclc.org/document/9360462)


### Workload Example - Simple CNN

Let's go through an example that is going to help you understand. Let's define a filter with 9 kernels (K) operating on 3-Channel Input of size 64x64. (We will keep stride and dilation = 1 for simplicity). So we assume that we have batch-size = 1. let's write the aforementioned for loop with our use-case values
```python
for b=0 to 0 # B=1
    for k=0 to 8 # K=9
        for c=0 to 2 # C=3
            for oy=0 to 62 # OY=62 (ROW OF OUTPUT), applying a 3x3 conv kernel to a 64x64 will result in a 62x62 image
                for ox=0 to 62 # OX=62 (COLUMN OF OUTPUT) , applying a 3x3 conv kernel to a 64x64 will result in a 62x62 image
                    for fy=0 to 2 # FY=3 Kernel row
                        for fx=0 to 2 # FX=3 kernel column
                            Output[b][k][oy][ox] += Input[b][c][oy+fy][ox+fx] * W[k][c][fy][fx]
```



To create a simple-cnn model like this you could run the following piece of code (either in colab or locally if you have torch and onnx installed)

```python
import torch
import torch.nn as nn
import onnx

class SimpleConvNet(nn.Module):
    def __init__(self):
        super(SimpleConvNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=9, kernel_size=3, stride=1, padding=0)  # padding to keep output size same as input size

    def forward(self, x):
        return self.conv1(x)

# Instantiate the model and set it to evaluation mode
model = SimpleConvNet()
model.eval()

# Create a dummy input tensor of size (1, 3, 64, 64) - batch size 1, 3 channels, 64x64 image
dummy_input = torch.randn(1, 3, 64, 64)

# Export the model to an ONNX file
onnx_file_path = "simple_conv_net.onnx"
torch.onnx.export(model, dummy_input, onnx_file_path, verbose=True, input_names=['input'], output_names=['output'])

print(f"Model exported to {onnx_file_path}")
```

### ZigZag specific workload mapping

Assuming that you have a simple (or a more complex workload) you should now map it into your hardware. You need Spatial and Temporal Mapping.
You can define both in mapping.yaml or you can use the Mapping Generators of ZigZag itself. I would suggest mapping the workload spatially yourself and use LOMA (temporal mapping engine) for temporal mapping. 

Example of yaml file with define spatial mapping 
```yaml
- name: default
  core_allocation: [1]
  memory_operand_links:
    O: O
    W: I2
    I: I1
  spatial_mapping:
    D1:
      - K, 3
    D3: 
      - C, 3
```


This mapping does the following : 
1. Unrolls 3 Kernels over D1 dimension / column dimension (we have 9 kernels in total)
2. Unrolls 3 Channels over D3 dimension / memory bank (we have 3 channels in total)

Here you need to be careful with your Spatial Mapping in order to be sure that your data fits into the operational array. 


You could also define in your mapping.yaml file temporal mapping as well. 
```yaml
temporal_ordering:
    - [OX,62] # Innermost loop
    - [OY, 62]
    - [FX, 3]
    - [FY, 3]
    - [K, 3] # Outermost loop
```

Or you can use LOMAEngine and allow it to generate a temporal mapping for you

## Cost-Model 
Now assuming that you have done all the things above

1. Defined your accelerator + changed the source code so that when we run ZigZag, we get results for Asmae's digital periphery
2. Defined your workload along with a temporal and spatial mapping
You can now run a cost model which will give you results regarding the whole workload execution.

Here you will not need to change much

# Hands-on Example 
Here is an example of using ZigZag from a notebook (decoupling it from its main functions)

## Imports and Utility Functions

In [1]:
from typing import Any
from zigzag.hardware.architecture.Accelerator import Accelerator
from zigzag.parser.accelerator_factory import AcceleratorFactory
from zigzag.parser.AcceleratorValidator import AcceleratorValidator
from zigzag.utils import open_yaml

In [2]:
# Utility functions
def print_attributes(obj):
    for attribute in dir(obj):
        # Filter out methods and built-in properties
        if not attribute.startswith('__') and not callable(getattr(obj, attribute)):
            print(f"{attribute}: {getattr(obj, attribute)}")

## Parse Accelerator

In [3]:
%%capture
ACCELERATOR_PATH = 'zigzag/inputs/hardware/dimc_asmae.yaml'

def parse_accelerator(accelerator_yaml_path: str) -> Accelerator:
        accelerator_data = open_yaml(accelerator_yaml_path)

        validator = AcceleratorValidator(accelerator_data)
        accelerator_data = validator.normalized_data
        validate_success = validator.validate()
        if not validate_success:
            raise ValueError("Failed to validate user provided accelerator.")

        #breakpoint()
        print(accelerator_data)
        factory = AcceleratorFactory(accelerator_data)
        return factory.create()


accelerator = parse_accelerator(ACCELERATOR_PATH)

In [4]:
print_attributes(accelerator.cores[0]) , print("\n Operational Array Attributes \n"), print_attributes(accelerator.cores[0].operational_array)

dataflows: None
id: 1
mem_hierarchy_dict: {I2: [MemoryLevel(instance=cells,operands=[I2], served_dimensions=set()), MemoryLevel(instance=dram,operands=[I1, I2, O], served_dimensions={D2, D1})], I1: [MemoryLevel(instance=rf_1B,operands=[I1], served_dimensions={D1}), MemoryLevel(instance=dram,operands=[I1, I2, O], served_dimensions={D2, D1})], O: [MemoryLevel(instance=rf_2B,operands=[O], served_dimensions={D2}), MemoryLevel(instance=dram,operands=[I1, I2, O], served_dimensions={D2, D1})]}
mem_r_bw_dict: {I2: [8, 524288], I1: [8, 524288], O: [16, 524288]}
mem_r_bw_min_dict: {I2: [8, 524288], I1: [8, 524288], O: [16, 524288]}
mem_sharing_list: [{I1: 1, I2: 1, O: 1}]
mem_size_dict: {I2: [8, 8589934592], I1: [8, 8589934592], O: [16, 8589934592]}
mem_w_bw_dict: {I2: [8, 524288], I1: [8, 524288], O: [16, 524288]}
mem_w_bw_min_dict: {I2: [8, 524288], I1: [8, 524288], O: [16, 524288]}
memory_hierarchy: MemoryHierarchy named 'Memory Hierarchy' with 4 nodes and 3 edges
operational_array: <zigzag.h

(None, None, None)

## Workload Spatial/Temporal Mapping


In [5]:
from zigzag.parser.onnx.ONNXModelParser import ONNXModelParser
WORKLOAD_PATH = 'zigzag/inputs/workload/simple_conv_net.onnx'
MAPPING_PATH = 'zigzag/inputs/mapping/dimc_asmae.yaml'

'''
This is the mapping we use 
- name: default
  core_allocation: [1]
  memory_operand_links:
    O: O
    W: I2
    I: I1
  
  spatial_mapping:
    D1:
      - FX, 3
      - FY, 3
      
    D3: 
      - K, 3
'''


onnx_model_parser=ONNXModelParser(WORKLOAD_PATH, MAPPING_PATH)

workload = onnx_model_parser.run()
onnx_model = onnx_model_parser.onnx_model

In [6]:
for layer in workload.topological_sort():
    core_id: int = layer.core_allocation[0]
    core = accelerator.get_core(core_id)
    operational_array = core.operational_array
    layer.spatial_mapping.oa_dim_sizes = operational_array.dimension_sizes

In [7]:
# Layer Instance Attributes
print_attributes(layer)

_abc_impl: <_abc._abc_data object at 0x7f777eef3600>
constant_operands: [W]
core_allocation: [1]
core_allocation_is_fixed: False
dimension_relations: [IX = 1*OX + 1*FX, IY = 1*OY + 1*FY]
equation: O[b][g][k][oy][ox] = W[g][k][c][fy][fx] * I[b][g][c][iy][ix]
id: 0
input_operand_source: {}
input_operands: [W, I]
layer_dim_sizes: {B: 1, K: 9, G: 1, OX: 62, OY: 62, C: 3, FX: 3, FY: 3}
layer_dims: [B, K, G, OX, OY, C, FX, FY]
layer_operands: [O, W, I]
loop_relevancy_info: <zigzag.workload.layer_node.LoopRelevancyInfo object at 0x7f777ecd2010>
memory_operand_links: {'O': 'O', 'W': 'I2', 'I': 'I1'}
name: /conv1/Conv
operand_data_reuse: {O: 27.0, W: 3844.0, I: 76.0166015625}
operand_precision: {O: 16, O_final: 8, W: 8, I: 8}
operand_size_bit: {O: 553536, W: 1944, I: 98304}
operand_size_elem: {O: 34596, W: 243, I: 12288}
output_operand: O
padding: {IX: (0, 0), IY: (0, 0)}
pr_decoupled_relevancy_info: <zigzag.workload.layer_node.LoopRelevancyInfo object at 0x7f777ed610d0>
pr_layer_dim_sizes: {IX

In [8]:
# See the spatial mapping - This mapping means that we unroll K over columns (D1) and Channels over memory banks (D3)
print("Layer's defined spatial mapping", layer.spatial_mapping)

Layer's defined spatial mapping {D1: {K: 3}, D3: {C: 3}}


In [9]:
# We need to convert the spatial mapping into another format for ZigZag
from zigzag.runutils.SpatialMappingConversion import SpatialMappingConversion
import copy

#spatial_mapping_conversion = SpatialMappingConversion(accelerator=accelerator, layer=copy.copy(layer))
spatial_mapping_conversion = SpatialMappingConversion(accelerator=accelerator, layer=copy.copy(layer))
spatial_mapping,spatial_mapping_internal = spatial_mapping_conversion.run() 

In [10]:
# Temporal mapping generations
from zigzag.opt.loma.LomaEngine import LomaEngine

engine = LomaEngine(
            accelerator=accelerator,
            layer=copy.copy(layer),
            spatial_mapping=spatial_mapping,
            loma_lpf_limit=6)
temporal_mappings = [x for x in engine.run()]

In [11]:
# Loma Generated 720 different temporal mappings. Some are better than others - You probably have to check for yourself
len(temporal_mappings)

720

In [12]:
# We will use the first one of the generated mappings for the Cost Model
temporal_mapping = temporal_mappings[0]

## Cost Model Evaluation

In [13]:
from zigzag.hardware.architecture.ImcArray import ImcArray
from zigzag.hardware.architecture._ImcArray import _ImcArray
from zigzag.cost_model.cost_model import CostModelEvaluation
from zigzag.cost_model.cost_model_imc import CostModelEvaluationForIMC

In [15]:
#core_id = layer.core_allocation[0]
#core = accelerator.get_core(core_id)
operational_array = core.operational_array 



cme = CostModelEvaluationForIMC(
            accelerator=accelerator,
            layer=layer,
            spatial_mapping=spatial_mapping,
            spatial_mapping_int=spatial_mapping_internal,
            temporal_mapping=temporal_mapping,
            access_same_data_considered_as_no_access=True,
)

In [17]:
# In CME you will now have everything you need. 
print_attributes(cme)

_abc_impl: <_abc._abc_data object at 0x7f777d92b900>
accelerator: Accelerator(aimc_toy)
access_same_data_considered_as_no_access: True
active_mem_level: {O: 2, W: 2, I: 2}
allowed_mem_update_cycle: {O: [4waydatamoving (rd ^: 1, wr v: 1, rd v: 1, wr ^: 1), 4waydatamoving (rd ^: 0, wr v: 0, rd v: 1, wr ^: 1)], W: [4waydatamoving (rd ^: 0, wr v: 1, rd v: 3844, wr ^: 0), 4waydatamoving (rd ^: 0, wr v: 0, rd v: 1, wr ^: 0)], I: [4waydatamoving (rd ^: 0, wr v: 1, rd v: 1, wr ^: 0), 4waydatamoving (rd ^: 0, wr v: 0, rd v: 1, wr ^: 0)]}
area_total: 0.4965462736
core: Core(1)
core_id: 1
cumulative_core_ids: []
cumulative_layer_ids: []
data_loading_cc_pair_combined_per_op: {W: [1], I: [1]}
data_loading_cc_per_op: {W: {'W0_rd_out_to_low': (1, False), 'W0_wr_in_by_high': (1, False), 'W1_rd_out_to_low': (1, True)}, I: {'I0_rd_out_to_low': (1, False), 'I0_wr_in_by_high': (1, False), 'I1_rd_out_to_low': (1, True)}}
data_loading_cycle: 1
data_loading_half_shared_part: {W: 1, I: 1}
data_loading_individ

# Conclusion

In the first part I have demonstrated everything you need to understand in order to re-implement some parts of the codebase so that Asmae's design can be seamlessly integrated. 
When you make these modifications, you can use parts of the hands-on example to run several benchmarks and extract their results 