# Develop a new VWR2A kernel using the VWR2A simulator
This notebook illustrates how to use the simulator both for decoding existing VWR2A kernels (using isntructions from the morphological filter erosion example), as well as writing your own kernels by translating human-readable variables into hexadacimal ISA instructions for each specialized slot. At the end, we develop a working kernel that adds two vectors together.

Note: For now, only one-column kernels are supported.

In [2]:
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd 
import sys, os
from src import *
from helpers import *

# ISAs for specialized slots
First, we set up objects for each specialized slot of the VWR2A (i.e. kernel configuration, Load Store Unit, Reconfigurable Cells, etc.) and show how the instructions are decoded and encoded. For detailed descriptions of what function each ISA component does in each specialized slot, please see ``vwr2a_docs/vwr2a_ISA.docx``.

### KERNEL CONFIGURATION
Set up an application to be accelerated by loading the different kernel configurations into the kernel memory

In [3]:
# Load an existing kernel into memory
kmem_pos = 1
kmem_word = 0x18026

kmem = KER_CONF()
kmem.set_word(kmem_word, kmem_pos)
kmem.get_kernel_info(kmem_pos)

This kernel uses 39 instruction words starting at IMEM address 0.
It uses column(s): both.
The SRF is located in SPM bank 0.


In [3]:
# Create a new kernel
kmem_pos = 2

# Kernel configuration parameters
num_instructions=1
imem_add_start=0
column_usage=1
srf_spm_addres=0

kmem.set_params(num_instructions, imem_add_start, column_usage, srf_spm_addres, kmem_pos)
print("Hex representation: " + kmem.get_word_in_hex(kmem_pos))
kmem.get_kernel_info(kmem_pos)

Hex representation: 0x8000
This kernel uses 1 instruction words starting at IMEM address 0.
It uses column(s): 0.
The SRF is located in SPM bank 0.


### Loop Control Unit IMEM 

In [4]:
# Load an existing imem word and decode it
imem_pos = 1
imem_word = 0xd9c00

lcu_imem = LCU_IMEM()
lcu_imem.set_word(imem_word, imem_pos)
lcu_imem.get_instruction_info(imem_pos)

Immediate value: 0
LCU is in loop control mode
Exiting out of kernel
No LCU registers are being written


In [5]:
# Create a new instruction
imem_pos = 2

# Define instruction parameters
imm=3
rf_wsel=2
rf_we=1
alu_op=LCU_ALU_OPS.SADD
br_mode=0
muxb_sel=LCU_MUXB_SEL.ONE
muxa_sel=LCU_MUXA_SEL.ZERO

lcu_imem.set_params(imm, rf_wsel, rf_we, alu_op, br_mode, muxb_sel, muxa_sel, imem_pos)
print("Hex representation: " + lcu_imem.get_word_in_hex(imem_pos))
lcu_imem.get_instruction_info(imem_pos)

Hex representation: 0xdc383
Immediate value: 3
LCU is in loop control mode
Performing ALU operation SADD between operands ZERO and ONE
Writing ALU result to LCU register 2


### Load Store Unit IMEM 

In [6]:
# Load an existing imem word and decode it
imem_pos = 1
imem_word = 0x43d3f

lsu_imem = LSU_IMEM()
lsu_imem.set_word(imem_word, imem_pos)
lsu_imem.get_instruction_info(imem_pos)

Performing LOAD from SPM to VWR_A
Performing ALU operation SADD between operands R7 and ONE
Writing ALU result to LSU register 7


In [7]:
# Create a new instruction
imem_pos = 2

# Define instruction parameters
rf_wsel=2
rf_we=1
alu_op=LSU_ALU_OPS.SRL
muxb_sel=LSU_MUXA_SEL.R5
muxa_sel=LSU_MUXB_SEL.TWO
vwr_shuf_op=SHUFFLE_SEL.CONCAT_BITREV_UPPER
vwr_shuf_sel=LSU_OP_MODE.SHUFFLE

lsu_imem.set_params(rf_wsel, rf_we, alu_op, muxb_sel, muxa_sel, vwr_shuf_op, vwr_shuf_sel, imem_pos)
print("Hex representation: " + lsu_imem.get_word_in_hex(imem_pos))
lsu_imem.get_instruction_info(imem_pos)

Hex representation: 0xe5aea
Shuffling VWR A and B data into VWR C using operation CONCAT_BITREV_UPPER
Performing ALU operation SRL between operands TWO and R5
Writing ALU result to LSU register 2


### Multiplexer Control Unit IMEM

In [5]:
# Load an existing imem word and decode it
imem_pos = 1
imem_word = 0x1

mxcu_imem = MXCU_IMEM()
mxcu_imem.set_word(imem_word, imem_pos)
mxcu_imem.get_instruction_info(imem_pos)

Writing to VWR rows [0] of VWR_A
Reading from SRF index 0
Performing ALU operation NOP between operands R0 and R0
No MXCU registers are being written


In [7]:
# Create a new instruction
imem_pos = 2

# Define instruction parameters
vwr_row_we = [0, 0, 0, 1]
vwr_sel = MXCU_VWR_SEL.VWR_B.value
srf_sel = 3
alu_srf_write = ALU_SRF_WRITE.MXCU
srf_we = 1
rf_wsel = 0 
rf_we = 0 
alu_op =  MXCU_ALU_OPS.SADD
muxb_sel = MXCU_MUXB_SEL.R0
muxa_sel = MXCU_MUXA_SEL.TWO

mxcu_imem.set_params(vwr_row_we, vwr_sel, srf_sel, alu_srf_write, srf_we, rf_wsel, rf_we, alu_op, muxb_sel, muxa_sel, imem_pos)
mxcu_imem.get_instruction_info(imem_pos)

Writing to VWR rows [0] of VWR_B
Writing from MXCU ALU to SRF register 3
Performing ALU operation SADD between operands TWO and R0
No MXCU registers are being written


### Reconfigurable Cell IMEM

In [10]:
# Load an existing imem word and decode it
imem_pos = 1
imem_word = 0xe923

rc_imem = RC_IMEM()
rc_imem.set_word(imem_word, imem_pos)
rc_imem.get_instruction_info(imem_pos)

Performing ALU operation LOR between operands SRF and ZERO
ALU is performing operations with 32-bit precision
Writing ALU result to RC register 1


In [1]:
# Create a new instruction
imem_pos = 2

# Define instruction parameters
rf_wsel = 1 
rf_we = 1 
muxf_sel = RC_MUXF_SEL.RCT 
alu_op =  RC_ALU_OPS.INB_SF_INA
op_mode = 0 #Always keep this to zero; 16-bit mode is not supported yet
muxb_sel =  RC_MUXA_SEL.VWR_A
muxa_sel = RC_MUXA_SEL.VWR_B

rc_imem.set_params(rf_wsel, rf_we, muxf_sel, alu_op, op_mode, muxb_sel, muxa_sel, imem_pos)
rc_imem.get_instruction_info(imem_pos)

NameError: name 'RC_ALU_OPS' is not defined

## Putting it all together: Instruction memory

### Process existing kernel
Load an existing kernel (in the form of an excel sheet where each row is a clock cycle and each column is a specialized slot) and use the simulator to understand what is going on in each element at a given clock cycle

In [4]:
kernel_path = "kernels/mf_q64_erosion/"
df = pd.read_csv(kernel_path + "instructions.csv")
print("The instruction memory has {0} entries.".format(len(df)))
df.head()

The instruction memory has 512 entries.


Unnamed: 0,LCU,LSU,MXCU,RC0,RC1,RC2,RC3,KMEM
0,0x0,0x5c49f,0x0,0x0,0x0,0x0,0x0,0x0
1,0x9c500,0x43d3f,0x180,0x0,0x0,0x0,0x0,0x802b
2,0x98fc0,0x4bd3f,0x40,0x0,0x0,0x0,0x0,0x0
3,0xf8f43,0x53c98,0x0,0x0,0x0,0x0,0x0,0x0
4,0xb8f80,0x539,0x4ce9000,0x0,0x0,0x0,0x0,0x0


In [5]:
# Set an IMEM object and load kernel 1
imem = IMEM(df)
imem.load_kernel(1)

In [6]:
# Make sure that the last kernel value is the exit instruction
n_instr, imem_add, n_col, _ = imem.kmem.get_params(1)
if n_col == 3:
    lcu_instr = imem.lcu_imem_col0.get_instruction_info(2*n_instr + imem_add)
else:
    lcu_instr = imem.lcu_imem_col0.get_instruction_info(n_instr + imem_add)

Immediate value: 0
LCU is in loop control mode
Exiting out of kernel
No LCU registers are being written


In [7]:
# Print what's going on at a given imem position
imem.get_pos_summary(imem_pos=0, col_index=0)

****RC0****
No ALU operation
No RC registers are being written
****RC1****
No ALU operation
No RC registers are being written
****RC2****
No ALU operation
No RC registers are being written
****RC3****
No ALU operation
No RC registers are being written
****LSU****
Performing LOAD from SPM to SRF
Performing ALU operation LOR between operands SRF and ZERO
Writing ALU result to LSU register 7
****LCU****
Immediate value: 0
LCU is in loop control mode
No LCU ALU Operation is performed
No LCU registers are being written
****MXCU****
Not writing to VWRs
Reading from SRF index 0
Performing ALU operation NOP between operands R0 and R0
No MXCU registers are being written


### Load a 2-column kernel
Kernels can use either one column of the CGRA, or both in parallel. The FFT example uses both.

In [8]:
kernel_path = "kernels/fft/"
df = pd.read_csv(kernel_path + "instructions.csv")
print("The instruction memory has {0} entries.".format(len(df)))
df.head()

The instruction memory has 512 entries.


Unnamed: 0,LCU,LSU,MXCU,RC0,RC1,RC2,RC3,KMEM
0,0x0,0x5d49e,0x0,0x0,0x0,0x0,0x0,0x0
1,0x9c540,0x4c80,0x0,0x0,0x0,0x0,0x0,0x18026
2,0x98f00,0x453a,0x40,0x0,0x0,0x0,0x0,0x393b0
3,0xf8fc8,0x412c,0x353b180,0x0,0x0,0x0,0x0,0x0
4,0x18f80,0x4c9b,0x31db000,0x0,0x0,0x0,0x0,0x0


In [9]:
# Set an IMEM object and load kernel 1
imem = IMEM(df)
imem.load_kernel(1)
imem.kmem.get_kernel_info(1)

This kernel uses 39 instruction words starting at IMEM address 0.
It uses column(s): both.
The SRF is located in SPM bank 0.


In [10]:
# Note that now, the second column is populated with non-default instructions
imem.get_pos_summary(imem_pos=0, col_index=1)

****RC0****
No ALU operation
No RC registers are being written
****RC1****
No ALU operation
No RC registers are being written
****RC2****
No ALU operation
No RC registers are being written
****RC3****
No ALU operation
No RC registers are being written
****LSU****
No loading, storing, or shuffling taking place
Performing ALU operation LAND between operands ZERO and ZERO
No LSU registers are being written
****LCU****
Immediate value: 0
LCU is in loop control mode
No LCU ALU Operation is performed
No LCU registers are being written
****MXCU****
Not writing to VWRs
Reading from SRF index 0
Performing ALU operation NOP between operands R0 and R0
No MXCU registers are being written


### Create a new kernel
Write your own kernel by populating the instruction memory of each specialized slot one at a time. In this example, we develop a  very simple kernel whose only instruction is an exit. The resulting ISAs of each element are loaded into a pandas dataframe and saved to a CSV and C header file, which can then be copied into a VWR2A testbench.

In [33]:
# Set up kernel
imem = IMEM()

num_instructions = 1
imem_add_start = 0
column_usage = 1
srf_spm_address = 0
pos = 1

imem.kmem.set_params(num_instructions, imem_add_start, column_usage, srf_spm_address, pos)
imem.kmem.get_kernel_info(pos)

This kernel uses 1 instruction words starting at IMEM address 0.
It uses column(s): 0.
The SRF is located in SPM bank 0.


In [36]:
# Write LCU instruction
imem.lcu_imem_col0.set_params(alu_op=LCU_ALU_OPS.EXIT, pos=0)

In [39]:
# Get instruction dataframe
df_out = imem.get_df()
df_out.head()

Unnamed: 0,LCU,LSU,MXCU,RC0,RC1,RC2,RC3,KMEM
0,0x1c00,0x4c80,0x0,0x0,0x0,0x0,0x0,0x0
1,0x0,0x4c80,0x0,0x0,0x0,0x0,0x0,0x8000
2,0x0,0x4c80,0x0,0x0,0x0,0x0,0x0,0x0
3,0x0,0x4c80,0x0,0x0,0x0,0x0,0x0,0x0
4,0x0,0x4c80,0x0,0x0,0x0,0x0,0x0,0x0


In [40]:
# Save to file
kernel_path = 'kernels/exit/'
# Save CSV of instructions to be easily re-loaded and fixed
df_out.to_csv(kernel_path + 'instructions.csv')
# Save header file used to load the instructions into a VWR2A RTL testbench for a functional simulation
dataframe_to_header_file(df_out, kernel_path)

### A more complicated example
Now that we know how to populate the IMEMs of the specialized slots, we will write a more complicated kernel that adds two vectors together. On the host processor side, the SRF is loaded into SPM bank 0. The first vector is loaded into SPM bank 1, the second vector in bank 2, and the result vector will be read from bank 3.

In [41]:
imem = IMEM()

In [42]:
##### Instruction 0 #######
pos=0

# LSU: Load SPM address zero into SRF. Set R7 (the next SPM address) to one.
rf_wsel=7
rf_we=1
alu_op=LSU_ALU_OPS.LOR
muxb_sel=LSU_MUXA_SEL.ZERO
muxa_sel=LSU_MUXB_SEL.ONE
vwr_shuf_op=LSU_VWR_SEL.SRF
vwr_shuf_sel=LSU_OP_MODE.LOAD

print("***LSU***")
imem.lsu_imem_col0.set_params(rf_wsel, rf_we, alu_op, muxb_sel, muxa_sel, vwr_shuf_op, vwr_shuf_sel, pos)
imem.lsu_imem_col0.get_instruction_info(pos)

***LSU***
Performing LOAD from SPM to SRF
Performing ALU operation LOR between operands ONE and ZERO
Writing ALU result to LSU register 7


In [43]:
##### Instruction 1 #######
pos=1

# LSU: Load SPM address one (vector 1) into VWR_A. Set R7 (the next SPM address) to two.
rf_wsel=7
rf_we=1
alu_op=LSU_ALU_OPS.SADD
muxb_sel=LSU_MUXA_SEL.R7
muxa_sel=LSU_MUXB_SEL.ONE
vwr_shuf_op=LSU_VWR_SEL.VWR_A
vwr_shuf_sel=LSU_OP_MODE.LOAD

print("***LSU***")
imem.lsu_imem_col0.set_params(rf_wsel, rf_we, alu_op, muxb_sel, muxa_sel, vwr_shuf_op, vwr_shuf_sel, pos)
imem.lsu_imem_col0.get_instruction_info(pos)


***LSU***
Performing LOAD from SPM to VWR_A
Performing ALU operation SADD between operands ONE and R7
Writing ALU result to LSU register 7


In [45]:
##### Instruction 2 #######
pos=2

# LSU: Load SPM address two (vector 2) into VWR_B. Set R7 (the next SPM address) to three for reading result later.
rf_wsel=7
rf_we=1
alu_op=LSU_ALU_OPS.SADD
muxb_sel=LSU_MUXA_SEL.R7
muxa_sel=LSU_MUXB_SEL.ONE
vwr_shuf_op=LSU_VWR_SEL.VWR_B
vwr_shuf_sel=LSU_OP_MODE.LOAD

print("***LSU***")
imem.lsu_imem_col0.set_params(rf_wsel=rf_wsel, rf_we=rf_we, alu_op=alu_op, muxa_sel=muxa_sel, muxb_sel=muxb_sel, vwr_shuf_op=vwr_shuf_op, vwr_shuf_sel=vwr_shuf_sel, pos=pos)
imem.lsu_imem_col0.get_instruction_info(pos)

# MXCU: Set R0 to 0 (first index of VWR slice) to begin adding respective vector indices.
rf_wsel = 0 
rf_we = 1 
alu_op =  MXCU_ALU_OPS.LOR
muxb_sel = MXCU_MUXB_SEL.ZERO
muxa_sel = MXCU_MUXA_SEL.ZERO

print("***MXCU***")
imem.mxcu_imem_col0.set_params(rf_wsel=rf_wsel, rf_we=rf_we, alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)
imem.mxcu_imem_col0.get_instruction_info(pos)

***LSU***
Performing LOAD from SPM to VWR_B
Performing ALU operation SADD between operands ONE and R7
Writing ALU result to LSU register 7
***MXCU***
Not writing to VWRs
Reading from SRF index 0
Performing ALU operation LOR between operands ZERO and ZERO
Writing ALU result to MXCU register 0


In [46]:
##### Instruction 3 #######
pos=3

# R0: Add VWR A and VWR B
alu_op =  RC_ALU_OPS.SADD
muxa_sel =  RC_MUXA_SEL.VWR_A
muxb_sel = RC_MUXB_SEL.VWR_B

print("***RC0***")
imem.rc0_imem_col0.set_params(alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)
imem.rc0_imem_col0.get_instruction_info(pos)

# R1: Add VWR A and VWR B
print("***RC1***")
imem.rc1_imem_col0.set_params(alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)
imem.rc1_imem_col0.get_instruction_info(pos)

# R2: Add VWR A and VWR B
print("***RC2***")
imem.rc2_imem_col0.set_params(alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)
imem.rc2_imem_col0.get_instruction_info(pos)

# R3: Add VWR A and VWR B
print("***RC3***")
imem.rc3_imem_col0.set_params(alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)
imem.rc3_imem_col0.get_instruction_info(pos)

# MXCU: Enable writing result to VWR C, increment R0 (index of the VWR slice)
vwr_row_we = [1, 1, 1, 1]
vwr_sel = MXCU_VWR_SEL.VWR_C
srf_sel = 0
alu_srf_write = ALU_SRF_WRITE.MXCU
srf_we = 0
rf_wsel = 0 
rf_we = 1 
alu_op =  MXCU_ALU_OPS.SADD
muxb_sel = MXCU_MUXB_SEL.R0
muxa_sel = MXCU_MUXA_SEL.ONE

print("***MXCU***")
imem.mxcu_imem_col0.set_params(vwr_row_we, vwr_sel, srf_sel, alu_srf_write, srf_we, rf_wsel, rf_we, alu_op, muxb_sel, muxa_sel, pos)
imem.mxcu_imem_col0.get_instruction_info(pos)

***RC0***
Performing ALU operation SADD between operands VWR_A and VWR_B
ALU is performing operations with 32-bit precision
No RC registers are being written
***RC1***
Performing ALU operation SADD between operands VWR_A and VWR_B
ALU is performing operations with 32-bit precision
No RC registers are being written
***RC2***
Performing ALU operation SADD between operands VWR_A and VWR_B
ALU is performing operations with 32-bit precision
No RC registers are being written
***RC3***
Performing ALU operation SADD between operands VWR_A and VWR_B
ALU is performing operations with 32-bit precision
No RC registers are being written
***MXCU***
Writing to VWR rows [0 1 2 3] of VWR_C
Reading from SRF index 0
Performing ALU operation SADD between operands ONE and R0
Writing ALU result to MXCU register 0


In [47]:
##### Instruction 4 ####### (Repeat previous instruction 31 more times)
for pos in range(4,35):

    # R0: Add VWR A and VWR B
    alu_op =  RC_ALU_OPS.SADD
    muxa_sel =  RC_MUXA_SEL.VWR_A
    muxb_sel = RC_MUXB_SEL.VWR_B

    imem.rc0_imem_col0.set_params(alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)

    # R1: Add VWR A and VWR B
    imem.rc1_imem_col0.set_params(alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)

    # R2: Add VWR A and VWR B
    imem.rc2_imem_col0.set_params(alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)

    # R3: Add VWR A and VWR B
    imem.rc3_imem_col0.set_params(alu_op=alu_op, muxb_sel=muxb_sel, muxa_sel=muxa_sel, pos=pos)
    # MXCU: Enable writing result to VWR C, increment R0 (index of the VWR slice) and save it in SRF position 1
    vwr_row_we = [1, 1, 1, 1]
    vwr_sel = MXCU_VWR_SEL.VWR_C
    srf_sel = 0
    alu_srf_write = ALU_SRF_WRITE.MXCU
    srf_we = 0
    rf_wsel = 0 
    rf_we = 1 
    alu_op =  MXCU_ALU_OPS.SADD
    muxb_sel = MXCU_MUXB_SEL.R0
    muxa_sel = MXCU_MUXA_SEL.ONE

    imem.mxcu_imem_col0.set_params(vwr_row_we, vwr_sel, srf_sel, alu_srf_write, srf_we, rf_wsel, rf_we, alu_op, muxb_sel, muxa_sel, pos)

In [48]:
##### Instruction 35 #######
pos=35

# LSU: Store VWR_C into SPM address 3 (stored in R7).
rf_wsel=0
rf_we=0
alu_op=LSU_ALU_OPS.LAND
muxb_sel=LSU_MUXA_SEL.R0
muxa_sel=LSU_MUXB_SEL.R0
vwr_shuf_op=LSU_VWR_SEL.VWR_C
vwr_shuf_sel=LSU_OP_MODE.STORE

print("***LSU***")
imem.lsu_imem_col0.set_params(rf_wsel, rf_we, alu_op, muxb_sel, muxa_sel, vwr_shuf_op, vwr_shuf_sel, pos)
imem.lsu_imem_col0.get_instruction_info(pos)

***LSU***
Performing STORE to SPM from VWR_C
Performing ALU operation LAND between operands R0 and R0
No LSU registers are being written


In [49]:
##### Instruction 36 #######
pos = 36

# LCU: Exit
print("***LCU***")
imem.lcu_imem_col0.set_params(alu_op=LCU_ALU_OPS.EXIT, pos=pos)
imem.lcu_imem_col0.get_instruction_info(pos)

***LCU***
Immediate value: 0
LCU is in loop control mode
Exiting out of kernel
No LCU registers are being written


In [50]:
#### Kernel configuration ###
num_instructions = pos+1
imem_add_start = 0
column_usage = 1
srf_spm_address = 0
pos = 1

imem.kmem.set_params(num_instructions, imem_add_start, column_usage, srf_spm_address, pos)
imem.kmem.get_kernel_info(pos)

This kernel uses 37 instruction words starting at IMEM address 0.
It uses column(s): 0.
The SRF is located in SPM bank 0.


In [51]:
# Get instruction dataframe
df_out = imem.get_df()
df_out[:num_instructions]

Unnamed: 0,LCU,LSU,MXCU,RC0,RC1,RC2,RC3,KMEM
0,0x0,0x5d49f,0x0,0x0,0x0,0x0,0x0,0x0
1,0x0,0x453bf,0x0,0x0,0x0,0x0,0x0,0x8024
2,0x0,0x4d3bf,0x4ce8000,0x0,0x0,0x0,0x0,0x0
3,0x0,0x4c80,0x501842f,0x420,0x420,0x420,0x420,0x0
4,0x0,0x4c80,0x501842f,0x420,0x420,0x420,0x420,0x0
5,0x0,0x4c80,0x501842f,0x420,0x420,0x420,0x420,0x0
6,0x0,0x4c80,0x501842f,0x420,0x420,0x420,0x420,0x0
7,0x0,0x4c80,0x501842f,0x420,0x420,0x420,0x420,0x0
8,0x0,0x4c80,0x501842f,0x420,0x420,0x420,0x420,0x0
9,0x0,0x4c80,0x501842f,0x420,0x420,0x420,0x420,0x0


In [52]:
# Save to file
kernel_path = 'kernels/add_vectors/'
# Save CSV of instructions to be easily re-loaded and fixed
df_out.to_csv(kernel_path + 'instructions.csv')
# Save header file used to load the instructions into a VWR2A RTL testbench for a functional simulation
dataframe_to_header_file(df_out, kernel_path)

### Now, your turn!
Either use the example above to make a new kernel, or see how we can optimize this one. Can you think of a way to avoid repeating the add instruction 32 times (HINT: use the LCU).