## Input and output buffers

By setting `C_SG_LENGTH_WIDTH`/`c_sg_length_width` to 22, 
the DMA IP is able to transfer a maximum of $2^{22}$ bytes.
This means we can transfer at most 1 million 32-bit integers.
In this example, we test 10000 integers. 

In [1]:
from time import time

num_data = 10000

We will reuse the same input buffer for both adders. The results are saved
into different output buffers.

In [2]:
import numpy as np
from pynq import Xlnk

xlnk = Xlnk()
in_buffer = xlnk.cma_array(shape=(num_data,), dtype=np.uint32, cacheable=1)
out_buffer0 = xlnk.cma_array(shape=(num_data,), dtype=np.uint32, cacheable=1)
out_buffer1 = xlnk.cma_array(shape=(num_data,), dtype=np.uint32, cacheable=1)

Now we populate the input buffer with some random integers.

In [3]:
for i in range(num_data):
    in_buffer[i] = (np.random.randint(0,15)<<4)+np.random.randint(0,15)

## 1. Adder using AXI Lite interface
![alt text](images/axi_adder.PNG)

In [4]:
from pynq import Overlay

ol_add = Overlay('add.bit')
adder_ip = ol_add.axi_adder_0

We create a thin wrapper to call the AXI adder.

In [5]:
def add_data():
    for i in range(len(in_buffer)):
        adder_ip.write(0x0, int(in_buffer[i]>>4))
        adder_ip.write(0x4, int(in_buffer[i]))
        out_buffer0[i] = adder_ip.read(0x8)

To verify the results, you can run something similar to:
```python
for i in range(10):
    adder_ip.write(0x0, i)
    adder_ip.write(0x4, i)
    print(adder_ip.read(0x8))
```

In the following cell, let us record the time when performing 10000 additions.

In [6]:
t1 = time()
add_data()
t2 = time()
t_add = t2-t1

## 2. Adder using AXI Stream interface
![alt text](images/axis_adder.PNG)

In [7]:
from pynq import Overlay

ol_adds = Overlay('adds.bit')
dma_ip = ol_adds.axi_dma_0

Similar as the AXI adder, we create a thin wrapper to call the AXIS adder.

In [8]:
def adds_data():
    dma_ip.sendchannel.transfer(in_buffer)
    dma_ip.recvchannel.transfer(out_buffer1)
    dma_ip.sendchannel.wait()
    dma_ip.recvchannel.wait()

You can use the following code to test the IP as well.
```python
for i in range(10):
    in_buffer[i] = (i<<4)+i+1

adds_data()

for i in range(10):
    print(out_buffer1[i])
```

Let us record the time when performing 10000 additions.

In [9]:
t1 = time()
adds_data()
t2 = time()
t_adds = t2-t1

## Performance comparison
We will compare the results stored in the two output buffers.
Also, we can compare the performance.

You can see in the following cell, the adder based on AXIS interfaces
achieves much better performance.

In [10]:
for i in range(num_data):
    assert out_buffer0[i] == out_buffer1[i], \
        f'unmatched data: {out_buffer0[i]}!={out_buffer1[i]} '\
        f'for {i}-th input {in_buffer[i]}.'

print('time used for axi lite adder: {}'.format(t_add))
print('time used for axi stream adder: {}'.format(t_adds))

time used for axi lite adder: 1.8603198528289795
time used for axi stream adder: 0.003721952438354492


Free the memory at the end.

In [11]:
xlnk.xlnk_reset()

Hope you enjoy this example! Have a good day!