# Matrix Multiplication

This notebook exemplies the use of a hardware overlay to accelerate a floating-point matrix multiplication.
The overlay implements the matrix product $\mathbf{C} = \mathbf{A}\mathbf{B} $, 
where $\mathbf{A}$, $\mathbf{B}$, and $\mathbf{C}$ are $128 \times 128$ matrices.


In [1]:
from pynq import (allocate, Overlay)
import numpy as np

## Load the overlay

Program the FPGA and reference the required hardware blocks.

In [2]:
overlay = Overlay('./matmult.bit')

dma = overlay.dma
mmult_ip = overlay.accel

## Allocate memory for the DMA transfers

In [3]:
DIM = 128
in_buffer = allocate(shape=(2, DIM, DIM), dtype=np.float32, cacheable=False)
out_buffer = allocate(shape=(DIM, DIM), dtype=np.float32, cacheable=False)


## Matrix multiplication in hardware (PL side)

The execution of the algorithm using the hardware kernel includes the roundtrip data transfer (processor to FPGA, and FPGA to processor). Usually, this data transfer constitutes the performance bottleneck.

In [4]:
CTRL_REG = 0x00
AP_START = (1<<0) # bit 0
AUTO_RESTART = (1<<7) # bit 7

def run_kernel():
    dma.sendchannel.transfer(in_buffer)
    dma.recvchannel.transfer(out_buffer)
    mmult_ip.write(CTRL_REG, (AP_START | AUTO_RESTART))  # initialize the module
    dma.sendchannel.wait()
    dma.recvchannel.wait()

Create example matrices to evaluate the kernel.

In [5]:
A = np.random.rand(DIM, DIM).astype(dtype=np.float32)
B = np.random.rand(DIM, DIM).astype(dtype=np.float32)

in_buffer[:] = np.stack((A, B))

Measure the execution time.

In [6]:
%%timeit
run_kernel()

100 loops, best of 3: 2.01 ms per loop


## Matrix multiplication in software (PS side)

The hardware implementation is compared against NumPy. 
Please note that NumPy uses, presumably, a more efficient algorithm than the naive $O(n^3)$ one implemented in the hardware kernel.

In [7]:
%timeit A @ B

100 loops, best of 3: 5.73 ms per loop


## Verify correctness

In [8]:
np.array_equal(A @ B, out_buffer)

True