# Matrix Multiplication

This notebook exemplies the use of a hardware overlay to accelerate a floating-point matrix multiplication.
The overlay implements the matrix product $\mathbf{C} = \mathbf{A}\mathbf{B} $, 
where $\mathbf{A}$, $\mathbf{B}$, and $\mathbf{C}$ are $64 \times 64$ floating-point matrices.


In [48]:
from pynq import (allocate, Overlay)
import numpy as np

## Load the overlay

Program the FPGA and reference the required hardware blocks.

In [49]:
ol = Overlay('/home/xilinx/pynq/overlays/matmult/64/matmult.bit')

dma = ol.dma
mmult_ip = ol.accel

## Allocate memory for the DMA transfers

In [50]:
DIM = 64
in_buffer = allocate(shape=(2, DIM, DIM), dtype=np.float32, cacheable=False)
out_buffer = allocate(shape=(DIM, DIM), dtype=np.float32, cacheable=False)


## Matrix multiplication in hardware (PL side)

The execution of the algorithm using the hardware kernel includes the roundtrip data transfer (processor to FPGA, and FPGA to processor). Usually, this data transfer constitutes the performance bottleneck.

In [58]:
CTRL_REG = 0x00
AP_START = (1<<0) # bit 0
AUTO_RESTART = (1<<7) # bit 7
mmult_ip.register_map.k = DIM
mmult_ip.register_map.m = DIM
mmult_ip.register_map.n = DIM

def run_kernel():
    dma.sendchannel.transfer(in_buffer)
    dma.recvchannel.transfer(out_buffer)
    mmult_ip.write(CTRL_REG, (AP_START | AUTO_RESTART))  # initialize the module
    dma.sendchannel.wait()
    dma.recvchannel.wait()
    
HW_SIZE = 64
    
def matmult_driver(A, B):
    (a_xSize, a_ySize) = np.shape(A)

    (b_xSize, b_ySize) = np.shape(B)

    a = np.concatenate((A, np.zeros((a_ySize, HW_SIZE-a_xSize))), axis=1)
    a = np.concatenate((a, np.concatenate((np.zeros((HW_SIZE-a_xSize, a_ySize)), 
                                            np.eye(HW_SIZE-a_ySize , HW_SIZE-a_xSize)), axis=1)), axis=0)
    b = np.concatenate((B, np.zeros((b_ySize, HW_SIZE-b_xSize))), axis=1)
    b = np.concatenate((b, np.concatenate((np.zeros((HW_SIZE-b_xSize, b_ySize)), 
                                            np.eye(HW_SIZE-b_ySize , HW_SIZE-b_xSize)), axis=1)), axis=0)

#     print(np.shape(a))
#     print(np.shape(b))
    in_buffer[:] = np.stack((a, b))

    run_kernel()
    
    return out_buffer[:a_xSize, :a_ySize]

Create example matrices to evaluate the kernel.

In [59]:
DIM = 10

A = np.random.rand(DIM, DIM).astype(dtype=np.float32)
B = np.random.rand(DIM, DIM).astype(dtype=np.float32)


Measure the execution time.

In [60]:
%%timeit
AB = matmult_driver(A, B)

2.47 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Matrix multiplication in software (PS side)

NumPy is the golden standard against with the hardware implementation is compared. 


In [54]:
%timeit A @ B

30.5 µs ± 49.6 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


## Verify correctness

In [55]:
np.array_equal(A @ B, AB)

True