# Random Projection + Logistic Regression Pipeline  (v1)

### Introduction:
This example uses Scikit-learn's pipeline class to describe a typical machine learning pipeline. Each stage is accelerated using separate FPGA function calls. This means our hardware architecture contains separate Random Projection and Logistic Regression cores, each with their own AXI interfaces and API. This is better illustrated below:

<img src="imgs/pipe_multi.jpg">

**Note:** This notebook is only compatible with "multi.bit" or "multi_sg.bit" bitstreams. In addition, the Random Projection stage only supports problems with **n_features=128** and **n_components=32**, and the Logistic Regression stage only supports problems with **n_features=32** and **n_classes=10**. For different problem shapes/sizes, new hybrid libraries should be developed (bitstream + C API + python API). 

In (v2) we present a notebook which deploys a real hardware pipeline, i.e. one which avoids PL to PS transfers between pipeline stages.   

### Hybrid Library:
This example implements two independent accelerators on one FPGA. Although the bitstream is shared, the C drivers and python APIs are not, and therefore, we must invoke a hybrid library for both PynqBinaryRandomProjection and PynqLogisticRegression.

In [1]:
import os
import sys
from pynq_sklearn import HybridLibrary, Registry

rp_lib = Registry.load("rp_multi_sg")
lr_lib = Registry.load("lr_multi_sg")
print(rp_lib)
print(lr_lib)

HybridLibrary(): {'bitstream': 'multi_sg.bit', 'library': 'libmulti_sg.so', 'dma_sg': True, 'pipe': False, 'c_callable': ' void _p0_RandomProjection_1_noasync(int *x, int *output, int datalen); ', 'input_width': 128, 'output_width': 32}
HybridLibrary(): {'bitstream': 'multi_sg.bit', 'library': 'libmulti_sg.so', 'dma_sg': True, 'pipe': False, 'c_callable': ' void _p0_LinReg_1_noasync(int *x, int a[320], int b[10], int *output, int datalen); ', 'input_width': 32, 'output_width': 10}


### Generate Dataset:

In [2]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.make_blobs(n_samples=5000, n_features=128, centers=10, cluster_std=8, random_state=43)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)

### Software Pipeline:

In [3]:
from sklearn import pipeline
from pynq_sklearn.linear_model import PynqLogisticRegression
from pynq_sklearn.random_projection import PynqBinaryRandomProjection

rp = PynqBinaryRandomProjection(hw=rp_lib, hw_accel=False)  
lr = PynqLogisticRegression(hw=lr_lib, fit_intercept=True, hw_accel=False) 

ml_pipe = pipeline.Pipeline([("dim_red", rp), ("clf", lr)])
ml_pipe.fit(X_train, y_train)
ypred_sw = ml_pipe.predict(X_test)

In [4]:
import timeit

number=200
def swresp():
    out = ml_pipe.predict(X_test)
    return
    
print("Running the benchmark")
sw_time = timeit.timeit(swresp,number=number)
print("Time taken by sw_pipe", number,"times",sw_time)

Running the benchmark
Time taken by sw_pipe 200 times 8.140533167000058


### Hardware Pipeline:

Here we offload both pipeline stages' predict/transform methods to the FPGA. For this to work, we must first convert the input into 32-bit fixed point numbers (with 20 fractional bits), which is compatible with our bitstream. We also copy the input array into physical contiguous memory to avoid expensive virtual address mapping.

**Note:** This last step is mandatory for most bitstreams. However, if the bitstream uses SDSoC scatter-gather DMA, the hybrid library will map the virtual addresses to physical memory.

In [5]:
FRAC_WIDTH = 20
X_test_hw = (X_test*(1<<FRAC_WIDTH)).astype(np.int32)
X_test_hw = rp.copy_array(X_test_hw, dtype=np.int32) # allocates X_test_hw to contiguous memory

###### i.) Explicitly set_params so that hw_accel=True

In [6]:
ml_pipe = ml_pipe.set_params(dim_red__hw_accel=True, clf__hw_accel=True)

###### iii.) Offload transform() and predict() to HW for both PynqBinaryRandomProjection and PynqLogisticRegression

In [7]:
ypred_hw = ml_pipe.predict(X_test_hw)

###### iv.) Verify equivalence
We should get approximately the same classification performance. Any errors/differences are attributable to fixed point rounding errors in the FPGA.

In [8]:
print("Exactly equal =",np.array_equal(ypred_hw , ypred_sw))
print("Differences =", np.count_nonzero((ypred_hw - ypred_sw)))

Exactly equal = True
Differences = 0


###### v.) Measure the pipeline performance 

In [9]:
number=200
def hwresp():
    out = ml_pipe.predict(X_test_hw)
    return
    
print("Running the benchmark")
hw_time = timeit.timeit(hwresp,number=number)
print("Time taken by hw_pipe", number,"times",hw_time)
print("HW Speedup = %.2fx"%(sw_time/hw_time))

Running the benchmark
Time taken by hw_pipe 200 times 0.8567619710000827
HW Speedup = 9.50x


### Software/Hardware Pipeline
This variation only deploys PynqLogisticRegression predict to the FPGA. PynqRandomProjection is implemented entirely in software. This only works for **"multi_sg.bit"**. This bitstream/library uses scatter gather DMA for transferring the input data from PS to PL. This means we don't have to explicitly copy the numpy array into physical contiguous memory.  

###### i.) Explicitly set_params so that hw_accel=True only for Logistic Regression accelerator

In [10]:
ml_pipe = ml_pipe.set_params(dim_red__hw_accel=False, clf__hw_accel=True)

###### ii.) Predict
Calling predict will only offload PynqLinearRegression to HW. Given that PynqBinaryRandomProjection is stage1 and is computed in SW, the input must be floating point, and is non-contiguous. The output of stage1 is converted to fixed point before stage2. This conversion significantly reduces the pipeline's performance.

In [11]:
ypred_swhw = ml_pipe.predict(X_test)
print("Exactly equal =",np.array_equal(ypred_swhw , ypred_sw))
print("Differences =", np.count_nonzero((ypred_swhw - ypred_sw)))

Exactly equal = True
Differences = 0


###### v.) Measure the sw/hw pipeline performance

In [12]:
number=200
def swhwresp():
    out = ml_pipe.predict(X_test)
    return
    
print("Running the benchmark")
swhw_time = timeit.timeit(hwresp,number=number)
print("Time taken by swhw_pipe", number,"times",swhw_time)
print("SW/HW Slowdown = %.2fx"%(swhw_time/hw_time))

Running the benchmark
Time taken by swhw_pipe 200 times 10.605611722999924
SW/HW Slowdown = 12.38x


### Evaluate Classification Results
We have access to Scikit-learn's entire library for evaluating and scoring machine learning models. We can perform score() directly on our HW accelerator model, or we can create a custom scoring function to be used separately.

In [13]:
ml_pipe.set_params(dim_red__hw_accel=True, clf__hw_accel=True)
acc = ml_pipe.score(X_test_hw, y_test) 
print("Mean Accuracy =", acc) 

Mean Accuracy = 0.964


In [14]:
from sklearn.metrics import classification_report

def custom_scorer(y, y_pred):
    # We can put anything in here.
    class_names = ["Class%d"%(i) for i in range(10)]
    return classification_report(y, y_pred, target_names=class_names)

print( custom_scorer(y_test, ypred_sw))

             precision    recall  f1-score   support

     Class0       0.98      0.95      0.96       100
     Class1       0.98      0.93      0.95        88
     Class2       0.97      0.99      0.98       109
     Class3       0.95      0.95      0.95        93
     Class4       0.91      0.99      0.95        96
     Class5       1.00      0.97      0.98        91
     Class6       0.98      0.99      0.99       118
     Class7       0.98      0.96      0.97       107
     Class8       0.96      0.94      0.95        94
     Class9       0.93      0.96      0.95       104

avg / total       0.96      0.96      0.96      1000



Refer [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) for classification report details. 

When we are finsished, we should free all CMA buffers:

In [15]:
rp.xlnk.xlnk_reset()