# Random Projection + Logistic Regression Pipeline  (v2)

### Introduction:
This example uses Scikit-learn's pipeline class to describe a machine learning pipeline. In software, this class chains multiple estimators together and executes them sequentially. In hardware this corresponds to sending data from PS to PL and then from PL to PS for each stage in the pipeline. This might be flexible (see v1 notebook) but for optimal performance we should keep the pipeline's state in local FPGA memory and avoid expensive DRAM transfers between stages.  

We achieve this using set_params() to explicitly:
    -  Offload all computation to stage 1 only, while bypassing HW offload in all other stages
    -  Transfer all HW parameters to stage 1

<img src="imgs/pipe_hw.jpg">

**Note:** This notebook is only compatible with "pipe.bit" or "pipe_sg.bit" bitstreams. In addition, the Random Projection stage only supports problems with **n_features=128** and **n_components=32**, and the Logistic Regression stage only supports problems with **n_features=32** and **n_classes=10**. For different problem shapes/sizes, new hybrid libraries should be developed (bitstream + C API + python API). 


### Hybrid Library:

In [1]:
import os
import sys
from pynq_sklearn import HybridLibrary, Registry

lib = Registry.load("pipe")
print(lib)

HybridLibrary(): {'bitstream': 'pipe.bit', 'library': 'libpipe.so', 'dma_sg': False, 'pipe': True, 'c_callable': ' void _p0_Pipe_1_noasync(int *x, int a[320], int b[10], int *output, int datalen); ', 'input_width': 128, 'output_width': 10}


### Generate Dataset:

In [2]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.make_blobs(n_samples=5000, n_features=128, centers=10, cluster_std=8, random_state=43)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)

### Software Pipeline:

In [3]:
from sklearn import pipeline
from pynq_sklearn.linear_model import PynqLogisticRegression
from pynq_sklearn.random_projection import PynqBinaryRandomProjection

rp = PynqBinaryRandomProjection(hw=lib, hw_accel=False)  
lr = PynqLogisticRegression(hw=lib, fit_intercept=True, hw_accel=False) 

ml_pipe = pipeline.Pipeline([("dr", rp), ("clf", lr)])
ml_pipe = ml_pipe.fit(X_train, y_train)
ypred_sw = ml_pipe.predict(X_test)

In [4]:
import timeit

number=200
def swresp():
    out = ml_pipe.predict(X_test)
    return
    
print("Running the benchmark")
sw_time = timeit.timeit(swresp,number=number)
print("Time taken by sw_pipe", number,"times",sw_time)

Running the benchmark
Time taken by sw_pipe 200 times 8.027412722999998


### Hardware Pipeline:
Both stages' predict/transform methods are executed on the FPGA. Also, both accelerators expect 32-bit fixed point numbers (with 20 fractional bits). We make this conversion and also copy the array into physical contiguous memory. 

**Note:** This last step is mandatory for most bitstreams. However, if the bitstream uses SDSoC scatter-gather DMA, the hybrid library will perform the necessary virtual address mapping.

In [5]:
FRAC_WIDTH = 20
X_test_hw = (X_test*(1<<FRAC_WIDTH)).astype(np.int32)
X_test_hw = rp.copy_array(X_test_hw, dtype=np.int32) # allocates X_test_hw to contiguous memory

###### i.) Explicitly set_params() and configure the HW pipeline:

In [6]:
pipe_params = {"a":lr.coef_hw.pointer, "b":lr.intercept_hw.pointer, "n_out":lr.n_classes}

ml_pipe = ml_pipe.set_params(dr__hw_accel=True,
                             dr__pipe_params=pipe_params,
                             clf__hw_accel=True)

stage1: 	 PynqBinaryRandomProjection
sw bypass: 	 PynqLogisticRegression


Now, when we run predict() on hw_pipe, we will invoke the pipeline from PynqBinaryRandomProjection.transform() method. PynqLogisticRegression.predict() will be bypassed. The only caveat is that the fitted parameters from PynqLogisticRegression must be passed as paramters to PynqBinaryRandomProjection. 

In [7]:
ypred_hw = ml_pipe.predict(X_test_hw)

###### iv.) Verify equivalence
We should get approximately the same classification performance. Any errors/differences are attributable to fixed point rounding errors in the FPGA.

In [8]:
print("Exactly equal =",np.array_equal(ypred_hw , ypred_sw))
print("Differences =", np.count_nonzero((ypred_hw - ypred_sw)))

Exactly equal = True
Differences = 0


###### v.) Measure the pipeline performance 

In [9]:
number=200
def hwresp():
    out = ml_pipe.predict(X_test_hw)
    return
    
print("Running the benchmark")
hw_time = timeit.timeit(hwresp,number=number)
print("Time taken by hw_pipe", number,"times",hw_time)
print("HW Speedup = %.2fx"%(sw_time/hw_time))

Running the benchmark
Time taken by hw_pipe 200 times 0.6482782940001925
HW Speedup = 12.38x


### Evaluate Classification Results
We can perform score() directly on HW pipeline, or we can create a custom scoring function which can be applied separately.

In [10]:
auc = ml_pipe.score(X_test_hw, y_test) 
print("AUC =", auc) 

AUC = 0.964


In [11]:
from sklearn.metrics import classification_report

def custom_scorer(y, y_pred):
    # We can put anything in here.
    class_names = ["Class%d"%(i) for i in range(10)]
    return classification_report(y, y_pred, target_names=class_names)

print( custom_scorer(y_test, ypred_sw))

             precision    recall  f1-score   support

     Class0       0.98      0.95      0.96       100
     Class1       0.98      0.93      0.95        88
     Class2       0.97      0.99      0.98       109
     Class3       0.95      0.95      0.95        93
     Class4       0.91      0.99      0.95        96
     Class5       1.00      0.97      0.98        91
     Class6       0.98      0.99      0.99       118
     Class7       0.98      0.96      0.97       107
     Class8       0.96      0.94      0.95        94
     Class9       0.93      0.96      0.95       104

avg / total       0.96      0.96      0.96      1000



Refer [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) for classification report details. 

When we are finished, we should free all CMA buffers:

In [12]:
rp.xlnk.xlnk_reset()