# Linear Regression

This notebook compares a CPU implementation and a GPU implementation of Linear Regression.  It includes code example for doing Linear Regression using RAPIDS cuDF and cuML.

## Notebook Credits
### Authorship
Original Author: Unknown  
Last Edit: Taurean Dyer, 2/20/2019    

### Test System Specs
Test System Hardware: DGX-2  
Test System Software: Ubuntu 16.04  
RAPIDS Version: 0.5.1 - Docker Install  
Driver: 410.79  
CUDA: 10.0  

### Known Working Systems
RAPIDS Versions: 0.4, 0.5, 0.5.1


## Let's Begin: Linear Regression
### Imports
Let's start with our Imports

In [3]:
import numpy as np
import pandas as pd
from sklearn import linear_model as sklGLM
from cuml import LinearRegression as cumlOLS
from cuml import Ridge as cumlRidge
import cudf
import os

### Helper Functions

In [4]:
from timeit import default_timer

class Timer(object):
    def __init__(self):
        self._timer = default_timer
    
    def __enter__(self):
        self.start()
        return self

    def __exit__(self, *args):
        self.stop()

    def start(self):
        """Start the timer."""
        self.start = self._timer()

    def stop(self):
        """Stop the timer. Calculate the interval in seconds."""
        self.end = self._timer()
        self.interval = self.end - self.start

In [5]:
import gzip
def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):
    if os.path.exists(cached):
        print('use mortgage data')
        with gzip.open(cached) as f:
            X = np.load(f)
        # the 4th column is 'adj_remaining_months_to_maturity'
        # used as the label
        X = X[:,[i for i in range(X.shape[1]) if i!=4]]
        y = X[:,4:5]
        rindices = np.random.randint(0,X.shape[0]-1,nrows)
        X = X[rindices,:ncols]
        y = y[rindices]
    else:
        print('use random data')
        X = np.random.rand(nrows,ncols)
        
    df_X = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
    df_y = pd.DataFrame({'fea%d'%i:y[:,i] for i in range(y.shape[1])})
    
    return df_X, df_y

In [6]:
from sklearn.metrics import mean_squared_error
def array_equal(a,b,threshold=2e-3,with_sign=True):
    a = to_nparray(a).ravel()
    b = to_nparray(b).ravel()
    if with_sign == False:
        a,b = np.abs(a),np.abs(b)
    error = mean_squared_error(a,b)
    res = error<threshold
    return res

def to_nparray(x):
    if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):
        return np.array(x)
    elif isinstance(x,np.float64):
        return np.array([x])
    elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):
        return x.to_pandas().values
    return x    

Now that we have our Helper functions, lets start to compare the speed and results for SciKit Learn's CPU impletmenation versus RAPIDS cuML GPU impletementation. 

In [7]:
%%time
nrows = 2**20
ncols = 399

X, y = load_data(nrows,ncols)
print('data',X.shape)
print('label',y.shape)

use mortgage data
data (1048576, 399)
label (1048576, 1)
CPU times: user 12.8 s, sys: 1.05 s, total: 13.9 s
Wall time: 11.8 s


Even though the OLS interface of cuML is very similar to Scikit-Learn's implemetation, cuML doesn't use some of the parameters such as "copy" and "n_jobs". Also, cuML includes two different implementation of OLS using SVD and Eigen decomposition. Eigen decomposition based implementation is very fast but causes very small errors in the coefficients which is negligible for most of the applications. SVD is stable but slower than eigen decomposition based implementation. 

### Get MSE for SciKit Learn

In [8]:
fit_intercept = True
normalize = False
algorithm = "eig" # eig: eigen decomposition based method, svd: singular value decomposition based method.

In [9]:
%%time
reg_sk = sklGLM.LinearRegression(fit_intercept=fit_intercept, normalize=normalize)
result_sk = reg_sk.fit(X, y)

CPU times: user 35.1 s, sys: 10.5 s, total: 45.6 s
Wall time: 7.95 s


In [10]:
%%time
y_sk = reg_sk.predict(X)
error_sk = mean_squared_error(y,y_sk)

CPU times: user 652 ms, sys: 0 ns, total: 652 ms
Wall time: 305 ms


### Get MSE for cuML

In [12]:
%%time
X_cudf = cudf.DataFrame.from_pandas(X)
y_cudf = y.values
y_cudf = y_cudf[:,0]
y_cudf = cudf.Series(y_cudf)

CPU times: user 2.11 s, sys: 748 ms, total: 2.86 s
Wall time: 1.83 s


In [13]:
%%time
reg_cuml = cumlOLS(fit_intercept=fit_intercept, normalize=normalize, algorithm=algorithm)
result_cuml = reg_cuml.fit(X_cudf, y_cudf)

CPU times: user 740 ms, sys: 200 ms, total: 940 ms
Wall time: 933 ms


In [14]:
%%time
y_cuml = reg_cuml.predict(X_cudf)
y_cuml = to_nparray(y_cuml).ravel()
error_cuml = mean_squared_error(y,y_cuml)

CPU times: user 360 ms, sys: 24 ms, total: 384 ms
Wall time: 375 ms


## Final Comparison Between SKL and cuML
Your final output should have both MSE results close to 0 (about 1.0e-7 to 1.0e-14).  However, despite having similar answers, you should see a **massive reduction to the sys time** when using **RAPIDS cuML** versus **SciKit Learn**.  Go RAPIDS!

In [16]:
print("SKL MSE(y):")
print(error_sk)
print("CUML MSE(y):")
print(error_cuml)

SKL MSE(y):
1.9705975e-13
CUML MSE(y):
2.9977223e-10
