# Ridge Regression Demo
Ridge extends linear regression by providing L2 regularization of the coefficients. It can reduce the variance of the predictors, and improves the conditioning of the problem.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well  as cuDF DataFrames. 

For information about cuDF, refer to the [cuDF documentation](https://rapidsai.github.io/projects/cudf/en/latest/)

For information about cuML's ridge regression implementation: https://rapidsai.github.io/projects/cuml/en/latest/api.html#ridge-regression

In [1]:
import os

import numpy as np

import pandas as pd
import cudf as gd

from sklearn.datasets import make_regression

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from cuml.linear_model import Ridge as cuRidge
from sklearn.linear_model import Ridge as skRidge

## Define Parameters

In [2]:
n_samples = 2**20
n_features = 399

## Generate Data

### Host

In [3]:
%%time
X,y = make_regression(n_samples=n_samples, n_features=n_features, random_state=0)

X = pd.DataFrame(X)
y = pd.DataFrame(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

CPU times: user 31.1 s, sys: 16.4 s, total: 47.5 s
Wall time: 32 s


### GPU

In [6]:
%%time
X_cudf = gd.DataFrame.from_pandas(X_train)
X_cudf_test = gd.DataFrame.from_pandas(X_test)

y_cudf = gd.Series(y_train.values[:,0])

CPU times: user 8.21 s, sys: 2.58 s, total: 10.8 s
Wall time: 10.8 s


## Scikit-learn Model

### Fit

In [7]:
%%time
skridge = skRidge(fit_intercept=False,
                  normalize=True,
                  alpha=0.1)

skridge.fit(X_train, y_train)

CPU times: user 23 s, sys: 15.4 s, total: 38.4 s
Wall time: 4.09 s


### Predict

In [8]:
%%time
sk_predict = skridge.predict(X_test)
error_sk = mean_squared_error(y_test,sk_predict)

CPU times: user 1.72 s, sys: 1.56 s, total: 3.28 s
Wall time: 160 ms


## cuML Model

### Fit

In [9]:
%%time
# run the cuml ridge regression model to fit the training dataset.  Eig is the faster algorithm, but svd is more accurate 
curidge = cuRidge(fit_intercept=False,
                  normalize=True,
                  solver='svd',
                  alpha=0.1)

curidge.fit(X_cudf, y_cudf)

CPU times: user 3.52 s, sys: 1.7 s, total: 5.22 s
Wall time: 5.21 s


### Predict

In [10]:
%%time
cu_predict = curidge.predict(X_cudf_test).to_array()
error_cu = mean_squared_error(y_test,cu_predict)

CPU times: user 240 ms, sys: 24 ms, total: 264 ms
Wall time: 258 ms


## Evaluate Results

In [11]:
print("SKL MSE(y): %s" % error_sk)
print("CUML MSE(y): %s" % error_cu)

SKL MSE(y): 5.886555526774146e-10
CUML MSE(y): 5.886555619959372e-10
