# Linear Regression
LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://rapidsai.github.io/projects/cudf/en/latest/)

For additional information cuML's linear regression: https://rapidsai.github.io/projects/cuml/en/latest/api.html#linear-regression

In [9]:
import os
import numpy as np

import pandas as pd
import cudf as gd

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

from cuml.linear_model import LinearRegression as cuLR
from sklearn.linear_model import LinearRegression as skLR

## Define Parameters

In [2]:
n_samples = 2**20
n_features = 399

## Generate Data

### Host

In [17]:
%%time
X,y = make_regression(n_samples=n_samples, n_features=n_features, random_state=0)

X = pd.DataFrame(X)
y = pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

CPU times: user 32.2 s, sys: 19.3 s, total: 51.4 s
Wall time: 34.5 s


### GPU

In [18]:
%%time
X_cudf = gd.DataFrame.from_pandas(X_train)
X_cudf_test = gd.DataFrame.from_pandas(X_test)

y_cudf = gd.Series(y_train.values)

CPU times: user 8.87 s, sys: 1.91 s, total: 10.8 s
Wall time: 10.8 s


## Scikit-learn Model

### Fit

In [5]:
%%time
skols = skLR(fit_intercept=True,
             normalize=True,
             n_jobs=-1)

skols.fit(X_train, y_train)

CPU times: user 5min 20s, sys: 3min 57s, total: 9min 18s
Wall time: 27.2 s


### Evaluate

In [6]:
%%time
sk_predict = skols.predict(X_test)

error_sk = mean_squared_error(y_test,sk_predict)

CPU times: user 1.48 s, sys: 600 ms, total: 2.08 s
Wall time: 209 ms


## cuML Model

### Fit

In [10]:
%%time
cuols = cuLR(fit_intercept=True,
             normalize=True,
             algorithm='eig')

cuols.fit(X_cudf, y_cudf)

CPU times: user 940 ms, sys: 400 ms, total: 1.34 s
Wall time: 1.35 s


### Evaluate

In [11]:
%%time
cu_predict = cuols.predict(X_cudf_test).to_array()

error_cu = mean_squared_error(y_test,cu_predict)

CPU times: user 268 ms, sys: 12 ms, total: 280 ms
Wall time: 276 ms


## Compare Results

In [16]:
print("SKL MSE(y): %s" % error_sk)
print("CUML MSE(y): %s" % error_cu)

SKL MSE(y): 3.839160481932643e-25
CUML MSE(y): 3.520005424017312e-25
