# Wrangling Big Data

<img src="assets/ml_map.png" width="400px">

[source](http://scikit-learn.org/stable/tutorial/machine_learning_map/)

![](assets/regression_ml_map.png)

## Why Stochastic Gradient Descent? 

Per the `sklearn` documentation:

> The advantages of Stochastic Gradient Descent are:
> - Efficiency.
> - Ease of implementation (lots of opportunities for code tuning).

> The disadvantages of Stochastic Gradient Descent include:
> - SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
> - SGD is sensitive to feature scaling.

Stated even more plainly:

> [`SGDRegressor`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend [`Ridge`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge), [`Lasso`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso), or [`ElasticNet`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet).

![](assets/100K.png)

## Out of the Box Comparison

In [None]:
from os import chdir; chdir('../../../lib/')
from mglearn.datasets import load_extended_boston

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
%matplotlib inline

### Domain and Dataset

We are interested in comparing the performance of the standard `sklearn` linear regressors out of the box. As such, our "dataset" is the set of the following four linear regressors:

- `sklearn.linear_model.LinearRegression`
- `sklearn.linear_model.Ridge`
- `sklearn.linear_model.Lasso`
- `sklearn.linear_model.SGDRegressor`

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, SGDRegressor

In order to perform the regression we will be using the "extended boston" dataset provided by the `mglearn` library. This library has 506 instances and 104 features. It is accompanied by a numeric target vector of quantitative values. It is a "canonical" linear regression dataset and is often used for learning linear regression concepts. 

In [None]:
X, y = load_extended_boston()
X.shape, y.shape

**Note**: It is considered a best practice when using Stochastic Gradient Descent (and linear models in general) to scale onese data. We will be preprocessing our data using the `StandardScaler` included with `sklearn`.

In [None]:
from sklearn.preprocessing import StandardScaler

### Problem Statement

Given the set of four linear regressors in their "out-of-the-box" state i.e. default, with no tuning parameters applied, we wish to see how these perform when applied to the same dataset. 

### Solution Statement

As this is an exploratory problem, our solution will take the form of scores for each model presented in a plot. 

### Metric

N/A

### Benchmark

N/A

### Code Template

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

LR = LinearRegression()
LR.fit(X_train, y_train)
print LR.score(X_train, y_train)
print LR.score(X_test, y_test)

In [None]:
def fit_and_score_linear_model(model, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    scaler = StandardScaler()
    scaler.fit(X_train)

    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)

    model.fit(X_train, y_train)
    return {'model': model, 'train_score' : model.score(X_train, y_train), 'test_score': model.score(X_test, y_test)}

In [None]:
fit_and_score_linear_model(LinearRegression(), X, y)

In [None]:
fit_and_score_linear_model(Ridge(), X, y)

In [None]:
fit_and_score_linear_model(Lasso(), X, y)

In [None]:
fit_and_score_linear_model(SGDRegressor(), X, y)

In [None]:
fits = [fit_and_score_linear_model(LinearRegression(), X, y),
        fit_and_score_linear_model(Ridge(), X, y),
        fit_and_score_linear_model(Lasso(), X, y),
        fit_and_score_linear_model(SGDRegressor(), X, y)]

fits = pd.DataFrame(fits)
fits

### Pandas Munging to make Seaborn Happy

In [None]:
for_plot = pd.concat([fits[['model','test_score']],fits[['model','train_score']]])
for_plot['model'] = for_plot['model'].astype(str).str.extract('(.+)\(', expand=False)
for_plot['type'] = ['test']*4+['train']*4
for_plot['train_score'].fillna(for_plot['test_score'], inplace=True)
for_plot.drop('test_score', inplace=True, axis=1)
for_plot.columns = ['model', 'score', 'type']

In [None]:
barplot = sns.barplot(x='model', y='score', hue='type', data=for_plot)

---

### Domain and Dataset

We are interested in comparing the timing performance of three standard `sklearn` linear regressors out of the box using datasets of different sizes. As such, our "dataset" is the set of the following three linear regressors:

- `sklearn.linear_model.LinearRegression`
- `sklearn.linear_model.Ridge`
- `sklearn.linear_model.SGDRegressor`


In [None]:
from sklearn.linear_model import LinearRegression, Ridge, SGDRegressor

In order to perform the regression we will be using the `make_regression` function built into provided by the `sklearn.datasets` library. This functions constructs a regression problem with a given number of instances and a given number of features features. It is accompanied by a numeric target vector of quantitative values.

In [None]:
from time import time
from sklearn import datasets
X, y = datasets.make_regression(int(1e4), 
                                n_features=1000, 
                                noise=50.0)

In [None]:
X.shape

**Note**: It is considered a best practice when using Stochastic Gradient Descent (and linear models in general) to scale onese data. We will be preprocessing our data using the `StandardScaler` included with `sklearn`.

In [None]:
from sklearn.preprocessing import StandardScaler

### Problem Statement

Given the set of three linear regressors in their "out-of-the-box" state i.e. default, with no tuning parameters applied, we wish to see how these perform when applied to the regression datasets of varying sizes. 

### Solution Statement

As this is an exploratory problem, our solution will take the form of scores for each model presented in a plot. 

### Metric

N/A

### Benchmark

N/A

### Code Template

In [None]:
start = time()
fit = fit_and_score_linear_model(LinearRegression(), X, y)
end = time()
fit['test_score'], end - start

In [None]:
def time_train_score_for_LM_of_size_n(model, n):
    X, y = datasets.make_regression(int(n),
                                    n_features=1000, 
                                    noise=50.0)
    start = time()
    fit = fit_and_score_linear_model(model, X, y)
    end = time()
    return {'score': fit['test_score'], 'time': end - start}

In [None]:
results = []
for n in [1e2, 1e3, 1e4]:
    res_dict = time_train_score_for_LM_of_size_n(LinearRegression(), n)
    res_dict['type'] = 'LR'
    res_dict['n'] = n
    results.append(res_dict)
    
    res_dict = time_train_score_for_LM_of_size_n(SGDRegressor(), n)
    res_dict['type'] = 'SGD'
    res_dict['n'] = n
    results.append(res_dict)
    
    res_dict = time_train_score_for_LM_of_size_n(Ridge(), n)
    res_dict['type'] = 'Ridge'
    res_dict['n'] = n
    results.append(res_dict)
    

In [None]:
results_df = pd.DataFrame(results)
sns.barplot(x='n', y='time', hue='type', data=results_df)
plt.yscale('log')
results_df

### Tuning Our Linear Models

### Domain and Dataset

We are interested in comparing the performance of two standard `sklearn` linear regressors after they have been tuned. As such, our "dataset" is the set of the following two linear regressors:

- `sklearn.linear_model.Ridge`
- `sklearn.linear_model.SGDRegressor`


In [None]:
from sklearn.linear_model import Ridge, SGDRegressor

In order to perform the regression we will be using the `make_regression` function built into provided by the `sklearn.datasets` library. This functions constructs a regression problem with a given number of instances and a given number of features features. It is accompanied by a numeric target vector of quantitative values.

We will use it to construct a regression problem with 1000000 instances and 1000 features.

In [None]:
from time import time
from sklearn import datasets
X, y = datasets.make_regression(int(1e5), 
                                n_features=1000, 
                                noise=75.0)

In [None]:
X.shape

**Note**: It is considered a best practice when using Stochastic Gradient Descent (and linear models in general) to scale onese data. We will be preprocessing our data using the `StandardScaler` included with `sklearn`.

In [None]:
from sklearn.preprocessing import StandardScaler

### Problem Statement

Given the set of two linear regressors, we wish to see how these perform when tuned for performance against a large dataset. 

### Solution Statement

As this is an exploratory problem, our solution will take the form of scores for each model presented in a plot. 

### Metric

We will use the default regression scorer, the $R^2$ and its performance against the test set. We will also be looking at time. 

### Benchmark

We will use the performance of the default `Ridge` regression as our benchmark performance and model.

In [None]:
start = time()
fit = fit_and_score_linear_model(Ridge(), X, y)
end = time()
print end-start, fit['test_score']

We have a benchmark time of 9.986 seconds and a benchmark $R^2$ score of $0.886262$.

### Tune the Ridge Regression


In [None]:
ridge_results = []
alphas = [0.001, 0.01, 0.1, 1.0, 2, 5, 10]
for alpha in alphas:
    print alpha
    start = time()
    fit = fit_and_score_linear_model(Ridge(alpha=alpha), X, y)
    end = time()
    results.append({'alpha': alpha,
                    'type' : 'ridge', 
                   'score' : fit['test_score'],
                   'time' : end-start})
results_df = pd.DataFrame(results)
results_df    

In [None]:
ridge_results_df = pd.DataFrame(ridge_results)
ridge_results_df

In [None]:
ridge_results_df = pd.DataFrame(results)
plt.plot(ridge_results_df['alpha'],
         ridge_results_df['score'])
plt.xscale('log')

In [None]:
sgd_results = []
alphas = [0.001, 0.01, 0.1, 1.0, 2, 5, 10]
for alpha in alphas:
    print alpha
    start = time()
    fit = fit_and_score_linear_model(
                SGDRegressor(alpha=alpha), X, y)
    end = time()
    sgd_results.append({'alpha': alpha,
                        'type' : 'SGD', 
                        'score' : fit['test_score'],
                        'time' : end-start})
    
sgd_results_df = pd.DataFrame(sgd_results)
sgd_results_df    

In [None]:
plt.plot(sgd_results_df['alpha'],
         sgd_results_df['score'])
plt.xscale('log')

In [None]:
sgd_results_2 = []

fit = fit_and_score_linear_model(
                    SGDRegressor(alpha=.001,
                                 warm_start=True), X, y)
sgd_results_2.append({'alpha': alpha,
                    'type' : 'SGD', 
                    'score' : fit['test_score'],
                    'time' : end-start})


for _ in range(25):
    fit = fit_and_score_linear_model(fit['model'], X, y)
    
    sgd_results_2.append({'alpha': alpha,
                        'type' : 'SGD', 
                        'score' : fit['test_score'],
                        'time' : end-start})
    
sgd_results_2_df = pd.DataFrame(sgd_results_2)
sgd_results_2_df       

In [None]:
plt.plot(range(len(sgd_results_2_df['score'])),
         sgd_results_2_df['score'])