# Regression with Scikit Learn

A regression is a predictive model that looks for a functional relationship between a set of variables (X) and a continuous outcome variable (y).

In other word, given an input array we try to predict a numerical value.

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Weight - Height dataset

In [None]:
df = pd.read_csv('../data/weight-height.csv')

In [None]:
df.head()

### Visualize the dataset

In [None]:
df.plot(kind='scatter', x='Height', y='Weight', alpha=0.2)
plt.title('Humans');

## Seaborn pairplot

In [None]:
import seaborn as sns

In [None]:
sns.pairplot(df, hue='Gender');

## Linear regression

### Features

In [None]:
# what's the purpose of the next line?
# try to print out df['Height'].values and X
# to figure it out

X = df[['Height']].values

### Target

In [None]:
y = df['Weight'].values

### Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size = 0.3, random_state=0)

In [None]:
### Fit a Linear regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

### Regression weights

In [None]:
print("Slope: %.2f" % model.coef_)
print("Intercept: %.2f" % model.intercept_)

### Mean square error

In [None]:
y_pred_test = model.predict(X_test)

In [None]:
mse = np.mean((y_pred_test - y_test) ** 2)

In [None]:
print("Residual sum of squares: %.2f" % mse)

### $R^2$ score

In [None]:
model.score(X_test, y_test)

### Plot line of best fit with Test set

In [None]:
X_test

In [None]:
x_min  = X_test.min()
x_max  = X_test.max()

x_line = np.linspace(x_min, x_max, 100) # dim 1 array
X_line = np.expand_dims(x_line, 1) # dim 2 array, same values

y_line = model.predict(X_line)

In [None]:
plt.scatter(X_test, y_test)
plt.plot(X_line, y_line, color = 'red')
plt.title('Humans')
plt.xlabel('Height (in)')
plt.ylabel('Weight (lbs)');

## Exercise 1

In this exercise we extend what we have learned about linear regression to a dataset with more than one feature. Here are the steps to complete it:
- Load the dataset ../data/housing-data.csv
- plot the histograms for each feature using `sns.pairplot`. Choose the most appropriate column for `hue`.
- create 2 variables called X and y: X shall be a matrix with 3 columns (sqft,bdrms,age) and y shall be a vector with 1 column (price)
- create a linear regression model
- split the data into train and test with a 20% test size and random state zero
- train the model on the training set and check its R2 coefficient on training and test set
- how's your model doing?
- predict the price of a house of 2000 sqft with 3 bedrooms and 20 years of age

This dataset contains multiple columns:
- sqft
- bdrms
- age
- price


In [None]:
df = pd.read_csv('../data/housing-data.csv')

In [None]:
df.head()

In [None]:
sns.pairplot(df, hue='bdrms');

In [None]:
model = LinearRegression()

In [None]:
X = df[['sqft', 'bdrms', 'age']].values
y = df['price'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

In [None]:
model.fit(X_train, y_train)

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
model.predict([[2000, 3, 20]])

## Exercise 2
Let's expand beyond the linear regression.
- Train a regularized regression model like [`Ridge`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html), [`Lasso`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) or [any other regression model](http://scikit-learn.org/stable/modules/linear_model.html) on the training dataset and test the score on the test set
- does regularization improve the score?
- Try changing the regularization strength alpha
(You could try several models programmatically and collect all the results in a nice table)

In [None]:
from sklearn.linear_model import Ridge, Lasso, BayesianRidge

In [None]:
results = []

for model in [LinearRegression(), Ridge(), Lasso(), BayesianRidge()]:
    model.fit(X_train, y_train)

    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    model_name = model.__class__.__name__
    
    results.append([model_name, train_score, test_score])

pd.DataFrame(results, columns=['model', 'train_score', 'test_score'])

In [None]:
results = []

for alpha in [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]:
    model = Ridge(alpha=alpha)
    
    model.fit(X_train, y_train)

    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
        
    results.append([alpha, train_score, test_score])

pd.DataFrame(results, columns=['alpha', 'train_score', 'test_score'])