# 8. Regression
Looks now look at the housing dataset which has a continuous value output. There are 80 features in the model. Read more about them at the [online data dictionary][1].

[1]: https://storage.googleapis.com/kaggle-competitions-data/kaggle/5407/data_description.txt?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1537163162&Signature=DJyH9SDDyhtHiKfFYNG%2F3sBpCXzB1BXkTCh9aakN3neRlcl%2BCalteIiJH9Kni%2FeGpAhdN6xQHgXinxPAJwNbG1zCymFsdG0%2FKtDBGHsthihgADyhpIkhRLhUjiwDPqstLQFe3tiAoGXc4EtBI1REpy5Az1kK4ah1X7ccxAHsluhiVloH9mteIMiYNsUkE%2BB1Gt3OnJQOYFAtRAcAyUFwf3%2BoQIrmTmF8mety89WSsi8qwdBqnGgea8eCLFba0akRrDJg9cvwtI%2F%2FFCScjQZ16l%2FjYF24ZRDtMWXW0CnXz2Q4%2Fr2Zo3rJijgr7L0b%2BlmuyordaUWEr%2BA3UrjNI3WL4g%3D%3D

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.options.display.max_columns = 100
%matplotlib inline

In [None]:
housing = pd.read_csv('../data/housing.csv')
housing.head()

# Identify target variable and type of problem
The target variable is sale price which is a continuous value therefore making this a regression problem. 

### Choose one predictor variable

In [None]:
X = housing['GrLivArea'].values.reshape(-1, 1)
y = housing['SalePrice'].values

# What is the simplest regression model you can think of?
### Exercise - build a model in scikit-learn that implement's this simple model

# Linear Regression
Outside of our dummy estimator, linear regression is among the simplest regression model we can choose:

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lr = LinearRegression()
cross_val_score(lr, X, y, cv=5)

### Objective for regression - Minimize squared distance between actual and predicted values (R-squared)

![][1]

[1]: images/r2.png

# Practice Regression in Scikit-Learn
Find several more regression models in the API, import, instantiate, fit, and score.

# Overfitting and Underfiting

![][1]

Credit: Andrew Ng ML class

[1]: images/reg.png

# Flexible vs Inflexible models
Given an infinite amount of flexibility, a model can be built to perfectly fit every possible point in the training dataset.

### More on flexibility
Flexibility can be achieved in two main ways:
* Using more parameters in the model
* Using a more flexible model

# This is called the Bias/Variance Tradeoff

### Definitions
* **Bias** - You have lots confidence in your assumptions about the model and use a model that is not flexible enough to capture more complex relationships
* **Variance** - You put too much reliance on the data. You build a model that is very sensitive to the inputs and overfits the data. Small changes of the input data can lead to drastically different fitted values of the parameters.

High variance is often referred to as **memorization**.
![](images/bv_trade.png?1)

## Linear Regression with all numeric variables
Let's build a model with the numeric variables. Notice, we dropped the final sale price since that is the target variable.

In [None]:
# Id is an integer label for the rows
# MSSubClass is a neighborhood
sale_price = housing['SalePrice']
housing_num = housing.select_dtypes('number').drop(columns=['Id', 'MSSubClass', 'SalePrice'])
housing_num.head()

In [None]:
housing_num_filled = housing_num.fillna(housing_num.median())
housing_num_filled.isna().sum()

In [None]:
X = housing_num_filled.values
y = sale_price.values

In [None]:
X.shape

In [None]:
y.shape

In [None]:
lr.fit(X, y)
lr.score(X, y)

## Not much overfitting
Our generalization performance was only about 3% less than the cross validated performance giving us confidence that it is not overfitting.

# Regularization
In machine learning, regularization is a process of constraining our model during training so that it doesn't overfit to the training data. We purposefully weaken the model so that it can't fit the noise. It is, of course, possible to apply too much of a dampening affect so that the signal is lost. The key is to find just the right amount regularization.

## Regularization in linear regression
Without regularization, the goal of linear regression is to get our predictions as close to the actual points as possible. However, when using regularization, we have an additional goal of keeping the values of the fitted parameters within a certain threshold.


# Ridge/Lasso Regression - A new problem formulation
* Minimize the sum of squared errors.
* For ridge, the sum of the squared values of the parameters must be less than some number (Scikit-Learn uses **`C`**)
* For lasso, we use the sum of the absolute values of the parameters.


## There is a boundary for your parameters - Find the minimum sum of squares within this boundary

![][1]


[1]: images/ridge.png?1

# Can no longer reach the minimum squared error
The center of the red contours represents the optimal values of the parameters without constraints. 

The points where the contour touches the blue-green region is the lasso/ridge solution.

## Use the Ridge and Lasso regressors in scikit-learn

In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV

## Can you find the best alpha?

## Regularization works best when the features are of the same scales

# Scale features so they are all in the same range
Standardization is the process of transforming a feature such that it has mean 0 and standard deviation of 1. Scikit-Learn provides the **`StandardScaler`** estimator. This object is the first **transformer** we have seem, which is similar to an estimator, but literally transforms the data into something different.

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

In [None]:
X_scaled = ss.fit_transform(X)
X_scaled[:5]

### Try again with scaled data

In [None]:
ridge = Ridge()
ridge.fit(X_scaled, y)
ridge.score(X_scaled, y)

In [None]:
cross_val_score(ridge, X_scaled, y=y, cv=5).mean()

In [None]:
lr.fit(X_scaled, y).score(X_scaled, y)

## Repeat with Lasso

In [None]:
lasso = Lasso()

In [None]:
lasso.fit(X_scaled, y).score(X_scaled, y)

### Not enough regularization to see a difference. Let's apply more

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Look through the methods of the StandardScaler instance.</span>

### Problem 2
<span  style="color:green; font-size:16px">Can you verify that the coefficients are shrinking with increasing more regularization? </span>