# Regularized Methods

- Feature Scaling
- Test/Train split
- Ridge, LASSO, Elastic Net Regression methods

---

In a regular linear scenario, we start with a regular linear function.

$$ \hat y = b + ax_0$$



The mean square error of these predictions would be given by:

$$RSS(a, b) = \sum_{i = 1}^n(y_i -  (ax_i + b))^2$$

From this basic $MSE$ formulation, we can introduce some Regularized methods that add a *regularization term*  to the $MSE$.  We will look at three methods that offer slight variations on this term.

### Feature Scaling

To use these methods, we want to scale our data.  Many Machine Learning algorithms don't do well with data operating on very different scales.  Using the `MinMaxScaler` normalizes the data and brings the values between 0 and 1. The `StandardScaler` method is less sensitive to wide ranges of values. We will use both on our Ames housing data.  To begin, we need to select the numeric columns from the DataFrame so we can transform them only.

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
#get our data and select the numer
ames = pd.read_csv('data/ames_housing.csv')
y = ames['SalePrice']
ames = ames.drop('SalePrice', axis = 1)

In [None]:
ames_numeric = ames.select_dtypes(include = 'int64')
ames_numeric.head()

### Using the Scaler on a DataFrame

Below, we can compare the results of the two scaling transformations by passing a list of column names to the scaler.  Note the practice of initializing the object, fitting it, and transforming.  

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [None]:
std_scaled = StandardScaler()
minmax_scaled = MinMaxScaler()

In [None]:
cols = ames_numeric.columns

In [None]:
std_df = std_scaled.fit_transform(ames[[name for name in cols]])
minmax_df = minmax_scaled.fit_transform(ames[[name for name in cols]])

In [None]:
pd.DataFrame(std_df).head()

In [None]:
pd.DataFrame(minmax_df).head()

### Fit a Linear Model on Scaled Data

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
y = np.log(y)

In [None]:
ames_numeric_scaled = std_scaled.fit_transform(ames[[name for name in cols]])

In [None]:
lm.fit(ames_numeric_scaled, y)

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
predictions = lm.predict(ames_numeric_scaled)

In [None]:
mse = mean_squared_error(y, predictions)

In [None]:
rmse = np.sqrt(mse)
score = lm.score(ames_numeric_scaled, predictions)

In [None]:
print('R-squared score: {}'.format(score), '\nRMSE: {:.4f}'.format(rmse))

### Splitting the Data 

As we have seen, we will tend to overfit the data if we use the entire dataset to determine the model.  To account for this, we will split our datasets into a **training set** to build our model on, and a **test set** to evaluate the performance of the model.  We have a handy sklearn method for doing this, who by default splits the data into 80% for training and 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(ames_numeric_scaled, y)

In [None]:
lm.fit(X_train, y_train)

In [None]:
pred = lm.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, pred)

In [None]:
rmse = np.sqrt(mse)
rmse

### Regularized Methods Comparison



In [None]:
crime = pd.read_csv('data/crime_data.csv', index_col = 'Unnamed: 0')

In [None]:
crime.head()

In [None]:
y = crime['ViolentCrimesPerPop']

In [None]:
X = crime.drop('ViolentCrimesPerPop', axis = 1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
lm = LinearRegression()
lm.fit(X_train_scaled, y_train)
predictions = lm.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
score = lm.score(X_test_scaled, y_test)
print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))

### Ridge Regression

$$RSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + \alpha \sum_{j = 1}^p w_j^2 $$

Many feature coefficients will be determined with small values.  Larger $\alpha$ means larger penalty, zero is base LinearRegression, and the default for sklearn's implementation is 1.0.

In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge_reg = Ridge(alpha = 1, solver = "cholesky")

In [None]:
ridge_reg.fit(X_train_scaled, y_train)

In [None]:
rpred = ridge_reg.predict(X_test_scaled)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, rpred))
score = ridge_reg.score(X_test_scaled, y_test)
print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))

In [None]:
np.sum(ridge_reg.coef_ != 0)

In [None]:
crime.shape

In [None]:
ridge_reg = Ridge(alpha = 20, solver = "cholesky")
ridge_reg.fit(X_train_scaled, y_train)
rpred = ridge_reg.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, rpred))
score = ridge_reg.score(X_test_scaled, y_test)
print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))

### Lasso Regression

$$RSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + \alpha \sum_{j = 1}^p |w_j| $$

Now, we end up in effect setting variables with low influence to a coefficient of zero.  Compared to Ridge, we would use Lasso if there are only a few variables with substantial effects.

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso_reg = Lasso(alpha = 2.0)

In [None]:
lasso_reg.fit(X_train_scaled, y_train)

In [None]:
lpred = lasso_reg.predict(X_test_scaled)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, lpred))
score = ridge_reg.score(X_test_scaled, y_test)
print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))

In [None]:
np.sum(lasso_reg.coef_ != 0)

In [None]:
for e in sorted (list(zip(list(X), lasso_reg.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))

### Elastic Net

$$RSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + r\alpha\sum_{i = 1}^n |w_j| + \frac{1-r}{2} \alpha \sum_{j = 1}^p w_j^2 $$



In [None]:
from sklearn.linear_model import ElasticNet

In [None]:
elastic_reg = ElasticNet(alpha = .05, l1_ratio=0.4)
elastic_reg.fit(X_train_scaled, y_train)
epred = elastic_reg.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, epred))
rmse

In [None]:
ridge_score = ridge_reg.score(X_test_scaled, y_test)
lasso_score = lasso_reg.score(X_test_scaled, y_test)
elastic_score = elastic_reg.score(X_test_scaled, y_test)

In [None]:
print("Ridge: {:.4f}".format(ridge_score), "\nLasso: {:.4f}".format(lasso_score),
      "\nElastic Net: {:.4f}".format(elastic_score))

### PROBLEM

Return to your Ames Data.  We have covered a lot of ground today, so let's summarize the things we could do to improve the performance of our original model that compared the Above Ground Living Area to the Logarithm of the Sale Price.
<div class="alert alert-info" role="alert">
1. Clean data, drop missing values
2. Transform data, code variables using either ordinal values or OneHotEncoder methods
3. Create more features from existing features
4. Split our data into testing and training sets
5. Normalize quantitative features
6. Use Regularized Regression methods and Polynomial regression to improve performance of model
</div>
Can you use some or all of these ideas to improve upon your initial model?

### Additional Resources

The last two lessons have pulled heavily from these resources.  I recommend them all strongly as excellent resources:

- SciKitLearn documentation on Regression: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

- Aurelien Geron, *Hands on Machine Learning with SciKitLearn and TensorFlow*

- James et. al, *An Introduction to Statistical Learning: With Applications in R*

- Philipp K. Janert, *Data Analysis with OpenSource Tools*

- University of Michigan Coursera Class on Machine Learning with SciKitLearn: https://www.coursera.org/learn/python-machine-learning

- Stanford University course on Machine Learning: https://www.coursera.org/learn/machine-learning