# Regularization 
* Regularization seeks to solve a few common model issues by: 
    * Minimizing model complexity 
    * Penalizing the loss function
    * Reducing model over fitting (Add more bias to reduce model variance)
* In simple word regularization is a way to reduce model over fitting and variance.
    * Requires some additional bias. 
    * Requires a search for optimal penalty hyperparameter. 
* There three main type of regularization: 
    * L1 Regularization: LASSO Regression
    * L2 Regularization: Ridge Regression
    * Combining L1 and L2: Elastic Net

* L1 Regularization:
* L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients.
    * Limits the size of the coefficients. 
    * Can yield sparse models where some coefficient can become zero.

* L2 regularization: 
* All coefficient are shrunk by the same factor. 
* Does not necessarily eliminate coefficients.
* L3 regularization adds a penalty equals to the square the magnitude of coefficient. 

* Elastic net combines L1 and L2 with the addition of an alpha parameter deciding the ratio between them. 

* These regularization methods do have a cost: 
    * Introducing an additional hyperparameter that needs to be tuned. 
    * A multiplier to the penalty to decide the "strength" of the penalty. 
    

## Feature scaling
* Feature scaling improves the convergence of steepest descent algorithms, which do not posses the property of scale invariance. 
* If features are on different scales, certain weights may update faster than others since the feature values x play a role in the weight updates. 
* Critical benefit of features scaling related to gradient descent. (There are some ML algorithms where scaling won't have any effect such as CART based methods). 
* Scaling the features so that their respective ranges are uniform is important in comparing measurements that have different units. 
* Allows us directly compare model coefficient to each other. 
* Feature scaling caveats: 
    * Must always scale new unseen data before feeding to model. 
    * Effects direct interoperability of feature coefficients. Effects to compare coefficients to one another, harder to relate back to original unscaled feature. 
* Feature scaling benefits: 
    * Can lead to great increases in performance. 
    * Absolutely necessary for some models.
    * Virtually no "real" downside to scaling features. 
* There are two main ways to scale features: 
    * Standardization: Rescales data to have a mean of 0 and standard deviation of 1.
    * Normalization: Rescales all data values to be between 0-1. 
* There are many more methods of scaling features and scikit-learn provides easy to use classes that fit and transform feature data for scaling. 
    * A .fit() method call simply calculates the necessary statistics (X_min, X_max, mean, standard deviation).
    * A .transform() call actually scales data and returns the new scaled version of data.  
    * Very important consideration for fit and transform:
        * We only fit to training data.
        * Calculating statistical information should come from training data.
        * Don't want to assume prior knowledge of the test set!
        * Using the full data set would cause data leakage. (Calculating statistics from full data leads to some information of the test set leaking into the training process upon transform() conversion.)
* This is the work flow for the feature scaling process. 
    * Perform train test split
    * Fit to training feature data
    * Transform training feature data
    * Transform test feature data
* Something should be keep in the mind are we do not need to scale label nor advised. Normalizing the output distribution is altering the definition of the target. Predicting a distribution that doesn't mirror read-world target. It can negatively impact stochastic gradient descent. 
    * [Read a blog - stackexchange](https://stats.stackexchange.com/questions/111467)

## Cross Validation: 
* Cross validation is a more advances set of methods for splitting data into training and testing sets. 
* We already know the reasoning behind the performing a train test split to fairly evaluate our model's performance on unseen data. Unfortunately this means we are not able to tune hyperparameter to the entire dataset. 
* Let's say what if we want to train on all the data and evaluate on all the data. We can achieve this impossible sound task using the cross validation. 
* The cross-validation basically means that we split the data into the 1/k values for testing and 1 - 1/k for training, and now run models k times each time changing the dataset. At the end of the model we take average of the mean error. 
* This allows us to train on all data and evaluate on the data data. We get a better sense of true performance across multiple potential splits. 
* But all the good things come with a price. 
    * This approach makes us to repeat model K number of times. This may not be an issue with the small dataset, if we have a large dataset it can be very resource expansive process. 
* This method is known as K-fold cross-validation. Usually common choice for K is 10 so each test set is 10% of total data. 
* One consideration to note with K-fold cross validation and a standard train test split is fairly tuning hyperparameter. If we tune hyperparameter to test data performance, are we ever fairly getting performance metrics? 
* How can we understand how the model behaves for data that is has not seen and not been influenced by for hyperparameter tuning? For this we can use a hold out test set. 
    * This is technically same as our normal machine learning methods such as dividing our data into train test splits. But in this case we will take out a chunk of the data and keep it aside and from the remaining data we will perform our regular machine learning analysis. We can than pick whichever method we like depending on the needs such as dividing model into regular 30-70 test/train split or do 10% k-fold split. After running the model, we run our model on the chunk of the data we took it aside and run our model on that data, and whatever the result we get we report the efficiency of the model. 
        * This approach is also called as train - validation - test split. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("data/Advertising.csv")
df

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5


In [3]:
X = df.drop('sales', axis=1)
X

Unnamed: 0,TV,radio,newspaper
0,230.1,37.8,69.2
1,44.5,39.3,45.1
2,17.2,45.9,69.3
3,151.5,41.3,58.5
4,180.8,10.8,58.4
...,...,...,...
195,38.2,3.7,13.8
196,94.2,4.9,8.1
197,177.0,9.3,6.4
198,283.6,42.0,66.2


In [4]:
y =  df.sales
y

0      22.1
1      10.4
2       9.3
3      18.5
4      12.9
       ... 
195     7.6
196     9.7
197    12.8
198    25.5
199    13.4
Name: sales, Length: 200, dtype: float64

In [5]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_converter = PolynomialFeatures(degree=3, include_bias=False)

In [6]:
poly_features = polynomial_converter.fit_transform(X)
poly_features[0]

array([2.30100000e+02, 3.78000000e+01, 6.92000000e+01, 5.29460100e+04,
       8.69778000e+03, 1.59229200e+04, 1.42884000e+03, 2.61576000e+03,
       4.78864000e+03, 1.21828769e+07, 2.00135918e+06, 3.66386389e+06,
       3.28776084e+05, 6.01886376e+05, 1.10186606e+06, 5.40101520e+04,
       9.88757280e+04, 1.81010592e+05, 3.31373888e+05])

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

In [9]:
from sklearn.preprocessing import StandardScaler

In [10]:
scaler = StandardScaler()
scaler.fit(X_train)

In [11]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Ridge regression
* Ridge regression is a regularization technique that works by helping reduce the potential for over fitting to the training model.
* It does this by adding in a penalty term to the error that is based on the squared value of the coefficients. 
* Ridge regression is a regularization method for Linear Regression. 
* In the linear regression model, we are basically trying to minimize the sum of the residual errors. The goal of Ridge regression is to help prevent overfitting by adding an additional penalty term. It also known as shrinkage term.  

In [12]:
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=10)
ridge_model.fit(X_train, y_train)

In [13]:
ridge_test_prediction = ridge_model.predict(X_test)

In [14]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
MAE = mean_absolute_error(y_test, ridge_test_prediction)
MAE

0.5774404204714175

In [15]:
RMSE = np.sqrt(mean_squared_error(y_test, ridge_test_prediction))
RMSE

0.8946386461319681

In [18]:
from sklearn.linear_model import RidgeCV
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0))

In [19]:
ridge_cv_model.fit(X_train, y_train)

In [20]:
ridge_cv_model.alpha_
# This is the best alpha value for the ridge cross validation method. 
# Now we can explore there are different techniques available to choose from
# to measure the performance of the.

0.1

In [21]:
from sklearn.metrics import SCORERS

In [22]:
SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'matthews_corrcoef', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'positive_likelihood_ratio', 'neg_negative_likelihood_ratio', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weig

In [23]:
ridge_cv_model_2 = RidgeCV(alphas=(0.1, 1.0, 10.0), scoring='neg_mean_absolute_error')
ridge_cv_model_2.fit(X_train, y_train)

In [24]:
ridge_cv_model_2.alpha_

0.1

In [25]:
ridgeCV_test_prediction = ridge_cv_model.predict(X_test)

In [26]:
MAE_ridge_CV = mean_absolute_error(y_test, ridgeCV_test_prediction)
RMSE_ridge_CV = np.sqrt(mean_squared_error(y_test, ridgeCV_test_prediction))

In [27]:
MAE_ridge_CV

0.42737748843373746

In [28]:
RMSE_ridge_CV

0.6180719926921404

In [29]:
ridge_cv_model.coef_

array([ 5.40769392,  0.5885865 ,  0.40390395, -6.18263924,  4.59607939,
       -1.18789654, -1.15200458,  0.57837796, -0.1261586 ,  2.5569777 ,
       -1.38900471,  0.86059434,  0.72219553, -0.26129256,  0.17870787,
        0.44353612, -0.21362436, -0.04622473, -0.06441449])

# Lasso regularization 
* LASSO - Least Absolute Shrinkage and Selection Operator 
* L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficient. 
    * It helps limit the size of the coefficients. 
    * Can yield sparse models where some coefficients can become zero. 
* LASSO can force some of the coefficient estimates to e exactly equal to zero when the tuning parameter lambda is sufficiently large. 
* Similar to subset selection, the LASSO performs variable selection. 
* Models generated from the LASSO are generally much easier to interpret. 
* LassoCV with sklearn operates on checking a number of alphas within a range, instead of providing the alphas directly. 

In [38]:
from sklearn.linear_model import LassoCV
lasso_cv_model = LassoCV(eps=0.1, n_alphas=100, cv=5)

In [39]:
lasso_cv_model.fit(X_train, y_train)

In [40]:
lasso_cv_model.alpha_

0.4943070909225828

In [41]:
lasso_cv_test_prediction = lasso_cv_model.predict(X_test)
MAE_lasso_CV = mean_absolute_error(y_test, lasso_cv_test_prediction)
RMSE_lasso_CV = np.sqrt(mean_squared_error(y_test, lasso_cv_test_prediction))

In [42]:
MAE_lasso_CV

0.6541723161252854

In [43]:
RMSE_lasso_CV

1.130800102276253

In [44]:
lasso_cv_model.coef_

array([1.002651  , 0.        , 0.        , 0.        , 3.79745279,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

The benefit of the lasso model is that it consider very less features compare to ridge cv model, and it helps make model much easier to understand. We can improve the efficiency of the model by increasing the number of time our model runs, but we needs to consider the trade off that it might be resource intensive on the larger datasets. 

In [45]:
lasso_cv_model_2 = LassoCV(eps=0.001, n_alphas=1000, cv=5, max_iter=100000)
lasso_cv_model_2.fit(X_train, y_train)
lasso_cv_test_prediction_2 = lasso_cv_model_2.predict(X_test)
MAE_ridge_CV_2 = mean_absolute_error(y_test, lasso_cv_test_prediction_2)
RMSE_ridge_CV_2 = np.sqrt(mean_squared_error(y_test, lasso_cv_test_prediction_2))

In [46]:
MAE_ridge_CV_2

0.43350346185900673

In [47]:
RMSE_ridge_CV_2

0.6063140748984039

In [48]:
"""As we can see we have made our model much more efficient but we had to run
the model many many more times compare to our initial model. Here we can see how
many feature it is considering."""
lasso_cv_model_2.coef_

array([ 4.86023329,  0.12544598,  0.20746872, -4.99250395,  4.38026519,
       -0.22977201, -0.        ,  0.07267717, -0.        ,  1.77780246,
       -0.69614918, -0.        ,  0.12044132, -0.        , -0.        ,
       -0.        ,  0.        ,  0.        , -0.        ])

## Combining L1 and L2 regularization
* We've been able to perform ridge and lasso regression. We know lasso is able to shrink coefficients to zero, but we haven't seen how it does that or why.
* Let's consider elastic new which combines lasso and ridge together. It will help us understand lasso as well. 
* Lasso was originally discovered in 1986 by Symes and Santosa. It was later independently rediscovered and popularized in 1996 by Robert Tibshirani who coined the term "Lasso". 
* Let's start through a simple equation:
    y_hat = beta_1 * x_1 + beta_2 * x_2
* We know that regularization can be expressed as an additional requirement that RSS is subject to. 
* Here, we know that L1 constrains the sum of absolute values, L2 constrains the sum of squared values. 
* So, lasso regression penalty would be: 
    abs(beta_1) + abs(beta_2) <= s
* Ridge regression penalty: 
    beta_1 ** 2 + beta_2 ** 2 <= s


In [49]:
from sklearn.linear_model import ElasticNetCV
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1],
                             eps=0.001, n_alphas=100, max_iter=1000000)

In [50]:
elastic_model.fit(X_train, y_train)

In [51]:
elastic_model.l1_ratio_

1.0

This shows our model is completely disregarding ridge model and only considering lasso model. We can see that by comparing the alpha values. 

In [52]:
elastic_model.alpha_

0.004943070909225827

In [53]:
lasso_cv_model_2.alpha_

0.004943070909225827