# 03 Lasso Regression

![viz](https://static.vecteezy.com/system/resources/previews/000/184/369/original/flat-data-visualization-vector.jpg)

# y = intercept + slope(s) * Variable(s)

main_performance = 1 + 2.1 * Coding_Exercise + 0 * Zodiac_Sign

## 1 Regularization

What is Regularization? Is a way of introducing penalties to (linear) models that are too simple or too complex while adjusting the coefficients to make them generalize better to new data. Another way to think about regularization is as penalties applied to regressions whose coefficients have become too large, and thus, sensitive to all kinds of inputs. These penalties, most notably L1 and L2, minimize the size of the coefficients and/or remove them completely. Here are two of the most important regularization methods.

1. Ridge - Also called L2 penalty, is a regularization method (and an extension of linear regression) that forces the model parameters to stay as small as possible but without reaching zero. The parameter we optimize is called lambda ($\lambda$) and it is usually initialized with the value 1.
2. Least Absolute Shrinkage and Selection Operator (LASSO) - Lasso is another type of linear regression model where the coefficients of the variables that don't contribute much to a model, and whose coefficients are too large, will be effectively reduced to zero and removed/ignored by the model. The parameter we optimize is called lambda ($\lambda$) and it is usually initialized with the value 1. In Python's scikit-learn package the parameter we optimise for is called `alpha`.


Some Important Terms:
- Overfitting: happens when your model fits the data too well or, more specifically, when it memorizes the training data and thus, fails to generalize well to new data. Another term used to refer to overfitting is high-variance. This is also a symptom of having too complex a model, e.g. a model with too many variables and not that many observations.
- Underfitting: happens when your model is too simple and fails to capture the relationship between the target variable and the features.
- bias-variance trade-off: having a good balance between the two above.

## 2 Analysis

In [None]:
import pandas as pd # data manipulation and analysis package
import numpy as np # numerical computing package
import matplotlib.pyplot as plt # data visualisation package
pd.options.display.max_columns = None # global setting for our session that allows us to see all the columns

# machine learning functions we will need
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

Import the data using pandas.

In [None]:
df = pd.read_csv('data/Hyp_attrition_bigdata.csv')
df.head() # show the first 5 rows, add a number to the parentheses to see more rows

Let's look at the shape of our dataset.

In [None]:
df.shape

Let's now select the columns that we will use for the models and take the id and target variables out.

In [None]:
train_vars = df.drop(['new_id', 'quitnow'], axis='columns').columns
print(train_vars.shape)
print(train_vars[0:5])

We now split our dataset into a training and a testing set. The training set will help us build and fine-tune the algorithm while the testing set will help us evaluate how the model performs with unseen data. We will use sklearn's `train_test_split` function and select and add 4 parameters to it
- `array1` - our predictors or independent variables whose sample we want to split. This will return 2 datasets
- `array2` - our target or dependent variables whose sample we want to split. This will return 2 arrays
- `train_size` - the size of the training dataset. A number between 0 and 1 will split any datasets and/or arrays by that percentage. An integer will add that amount of rows to the training dataset
- `random_state` - a seed to make sure our results are reproducible

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, df['quitnow'], train_size=102, random_state=7)

Evaluate one of the datasets.

In [None]:
X_train[train_vars].head()

In [None]:
y_test.shape

In [None]:
y_train.head()

Modeling without scaling. This means that we will be evaluating the model with the data as is.

In [None]:
lasso = Lasso() # first we instantiate (i.e. create and initialize) a variable with our model
lasso.fit(X_train[train_vars], y_train) # we then fit the training

In [None]:
print("Training set score: {:.2f}".format(lasso.score(X_train[train_vars], y_train))) # let's now evaluate the R2 of our model with training data only
print("Test set score: {:.2f}".format(lasso.score(X_test[train_vars], y_test))) # let's now evaluate the R2 of our model with testing data only
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0))) # let's sum up the columns whose coefficients are now 0

Let's now try a different model with a smaller penalty and run more iterations of the regression. We will the following parameters
- `alpha=` lambda is the main parameter of a lasso regression, and thus, what helps us penalize the coefficients
- `random_state=` a seed for reproducibility purposes
- `max_iter=` the amount of times the model will run in search of convergeance, i.e. the best fit for the line

In [None]:
lasso_2 = Lasso(alpha=0.1, random_state=7, max_iter=10000) # first we instantiate (i.e. create and initialize) a variable with our model
lasso_2.fit(X_train[train_vars], y_train) # we then fit the training

In [None]:
print("Training set score: {:.2f}".format(lasso_2.score(X_train[train_vars], y_train))) # let's now evaluate the R2 of our model with training data only
print("Test set score: {:.2f}".format(lasso_2.score(X_test[train_vars], y_test))) # let's now evaluate the R2 of our model with testing data only
print("Number of features used: {}".format(np.sum(lasso_2.coef_ != 0))) # let's sum up the columns whose coefficients are now 0

In [None]:
lasso_3 = Lasso(alpha=0.01, random_state=7, max_iter=1000000).fit(X_train[train_vars], y_train)

In [None]:
print("Training set score: {:.2f}".format(lasso_3.score(X_train[train_vars], y_train))) # let's now evaluate the R2 of our model with training data only
print("Test set score: {:.2f}".format(lasso_3.score(X_test[train_vars], y_test))) # let's now evaluate the R2 of our model with testing data only
print("Number of features used: {}".format(np.sum(lasso_3.coef_ != 0))) # let's sum up the columns whose coefficients are now 0

Now let's scale our models and see what happens. Scaling means squizing the values of our variables between pre-specified values such a 0 and 1, the mean and a standard deviation, etc. In this case, we only want to use our training dataset for scaling and never our testing dataset.

In [None]:
scaler = StandardScaler() # instantiate the scaler
scaler.fit(X_train[train_vars]) # fit the trainin data only

In [None]:
scaler.transform(X_train[train_vars]) # look at the results

We now create two new training and testing datasets to continue the modeling stage.

In [None]:
X_train_scaled = pd.concat([X_train[['new_id', 'quitnow']].reset_index(drop=True), # take the first two columns out
                            pd.DataFrame(scaler.transform(X_train[train_vars]), columns=train_vars)], # scale the rest of the variables and create a dataframe
                            axis=1) # do it all by the columns and not the rows

X_test_scaled = pd.concat([X_test[['new_id', 'quitnow']].reset_index(drop=True), # take the first two columns out
                           pd.DataFrame(scaler.transform(X_test[train_vars]), columns=train_vars)], # scale the rest of the variables and create a dataframe
                           axis=1) # do it all by the columns and not the rows

In [None]:
X_train_scaled.head() # look at the scaled variables

Repeat the modeling from above but with the new scaled datasets.

In [None]:
lasso_4 = Lasso(random_state=7).fit(X_train_scaled[train_vars], y_train)

In [None]:
print("Training set score: {:.2f}".format(lasso_4.score(X_train_scaled[train_vars], y_train))) # let's now evaluate the R2 of our model with the scaled training data only
print("Test set score: {:.2f}".format(lasso_4.score(X_test_scaled[train_vars], y_test))) # let's now evaluate the R2 of our model with the scaled testing data only
print("Number of features used: {}".format(np.sum(lasso_4.coef_ != 0))) # let's sum up the columns whose coefficients are now 0

In [None]:
lasso_5 = Lasso(alpha=0.1, random_state=7, max_iter=100000).fit(X_train_scaled[train_vars], y_train)

In [None]:
print("Training set score: {:.2f}".format(lasso_5.score(X_train_scaled[train_vars], y_train))) # let's now evaluate the R2 of our model with the scaled training data only
print("Test set score: {:.2f}".format(lasso_5.score(X_test_scaled[train_vars], y_test))) # let's now evaluate the R2 of our model with the scaled testing data only
print("Number of features used: {}".format(np.sum(lasso_5.coef_ != 0))) # let's sum up the columns whose coefficients are now 0

In [None]:
lasso_6 = Lasso(alpha=0.01, random_state=7, max_iter=100000).fit(X_train_scaled[train_vars], y_train)

In [None]:
print("Training set score: {:.2f}".format(lasso_6.score(X_train_scaled[train_vars], y_train))) # let's now evaluate the R2 of our model with the scaled training data only
print("Test set score: {:.2f}".format(lasso_6.score(X_test_scaled[train_vars], y_test))) # let's now evaluate the R2 of our model with the scaled testing data only
print("Number of features used: {}".format(np.sum(lasso_6.coef_ != 0))) # let's sum up the columns whose coefficients are now 0

Cross-validation allows us to test small combinations of the independent variables and the dependent variable in any dataset. Let's check it out with the training dataset.

In [None]:
scores = cross_val_score(Lasso(alpha=0.1), X_train_scaled[train_vars], y_train, cv=10, n_jobs=-1)
scores

In [None]:
print('Mean Absolute Best Scores %.3f and Standard Deviation (%.3f)' % (np.mean(np.abs(scores)), np.std(np.abs(scores))))

We will create a dictionary to search for the best parameter lambda.

In [None]:
grid = dict()
grid['alpha'] = np.arange(0, 1, 0.01)

In [None]:
search = GridSearchCV(Lasso(max_iter=100000), grid, cv=10, n_jobs=-1) # we will shuffle the data 10 times

In [None]:
results = search.fit(X_train_scaled[train_vars], y_train) # fit the train datasets
print('Score: %.3f' % results.best_score_) # evaluate best score
print('Config: %s' % results.best_params_) # evaluate best parameters

Let's now select the best features/columns/variables with our new lambda and then create one last model.

In [None]:
selection = SelectFromModel(Lasso(alpha=0.03, max_iter=100000, random_state=7))
selection.fit(X_train_scaled[train_vars], y_train)

In [None]:
selection.get_support() # shows the variables we need as Trues and Falses

In [None]:
selected_cols = X_train_scaled[train_vars].columns[selection.get_support()] # select only the variables we need
selected_cols

Evaluate the variables we choose.

In [None]:
print("Total Variables: {}".format(X_train_scaled.shape[1]))
print("Variables Selected: {}".format(len(selected_cols)))
print("Variables whose coefficient got shrank to zero: {}".format(np.sum(selection.estimator_.coef_ == 0)))

Repeat modeling exercise.

In [None]:
lasso_7 = Lasso(alpha=0.03, max_iter=100000, random_state=7).fit(X_train_scaled[selected_cols], y_train)

print("Training Set Score: {}".format(lasso_7.score(X_train_scaled[selected_cols], y_train)))
print("Test Set Score: {}".format(lasso_7.score(X_test_scaled[selected_cols], y_test)))

In [None]:
errors = y_test - lasso_7.predict(X_test_scaled[selected_cols])
errors.hist(bins=15)

## Most Important Variables

Let's now visually inspect the most important variables of our model.

In [None]:
importance = pd.Series(np.abs(lasso_7.coef_.ravel()))
importance.index = selected_cols
importance.sort_values(inplace=True, ascending=False)
importance.plot.bar(figsize=(18, 6))
plt.ylabel("Lasso Coefficients")
plt.title("Feature Importance")

## Finally - All of the Above in One Go

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, df['quitnow'], train_size=102, random_state=7)

In [None]:
scaler = StandardScaler().fit(X_train[train_vars])
X_train_scaled = pd.concat([X_train[['new_id', 'quitnow']].reset_index(drop=True),
                            pd.DataFrame(scaler.transform(X_train[train_vars]), columns=train_vars)], axis=1)

X_test_scaled = pd.concat([X_test[['new_id', 'quitnow']].reset_index(drop=True),
                           pd.DataFrame(scaler.transform(X_test[train_vars]), columns=train_vars)], axis=1)

In [None]:
reg = LassoCV(cv=10, max_iter=100000, random_state=7, n_jobs=-1).fit(X_train_scaled[train_vars], y_train)

In [None]:
print("Training Set Score: {}".format(reg.score(X_train_scaled[train_vars], y_train)))
print("Test Set Score: {}".format(reg.score(X_test_scaled[train_vars], y_test)))