<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/18_Elastic_net_regularization_Regression_on_Gapminder_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Elastic net regularization: Regression on Gapminder Data**

Lasso used the L1 penalty to regularize, while ridge used the L2 penalty. There is another type of regularized regression known as the elastic net. In elastic net regularization, the penalty term is a linear combination of the L1 and L2 penalties:

a∗L1+b∗L2

In scikit-learn, this term is represented by the 'l1_ratio' parameter: An 'l1_ratio' of 1 corresponds to an L1 penalty, and anything lower is a combination of L1 and L2.

The Gapminder Dataset describes life expentency depending on factors like fertility, GDP, Region, population etc.

The dataset is imported from Kaggle.

https://www.kaggle.com/

Installing Kaggle Package to access the Gapminder dataset from Kaggle.

In [1]:
!pip install kaggle



Make .kaggle directory under root to import the Kaggle Authentication JSON.

In [0]:
!mkdir ~/.kaggle

Change file path to root/.kaggle/kaggle.json

In [0]:
!cp /content/kaggle.json ~/.kaggle/kaggle.json

Protect Kaggle JSON file for security reasons

Chmod 600 (chmod a+rwx,u-x,g-rwx,o-rwx) sets permissions so that, (U)ser / owner can read, can write and can't execute. (G)roup can't read, can't write and can't execute. (O)thers can't read, can't write and can't execute.

In [0]:
!chmod 600 /root/.kaggle/kaggle.json

Import the Gapminder dataset

In [5]:
!kaggle datasets download -d deepakdodi/gapminder

Downloading gapminder.zip to /content
  0% 0.00/5.43k [00:00<?, ?B/s]
100% 5.43k/5.43k [00:00<00:00, 10.0MB/s]


In [6]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the Gapminder file into a DataFrame: df
df = pd.read_csv('gapminder.zip', compression='zip', header=0, sep=',', quotechar='"')
print(df)

     population  fertility  ...  child_mortality                      Region
0    34811059.0       2.73  ...             29.5  Middle East & North Africa
1    19842251.0       6.43  ...            192.0          Sub-Saharan Africa
2    40381860.0       2.24  ...             15.4                     America
3     2975029.0       1.40  ...             20.0       Europe & Central Asia
4    21370348.0       1.96  ...              5.2         East Asia & Pacific
..          ...        ...  ...              ...                         ...
134   3350832.0       2.11  ...             13.0                     America
135  26952719.0       2.46  ...             49.2       Europe & Central Asia
136  86589342.0       1.86  ...             26.2         East Asia & Pacific
137  13114579.0       5.88  ...             94.9          Sub-Saharan Africa
138  13495462.0       3.85  ...             98.3          Sub-Saharan Africa

[139 rows x 10 columns]


Create array X for the 'fertility' feature and array y for the 'life' target variable.

In [0]:
# Create arrays for features and target variable
y = df['life'].values
X_fertility = df['fertility'].values

In [8]:
# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X_fertility.shape))

Dimensions of y before reshaping: (139,)
Dimensions of X before reshaping: (139,)


**Lasso Regression on Gapminder Dataset**

Lasso regression is able to select features that are the most important for predicting the target values, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way is very useful in situations with data involving thousands of features.

Here, we fit a lasso regression to the Gapminder dataset to plot the coefficients. It can be observed that the coefficients of some features are shrunk to 0, with only the most important ones remaining.

In [0]:
X = df.drop(['life', 'Region'], axis=1)

In [0]:
# Import necessary modules
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np
import matplotlib.pyplot as plt

In [0]:
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

In [0]:
# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

In [0]:
# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

In [0]:
# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

In [15]:
# Fit it to the training data
gm_cv.fit(X_train, y_train)

  positive)
  positive)
  positive)
  positive)
  positive)


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True,
                                  l1_ratio=0.5, max_iter=1000, normalize=False,
                                  positive=False, precompute=False,
                                  random_state=None, selection='cyclic',
                                  tol=0.0001, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'l1_ratio': array([0.        , 0.03448276, 0.06896552, 0.10344828, 0.13793103,
       0.17241379, 0.20689655, 0.24137931, 0.27586207, 0.31034483,
       0.34482759, 0.37931034, 0.4137931 , 0.44827586, 0.48275862,
       0.51724138, 0.55172414, 0.5862069 , 0.62068966, 0.65517241,
       0.68965517, 0.72413793, 0.75862069, 0.79310345, 0.82758621,
       0.86206897, 0.89655172, 0.93103448, 0.96551724, 1.        ])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, v

In [18]:
# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(round(r2,2)))
print("Tuned ElasticNet MSE: {}".format(round(mse,2)))

Tuned ElasticNet l1 ratio: {'l1_ratio': 0.20689655172413793}
Tuned ElasticNet R squared: 0.87
Tuned ElasticNet MSE: 10.06
