# Regularization

Regularization is a technique used to prevent overfitting in a model. It does this by adding a penalty term to the loss function. This penalty term is a function of the weights of the model. The idea is to make the model less complex by penalizing large weights. This will make the model generalize better to unseen data.

## L1 Lasso
Reduce complexity by **eliminating** the features that don't contribute much to the model. This is done by adding the absolute value of the weights to the loss function.

## L2 Ridge
Reduce complexity by **making the weights smaller**. This is done by adding the square of the weights to the loss function.

## L1 vs L2
- L1 regularization is more robust to outliers and is better for feature selection. Use Lasso if there are not too many features directly related to the label. 
- L2 regularization is more stable and is better for generalization. Use Ridge if there are many features that are related to the label.

## Elastic Net
A combination of L1 and L2 regularization.

In [5]:
import pandas as pd

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [9]:
df = pd.read_csv('../data/whr2017.csv')
df.head()

Unnamed: 0,country,rank,score,high,low,gdp,family,lifexp,freedom,generosity,corruption,dystopia
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
3,Switzerland,4,7.494,7.561772,7.426227,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007,2.276716
4,Finland,5,7.469,7.527542,7.410458,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612,2.430182


In [16]:
df.drop(['country'], axis=1, inplace=True)

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   rank        155 non-null    int64  
 1   score       155 non-null    float64
 2   high        155 non-null    float64
 3   low         155 non-null    float64
 4   gdp         155 non-null    float64
 5   family      155 non-null    float64
 6   lifexp      155 non-null    float64
 7   freedom     155 non-null    float64
 8   generosity  155 non-null    float64
 9   corruption  155 non-null    float64
 10  dystopia    155 non-null    float64
dtypes: float64(10), int64(1)
memory usage: 13.4 KB


In [17]:
df.describe()

Unnamed: 0,rank,score,high,low,gdp,family,lifexp,freedom,generosity,corruption,dystopia
count,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0
mean,78.0,5.354019,5.452326,5.255713,0.984718,1.188898,0.551341,0.408786,0.246883,0.12312,1.850238
std,44.888751,1.13123,1.118542,1.14503,0.420793,0.287263,0.237073,0.149997,0.13478,0.101661,0.500028
min,1.0,2.693,2.864884,2.521116,0.0,0.0,0.0,0.0,0.0,0.0,0.377914
25%,39.5,4.5055,4.608172,4.374955,0.663371,1.042635,0.369866,0.303677,0.154106,0.057271,1.591291
50%,78.0,5.279,5.370032,5.193152,1.064578,1.253918,0.606042,0.437454,0.231538,0.089848,1.83291
75%,116.5,6.1015,6.1946,6.006527,1.318027,1.414316,0.723008,0.516561,0.323762,0.153296,2.144654
max,155.0,7.537,7.62203,7.479556,1.870766,1.610574,0.949492,0.658249,0.838075,0.464308,3.117485


In [18]:
X = df[['gdp', 'family', 'lifexp', 'freedom', 'corruption', 'generosity', 'dystopia']]
y = df[['score']]

In [19]:
X.shape, y.shape

((155, 7), (155, 1))

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [36]:
default_lr = LinearRegression()
default_lr.fit(X_train, y_train)

mean_squared_error(y_test, default_lr.predict(X_test))

1.0299953056986218e-07

In [34]:
lasso_lr = Lasso(alpha=0.02)
lasso_lr.fit(X_train, y_train)

mean_squared_error(y_test, lasso_lr.predict(X_test))

0.04851928542575024

In [33]:
ridge_lr = Ridge(alpha=1)
ridge_lr.fit(X_train, y_train)

mean_squared_error(y_test, ridge_lr.predict(X_test))

0.005127079602999997

In [37]:
# now with elastic net
elastic_net = ElasticNet(alpha=0.02, l1_ratio=0.5)
elastic_net.fit(X_train, y_train)

mean_squared_error(y_test, elastic_net.predict(X_test))

0.028931996009613138