# Regularization

Let's improve our understanding of what impacted **Titanic** passenger's chance of survival
- We will use logistic classifiers which are easy to interpret
- Remember we already did it with statsmodels in lecture "Decision Science - Logistic Regression"
- We were using `p-values` & statistical assumptions to detect which features were irrelevant / don't generalize
- This time, we will use `regularization` to detect relevant/irrelevant features based on under/overfitting criteria
- **Our goal is to compare `L1` and `L2` penalties**

## 1. We load and preprocess the data for you

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv")

# the dataset is already one-hot-encoded
data.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,0,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [3]:
# We build X and y

y = data["survived"]
X = data.drop(columns="survived")
X.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [4]:
# We MinMaxScale our features for you
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X.shape

(714, 12)

In [5]:
# from sklearn.model_selection import train_test_split

# X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)

## 1.  Logistic Regression without regularization

❓ Rank the feature by decreasing order of importance according to a simple **non-regularized** Logistic Regression

- Careful, `LogisticRegression` is penalized by default
- Increase `max_iter` to a larger number until model converges

In [6]:
from sklearn.linear_model import LogisticRegression
lg_model=LogisticRegression(penalty="none",max_iter=1000)
lg_model.fit(X_scaled,y)

LogisticRegression(max_iter=1000, penalty='none')

In [7]:
pd.Series(lg_model.coef_.tolist()[0],index=X_scaled.columns).sort_values(ascending=False)
#if female logg-odss of cahnce of survive 

sex_female                  2.671883
pclass                      2.547187
class_First                 2.360417
fare                        1.358812
who_child                   1.336356
parch                      -0.893820
age                        -2.196151
class_Third                -2.456891
sibsp                      -2.477131
embark_town_Cherbourg     -11.221671
embark_town_Southampton   -11.523126
embark_town_Queenstown    -11.918725
dtype: float64

In [8]:
# df_lg=pd.DataFrame(lg_model.coef_)

In [9]:
# df_lg.columns =X_scaled.columns
# df_lg.T.sort_values

❓How do you interpret, in plain english language, the value for the coefficient `sex_female` ?

<details>
    <summary>Answer</summary>

> "All other things being equal (such as age, ticket class etc...),
being a women increases your log-odds of survival by 2.67 (your coef value)"
    
> "Controling for all other explaining factors available in this dataset,
being a women increases your odds-ratio of survival by exp(2.67) = 14"

</details>


In [10]:
from sklearn.inspection import permutation_importance
lg_model=LogisticRegression()
lg_model.fit(X_scaled,y)
permutation_score=permutation_importance(lg_model,X_scaled,y,n_repeats=100)
importance_df =X_scaled.columns,permutation_score.importances_mean
importance_df 

(Index(['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_female', 'class_First',
        'class_Third', 'who_child', 'embark_town_Cherbourg',
        'embark_town_Queenstown', 'embark_town_Southampton'],
       dtype='object'),
 array([ 1.05322129e-02,  2.69747899e-02,  7.24089636e-03,  2.52100840e-03,
        -1.76470588e-03,  2.13977591e-01,  3.50140056e-03,  1.68907563e-02,
         1.25070028e-02, -1.40056022e-04,  1.17647059e-03, -9.80392157e-04]))

In [11]:
importance_df =np.vstack((X_scaled.columns,permutation_score.importances_mean)).T
importance_df 

array([['pclass', 0.010532212885154033],
       ['age', 0.026974789915966378],
       ['sibsp', 0.0072408963585433965],
       ['parch', 0.0025210084033613356],
       ['fare', -0.001764705882352935],
       ['sex_female', 0.2139775910364146],
       ['class_First', 0.003501400560224077],
       ['class_Third', 0.016890756302520994],
       ['who_child', 0.012507002801120413],
       ['embark_town_Cherbourg', -0.0001400560224089631],
       ['embark_town_Queenstown', 0.00117647058823529],
       ['embark_town_Southampton', -0.0009803921568627416]], dtype=object)

In [12]:
importance_df =pd.DataFrame(np.vstack((X_scaled.columns,permutation_score.importances_mean)).T)
importance_df 

Unnamed: 0,0,1
0,pclass,0.010532
1,age,0.026975
2,sibsp,0.007241
3,parch,0.002521
4,fare,-0.001765
5,sex_female,0.213978
6,class_First,0.003501
7,class_Third,0.016891
8,who_child,0.012507
9,embark_town_Cherbourg,-0.00014


In [13]:
importance_df.columns=["feature","feature importance"]

In [14]:
importance_df.sort_values(by="feature importance",ascending=False)

Unnamed: 0,feature,feature importance
5,sex_female,0.213978
1,age,0.026975
7,class_Third,0.016891
8,who_child,0.012507
0,pclass,0.010532
2,sibsp,0.007241
6,class_First,0.003501
3,parch,0.002521
10,embark_town_Queenstown,0.001176
9,embark_town_Cherbourg,-0.00014


In [15]:
importance_df.feature[5]

'sex_female'

❓ What is the feature that most impacts the chances of survival according to your model ? 

In [30]:
top_1_feature =['embark_town_Queenstown']

In [88]:
from nbresult import ChallengeResult
result = ChallengeResult('unregularized', top_1_feature = top_1_feature)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/shu/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shu/Desktop/Lewagon/code/shiro101010101/data-challenges/05-ML/05-Model-Tuning/02-Regularization
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 1 item

tests/test_unregularized.py::TestUnregularized::test_top_1 [32mPASSED[0m[32m        [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/unregularized.pickle

[32mgit[39m commit -m [33m'Completed unregularized step'[39m

[32mgit[39m push origin master


## 2.  Logistic Regression with a L2 penalty

Let's use a **Logistic model** whose log-loss has been penalized with a **L2** term to figure out the **most important features** without overfitting.  
This is the "classification" equivalent to the "Ridge" regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its feature importance
- By "strongly regularized" we mean "more than sklearn's default applied regularization factor". 
- Default sklearn's values are very useful orders of magnitudes to keep in mind for "scaled features"

In [62]:
log_model_l2=LogisticRegression(penalty="l2",max_iter=1000,C=0.1)
log_model_l2.fit(X_scaled,y)
pd.Series(log_model_l2.coef_.tolist()[0],index=X_scaled.columns).sort_values(ascending=False)

sex_female                 1.808600
who_child                  0.602809
class_First                0.441486
embark_town_Cherbourg      0.252956
fare                       0.136790
parch                     -0.053906
embark_town_Queenstown    -0.132443
embark_town_Southampton   -0.154403
sibsp                     -0.340872
age                       -0.477720
pclass                    -0.539228
class_Third               -0.636932
dtype: float64

In [35]:

permutation_score_l2=permutation_importance(log_model_l2,X_scaled,y,n_repeats=100)
X_scaled.columns,permutation_score_l2.importances_mean

(Index(['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_female', 'class_First',
        'class_Third', 'who_child', 'embark_town_Cherbourg',
        'embark_town_Queenstown', 'embark_town_Southampton'],
       dtype='object'),
 array([ 0.0109944 ,  0.02612045,  0.00931373,  0.00292717, -0.00147059,
         0.21438375,  0.00355742,  0.01726891,  0.01313725,  0.00065826,
         0.00145658, -0.00077031]))

❓ What are the top 2 features driving chances of survival according to your model ?

In [19]:
pd.DataFrame(np.vstack((X_scaled.columns,permutation_score_l2.importances_mean)))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
1,0.0107,0.027031,0.00888,0.002129,-0.001303,0.216667,0.003431,0.017437,0.012661,0.00028,0.001345,-0.000826


In [20]:
log_model_l2_importance=pd.DataFrame(np.vstack((X_scaled.columns,permutation_score.importances_mean))).T
log_model_l2_importance.columns=["feature","feature importanse"]


In [21]:
log_model_l2_importance.sort_values(by="feature importanse",ascending=False)

Unnamed: 0,feature,feature importanse
5,sex_female,0.213978
1,age,0.026975
7,class_Third,0.016891
8,who_child,0.012507
0,pclass,0.010532
2,sibsp,0.007241
6,class_First,0.003501
3,parch,0.002521
10,embark_town_Queenstown,0.001176
9,embark_town_Cherbourg,-0.00014


In [22]:
log_model_l2_importance.feature[1]

'age'

In [82]:
# Fill your top 2 features below
top_2_features = ["sex_female","class_Third"]

#### 🧪 Test your code below

In [87]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', top_2 = top_2_features)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/shu/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shu/Desktop/Lewagon/code/shiro101010101/data-challenges/05-ML/05-Model-Tuning/02-Regularization
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 1 item

tests/test_ridge.py::TestRidge::test_top2 [32mPASSED[0m[32m                         [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/ridge.pickle

[32mgit[39m commit -m [33m'Completed ridge step'[39m

[32mgit[39m push origin master


## 2. Logistic Regression with a L1 penalty

This time, we'll use a logistic model whose log-loss has been penalized with a **L1** term to **filter-out the less important features**.  
This is the "classification" equivalent to the **Lasso** regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its feature importance

In [25]:
# from sklearn.linear_model import Lasso
# lasso_l1=Lasso(alpha=0.2).fit(X_scaled,y)
# coef=pd.DataFrame({"coef_ridge":pd.Series(lasso_l1.coef_,index=X_scaled.columns)})
# #why?

In [68]:
log_model_l1=LogisticRegression(penalty='l1',max_iter=1000,C=0.1,solver='liblinear')
log_model_l1.fit(X_scaled,y)
pd.Series(log_model_l1.coef_.tolist()[0],index=X_scaled.columns).sort_values(ascending=False)

sex_female                 2.000510
who_child                  0.255894
sibsp                      0.000000
parch                      0.000000
fare                       0.000000
class_First                0.000000
embark_town_Cherbourg      0.000000
embark_town_Queenstown     0.000000
age                       -0.062792
class_Third               -0.144571
embark_town_Southampton   -0.239346
pclass                    -1.442673
dtype: float64

In [69]:
coefs_with_p_value

NameError: name 'coefs_with_p_value' is not defined

❓ What are the features that have absolutely no impact on chances of survival, according to your L1 model?
- Do you notice how some of them were "highly important" according to the non-regularized model ? 
- From now on, we will always regularize our linear models!

In [84]:
zero_impact_features = ["sibsp","parch","fare","class_First","embark_town_Cherbourg","embark_town_Queenstown"]

#### 🧪 Test your code below

In [86]:
from nbresult import ChallengeResult
result = ChallengeResult('lasso', zero_impact_features = zero_impact_features)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/shu/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shu/Desktop/Lewagon/code/shiro101010101/data-challenges/05-ML/05-Model-Tuning/02-Regularization
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 1 item

tests/test_lasso.py::TestLasso::test_zero_impact [32mPASSED[0m[32m                  [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/lasso.pickle

[32mgit[39m commit -m [33m'Completed lasso step'[39m

[32mgit[39m push origin master


**🏁 Congratulation! Don't forget to commit and push your notebook**