# Regularization

Let's improve our understanding of what impacted **Titanic** passengers' chance of survival
- We will use logistic classifiers which are easy to interpret
- Remember we already did it with statsmodels in lecture "Decision Science - Logistic Regression"
- We were using `p-values` & statistical assumptions to detect which features were irrelevant / don't generalize
- This time, we will use `regularization` to detect relevant/irrelevant features based on under/overfitting criteria
- **Our goal is to compare `L1` and `L2` penalties**

## 1. We load and preprocess the data for you

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv")

# the dataset is already one-hot-encoded
data.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,0,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [3]:
# We build X and y

y = data["survived"]
X = data.drop(columns=["survived"])
X.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [4]:
# We MinMaxScale our features for you
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X.shape

(714, 12)

## 2.  Logistic Regression without regularization

‚ùì Rank the features by decreasing order of importance after training a simple **non-regularized** Logistic Regression (i.e. look at the coefficients after fitting)
- Careful: `LogisticRegression` is penalized by default
  - take a look at the [penalty parameter](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to find out how to remove the penalty)
- Increase `max_iter` to a larger number until the model converges
- Use `tol=1e-9` to set the solver's stopping criterion: when the gradient's largest component becomes smaller than this, the solver will stop. If you would set it to higher values, you would see that the coefficients fluctuate a lot with the value of `tol`.

<details>
    <summary>Hint</summary>
    <img src="https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/05-Model-Tuning/model_selection.png" alt="penalizing a regression" width="500">
</details>

In [6]:
from sklearn.linear_model import LogisticRegression

# 1. Instantiate the model with no regularization
log_reg = LogisticRegression(
    penalty= None,   # remove regularization
    max_iter=5000,    # ensure convergence
    tol=1e-9          # precise stopping criterion
)

# 2. Fit the model
log_reg.fit(X_scaled, y)

# 3. Extract coefficients into a Series for readability
coefs = pd.Series(log_reg.coef_[0], index=X_scaled.columns)

# 4. Sort by decreasing absolute importance
coefs_sorted = coefs.abs().sort_values(ascending=False)
coefs_sorted


embark_town_Queenstown     22.829278
embark_town_Southampton    22.433873
embark_town_Cherbourg      22.132503
pclass                      5.664538
class_Third                 4.015465
class_First                 3.919071
sex_female                  2.671879
sibsp                       2.476880
age                         2.196129
fare                        1.360188
who_child                   1.336356
parch                       0.894275
dtype: float64

‚ùìHow do you interpret, in plain English, the value for the coefficient `sex_female`?

<details>
    <summary>Answer</summary>

> "All other things being equal (such as age, ticket class etc...),
being a women increases your log-odds of survival by 2.67 (your coef value)"
    
> "Controling for all other explaining factors available in this dataset,
being a women increases your odds of survival by exp(2.67) = 14"

</details>


In [8]:
#exp(2.67) = 14


‚ùì What is the feature that most impacts the chances of survival according to your model?  
Fill the `top_1_feature` list below with the name of this feature

In [9]:
top_1_feature = ["embark_town_Queenstown"]

In [10]:
from nbresult import ChallengeResult
result = ChallengeResult('unregularized', top_1_feature=top_1_feature)
result.write()
print(result.check())


platform darwin -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /Users/simonhingant/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /Users/simonhingant/code/simsam56/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_unregularized.py::TestUnregularized::test_top_1 [32mPASSED[0m[32m              [100%][0m



üíØ You can commit your code:

[1;32mgit[39m add tests/unregularized.pickle

[32mgit[39m commit -m [33m'Completed unregularized step'[39m

[32mgit[39m push origin master



## 3.  Logistic Regression with an L2 penalty

Let's use a **Logistic model** whose log-loss has been penalized with an **L2** term to figure out the **most important features** without overfitting.  
This is the "classification" equivalent to the "Ridge" regressor

‚ùì Instantiate a **strongly regularized** `LogisticRegression` and rank its features by importance (look at the coefficients)
- By "strongly regularized" we mean "more than Sklearn's default regularization factor". 
- Sklearn's default values are very useful orders of magnitudes to keep in mind for "scaled features"

In [16]:
from sklearn.linear_model import LogisticRegression

# Strongly regularized model
log_reg_l2 = LogisticRegression(
    penalty='l2',
    C=0.001,        # strong regularization
    max_iter=5000,
    tol=1e-9
)

log_reg_l2.fit(X_scaled, y)

# Extract coefficients
coefs_l2 = pd.Series(log_reg_l2.coef_[0], index=X_scaled.columns)

# Sort by absolute importance
coefs_l2_sorted = coefs_l2.abs().sort_values(ascending=False)
coefs_l2_sorted


sex_female                 0.086500
class_Third                0.053644
pclass                     0.047718
class_First                0.041793
embark_town_Cherbourg      0.023665
embark_town_Southampton    0.021576
who_child                  0.015245
fare                       0.008876
age                        0.005416
parch                      0.004379
embark_town_Queenstown     0.003203
sibsp                      0.001203
dtype: float64

‚ùì What are the top 2 features driving chances of survival according to your model?  
Fill the `top_2_features` list below with the name of these features

In [17]:
top_2_features = ["sex_female", "class_Third"]

#### üß™ Test your code below

In [18]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', top_2=top_2_features)
result.write()
print(result.check())


platform darwin -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /Users/simonhingant/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /Users/simonhingant/code/simsam56/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_ridge.py::TestRidge::test_top2 [32mPASSED[0m[32m                               [100%][0m



üíØ You can commit your code:

[1;32mgit[39m add tests/ridge.pickle

[32mgit[39m commit -m [33m'Completed ridge step'[39m

[32mgit[39m push origin master



## 4. Logistic Regression with an L1 penalty

This time, we'll use a logistic model whose log-loss has been penalized with a **L1** term to **filter-out the less important features**.  
This is the "classification" equivalent to the **Lasso** regressor

‚ùì Instantiate a **strongly regularized** `LogisticRegression` and rank its features by importance

In [23]:
from sklearn.linear_model import LogisticRegression

log_reg_l1 = LogisticRegression(
    penalty='l1',
    C=0.3,         # strong regularization
    solver='liblinear',  # required for L1
    max_iter=5000,
    tol=1e-9
)

log_reg_l1.fit(X_scaled, y)

coefs_l1 = pd.Series(log_reg_l1.coef_[0], index=X_scaled.columns)

coefs_l1_sorted = coefs_l1.abs().sort_values(ascending=False)
coefs_l1_sorted


sex_female                 2.351193
pclass                     1.662697
who_child                  0.829716
age                        0.797074
sibsp                      0.749168
embark_town_Cherbourg      0.308125
class_Third                0.242666
class_First                0.027832
parch                      0.000000
fare                       0.000000
embark_town_Queenstown     0.000000
embark_town_Southampton    0.000000
dtype: float64

‚ùì What are the features that have absolutely no impact on chances of survival, according to your L1 model?  
Fill the `zero_impact_features` list below with the name of these features; you may have to add elements to the list.

- Do you notice how some of them were "highly important" according to the non-regularized model? 
- From now on, we will always regularize our linear models!

In [26]:
zero_impact_features = ["embark_town_Southampton", "embark_town_Queenstown", "fare", "parch"]

#### üß™ Test your code below

In [27]:
from nbresult import ChallengeResult
result = ChallengeResult('lasso', zero_impact_features = zero_impact_features)
result.write()
print(result.check())


platform darwin -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /Users/simonhingant/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /Users/simonhingant/code/simsam56/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_lasso.py::TestLasso::test_zero_impact [32mPASSED[0m[32m                        [100%][0m



üíØ You can commit your code:

[1;32mgit[39m add tests/lasso.pickle

[32mgit[39m commit -m [33m'Completed lasso step'[39m

[32mgit[39m push origin master



# 5. Taking a step back

ü§Ø¬†**Why were some of those coefficients so high in the first place?**

Let's think about the three features that were regularized away:
- `embark_town_Cherbourg`
- `embark_town_Southampton`
- `embark_town_Queenstown`

The three embark towns are of course related: if you didn't embark in two of them, you must have embarked in the third one. So we know: 

$$embark\_town\_Cherbourg + embark\_town\_Southampton + embark\_town\_Queenstown = 1$$

These three features are **perfectly multicollinear**!

**When using unregularized models, this typically leads to numerical instability**, which is exactly what we saw here. It also means **we can't really trust the coefficients**
 we get in such a case.

‚ùóÔ∏è These three multicollinear features come from one hot encoding a categorical feature `embark_town`.

Thanks to the regularization we overcame this problem: it prevented the coefficients for the three towns to become very large. **This is why we'll almost always use regularization.**

üîç **Remember that `tol` parameter we set in the beginning?**

An extra bonus of the regularization is that setting `tol` became less important: you could set it to any value between `1e-2` and `1e-9` and the coefficients would hardly change! üí™

**üèÅ Congratulation! Don't forget to commit and push your notebook**