<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

# Regularization

Remember the **Golden Plains Roadside Biodiversity** dataset your worked on the first day? We ended up dropping many features in our **linear regression** whilst maintaining a good $R^2$ score. We will use this dataset again here.

![kangaroo](https://s.yimg.com/uu/api/res/1.2/GANJCEs2SP0QamHePbZqUw--~B/aD0zNjE7dz03Njg7YXBwaWQ9eXRhY2h5b24-/http://media.zenfs.com/en_us/News/afp.com/b9a6c5065aab22b840d60a188e7767a7ce7c471c.jpg)

However, there are a few differences:
- We will use logistic classifiers here which are easy to interpret
- We will model the `RCACScore` as out target variable changed to a binary class: `0` indicates a score <=12, `1` a score >12.
- The dataset is already cleaned, scaled, and one-hot-encoded for you 😌
- The goal is to use `regularization` to detect relevant/irrelevant features based on under/overfitting criteria
- **Our goal is to compare `L1` and `L2` penalties**

## Load the data

Load the data into a variable named `data`, and split it into an `X` feature matrix and a `y` target vector.

In [None]:
from nbta.utils import download_data
download_data(id='1cIO50NnXZg6F1Y9-aRKorKhXSS5Idnjc')

In [None]:
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv("raw_data/biodiversity-prepared.csv")

# the dataset is already one-hot-encoded
data.head()

In [None]:
# Let's build X and y

y = data["RCACScore"]
X = data.drop(columns="RCACScore")


## Logistic Regression without regularization

❓ Rank the feature by decreasing order of importance according to a simple **non-regularized** Logistic Regression

- Careful, `LogisticRegression` is penalized by default
- Increase `max_iter` to a larger number until the model converges
- remember that you can access the coefficients of the regression by calling `.coef_` on your trained model. 
- *Hint*: it might help to put the coefficient of the model in a dataframe with column names from `X` to be able to interpret them. Also check the `transpose()` and `sort_values()` pandas functions


In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

❓How do you interpret, in plain english language, the value for the coefficient `RCACRareSp` ?

<details>
    <summary>Answer</summary>

> "All other things being equal (i.e. if the other variables are the same),
the abundance of rare species (`RCACRareSp`) increases the log-odds of the site being classified as important by 33.38 (your coef value)"
    
> "Controling for all other explaining factors available in this dataset,
a high `RCACRareSp` increases the odds-ratio of a high score by exp(33.38) = 3.14E15"


</details>


❓ What are the 5 features that most impact the chances of classifying a site as a high scoring site? Save your answer as an array under a variable named `base_most_important`.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

❓ Now cross validate a model with the same parameters as the model above, and save the mean score under a variable named `base_model_score`.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

In [None]:
from nbresult import ChallengeResult
result = ChallengeResult('unregularized', 
                         top_features = base_most_important,
                            score=base_model_score)
result.write()
print(result.check())

## Logistic Regression with a L2 penalty

Let's use a **Logistic model** whose log-loss has been penalized with a **L2** term to figure out the **most important features** without overfitting.  
This is the "classification" equivalent to the "Ridge" regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its feature importance
- By "strongly regularized" we mean "more than sklearn's default applied regularization factor". 
- Default sklearn's values are very useful orders of magnitudes to keep in mind for "scaled features"
- We suggest trying a regularization factor of 10% of the default value in this case

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

❓ What are the top 5 features driving chances of survival according to your model ? Save them as an array under the variable name `l2_most_important`. Are these the same features as for `base_most_important`?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

❓ Now cross validate a model with the same parameters as the model above, and save the mean score under a variable named `l2_model_score`. What can you say about the new score compare to the `base_model_score`?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

#### 🧪 Test your code below

In [None]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', 
                         top_features = l2_most_important,
                        score=l2_model_score)
result.write()
print(result.check())

## Logistic Regression with a L1 penalty

This time, we'll use a logistic model whose log-loss has been penalized with a **L1** term to **filter-out the less important features**.  
This is the "classification" equivalent to the **Lasso** regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its feature importance. We suggest that you use the same regularization value as for **L2** to be able to compare your results.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

❓ What are the features that have absolutely no impact on chances of survival, according to your L1 model?
- Save them as in a array variable named `zero_impact_features`
- Do you notice how some of them were "highly important" according to the non-regularized model ? 
- From now on, we will always regularize our linear models!

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

❓ Now cross validate a model with the same parameters as the model above, and save the mean score under a variable named `l1_model_score`. What can you say about the new score compare to the `base_model_score` and `l2_model_score`?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

💡 Have you noticed how the `l1_model_score` is slightly higher than the `l2_model_score` but using much less features, and that the `l2_model_score` itself higher than the `base_model_score` score? This is why regularization is so important: by filtering out the unecessary variables (i.e. setting their coefficient to zero) **L1** regularization has improved our classification score! Of course, this also comes down to the choice of the hyperparameter C, and it is possible to over-regularize.

#### 🧪 Test your code below

In [None]:
from nbresult import ChallengeResult
result = ChallengeResult('lasso', 
                         zero_impact_features = zero_impact_features,
                        score=l1_model_score)
result.write()
print(result.check())

# GridSearch the best hyperparameters

So ***how*** do we determine the best hyperparameters for our algorithm? We can use `GridSearchCV` for that! 

For instance, which one of the L1 or L2 regularization is best for our performance? Or maybe we are looking at a mix of L1 and L2, known as `elastic net`? We can find out! Do a `GridSearchCV` for a logistic regression model initiated with the following arguments: `max_iter=5000`, `random_state=42`, `penalty='elasticnet'`, `solver='saga'`. Saga is the only solver that will work with elastic net. Then, find the best `LogisticRegression` model by testing the following hyperparameters in gridsearch:

1. C = [1, 0.1, 0.01, 0.001]
2. class_weight = [None, 'balanced']
3. multi_class = ['multinomial','ovr']
4. l1_ratio:[0, 1, 0.9, 0.7, 0.5, 0.2]

Try to understand these parameters by reading the documentation, and then fit your GridSearchCV on `X` and `y`. Save the best estimator in a variable called `best_estimator`, the best parameters (as a dictionary) in a variable called `best_params`, and the accuracy score in a variable called `best_score` (hint: all of these values can be obtained from your fitted grid search model). Then test your code!

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

#### 🧪 Test your code below

In [None]:
from nbresult import ChallengeResult
result = ChallengeResult('gridsearch', 
                        score = best_score,
                        params=best_params)
result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.