<img src="images/logo.png" align='right' width=250px>

# Custom Model in Scikit-Learn

Scikit-Learn can be extended with functionality that is not already natively included in the library. 

This notebook covers an example of how to customize an existing Model - the `RandomForestClassifier`.

By the end of this notebook you will be able to:

- [Explain the benefits of using a custom Model](#benefits)
- [Overwrite the score method on an existing Model](#custom)
- [Implement the custom Model in a scikit-learn Pipeline](#pipeline)

<a id=benefits></a>

# Benefits

There are a wide range of models that are available in scikit-learn, with integrated methods such as `.predict()` and `.score()`, but the list of models is finite and the methods are set up to work in specific ways that can be limiting.

It is possible to extend models by creating custom `Model` classes that inherit from the `BaseEstimator`, or from a specific model. This can provide many benefits:

- Design a solution that specifically fits the requirements of your problem
- Integrate with scikit-learn Pipelines to streamline routine processes in a machine learning workflow
- Gain a deeper understanding of how machine learning models work at a fundamental level
- Experiment with novel or niche algorithms to implement and test your ideas
- Reuse your custom Model across different projects or share it with others

<a id=custom></a>

## Overwrite the score method on an existing Model

Take this existing model, which takes information about patients and predicts whether they will have a stroke or not. 

This model can be useful to doctors who can use this information to target specific patients and put in any desired intervention if necessary.

In [None]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (StandardScaler, 
                                   PolynomialFeatures, 
                                   OneHotEncoder, 
                                   OrdinalEncoder, 
                                   KBinsDiscretizer)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

First the data is read in, cleaned and split into `X` and `y` and train and test.

In [None]:
# read in the stroke data
stroke = pd.read_csv('data/stroke.csv').rename(columns=str.lower)

# Columns to treat
drop_cols = ['id', 'address']
target = 'stroke'

def create_Xy(df, drop_cols, target_col):
    df = df.drop(columns=drop_cols)
    return (
        df.drop(columns=target_col),
        df[target_col]
    )
    
# New feature matrix
X, y = (
    stroke
    .pipe(create_Xy, 
          drop_cols=drop_cols, 
          target_col=target,
          )
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 123,
                                                    stratify = y,
                                                    )

Now a `Pipeline` with a `RandomForestClassifier` is built and fitted.

In [None]:
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type']
numeric_cols = ['age', 'hypertension', 'heart_disease', 'ever_married', 'avg_glucose_level', 'bmi']
missing_cols = ['age','bmi']

onehot = Pipeline(steps = [
    ('onehot', OneHotEncoder(drop = "if_binary", sparse_output=False)),
])

impute = Pipeline(steps = [
    ('impute', SimpleImputer(strategy ='mean')),
])

preprocessor = ColumnTransformer(transformers = [
    ('onehot', onehot, categorical_cols),
    ('impute', impute, numeric_cols)
], remainder = 'passthrough')

base_model = RandomForestClassifier(class_weight='balanced',
                                    max_depth=5,
                                    random_state=123,
                                    )

base_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', base_model)
])

base_pipeline.fit(X_train, y_train)

Every Model in scikit-learn has a score method. 

In [None]:
base_pipeline.score(X_test, y_test)

<mark>Question</mark>

What metric does the score method return (by default)?

<details>
  <summary><span style="color:blue">Show answer</span></summary>
    
This is the accuracy score.
This can be seen in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.score) and in the [source code](https://github.com/scikit-learn/scikit-learn/blob/9e38cd00d/sklearn/base.py#L738)

In this example the following code would give the same result:
```python
from sklearn.metrics import accuracy_score

accuracy_score(y_test, base_pipeline.predict(X_test))
```

</details>

However, due to the imbalance of the target variable, it makes sense to not look at accuracy score, but look at the area under the ROC. Which requires the following code:

In [None]:
from sklearn.metrics import roc_auc_score

# Find the probabilities of stroke for AUC evaluation
y_train_probs = base_pipeline.predict_proba(X_train)[:,1]
y_test_probs = base_pipeline.predict_proba(X_test)[:,1]

print(f'AUC train: {roc_auc_score(y_train, y_train_probs)}')
print(f'AUC test: {roc_auc_score(y_test, y_test_probs)}')

## Overwriting the score method

Since building a model requires many iterations, that could all improve the performance of the model, it would make sense for the `score` method to return the desired metric for the project. 

In this example, the concept of parent/child classes in OOP is very important, as the entire functionality of the model should remain the same, and the only method that needs to be changed is `score()`. Therefore the custom class can inherit from the desired model and only contain one method.

In [None]:
class RandomForestClassifierAUC(RandomForestClassifier):

    def score(self, X, y):
        
        from sklearn.metrics import roc_auc_score
        
        predictions = self.predict_proba(X)[:,1]
        return roc_auc_score(y, predictions)

Now the `RandomForestClassifierAUC` is the same classifier as the original `RandomForestClassifier`.

### <mark>Exercise: Use the custom Model</mark>

Replace `...` with code to instantiate the new `RandomForestClassifierAUC` and calculate the new score method. 

In [None]:
# instantiate the custom Model here
...

# leave this code the same
base_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', base_model)
])

base_pipeline.fit(X_train, y_train)

# calculate the score and compare to the roc_auc_score metric - has it worked?
...

**Answer**: Uncomment and run the code below to reveal a solution.

In [None]:
# %load answers/overwrite_score_method.py

---

## Conclusion

Integrating a custom model with scikit-learn to overwrite the score method with the Area Under the Curve (AUC) metric brings forth a powerful and flexible solution for binary classification tasks. Custom models allow for fine-tuning and customization, enabling the incorporation of specific performance metrics like AUC into the scoring mechanism.

One key advantage is the adaptability of custom models within the scikit-learn framework, providing users with the flexibility to tailor their models to specific evaluation criteria. Overwriting the score method with AUC enhances the model's ability to capture the trade-off between true positive and false positive rates, offering a more comprehensive assessment of classification performance beyond traditional accuracy metrics.

Employing a custom model in scikit-learn to replace the score method with AUC facilitates personalized model evaluation for binary classification tasks. This customization not only enhances the interpretability of model performance but also aligns the evaluation process with specific objectives, providing a versatile approach for practitioners seeking to optimize models based on AUC and other relevant metrics.