# Setup environment

**Introduction**

In the real world of machine learning, particularly in fraud detection, disease diagnosis, or anomaly detection, we often face a common challenge: imbalanced datasets. When one class significantly outnumbers the other, our models can become biased, leading to suboptimal performance where the minority class - often the one we're most interested in - gets overlooked.

NOTES:


- https://towardsdatascience.com/how-to-build-a-custom-estimator-for-scikit-learn-fddc0cb9e16e/ - paper sample in CV

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import os
from functools import partial
# import model libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import StratifiedKFold
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import OrdinalEncoder
from tqdm import tqdm


## import the dataset

In [2]:
data = pd.read_parquet('data/credit_card_transactions.parquet')

In [3]:
metadata_columns = ['trans_date_trans_time','gender','street','trans_num']

In [4]:
data.trans_date_trans_time.describe()

count                          1852394
mean     2020-01-20 21:31:46.801827328
min                2019-01-01 00:00:18
25%      2019-07-23 04:13:43.750000128
50%                2020-01-02 01:15:31
75%      2020-07-23 12:11:25.249999872
max                2020-12-31 23:59:34
Name: trans_date_trans_time, dtype: object

In [5]:
ym_stats = data[['trans_date_trans_time','state','is_fraud','trans_num']]
ym_stats['ym'] = data.trans_date_trans_time.dt.year.astype(str) + data.trans_date_trans_time.dt.month.astype(str).str.zfill(2)
ym_stats.groupby(['ym']).agg({'is_fraud':sum, 'trans_num': 'nunique'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ym_stats['ym'] = data.trans_date_trans_time.dt.year.astype(str) + data.trans_date_trans_time.dt.month.astype(str).str.zfill(2)
  ym_stats.groupby(['ym']).agg({'is_fraud':sum, 'trans_num': 'nunique'})


Unnamed: 0_level_0,is_fraud,trans_num
ym,Unnamed: 1_level_1,Unnamed: 2_level_1
201901,506,52525
201902,517,49866
201903,494,70939
201904,376,68078
201905,408,72532
201906,354,86064
201907,331,86596
201908,382,87359
201909,418,70652
201910,454,68758


In [6]:
data.columns

Index(['trans_date_trans_time', 'merchant', 'category', 'amt', 'gender',
       'street', 'city', 'state', 'zip', 'city_pop', 'job', 'trans_num',
       'unix_time', 'is_fraud', 'age_at_purchase', 'age_group',
       'transaction_day_of_the_week', 'transaction_time_of_the_day',
       'transaction_month', 'distance_from_mercant_km'],
      dtype='object')

In [7]:
data.head()

Unnamed: 0,trans_date_trans_time,merchant,category,amt,gender,street,city,state,zip,city_pop,job,trans_num,unix_time,is_fraud,age_at_purchase,age_group,transaction_day_of_the_week,transaction_time_of_the_day,transaction_month,distance_from_mercant_km
0,2019-01-01 00:00:18,"fraud_Rippin, Kub and Mann",misc_net,4.97,F,561 Perry Cove,Moravian Falls,NC,28654,3495,"Psychologist, counselling",0b242abb623afc578575680df30655b9,1325376018,0,31,3.0,1,1,1,78.773821
1,2019-01-01 00:00:44,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,F,43039 Riley Greens Suite 393,Orient,WA,99160,149,Special educational needs teacher,1f76529f8574734946361c461b024d99,1325376044,0,41,3.0,1,1,1,30.216618
2,2019-01-01 00:00:51,fraud_Lind-Buckridge,entertainment,220.11,M,594 White Dale Suite 530,Malad City,ID,83252,4154,Nature conservation officer,a1a22d70485983eac12b5b88dad1cf95,1325376051,0,57,4.0,1,1,1,108.102912
3,2019-01-01 00:01:16,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,M,9443 Cynthia Court Apt. 038,Boulder,MT,59632,1939,Patent attorney,6b849c168bdad6f867558c3793159a81,1325376076,0,52,4.0,1,1,1,95.685115
4,2019-01-01 00:03:06,fraud_Keeling-Crist,misc_pos,41.96,M,408 Bradley Rest,Doe Hill,VA,24433,99,Dance movement psychotherapist,a41d7549acf90789359a9aa5346dcb46,1325376186,0,33,3.0,1,1,1,77.702395


# Prepare datasets for training

Dataset preparation is a critical step for modeling. There are multiple ways of doing it, depending on the type of phenomena that one is modeling.
Many fall in misconception that fraud is a timeseries event and treat the dataset accordingly. However, even if it is true that past behavior of a customer/a card lead to fraud or not, we cannot state that in general every previous event is the cause for the following. Therefore, frauds follows a time-like event importance, but it is not a timeseries.

This means that previous data do not necessary cause the following behaviours. 
Fraud is, in fact, a circular event: past patterns can be easily seen in the future, with some different nuances. For these reasons when we are modeling fraud it is important to:
- Have a good timeframe in the past
- Test model stability both on:
    * A validation set sampled from the training population. This serve to test that model is able to generalize btw what he has already seen
    * An out of time set (oot) that the model has never seen (in the future wrt training set)


Also: [TODO]
- risk of data likage due to the fact that fraud labels often comes after a certain amount of period
- ?how much back in the past depends on use case and frequency of retraining (balance btw precision in short term and stability.)
- INTRO: peaks->policy - underlying behavior->models
- sampling in CV to grant not to loose info in sampling 
- sample only train

## Extract OOT and validation

In [8]:
_expression = "trans_date_trans_time < '2020-07-01 00:00:00'"
print(f"Splitting dataframe based on expression {_expression!r}.")
data.index = data.trans_num
train = data.query(_expression)
oot = data.query(f"~({_expression})")
print(f"Split dataframe into two dataframes with shapes {train.shape} and {oot.shape}.")

oot_y = oot.is_fraud
oot_X = oot.drop(columns=['is_fraud'])


Splitting dataframe based on expression "trans_date_trans_time < '2020-07-01 00:00:00'".
Split dataframe into two dataframes with shapes (1326733, 20) and (525661, 20).


In [9]:
# train test split

X = train.drop(columns=['is_fraud'])
X.drop(metadata_columns, axis=1, inplace=True)
y = train['is_fraud']

train_X, holdout_X, train_y, holdout_y = train_test_split(X, y, test_size=0.2, random_state=42)
train_X.drop(columns=['age_at_purchase'], inplace=True)
holdout_X.drop(columns=['age_at_purchase'], inplace=True)


### Define support classes and functions

In [10]:
class CustomStratifiedKFold:
    """_summary_
    """
    def __init__(self, n_splits=5, undersample_func=None, shuffle=True, random_state=42):
        self.n_splits = n_splits
        self.undersample_func = undersample_func
        self.random_state = random_state
        self.skf = StratifiedKFold(n_splits=self.n_splits, shuffle=shuffle, random_state=self.random_state)

    def split(self, dataframe, y):
        folds = []
        for train_index, test_index in tqdm(self.skf.split(X=dataframe, y=y), desc="Generating K-Folds", total=self.n_splits):
            train_df, y_train = dataframe.iloc[train_index], y.iloc[train_index]
            test_df, y_test = dataframe.iloc[test_index], y[test_index]

            # Oversample only the training data
            if self.undersample_func is not None:
                train_df, y_train = self.undersample_func.fit_resample(train_df, y_train)

            folds.append(((train_df, y_train), (test_df, y_test)))
        return folds

In [11]:
from collections import defaultdict
from sklearn.base import BaseEstimator

class MultiColumnEncoder(BaseEstimator):
    """https://www.geeksforgeeks.org/label-encoding-across-multiple-columns-in-scikit-learn/ 

    Args:
        BaseEstimator (_type_): _description_
    """
    def __init__(self, columns=None):
        self.columns = columns
        oe = partial(OrdinalEncoder, handle_unknown='use_encoded_value', unknown_value=-1)
        self.encoders = defaultdict(oe)

    def fit(self, X):
        for col in self.columns:
            self.encoders[col].fit(X[[col]])
        return self

    def transform(self, X):
        X_copy = X.copy()  # To avoid modifying the original dataframe
        for col in self.columns:
            X_copy[col] = self.encoders[col].transform(X_copy[[col]])
        return X_copy
    
    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)

In [12]:
import plotly.express as px
from sklearn.metrics import roc_curve, precision_recall_curve, auc
import plotly.graph_objects as go
from plotly.subplots import make_subplots

class EvalPlots():

    def __init__(self):
        pass

    def plot_eval_basic(self, y_true, y_score):

        '''
        y_score = model.predict_proba(X)[:, 1]
        '''

        precision, recall, thresholds = precision_recall_curve(y_true, y_score)

        # The histogram of scores compared to true labels
        fig_hist = px.histogram(
            x=y_score, color=y_true, nbins=50,
            labels=dict(color='True Labels', x='Score')
            , histnorm='probability density'
        )

        fig_hist.show()


        # Evaluating model performance on PR curve

        fig_thresh = px.area(
            x=recall, y=precision,
            title=f'Precision-Recall Curve (AUC={auc(recall, precision):.4f})',
            labels=dict(x='Recall', y='Precision'),
            width=700, height=500
        )
        fig_thresh.add_shape(
            type='line', line=dict(dash='dash'),
            x0=0, x1=1, y0=1, y1=0
        )
        fig_thresh.update_yaxes(scaleanchor="x", scaleratio=1)
        fig_thresh.update_xaxes(constrain='domain')

        fig_thresh.show()

        return fig_hist, fig_thresh
    
    
    
    def plot_eval_pred_dist(self, y_train_true, y_train_pred, y_holdout_true, y_holdout_pred, y_oot_true, y_oot_pred):
        fig = make_subplots(rows=3, cols=1, subplot_titles=("Train", "Holdout", "OOT"))
        print('inside plot_eval_pred_dist')

        trace0 = px.histogram(
                    x=y_train_pred, color=y_train_true, nbins=50,
                    histnorm='probability density',
                    labels=dict(color='True Labels', x='Score')
                )
        print(  'trace0')
        trace1 = px.histogram(
                    x=y_holdout_pred, color=y_holdout_true, nbins=50,
                    labels=dict(color='True Labels', x='Score')
                    , histnorm='probability density'
                )
        trace2 = px.histogram(
                    x=y_oot_pred, color=y_oot_true, nbins=50,
                    labels=dict(color='True Labels', x='Score')
                    , histnorm='probability density'
                )

        # add each trace (or traces) to its specific subplot
        pl_nr = 0
        for plot_ in [trace0, trace1, trace2]:
            pl_nr += 1
            for trace in plot_.data:
                fig.add_trace(trace, row=pl_nr, col=1)

        fig.update_layout(title_text="Model Performance", showlegend=True)
        return fig

    def plot_eval_pr_auc(self, precision_train, recall_train, precision_holdout, recall_holdout, precision_oot, recall_oot):
        # Evaluating model performance on PR curve

        tr_title = f'Train (AUC={auc(recall_train, precision_train):.4f})' 
        ho_title = f'Holdout (AUC={auc(recall_holdout, precision_holdout):.4f})'
        oot_title = f'OOT (AUC={auc(recall_oot, precision_oot):.4f})'

        fig = make_subplots(rows=1, cols=3, subplot_titles=(tr_title, ho_title, oot_title)) 

        trace0 = px.area(
            x=recall_train, y=precision_train,
            title=f'Training (AUC={auc(recall_train, precision_train):.4f})',
            labels=dict(x='Recall', y='Precision'),
            width=700, height=500
        )
        trace0.add_shape(
            type='line', line=dict(dash='dash'),
            x0=0, x1=1, y0=1, y1=0
        )
        trace0.update_yaxes(scaleanchor="x", scaleratio=1)
        trace0.update_xaxes(constrain='domain')

        trace1 = px.area(
            x=recall_holdout, y=precision_holdout,
            title=f'Holdout (AUC={auc(recall_holdout, precision_holdout):.4f})',
            labels=dict(x='Recall', y='Precision'),
            width=700, height=500
        )
        trace1.add_shape(
            type='line', line=dict(dash='dash'),
            x0=0, x1=1, y0=1, y1=0
        )
        trace1.update_yaxes(scaleanchor="x", scaleratio=1)
        trace1.update_xaxes(constrain='domain')

        trace2 = px.area(
            x=recall_oot, y=precision_oot,
            title=f'OOT AUC={auc(recall_oot, precision_oot):.4f})',
            labels=dict(x='Recall', y='Precision'),
            width=700, height=500
        )
        trace2.add_shape(
            type='line', line=dict(dash='dash'),
            x0=0, x1=1, y0=1, y1=0
        )
        trace2.update_yaxes(scaleanchor="x", scaleratio=1)
        trace2.update_xaxes(constrain='domain')

        pl_nr = 0
        for plot_ in [trace0, trace1, trace2]:
            pl_nr += 1
            for trace in plot_.data:
                fig.add_trace(trace, row=1, col=pl_nr)

        fig.update_layout(title_text="Model Precision-Recall Curve", showlegend=True)

        return fig

In [13]:
## Build model performance report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix


class ModelPerformanceReport(EvalPlots):
    def __init__(self, train_X,train_y, holdout_X,holdout_y,oot_X, oot_y):
        self.train_X = train_X
        self.train_y = train_y
        self.holdout_X = holdout_X
        self.holdout_y = holdout_y
        self.oot_X = oot_X
        self.oot_y = oot_y
        super().__init__()

    def predictions(self, model):
        y_train_pred = model.predict(self.train_X)
        y_train_true = self.train_y
        y_holdout_pred = model.predict(self.holdout_X)
        y_holdout_true = self.holdout_y
        y_oot_true = self.oot_y
        y_oot_pred = model.predict(self.oot_X[self.train_X.columns])

        return y_train_pred, y_train_true, y_holdout_pred, y_holdout_true, y_oot_pred, y_oot_true

    def produce_report(self, model): 
        
        y_train_pred, y_train_true, y_holdout_pred, y_holdout_true, y_oot_pred, y_oot_true = self.predictions(model)

        # confusion_matrix(y_train_true, y_train_pred)


        results_df = pd.DataFrame()
        results_df['train'] = [accuracy_score(y_train_true, y_train_pred), precision_score(y_train_true, y_train_pred), recall_score(y_train_true, y_train_pred), f1_score(y_train_true, y_train_pred)]
        results_df['holdout'] = [accuracy_score(y_holdout_true, y_holdout_pred), precision_score(y_holdout_true, y_holdout_pred), recall_score(y_holdout_true, y_holdout_pred), f1_score(y_holdout_true, y_holdout_pred)]   
        results_df['oot'] = [accuracy_score(y_oot_true, y_oot_pred), precision_score(y_oot_true, y_oot_pred), recall_score(y_oot_true, y_oot_pred), f1_score(y_oot_true, y_oot_pred)]
        results_df.index = ['accuracy', 'precision', 'recall', 'f1']
        return results_df
    

    def proba_predictions(self, model):
        y_train_pred = model.predict_proba(self.train_X)[:, 1]
        y_train_true = self.train_y
        y_holdout_pred = model.predict_proba(holdout_X)[:, 1]
        y_holdout_true = self.holdout_y
        y_oot_true = self.oot_y
        y_oot_pred = model.predict_proba(self.oot_X[self.train_X.columns])[:, 1]

        return y_train_pred, y_train_true, y_holdout_pred, y_holdout_true, y_oot_true, y_oot_pred
    
    def produce_proba_report(self, model):
        y_train_true, y_train_pred, y_holdout_true, y_holdout_pred, y_oot_true, y_oot_pred = self.proba_predictions(model)
        return self.plot_eval_pred_dist(y_train_true, y_train_pred, y_holdout_true, y_holdout_pred, y_oot_true, y_oot_pred)

    def precision_recall_calc(self, y_train_true, y_train_pred, y_holdout_true, y_holdout_pred, y_oot_true, y_oot_pred):
        precision_train, recall_train, _ = precision_recall_curve(y_train_true, y_train_pred)
        precision_holdout, recall_holdout, _ = precision_recall_curve(y_holdout_true, y_holdout_pred)
        precision_oot, recall_oot, _ = precision_recall_curve(y_oot_true, y_oot_pred)
        return precision_train, recall_train, precision_holdout, recall_holdout, precision_oot, recall_oot

    def produce_pr_auc_report(self, model):
        y_train_pred, y_train_true, y_holdout_pred, y_holdout_true, y_oot_true, y_oot_pred = self.proba_predictions(model)
        precision_train, recall_train, precision_holdout, recall_holdout, precision_oot, recall_oot = self.precision_recall_calc(y_train_true, y_train_pred, y_holdout_true, y_holdout_pred, y_oot_true, y_oot_pred)
        return self.plot_eval_pr_auc(precision_train, recall_train, precision_holdout, recall_holdout, precision_oot, recall_oot) 


## Trasform dataset

In [14]:
categorical_columns = X.select_dtypes(include=['object']).columns.tolist()
encoder = MultiColumnEncoder(categorical_columns)
train_X = encoder.fit_transform(train_X)
holdout_X = encoder.transform(holdout_X)
oot_X = encoder.transform(oot_X)

In [15]:
report_class = ModelPerformanceReport(train_X,train_y,holdout_X,holdout_y,oot_X,oot_y)
eval_plots = EvalPlots()

## Standard run
no balancing
no tuning


Add description
Observations on overfitting

In [16]:
score_standard_model = LGBMClassifier(objective='binary').fit(train_X, train_y)#, eval_metric='average_precision') #default logloss

[LightGBM] [Info] Number of positive: 6157, number of negative: 1055229
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003153 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2115
[LightGBM] [Info] Number of data points in the train set: 1061386, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.005801 -> initscore=-5.143923
[LightGBM] [Info] Start training from score -5.143923


In [17]:
score_standard_model.n_iter_

100

In [18]:
report_class.produce_report(score_standard_model)

Unnamed: 0,train,holdout,oot
accuracy,0.997707,0.997034,0.995655
precision,0.833782,0.74386,0.452146
recall,0.755238,0.71525,0.638668
f1,0.792569,0.729274,0.52946


In [27]:
y_train_pred, y_train_true, y_holdout_pred, y_holdout_true, y_oot_true, y_oot_pred = report_class.predictions(score_standard_model)

In [20]:
train_X.columns

Index(['merchant', 'category', 'amt', 'city', 'state', 'zip', 'city_pop',
       'job', 'unix_time', 'age_group', 'transaction_day_of_the_week',
       'transaction_time_of_the_day', 'transaction_month',
       'distance_from_mercant_km'],
      dtype='object')

In [21]:
train_X.sample(3)

Unnamed: 0_level_0,merchant,category,amt,city,state,zip,city_pop,job,unix_time,age_group,transaction_day_of_the_week,transaction_time_of_the_day,transaction_month,distance_from_mercant_km
trans_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
42120e2054d1397b85b0e3a6334b42cf,362.0,6.0,22.8,494.0,38.0,17051,4653,313.0,1361916693,2.0,2,5,2,101.46288
9e1feddcb44cbd60e084a121d1ce6797,210.0,0.0,79.51,696.0,38.0,18246,143,217.0,1360688735,3.0,2,3,2,60.229398
9233b4c947e092f5070af77d777b7b39,598.0,9.0,33.72,101.0,21.0,4616,824,276.0,1337996045,3.0,6,1,5,60.185865


In [22]:
pd.Series(y_oot_pred).value_counts()

is_fraud
0    523649
1      2012
Name: count, dtype: int64

In [23]:
pd.Series(y_oot_true).value_counts( )

0    522819
1      2842
Name: count, dtype: int64

In [24]:
2744/(2744+522917)

0.005220094319342694

In [25]:
report_class.plot_eval_pred_dist(y_train_true, y_train_pred, y_holdout_true, y_holdout_pred, y_oot_true, y_oot_pred)

NameError: name 'y_train_true' is not defined

In [None]:
report_class.produce_pr_auc_report(score_standard_model)

In [None]:
y_predict_train = score_standard_model.predict_proba(train_X)[:, 1]
fig_hist, fig_thresh = eval_plots.plot_eval_basic(y_true = train_y, y_score=y_predict_train)

## add balancing
Why it is needed, what is solves for

*Note:
When trying to add SMOTE to my pipeline in my project, I hit an error. The issue is that sklearn’s pipeline will try to oversample the training and validation sets, which is not what you want to do with SMOTE. To fix this, imblearn has a pipeline that is built on top of sklearn’s pipeline, meaning it functions almost exactly the same way. However, when you call the predict( ) method, the imblearn pipeline will skip the sampling step, solving this issue.

In [None]:
from imblearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate

undersample_pipe = Pipeline([('sampling', RandomUnderSampler(sampling_strategy=0.2, random_state=42)) 
                             , ('class', LGBMClassifier(objective='binary'))])
score_balanced_model = undersample_pipe.fit(train_X, train_y
                                            , class__eval_metric='average_precision'
                                            )


In [None]:
score_balanced_model['class'].n_iter_

In [None]:
report_class.produce_report(score_balanced_model)

In [75]:
y_train_pred, y_train_true, y_holdout_pred, y_holdout_true, y_oot_true, y_oot_pred = report_class.predictions(score_balanced_model)

In [None]:
pd.Series(y_train_pred).value_counts()

In [None]:
report_class.produce_pr_auc_report(score_balanced_model)

## Add parameter tuning

In [78]:
hyperparameters = {
        #'encode__columns': [categorical_columns],
        'class__n_estimators': [20, 50, 200, 500],
        "class__objective": ["binary"],
        "class__early_stopping_round": [10],
        "class__num_leaves": [5, 10, 80, 100],
        "class__min_data_in_leaf": [10, 50, 100, 200],
    }

In [None]:
from imblearn.pipeline import Pipeline
from imblearn.ensemble import BalancedBaggingClassifier

#, ('encode',MultiColumnLabelEncoder())
undersample_pipe = Pipeline([('sampling', RandomUnderSampler(sampling_strategy=0.1, random_state=42)) 
                             , ('class', LGBMClassifier(objective='binary'))])

score_balanced_parameter_model = GridSearchCV(undersample_pipe, param_grid=hyperparameters, cv=3, scoring='average_precision')

score_balanced_parameter_model.fit(train_X, train_y, class__eval_set=(holdout_X, holdout_y))

In [None]:
report_class.produce_report(score_balanced_parameter_model)

In [None]:
report_class.produce_pr_auc_report(score_balanced_parameter_model)

## Test BalancedBaggingClassifier
https://medium.com/@nageshmashette32/balanced-bagging-classifier-bagging-for-imbalanced-classification-dfba66c44c14
https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html

In [82]:
categorical_columns = X.select_dtypes(include=['object']).columns.tolist()
hyperparameters_bbc = {
        'estimator__n_estimators': [20, 50, 100, 200, 500],
        "estimator__objective": ["binary"],
        "estimator__early_stopping_round": [10],
        "estimator__num_leaves": [5, 10, 30, 50, 80, 100],
        "estimator__min_data_in_leaf": [10, 20, 30, 40, 50, 80, 100, 200],
        "estimator__use_first_metric": [True],
        'n_estimators': [3, 4, 10],
        'sampling_strategy': [0.10, 0.15, 0.2, 0.25, 0.3, 0.4]
    }

In [None]:
'''bbc = BalancedBaggingClassifier(base_estimator=LGBMClassifier
                                , sampling_strategy=0.2, 
                                random_state=42
                                , n_estimators=10,
                                oob_score=True, warm_start=True, n_jobs=-1)
bbc.fit(train_X, train_y)'''
scores_bcc = GridSearchCV(BalancedBaggingClassifier(estimator=LGBMClassifier(),
                                random_state=42,
                                oob_score=True, warm_start=True, n_jobs=-1)
                                , param_grid=hyperparameters_bbc, cv=3, scoring='average_precision'
                                )

scores_bcc.fit(train_X, train_y, eval_set=(holdout_X, holdout_y))
#scores['test_roc_auc'].mean(), 
scores_bcc['test_average_precision'].mean()

# Compare models

Show how model stability changes between the 3 model versions and explain why

In [None]:
# Compare models results

# Generate reports for standard model
standard_report = report_class.produce_report(score_standard_model)
standard_pr_auc_report = report_class.produce_pr_auc_report(score_standard_model)

# Generate reports for balanced model
balanced_report = report_class.produce_report(score_balanced_model)
balanced_pr_auc_report = report_class.produce_pr_auc_report(score_balanced_model)

# Generate reports for balanced parameter tuned model
balanced_param_report = report_class.produce_report(score_balanced_parameter_model)
balanced_param_pr_auc_report = report_class.produce_pr_auc_report(score_balanced_parameter_model)

# Combine reports into a single DataFrame for comparison
comparison_df = pd.concat([standard_report.add_suffix('_standard'), balanced_report.add_suffix('_balanced'), balanced_param_report.add_suffix('_tuned')], axis=1)
#comparison_df.columns = ['Standard Model', 'Balanced Model', 'Balanced Parameter Tuned Model']

# Display the comparison DataFrame
comparison_df

In [None]:
plot_df =standard_report
plot_df['model'] = 'standard'
plot_df = pd.concat([plot_df,balanced_report])
plot_df['model'] = plot_df.model.fillna('balanced')
plot_df = pd.concat([plot_df,balanced_param_report])
plot_df['model'] = plot_df.model.fillna('balanced_param')
plot_df.reset_index(inplace=True, names='metric')

In [76]:
plot_df = pd.melt(plot_df, id_vars=['metric', 'model'], value_vars=['train', 'holdout', 'oot'])

In [None]:
from dash import Dash, dcc, html, Input, Output
from sklearn.model_selection import train_test_split
from sklearn import linear_model, tree, neighbors
from sklearn import metrics, datasets
import plotly.express as px

app = Dash(__name__)

app.layout = html.Div([
    html.H4("Analysis of the ML model's results using scoring metrics"),
    html.P("Select metric:"),
    dcc.Dropdown(
        id='dropdown',
        options=['accuracy', 'precision', 'recall', 'f1'],
        value='precision',
        clearable=False
    ),
    dcc.Graph(id="graph"),
])


@app.callback(
    Output("graph", "figure"), 
    Input('dropdown', "value"))

def train_and_display(metric):


    fig = px.line(plot_df[plot_df.metric==metric], y='value', x='variable', color='model', markers=True)

    return fig


app.run_server(debug=True)

In [None]:
from dash import Dash, dcc, html, Input, Output
from sklearn.model_selection import train_test_split
from sklearn import linear_model, tree, neighbors
from sklearn import metrics, datasets
import plotly.express as px

app = Dash(__name__)
MODELS = {'standard': score_standard_model, 'balanced': score_balanced_model, 'balanced_param': score_balanced_parameter_model}
app.layout = html.Div([
    html.H4("Analysis of the ML model's results using scoring metrics"),
    html.P("Select metric:"),
    dcc.Dropdown(
        id='dropdown',
        options= ['standard', 'balanced', 'balanced_param'],
        value='balanced_param',
        clearable=False
    ),
    dcc.Graph(id="graph"),
])


@app.callback(
    Output("graph", "figure"), 
    Input('dropdown', "value"))

def train_and_display(model):

    return report_class.produce_pr_auc_report(MODELS[model])


app.run_server(debug=True)