### Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import precision_score, f1_score, confusion_matrix, classification_report
import warnings

pd.set_option('display.max_columns', None)
warnings.simplefilter('ignore')
seed = 42

### Baseline model

I want to make a simple and fast model to be used as a reference when trying different techniques. For this case I use a Gradient Boosting Classifier because it generally performs well on a variety of tasks without the need to further process the data or dive into more complex tuning. \
This is supposed just to be a baseline.

In [2]:
data = pd.read_csv('data/base_dataset.csv')

In [3]:
data.shape

(3333, 68)

In [4]:
X = data.drop('Churn', axis=1)
y = data['Churn']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=seed, stratify=y)

In [6]:
clf = HistGradientBoostingClassifier(validation_fraction=.2, class_weight='balanced', early_stopping=True, scoring='loss', random_state=seed)

I've defined the model to use a 20% cross validation sample for training and automatically apply weights to balance the classes, that is simply achieved by setting the weight as the inverse of the class population.

In [7]:
%%time
clf.fit(X_train, y_train)

CPU times: user 2.46 s, sys: 98 ms, total: 2.56 s
Wall time: 178 ms


In [8]:
y_pred = clf.predict(X_test) 

In [9]:
print(classification_report(y_test, y_pred)) 

              precision    recall  f1-score   support

           0       0.97      0.96      0.97       570
           1       0.80      0.80      0.80        97

    accuracy                           0.94       667
   macro avg       0.88      0.88      0.88       667
weighted avg       0.94      0.94      0.94       667



In [10]:
print(f'Precision: {round(precision_score(y_test, y_pred), 2)}')
print(f'F1-score: {round(f1_score(y_test, y_pred), 2)}') 

Precision: 0.8
F1-score: 0.8


### Baseline model on the oversampled dataset

I want to assess the effectiveness of the resampling technique tested.

In [11]:
data_sm = pd.read_csv('data/base_dataset_smote.csv')

In [12]:
data_sm.shape

(5700, 68)

In [13]:
X_sm = data_sm.drop('Churn', axis=1)
y_sm = data_sm['Churn']

In [14]:
X_train_sm, X_test_sm, y_train_sm, y_test_sm = train_test_split(X_sm, y_sm, test_size=.2, random_state=seed, stratify=y_sm)

In [15]:
clf_sm = HistGradientBoostingClassifier(validation_fraction=.2, class_weight='balanced', early_stopping=True, scoring='loss', random_state=seed)

In [16]:
%%time
clf_sm.fit(X_train_sm, y_train_sm)

CPU times: user 5.74 s, sys: 4.98 ms, total: 5.75 s
Wall time: 381 ms


In [17]:
y_pred_sm = clf_sm.predict(X_test_sm) 

In [18]:
print(classification_report(y_test_sm, y_pred_sm)) 

              precision    recall  f1-score   support

           0       0.93      0.98      0.96       570
           1       0.98      0.93      0.96       570

    accuracy                           0.96      1140
   macro avg       0.96      0.96      0.96      1140
weighted avg       0.96      0.96      0.96      1140



In [19]:
print(f'Precision: {round(precision_score(y_test_sm, y_pred_sm), 2)}')
print(f'F1-score: {round(f1_score(y_test_sm, y_pred_sm), 2)}') 

Precision: 0.98
F1-score: 0.96


### Comment

Oversampling with SMOTE proved to greatly improve the all-round performance of the model tested, without adding complexity. \
Additional tests could be made with other oversampling techniques, or by manually setting the class weights in the original case to give additional importance to the positive class. \
Given the reduced amount of total data, I'd avoid undersampling techniques because I don't want to lose pieces of information.