# Predicting Customer Satisfaction with Imbalanced Data

In this notebook, I'll show different techniques suitable for imabalanced datses to try and improve model performance.

## In this notebook

* Remove redundant features
* Train a classifier to predict customer satisfaction
* Improve performance with different techniques for imbalanced data

## Feature selection

To begin with, I will remove duplicated and constant features. These are in essence redundant or non-predictive. Next, I will find quasi-constant features, and evaluate the distribution of its values across satisfied and un-satisfied customers, to determine if I can remove them.

* Remove constant features
* Remove duplicated features


## The Machine Learning Model

This is a classification problem. I want to predict if a customer is unsatisfied (1 in the target). From the off-the-shelf algorithms we know that **Gradient Boosting Machines** out-perform all other models. So in this notebook I will train a Gradient Boosting Classifier from Scikit-learn.


## Imbalanced data

There are a number of techniques that we can use to try and improve the performance of models trained on imbalanced datasets. We can under- or over-sample the dataset. Within the under-sampling techniques we have cleaning techniques that allow us to remove noisy observations instead of observations just at random. Within the over-sampling techniques we have methods to create new, "synthetic", data using existing observations as templates. This way, we do not just "duplicate" the data as we would do with over-sampling.

We can also implement cost-sensitive learning, where we modify the optimization function to account for the cost of miss-classification. Miss-classifying an observation from the minority class tends to be more costly in real situations. And finally we have special ensemble algorithms that were designed specifically to work with imbalanced datasets.

So in summary, we could try:

* undersampling at random or based on cleaning criteria
* oversampling at random or create synthethic new data
* introduce cost sensitive learning
* train a special algorithm for imbalanced datasets

For more details on feature engineering and feature selection, hyperparameter optimization or working with imbalanced datasets, visit my [online courses](https://www.trainindata.com/).

In [None]:
# Let's install Feature-engine
# this package will allow us to quickly remove 
# non-predictive variables

!pip install feature-engine

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# I use GBM because it usually out-performs other off-the-shelf 
# classifiers
from sklearn.ensemble import GradientBoostingClassifier

# metric to optimize for the competition
from sklearn.metrics import roc_auc_score

from sklearn.model_selection import train_test_split

# to assemble various procedures in sequence
from sklearn.pipeline import Pipeline

# some methods to work with imbalanced data are based in nearest neighbours
# and nearest neighbours are sensitive to the magnitude of the features
# so we need to scale the data
from sklearn.preprocessing import MinMaxScaler

# import selection classes from Feature-engine
# to reduce the number of features
from feature_engine.selection import (
    DropDuplicateFeatures,
    DropConstantFeatures,
)

# over-sampling techniques for imbalanced data
from imblearn.over_sampling import (
    RandomOverSampler,
    SMOTENC,
)

# under-sampling techniques for imbalanced data
from imblearn.under_sampling import (
    InstanceHardnessThreshold,
    RandomUnderSampler,
)

# special ensemble methods to work with imbalanced data
# we will use those based on boosting, which tend to work better
from imblearn.ensemble import (
    RUSBoostClassifier,
    EasyEnsembleClassifier,
)

## Load Data

In [None]:
# load the Santander Customer Satisfaction dataset

data = pd.read_csv('/kaggle/input/santander-customer-satisfaction/train.csv')

In [None]:
# separate dataset into train and test sets
# I split 20:80 mostly to reduce the size of the train set
# so that this notebook does not run out of memory :_(

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['ID','TARGET'], axis=1),
    data['TARGET'],
    test_size=0.8,
    random_state=0)

X_train.shape, X_test.shape

## Target

The target class is imbalanced. The value 1 refers to un-satisfied customers and 0 to satisfied. So most of Santander's customers are satisfied.

In [None]:
# check class imbalance

y_train.value_counts(normalize=True), y_train.value_counts()

We see that ~ 4% of the customers are not satisfied, that is around ~ customers 2700.

In [None]:
# check also the test set
y_test.value_counts(normalize=True)

## Drop constant and duplicated features

This dataset contains constant and duplicated features. I know this from previous analysis so I will quickly remove these features to reduce the data size.

More insight about feature selection for this dataset here:
https://www.kaggle.com/solegalli/feature-selection-with-feature-engine

In [None]:
# to remove constant and duplicated features, we use the transformers from Feature-engine

pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=1)), # drops constant features
    ('duplicated', DropDuplicateFeatures()), # drops duplicates
])

# find features to remove
pipe.fit(X_train, y_train)

In [None]:
# how many constant features are there in the dataset?

len(pipe.named_steps['constant'].features_to_drop_)

In [None]:
# how many duplicated features are there in the dataset?

len(pipe.named_steps['duplicated'].features_to_drop_)

Let's go ahead and remove them from the datasets.

In [None]:
print('Number of original variables: ', X_train.shape[1])

# see how with the pipeline we can apply all transformers in sequence
# with one line of code, for each data set
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

print('Number of variables after selection: ', X_train.shape[1])

Now, we reduced the dataset a bit. Let's hope that that helps speed things up!

## Variable exploration

From previous analysis we know that this data set does not contain missing values and that all variables are numerical.

We also know from previous analysis that most variables in this dataset are binary and discrete, with very few continuous variables.

In [None]:
# Let's find out how many variables we have with 2, or less than 10 or 20 distinct values

for max_unique in [2, 10, 20]:
    vars_ = [x for x in X_train.columns if X_train[x].nunique()<= max_unique]
    vars_ = len(vars_)
    print(f'{vars_} variables with less than or equal to {max_unique} values')

We see that we have 95 binary variables, and a few more that are also discrete.

Why is this important?

* Some under- and over- sampling methods for imbalanced datasets are based of Nearest neighbours
* Nearest neighbours depend on distance metrics
* in theory, distance metrics for continuous variables are not appropriate for discrete variables and vice-versa.

So, if we are strict, we would exclude most of the under- and smote based algorithms, or alternatively, we should create distance matrices manually to accomodate the different metrics. But that is a lot of work.

So, in this notebook, we will use only algorithms that we can use off-the-shelf.

For now, let's train a gradient boosting machine with these variables to determine the benchmark performance.

## Train Gradient Boosting Model

We know that for classification Gradient Boosting Machines out-perform all other models, so we will implement directly that model.

In [None]:
# set up the gradient boosting classifier
gbm = GradientBoostingClassifier(
    loss = 'exponential',
    max_depth = 1,
    min_samples_split = 0.80,
    n_estimators = 100,
)

# fit
gbm.fit(X_train, y_train)

In [None]:
# Now let's get the benchmark performance on train and test

X_train_preds = gbm.predict_proba(X_train)[:,1]
X_test_preds = gbm.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_train, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

Let's see if we can improve the performance a bit by incorporating methods designed to work with imbalanced datasets.

# Methods for Imbalanced data

# Under-sampling

## Instance Hardness Threshold

Among the under-sampling methods, we can perform random under-sampling, where we extract samples form the majority class at random. We extract normally as many samples as those we have in the minority. 

Then we have cleaning methods, but all of them depend on nearest neighbours, so I would argue that are not suitable given that we have a mix of discrete and continuous variables. 

We can use the InstanceHardness treshold which will remove observations from the majority class that are hard to classify correctly. 

Instance hardness is a measure of how difficult an observation is to classify correclty, and it is inversely correlated to the probability of its class.

So to keep things simple, let's just implement the instance hardness treshold to under-sample our data.

In [None]:
# set up instance hardness threshold
# the instance hardness is determined based on a gradient boosting machine
# trained on the entire dataset

iht = InstanceHardnessThreshold(
    estimator=gbm, # we pass the model we set up earlier
    sampling_strategy='auto',  # undersamples only the majority class
    random_state=1,
    cv=2,  # cross validation fold, 2 to speed things up.
)

# resample
X_resampled, y_resampled = iht.fit_resample(X_train, y_train)

# shape of original data and data after resampling
X_train.shape, X_resampled.shape

In [None]:
# check the resampled target
# instance hardness treshold is a fixed undersampling method
# so it aims for 50:50 observations from majority and minority class

# let's see
y_resampled.value_counts(normalize=True)

In [None]:
# train model on resampled data

gbm.fit(X_resampled, y_resampled)

In [None]:
# Now let's get the performance on train and test

X_train_preds = gbm.predict_proba(X_resampled)[:,1]
X_test_preds = gbm.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_resampled, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

The model is over-fit to the train set. The instance hardness threshold is not improving the model performance. On the contrary.

## Random Undersampling

In random undersampling, we would select at random as many observations from the majority as those we have in the minority. This method is often neglected because it tends to reduce the size of the train set quite dramatically.

Let's try it in any case.

In [None]:
rus = RandomUnderSampler(
    sampling_strategy='auto',  # undersamples only the majority class
    random_state=0,
)

# resample
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

# shape of original data and data after resampling
# we see that the data was reduced quite a bit

X_train.shape, X_resampled.shape

In [None]:
# check the resampled target

y_resampled.value_counts(normalize=True)

In [None]:
# train model

gbm.fit(X_resampled, y_resampled)

In [None]:
# Now let's get the performance on train and test

X_train_preds = gbm.predict_proba(X_resampled)[:,1]
X_test_preds = gbm.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_resampled, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

Interesting, even with under-sampling, reducing the dataset quite a bit, we obtain quite a similar performance to that obtain using the entire dataset.

If we want to train a model repeatedly in a live system, this could be a nice alternative, as smaller datasets allow faster training times, and we would not be sacrificing performance.

# Oversampling

## Random Over-sampling

Among the over-sampling methods we have, random over-sampling, which bootstraps observations from the minority class to increase their number. This technique in essence duplicates data, so sometimes leads to over-fitting. 

Let's try it in any case.

In [None]:
ros = RandomOverSampler(
    sampling_strategy='auto',  # undersamples only the majority class
    random_state=0,
)

# resample
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

# we would have a lot more observations from the majority class now
X_train.shape, X_resampled.shape

In [None]:
# check the resampled target

y_resampled.value_counts(normalize=True)

In [None]:
# Now let's get the performance on train and test

X_train_preds = gbm.predict_proba(X_resampled)[:,1]
X_test_preds = gbm.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_resampled, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

We observe a tiny improvement, but probably within the error of the model. We would have to use cross-validation and get a measure of the error dispersion to be sure. But I can't do that on Kaggle kernels for every technique. It runs out of memory.

## SMOTENC

SMOTE methos create new, synthetic data, using original samples as templates. They interpolate the synthetic data within the range of a sample used as template and any og its 5 closest neighbours.

There is 1 variant that is suitable for datasets with continuous and discrete variables, which is SMOTE-NC, so we will implement that method.

In [None]:
# we need to capture the index of the discrete variables

# make list of discrete variables
cat_vars = [var for var in X_train.columns if X_train[var].nunique() <= 10]

# capture the index in the dataframe columns
cat_vars_index = [cat_vars.index(x) for x in cat_vars]

cat_vars_index[0:6]

In [None]:
smnc = SMOTENC(
    sampling_strategy='auto', # samples only the minority class
    random_state=0,  # for reproducibility
    k_neighbors=3,
    categorical_features=cat_vars_index # indeces of the columns of discrete variables
)  

# because SMOTE uses KNN, and KNN is sensible to variable magnitude, we re-scale the data

# this procedure will take a while
X_resampled, y_resampled = smnc.fit_resample(MinMaxScaler().fit_transform(X_train), y_train)

X_train.shape, X_resampled.shape

In [None]:
# check the distribution of the resampled target
# we should have 50:50 now

y_resampled.value_counts(normalize=True)

In [None]:
# train the model 

gbm.fit(X_resampled, y_resampled)

In [None]:
# Now let's get the performance on train and test

X_train_preds = gbm.predict_proba(X_resampled)[:,1]
X_test_preds = gbm.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_resampled, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

This SMOTE method was not useful in this dataset

## Ensemble methods for imbalanced data

We will implement RUSBOOSt and Easy Ensemble. Both are based on boosting methods thus, tend to return better model performance.

In [None]:
# set up the RUSBoost ensemble model

rusboost = RUSBoostClassifier(
        base_estimator=None,
        n_estimators=20,
        learning_rate=1.0,
        sampling_strategy='auto',
        random_state=2909,
    )


# train model
rusboost.fit(X_train, y_train)

In [None]:
# Now let's get the performance on train and test

X_train_preds = rusboost.predict_proba(X_train)[:,1]
X_test_preds = rusboost.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_train, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

In [None]:
easy = EasyEnsembleClassifier(
        n_estimators=10,
        sampling_strategy='auto',
        random_state=2909,
    )


# train model
easy.fit(X_train, y_train)

In [None]:
# Now let's get the performance on train and test

X_train_preds = easy.predict_proba(X_train)[:,1]
X_test_preds = easy.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_train, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

The ensemble methods did not improve the performance either. Another thing we can investigate is cost-sensitive learning.

## Cost-sensitive learning

To finish with, we will implement cost-sensitive learning. That is, we will modify the penalization cost of the minority class. We can do this directly from the sklearn Gradient Boosting Classifier as follows:

In [None]:
# we have an imbalance of 95 to 5, so we use those as weights
sample_weight = np.where(y_train==1, 95, 5)

# train model
gbm.fit(X_train, y_train, sample_weight)

In [None]:
# Now let's get the performance on train and test

X_train_preds = gbm.predict_proba(X_train)[:,1]
X_test_preds = gbm.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_train, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

From all the techniques that we tested in this notebook, the benchmark model trained on the entire dataset and the 1 with cost-sensitive learning seem to be the ones that perform the best. So to follow up, we could optimize parameters on these to see if this improves model performance.

That will be all for this notebook.

For more details on feature engineering and feature selection, hyperparameter optimization or working with imbalanced datasets, feel free to check my [online courses](https://www.trainindata.com/).