# Predicting Customer Satisfaction with Imbalanced Data and Hyperparameter Optimization

In this notebook, I'll show how to approach developing a model to predict customer satisfaction using an imabalanced datset.

I discuss why I made some decisions on which model to train, on how to optimize hyperparameters and on which techniques to use to improve the performance of the model trained with an imbalanced data set.

Now, to be able to run the code on Kaggle, I need to reduce the size of the dataset quite a bit. Otherwise, the kernel runs out of memory and breaks. So I will do that here, to be able to show methods for hyperparameter optimization and techniques used on imbalanced datasets. But if I was working on a computer with more computational resources, I would probably not reduce the number of features before hand. In fact, I would evaluate how predictve the features are at the back of the techniques applied to work with imbalaced data. And this is because some methods will alter the distribution of the data, thus, features that were not originally predictive, could be so after resampling.

In any case, I will take the opportunity to discuss the feature selection methods that I use and why.

## In this notebook

* Remove non-predictive features
* Train a classifier to predict customer satisfaction
* Optimize hyperparameters
* Improve performance with different techniques for imbalanced data

## Feature selection

To begin with, I will remove duplicated and constant features. These are in essence redundant or non-predictive. Next, I will find quasi-constant features, and evaluate the distribution of its values across satisfied and un-satisfied customers, to determine if I can remove them.

* Remove constant features
* Remove duplicated features
* Remove quasi-constant features

More details throughout the notebook.

## The Machine Learning Model

This is a classification problem. I want to predict if a customer is unsatisfied (1 in the target). From the off-the-shelf algorithms we know that **Gradient Boosting Machines** out-perform all other models. So in this notebook I will train a Gradient Boosting Classifier from Scikit-learn.

Other suitable options would be a Gradient Boosting Classifier from the packages **xgb** or **LightGBM**. In fact, if you have computing power, you could try both along side the GBM from sklearn. Or simply pick the one you like the most, to keep things simple.

## Hyperparameter Optimization

There are a number of methods to select the best hyperparameters. The basic methods include grid seearch and random search. Those are usually suitable for models with few hyperparameters, like the GBM from sklearn. If we had a lot of hyperparameters and a model that is very costly to train, then we would be better of performing bayesian hyperparameter optimization. But for this problem, that might be an over-kill and a random search should be more than enough.

The number of hyperparameters in Gradient Boosting Machines is not very big, so we should be able to find the best hyperparameters with a Randomized Search. So in this notebook, I will use this procedure. Randomized Search comes baked into sklearn, so there is no need to use alternative Python packages.

* Randomized hyperparameter search with sklearn

## Imbalanced data

There are a number of techniques that we can use to try and improve the performance of models trained on imbalanced datasets. We can under- or over-sample the dataset. Within the under-sampling techniques we have cleaning techniques that allow us to remove noisy observations instead of observations just at random. Within the over-sampling techniques we have methods to create new, "synthetic", data using existing observations as templates. This way, we do not just "duplicate" the data as we would do with over-sampling.

We can also implement cost-sensitive learning, where we modify the optimization function to account for the cost of miss-classification. Miss-classifying an observation from the minority class tends to be more costly in real situations. And finally we have special ensemble algorithms that were designed specifically to work with imbalanced datasets.

So in summary, we could try:

* undersampling at random or based on cleaning criteria
* oversampling at random or create synthethic new data
* introduce cost sensitive learning
* train a special algorithm for imbalanced datasets

For more details on feature selection, hyperparameter optimization or working with imbalanced datasets, visit my [online courses](https://www.trainindata.com/)

In [None]:
# Let's install Feature-engine
# this package will allow us to quickly remove 
# non-predictive variables

!pip install feature-engine

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# to sample the hyperparameter space based on distributions
from scipy import stats

# I use GBM because it usually out-performs other off-the-shelf 
# classifiers
from sklearn.ensemble import GradientBoostingClassifier

# to evaluate features
from sklearn.feature_selection import chi2

# metric to optimize for the competition
from sklearn.metrics import roc_auc_score

# to optimize the hyperparameters we import the randomized search class
from sklearn.model_selection import (
    RandomizedSearchCV,
    train_test_split,
)

# to assemble various procedures in sequence
from sklearn.pipeline import Pipeline

# some methods to work with imbalanced data are based in nearest neighbours
# and nearest neighbours are sensitive to the magnitude of the features
# so we need to scale the data
from sklearn.preprocessing import (
    MinMaxScaler,
    Binarizer,
)

# import selection classes from Feature-engine
# to reduce the number of features
from feature_engine.selection import (
    DropDuplicateFeatures,
    DropConstantFeatures,
)

# to apply sklearn transformers to a subset of features
from feature_engine.wrappers import SklearnTransformerWrapper

# over-sampling techniques for imbalanced data
from imblearn.over_sampling import SMOTENC

# under-sampling techniques for imbalanced data
from imblearn.under_sampling import (
    InstanceHardnessThreshold,
)

# special ensemble methods to work with imbalanced data
# we will use those based on boosting, which tend to work better
from imblearn.ensemble import (
    RUSBoostClassifier,
    EasyEnsembleClassifier,
)

# to put the final model together at the end of the notebook
from imblearn.pipeline import Pipeline as imb_Pipeline

## Load Data

In [None]:
# load the Santander Customer Satisfaction dataset

data = pd.read_csv('/kaggle/input/santander-customer-satisfaction/train.csv')

In [None]:
# separate dataset into train and test sets
# I split 20:80 mostly to reduce the size of the train set
# so that this notebook does not run out of memory :_(

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['ID','TARGET'], axis=1),
    data['TARGET'],
    test_size=0.8,
    random_state=0)

X_train.shape, X_test.shape

## Target

The target class is imbalanced. The value 1 refers to un-satisfied customers and 0 to satisfied. So most of Santander's customers are satisfied.

In [None]:
# check class imbalance

y_train.value_counts(normalize=True), y_train.value_counts()

We see that ~ 4% of the customers are not satisfied.

In [None]:
# check also the test set
y_test.value_counts(normalize=True)

## Drop constant and duplicated features

This dataset contains constant and duplicated features. I know this from previous analysis so I will quickly remove these features to reduce the data size.

More insight about feature selection for this dataset here:
https://www.kaggle.com/solegalli/feature-selection-with-feature-engine

In [None]:
# to remove constant and duplicated features, we use the transformers from Feature-engine

pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=1)), # drops constant features
    ('duplicated', DropDuplicateFeatures()), # drops duplicates
])

# find features to remove
pipe.fit(X_train, y_train)

In [None]:
# how many constant features are there in the dataset?

len(pipe.named_steps['constant'].features_to_drop_)

In [None]:
# how many duplicated features are there in the dataset?

len(pipe.named_steps['duplicated'].features_to_drop_)

Let's go ahead and remove them from the datasets.

In [None]:
print('Number of original variables: ', X_train.shape[1])

# see how with the pipeline we can apply all transformers in sequence
# with one line of code, for each data set
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

print('Number of variables after selection: ', X_train.shape[1])

## Drop quasi-constant features

Let's first find out the number of quasi-constant features.

Quasi-constant features are those that show the same value for the great majority of the observations.

In [None]:
# we find features with the same value in 98% of 
# the observations

sel_ = DropConstantFeatures(tol=0.98)

sel_.fit(X_train)

In [None]:
# how many quasi-constant features are there in the dataset?

len(sel_.features_to_drop_)

In [None]:
# Let's look at 1 feature

sel_.features_to_drop_[0]

In [None]:
# we see that this feature has the value 0 in more than 99% of the observations

X_train[sel_.features_to_drop_[0]].value_counts(normalize=True)

To determine if this features are useful, we will:

* replace its values by 0 for the majority value and 1 for the rare values
* compare the distribution of the transformed variables in satisfied and unsatisfied customers

To measure these distributions we use chi-squared.

In [None]:
# let's replace values greater than 0 by 1
# for this we need the Binarizer from sklearn with threshold 0

# capture quasi cosntant features in a list
quasi_ = list(sel_.features_to_drop_)

# in order to modify just the quasi-constant features
# we use the sklearn wrapper from feature engine,
# which by the way also returns a dataframe

binarizer_ = SklearnTransformerWrapper(
    transformer = Binarizer(threshold=0),
    variables = quasi_,
)

binarizer_.fit(X_train)

X_train = binarizer_.transform(X_train)
X_test = binarizer_.transform(X_test)

In [None]:
# Now, if we re-evaluate the quasi-constant feature, we should see only 2 values

X_train[quasi_[0]].value_counts(normalize=True)

In [None]:
# Now let-s evaluate the distribution of the features

# Compute chi-squared stats between each non-negative feature and class.

chi_ = chi2(X_train[quasi_], y_train)

In [None]:
# join feature names and p-values in a dataframe

feat = pd.concat([
    pd.Series(quasi_),
    pd.Series(chi_[1]),
], axis=1,
)

feat.columns = ['feature', 'p_value']

feat.head()

In [None]:
feat['p_value'].hist(bins=30)
plt.ylabel('Number of features')
plt.xlabel('p value')

We have a few features that seem to be differently distributed between satisfied and unsatisfied customers. Let's keep those and remove the rest.

I will keep the features with a p_value bigger than 0.4, to try and reduce the data size as much as possible. And remember that this is so I can keep data small not to run out of memory in this notebook.

In [None]:
print('Number of total quasi-constant features: ', len(feat))

feat = feat[feat['p_value']<0.4]

print('Number of non predictive quasi-constant features: ', len(feat))

In [None]:
# let's drop the features

X_train.drop(labels=feat['feature'], axis=1, inplace=True)
X_test.drop(labels=feat['feature'], axis=1, inplace=True)

X_train.shape

Now, we reduced the dataset from 369 features to 231. Let's hope that that helps speed things up!

## Variable exploration

From previous analysis we know that this data set does not contain missing values and that all variables are numerical.

We also know from previous analysis that most variables in this dataset are binary and discrete, with very few continuous variables. In fact, a few variables that had more than 2 values, are now binary after we applied the label binarizer.

In [None]:
# Let's find out how many variables we have with 2, or less than 10 or 20 distinct values

for max_unique in [2, 10, 20]:
    vars_ = [x for x in X_train.columns if X_train[x].nunique()<= max_unique]
    vars_ = len(vars_)
    print(f'{vars_} variables with less than or equal to {max_unique} values')

We see that we have 106 binary variables, and a few more that are also discrete.

Why is this important?

* Some under- and over- sampling methods for imbalanced datasets are based of Nearest neighbours
* Nearest neighbours depend on distance metrics
* in theory, distance metrics for continuous variables are not appropriate for discrete variables and vice-versa.

So, this will guide how I select which under- and over-sampling methods I can apply on my data, as I will discuss later.

For now, let's train a gradient boosting machine with these variables to determine the benchmark performance.

## Train Gradient Boosting Model

We know that for classification Gradient Boosting Machines out-perform all other models, so we will implement directly that model.

We will do so with cross validation and hyperparameter search.

We know that Random search of hyperparameters out-performs Grid Search, so we will implement that straightaway.

### Hyperparameter optimization

For hyperparameter optimization we need to define:

- the machine learning model to train
- the hyperparameter space (the hyperparameter distributions to sample from)
- the search algorithm
- the metric to optimize

Let's do that...

In [None]:
# set up the gradient boosting classifier with default parameters
gbm = GradientBoostingClassifier(random_state=0)

# determine the hyperparameter space
# we use stats to sample from distributions

param_grid = dict(
    n_estimators=stats.randint(10, 200),
    min_samples_split=stats.uniform(0, 1),
    max_depth=stats.randint(1, 5),
    loss=('deviance', 'exponential'),
    )

# set up the search
search = RandomizedSearchCV(
    gbm, # the model
    param_grid, # hyperparam space
    scoring='roc_auc', # metric to optimize
    cv=2, # I do 2 to speed things up, 5 would be better as the dataset is quite small
    n_iter = 5, # I do 5 to speed things up, but for randomized search 60 has been shown to find the optimal hyperparameters
    random_state=5, # reproducibility
    refit=True, # this fits the model with the best hyperparams to the entire training set after the hyperparam search
)

# find best hyperparameters
search.fit(X_train, y_train)

In [None]:
# the best hyperparameters are stored in an attribute:

search.best_params_

In [None]:
# Now let's get the benchmark performance on train and test

X_train_preds = search.predict_proba(X_train)[:,1]
X_test_preds = search.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_train, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

# Methods for Imbalanced data

## Under-sampling - Instance Hardness Threshold

Among the under-sampling methods, we can perform random under-sampling, where we extract samples form the majority class at random. We extract normally as many samples as those we have in the minority. 

Then we have cleaning methods, but all of them depend on nearest neighbours, so I would argue that are not suitable given that we have a mix of discrete and continuous variables. 

We can use the InstanceHardness treshold which will remove observations from the majority class that are hard to classify correctly. 

Instance hardness is a measure of how difficult an observation is to classify correclty, and it is inversely correlated to the probability of its class.

So to keep things simple, let's just implement the instance hardness treshold to under-sample our data.

In [None]:
# set up instance hardness threshold
# the instance hardness is determined based on a gradient boosting machine
# trained on the entire dataset

iht = InstanceHardnessThreshold(
    estimator=gbm, # we pass the model we set up earlier
    sampling_strategy='auto',  # undersamples only the majority class
    random_state=1,
    cv=2,  # cross validation fold, 2 to speed things up.
)

# resample
X_resampled, y_resampled = iht.fit_resample(X_train, y_train)

# shape of original data and data after resampling
X_train.shape, X_resampled.shape

In [None]:
# check the resampled target
y_resampled.value_counts(normalize=True)

In [None]:
# train model while finding best hyperparameters

search.fit(X_resampled, y_resampled)

In [None]:
# the best hyperparameters are stored in an attribute:

search.best_params_

In [None]:
# Now let's get the performance on train and test

X_train_preds = search.predict_proba(X_resampled)[:,1]
X_test_preds = search.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_resampled, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

The instance hardness seems to cause over-fitting. So it is not making things better in this dataset.

## Over-Sampling - SMOTENC

Among the over-sampling methods we have, random over-sampling, which bootstraps observations from the minority class to increase their number. This technique in essence duplicates data, so sometimes leads to over-fitting. 

Instead, we can create new data based on SMOTE or its variants. There is 1 variant that is suitable for datasets with continuous and discrete variables, which is SMOTE-NC, so we will implement that method.

In [None]:
# we need to capture the index of the discrete variables

# make list of discrete variables
cat_vars = [var for var in X_train.columns if X_train[var].nunique() <= 10]

# capture the index in the dataframe columns
cat_vars_index = [cat_vars.index(x) for x in cat_vars]

cat_vars_index[0:6]

In [None]:
smnc = SMOTENC(
    sampling_strategy='auto', # samples only the minority class
    random_state=0,  # for reproducibility
    k_neighbors=3,
    categorical_features=cat_vars_index # indeces of the columns of categorical variables
)  

# because SMOTE uses KNN, and KNN is sensible to variable magnitude, we re-scale the data
X_resampled, y_resampled = smnc.fit_resample(MinMaxScaler().fit_transform(X_train), y_train)

X_train.shape, X_resampled.shape

In [None]:
# check the distribution of the resampled target

y_resampled.value_counts(normalize=True)

In [None]:
# train the model while finding best hyperparameters

search.fit(X_resampled, y_resampled)

In [None]:
# the best hyperparameters are stored in an attribute:

search.best_params_

In [None]:
# Now let's get the performance on train and test

X_train_preds = search.predict_proba(X_resampled)[:,1]
X_test_preds = search.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_resampled, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

SMOTENC does not seem to improve model performance either. On the contrary.

## Ensemble methods for imbalanced data

We will implement RUSBOOSt and BalancedCascade.

In [None]:
# set up the RUSBoost ensemble model
rusboost = RUSBoostClassifier(
        base_estimator=None,
        n_estimators=20,
        learning_rate=1.0,
        sampling_strategy='auto',
        random_state=2909,
    )

# set up the hyperparameter space
# the default implementation as 2 hyperparameters to optimize

param_grid = dict(
    n_estimators=stats.randint(10, 200),
    learning_rate=stats.uniform(0.0001, 1),
    )

# set up the search
search = RandomizedSearchCV(
    rusboost, # the model
    param_grid, # hyperparam space
    scoring='roc_auc', # metric to optimize
    cv=2, # I do 2 to speed things up, 5 would be better as the dataset is quite small
    n_iter = 5, # I do 10 to speed things up, but for randomized search 60 has been shown to find the optimal hyperparameters
    random_state=10, # reproducibility
    refit=True, # this fits the model with the best hyperparams to the entire training set after the hyperparam search
)

# find best hyperparameters
# using the original data (without resampling)
search.fit(X_train, y_train)

In [None]:
# the best hyperparameters are stored in an attribute:

search.best_params_

In [None]:
# Now let's get the performance on train and test

X_train_preds = search.predict_proba(X_train)[:,1]
X_test_preds = search.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_train, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

In [None]:
easy = EasyEnsembleClassifier(
        n_estimators=20,
        sampling_strategy='auto',
        random_state=2909,
    )

# set up the hyperparameter space
# the default implementation as 1 hyperparameters to optimize

param_grid = dict(
    n_estimators=stats.randint(10, 200),
    )

# set up the search
search = RandomizedSearchCV(
    easy, # the model
    param_grid, # hyperparam space
    scoring='roc_auc', # metric to optimize
    cv=2, # I do 2 to speed things up, 5 would be better as the dataset is quite small
    n_iter = 5, # I do 10 to speed things up, but for randomized search 60 has been shown to find the optimal hyperparameters
    random_state=10, # reproducibility
    refit=True, # this fits the model with the best hyperparams to the entire training set after the hyperparam search
)

# find best hyperparameters
# using the original data (without resampling)
search.fit(X_train, y_train)

In [None]:
# the best hyperparameters are stored in an attribute:

search.best_params_

In [None]:
# Now let's get the performance on train and test

X_train_preds = search.predict_proba(X_train)[:,1]
X_test_preds = search.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_train, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

## Cost-sensitive learning

To finish with, we will implement cost-sensitive learning. That is, we will modify the penalization cost of the minority class. We can do this directly from the sklearn Gradient Boosting Classifier as follows:

In [None]:
# set up the gradient boosting classifier with default parameters
gbm = GradientBoostingClassifier(random_state=0)

# determine the hyperparameter space
# we use stats to sample from distributions

param_grid = dict(
    n_estimators=stats.randint(10, 200),
    min_samples_split=stats.uniform(0, 1),
    max_depth=stats.randint(1, 5),
    loss=('deviance', 'exponential'),
    )

# set up the search
search = RandomizedSearchCV(
    gbm, # the model
    param_grid, # hyperparam space
    scoring='roc_auc', # metric to optimize
    cv=2, # I do 2 to speed things up, 5 would be better as the dataset is quite small
    n_iter = 5, # I do 10 to speed things up, but for randomized search 60 has been shown to find the optimal hyperparameters
    random_state=10, # reproducibility
    refit=True, # this fits the model with the best hyperparams to the entire training set after the hyperparam search
)

# we have an imbalance of 95 to 5, so we use those as weights
sample_weight = np.where(y_train==1, 95, 5)

# find best hyperparameters
search.fit(X_train, y_train, sample_weight)

In [None]:
# the best hyperparameters are stored in an attribute:

search.best_params_

In [None]:
# Now let's get the performance on train and test

X_train_preds = search.predict_proba(X_train)[:,1]
X_test_preds = search.predict_proba(X_test)[:,1]

print('Train roc_auc: ', roc_auc_score(y_train, X_train_preds))
print('Test roc_auc: ', roc_auc_score(y_test, X_test_preds))

From all the methods tested the GBM trained on the entire dataset, or the last one trained with cost-sensitive learning seem to return the best performing models,