# Resampling
---
As we observed during the feature exploration, our target feature is highly imbalanced (less than 4% are unsatisfied customers). This can be problematic as our models can present a good accuracy just by predicting the majority class (i.e. a model that predicts that all customers are satisfied). Also, we need to take this into account in our modelling during the train_test_split phase. Ideally we would want to have approximately the same representation of both classes in the train and test datasets.

In this kernel, we're going to resample the data in order to balance the classes. Later on we'll use this balanced dataset to build our models and study and compare the differences of the models when using the original dataset, principal components and this resampled dataset.

# Imports

In [None]:
import numpy as np
import pandas as pd

# load the train and test data files
train = pd.read_csv("../input/feature-exploration-and-dataset-preparation/train_clean_standarized.csv", index_col=0)

In [None]:
# let's see how imbalanced is our TARGET feature
print(train.TARGET.value_counts())
ax = train.TARGET.value_counts().plot(kind='bar', title='Customer satisfaction')
ax.set_xticklabels(['satisfied', 'unsatisfied']);

# 1. Resampling
One of the most common techniques when dealing with unbalanced datasets is resampling:

1. Undersampling: removing samples from the majority class
2. Oversampling: adding samples to the minority class

In the first case, we could just randomly remove data points associated to "satisfied" customers from our dataset to match the number of unsatisfied customers. However, in this case we would be dropping most of the data which may contain important information to train our models. In this case, if we just do this, we would end up with around 6000 data points, which may not be enough, specially taking into account the number of features we have.

On the other hand, we could add samples to the minority class. The simplest technique would be just by duplicating records from the minority class, but this could cause overfitting.

Other than random undersampling / oversampling, there are more advanced techniques that can be used:

1. Undersampling:
  - Tomek links (from the imbalanced-learn module (imblearn)): Tomek links are pairs of instances that are very close to each other, but belong to different classes. By removing the instance from the majority class, we'll facilitate the classification process.
  - Cluster centroids: This technique generates centroids based on clustering methods and will remove data points while preserving the majority of the information.
2. Oversampling:
  - SMOTE: This technique uses the k-nearest neighbors in order to create new datapoints.

However these techniques are based on clustering and ([don't work well with high-dimensional data](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-106)) as in our case. We would need to reduce the number of features, i.e. calculating PCA, and we've already seen during our PCA analysis that we need a fair amount of components.

That's why we're just going to reduce the number of data points by randomly remove instances from the majority class (for now we're going to keep 20000 thousand records):

In [None]:
# split the data between satisfied and unsatisfied customers
train_satisfied = train[train.TARGET == 0]
train_unsatisfied = train[train.TARGET == 1]

# undersample the majority class to 20000 instances
train_satisfied_under = train_satisfied.sample(20000)

# combine the two classes
train_resampled = pd.concat([train_satisfied_under, train_unsatisfied]);

In [None]:
# let's see how imbalanced is our TARGET feature in the resampled dataset
print(train_resampled.TARGET.value_counts())
ax = train_resampled.TARGET.value_counts().plot(kind='bar', title='Customer satisfaction')
ax.set_xticklabels(['satisfied', 'unsatisfied']);

# Output

In [None]:
train_resampled.to_csv('train_resampled.csv')