# Demo of some oversampling issues
The purpose of this notebook is to document some learnings I made during the starbucks project in this repository. 

1. I ran into the problem that oversampling transforms your data into an np.array so you lose all your column labels
2. Fixed that with a function that restores the dataframe properties
3. Then I run into an issue that the categorical values got messed up by oversampling with ADASYN
4. Fixed that with chaning for SMOTENC

__Key learnings:__
- for use with categorical data use SMOTENC, the other classes do not work
- you cannot use sampling classes within sklearn pipelines, use imblearns own pipeline object (this problem is actually not documented / demonstrated in this notebook, but [here](https://stackoverflow.com/questions/50245684/using-smote-with-gridsearchcv-in-scikit-learn))

In [2]:
# load in packages

import numpy as np
import pandas as pd
import cleaning_functions as cleaning

from imblearn.over_sampling import ADASYN, SMOTENC

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [3]:
# load the data
data = pd.read_csv('data/training.csv')

### EDA

In [4]:
data.sample(5)

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
25718,38218,No,0,2,34.128535,-0.12615,2,2,1,1
43129,64128,Yes,0,0,34.67843,0.220161,1,3,4,2
76657,114418,Yes,0,2,25.226853,0.393317,2,2,1,2
83349,124369,No,1,1,25.218409,-0.559039,2,3,1,1
27470,40808,No,0,0,28.664591,0.999361,2,1,1,2


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84534 entries, 0 to 84533
Data columns (total 10 columns):
ID           84534 non-null int64
Promotion    84534 non-null object
purchase     84534 non-null int64
V1           84534 non-null int64
V2           84534 non-null float64
V3           84534 non-null float64
V4           84534 non-null int64
V5           84534 non-null int64
V6           84534 non-null int64
V7           84534 non-null int64
dtypes: float64(2), int64(7), object(1)
memory usage: 6.4+ MB


In [6]:
# change datatypes, drop ID column

def wrangle_1_columns(df):
    df['Promotion'] = df['Promotion'].map({'Yes':1, 'No':0})
    colsToCat = ["Promotion", "V1", "V4", "V5", "V6", "V7"]
    df = cleaning.change_dtypes(df, cols_to_category=colsToCat)
    df.drop('ID', axis=1, inplace=True)


In [8]:
# call function and check results

wrangle_1_columns(data)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84534 entries, 0 to 84533
Data columns (total 9 columns):
Promotion    84534 non-null category
purchase     84534 non-null int64
V1           84534 non-null category
V2           84534 non-null float64
V3           84534 non-null float64
V4           84534 non-null category
V5           84534 non-null category
V6           84534 non-null category
V7           84534 non-null category
dtypes: category(6), float64(2), int64(1)
memory usage: 2.4 MB


In [9]:
# check distribution of target variable 'purchase'

display(eda['purchase'].value_counts())
print("proportion of purchases (%)", round(eda['purchase'].value_counts()[1] / len(eda) *100, 2))

0    83494
1     1040
Name: purchase, dtype: int64

proportion of purchases (%) 1.23


### Oversample with ADASYN

In [10]:
# separate target variable from features

def create_Xy(df):
    """Seprate target variable from features."""

    X = df.copy()
    y = X['purchase']
    X = X.drop(['purchase', 'Promotion'], axis=1)  # Promotion is no valid input feature
    
    return X, y

In [11]:
# call function
X, y = create_Xy(data)

#### Difficulties with oversampling

Problem 1: Calling a samling function like ADASYN transforms your X dataframe into an np.array. The initial column labels and datatypes get lost in the process:

In [12]:
sm = ADASYN()
X, y = sm.fit_sample(X, y)

In [45]:
# check results for y - ok

print(y.shape)
unique, counts = np.unique(y, return_counts=True)
print(np.asarray((unique, counts)).T)

(166646,)
[[    0 83494]
 [    1 83152]]


In [16]:
# check results for X 

X = pd.DataFrame(X)
display(X.sample(5))
display(X.info())

Unnamed: 0,0,1,2,3,4,5,6
58314,1.0,22.491507,-0.732194,2.0,3.0,1.0,2.0
119157,0.993736,21.129226,0.307281,2.0,3.0,3.0,2.0
141532,1.0,30.311283,-1.477767,2.0,2.0,4.0,2.0
80,1.0,29.264723,1.43225,2.0,1.0,4.0,2.0
718,1.0,36.519262,-1.165083,2.0,2.0,3.0,2.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166646 entries, 0 to 166645
Data columns (total 7 columns):
0    166646 non-null float64
1    166646 non-null float64
2    166646 non-null float64
3    166646 non-null float64
4    166646 non-null float64
5    166646 non-null float64
6    166646 non-null float64
dtypes: float64(7)
memory usage: 8.9 MB


None

#### Solution to problem 1: Define function to get DataFrame of X with initial properties

In [23]:
X, y = create_Xy(data)  # reset data


# define oversampling function

def oversample_ADASYN(X, y):
    """Oversampling of underrepresented class with imbalanced learn package."""
    
    sm = ADASYN()
    X, y = sm.fit_sample(X, y)
    
    # restore df-format, column names and dtypes of X
    X = pd.DataFrame(X)
    X.columns = data.columns[2:]
    colsToCat = ["V1", "V4", "V5", "V6", "V7"]
    X = cleaning.change_dtypes(X, cols_to_category=colsToCat)
    
    return X, y

In [24]:
X, y = oversample_ADASYN(X, y)

In [25]:
# check results for X 

X = pd.DataFrame(X)
display(X.sample(5))
display(X.info())

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7
125612,0.418366,40.514869,-0.996347,1.418366,3.0,4.0,1.0
106479,1.0,24.865598,-1.176181,2.0,3.0,1.709394,2.0
86282,2.676402,36.457274,-0.765268,2.0,3.676402,4.0,2.0
103028,2.0,28.066158,1.418252,2.0,2.161689,4.0,2.0
30044,2.0,41.166364,1.259095,1.0,2.0,1.0,2.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166646 entries, 0 to 166645
Data columns (total 7 columns):
V1    166646 non-null category
V2    166646 non-null float64
V3    166646 non-null float64
V4    166646 non-null category
V5    166646 non-null category
V6    166646 non-null category
V7    166646 non-null category
dtypes: category(5), float64(2)
memory usage: 8.3 MB


None

**Remaining problem:** the values of the categorical data are not really categorical anymore ...

### Oversample with SMOTENC
[doku here](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTENC.html)

In [27]:
X, y = create_Xy(data)  # reset data


# define oversampling function

def oversample_SMOTENC(X, y):
    """Oversampling of underrepresented class with imbalanced learn package."""
    
    sm = SMOTENC(categorical_features=[0,3,4,5,6])  # indices of categorical variables
    X, y = sm.fit_sample(X, y)
    
    # restore df-format, column names and dtypes of X
    X = pd.DataFrame(X)
    X.columns = data.columns[2:]
    colsToCat = ["V1", "V4", "V5", "V6", "V7"]
    X = cleaning.change_dtypes(X, cols_to_category=colsToCat)
    
    return X, y

In [28]:
# check results for X 

X = pd.DataFrame(X)
display(X.head(5))
display(X.info())

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7
31343,2,31.957064,0.220161,2,1,2,2
4169,1,29.277509,-0.039572,2,1,3,2
80577,1,36.455552,-1.165083,1,3,2,2
64144,2,18.798151,0.306739,2,3,1,2
27433,1,35.359499,-0.559039,2,3,1,2


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84534 entries, 0 to 84533
Data columns (total 7 columns):
V1    84534 non-null category
V2    84534 non-null float64
V3    84534 non-null float64
V4    84534 non-null category
V5    84534 non-null category
V6    84534 non-null category
V7    84534 non-null category
dtypes: category(5), float64(2)
memory usage: 1.7 MB


None

---