# Data Exploration

In [None]:
import pandas as pd


train = pd.read_csv('../input/santander-customer-satisfaction/train.csv',index_col='ID')
test = pd.read_csv('../input/santander-customer-satisfaction/test.csv',index_col='ID')
train.head()

So we have imported the dataset. Let's look at the shape of the dataset

In [None]:
train.shape

In [None]:
train.describe()

In [None]:
train.columns

In [None]:
train.dtypes

In [None]:
train.dtypes.unique()

We have no categorical variables Now let's look at the null value situation

In [None]:
train.isnull().sum()

In [None]:
train.isnull().sum().unique()

We can confirm that there are no null values or categorical variables and thus no requirement for cleaning the data

# Vanilla Model - No Data Engineering

Splitting the target variable and inputs

In [None]:
y = train.TARGET
y.head()

In [None]:
x = train.drop(['TARGET'],axis=1) 
x.head()

Let's first attempt to build a plain vanilla model with no feature engineering whatsoever.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, Y_train, Y_val = train_test_split(x,y,train_size=0.65,test_size=0.35,random_state=0)

### Vanilla model - Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_absolute_error


model = DecisionTreeClassifier(random_state=1)
model.fit(X_train,Y_train)


In [None]:
preds = model.predict(X_val)

In [None]:
from sklearn import metrics

print(metrics.classification_report(preds,Y_val))

This is a pretty horrible model. Although one might think that 93% accuracy is pretty good, we have to consider the precision and recall scores for both the classes - which is truly pathetic.

To emphasize that the problem lies in the dataset and not in the model chosen, let's use logistic regression to model this data

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=1)

lr.fit(X_train,Y_train)
print(metrics.classification_report(lr.predict(X_val),Y_val))

This confirms our problem - Imbalanced Dataset

In [None]:
print(y.value_counts())

# Tackling Imbalanced Dataset problem

There are various ways one way can attack an imbalanced dataset problem. We discuss two

1. Resampling Dataset : Under-sampling the majority class or over-sampling the minority class
2. Using an ensemble model to achieve better generalization


## Resampling

### Resampling : Under-sampling of majority class

In [None]:
train.TARGET.value_counts(normalize=True)

Class 0 accounts for 96% of the dataset. Let's bring this down by selecting a subsample of class 0 to train with on class 1.

In [None]:
### General Function to randomly sample N*x , where N is the ratio of class 0 to class 1 points and 
### x is the number of class 1 points

def under_sampler(N,x = train.TARGET.value_counts()[1]):
    class_0 = train[train['TARGET']==0].sample( int(N*x) ,random_state=1)
    class_1 = train[train['TARGET']==1]
    return pd.concat([class_0,class_1],axis=0)

train_new = under_sampler(4)
train_new.shape

In [None]:
X = train_new.drop(columns='TARGET')
y = train_new.TARGET

X_train, X_val, y_train, y_val = train_test_split(X,y,random_state=1,train_size=0.8)

model.fit(X_train,y_train)
preds = model.predict(X_val)
print(metrics.classification_report(preds,y_val))


In [None]:
print(metrics.roc_auc_score(preds,y_val))

We already see tremendous improvement here in terms of precision , recall and Roc score

Let's now experiment with different ratios in 1 to 5 in increments of 1

In [None]:
import numpy as np

In [None]:
for N in np.arange(1.0,5.0,0.5):
    train_new = under_sampler(N)
    X = train_new.drop(columns='TARGET')
    y = train_new.TARGET

    X_train, X_val, y_train, y_val = train_test_split(X,y,random_state=1,train_size=0.8)

    model.fit(X_train,y_train)
    preds = model.predict(X_val)
    print("The ratio of majority to minor class is ",N)
    print(metrics.classification_report(preds,y_val))


The trade-off between class 0 and class 1 precision scores are clear. For now an optimal solution seems like using a ratio of 3:1 for under-sampling. Let's move on to over-sampling and reviewing it's results

### Resampling : Over-Sampling of minority class

There are majorly two ways one can achieve this
1. Repition of class 1 data points
2. SMOTE (Synthetic Minoirity Oversampling Technique)

##### Repitition

Let's multiply the amount of class 1 points by two. The problem with this approach is that it risks overfitting since the classifier sees the same data over and over again (to be more precise, it sees it twice).

In [None]:
train.TARGET.value_counts()

In [None]:
train_new = train.copy()

train_new = pd.concat([train_new,train_new[train_new.TARGET == 1]],axis=0)
train_new.TARGET.value_counts()

In [None]:
X = train_new.drop(columns='TARGET')
y = train_new.TARGET

X_train,X_val,y_train,y_val = train_test_split(X,y,random_state=1,train_size=0.8)

In [None]:
model.fit(X_train,y_train)
preds = model.predict(X_val)
print(metrics.classification_report(preds,y_val))

In [None]:
metrics.roc_auc_score(preds,y_val)

Surprisingly, we find better class 1 precision and recall scores without any effect on class 0 scores. Can we achieve better results while utilizing SMOTE?

#### SMOTE (Synthetic Minority Oversampling Technique)

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
from collections import Counter

In [None]:
print("Ratio of minority to majority class points",train.TARGET.value_counts()[1]/train.TARGET.value_counts()[0])

In [None]:
### CAUTION : Change n_jobs to match the number of threads that runs on your CPU. I have a CPU running 6 cores and 12 threads
### and hence assinged n_jobs = 12. If you aren't sure of the number of threads on your CPU, remove the parameter n_jobs.

sm = SMOTE(random_state=1,n_jobs=12,sampling_strategy = 0.25)# ratio of resampled minority class points to majority class points

X_resampled, y_resampled = sm.fit_resample(X,y)
print("Before sampling : ",Counter(y),"\nAfter Sampling : ",Counter(y_resampled))

In [None]:
X_train,X_val,y_train,y_val = train_test_split(X_resampled,y_resampled,random_state=1,train_size=0.8)

model.fit(X_train,y_train)
preds = model.predict(X_val)
print(metrics.classification_report(preds,y_val))

In [None]:
metrics.roc_auc_score(preds,y_val)

A much better solution overall. We have greatly improved our scores using SMOTE. Now , what if we both under-sample majority class points and over sample minority class points after adjusting for ratios?

In [None]:
predictions = model.predict(test)

In [None]:
predictions

In [None]:
out =pd.DataFrame({'ID':test.index,'TARGET':predictions})
out.set_index('ID',inplace=True)

In [None]:
out

In [None]:
out.to_csv("Predictions1.csv",index=True)

On kaggle this gives us a score of 0.58 auc

### Resampling : Both undersampling and oversampling

Let's under sample majojrity class points and over sample minority class points using SMOTE

Under-sampling : Take a sample of 4 times the number of minority class points

In [None]:
train_new = under_sampler(6)

X = train_new.drop(columns='TARGET')
y = train_new.TARGET

y.value_counts()

The ratio of minority to majority class points is now 0.25 (1/4). Let's use SMOTE to bring this upto 0.5

In [None]:
sm = SMOTE(random_state=1,n_jobs=12,sampling_strategy = 0.4)

X_resampled, y_resampled = sm.fit_resample(X,y)
print("Before sampling : ",Counter(y),"\nAfter Sampling : ",Counter(y_resampled))

Now let's train the model and see

In [None]:
X_train,X_val,y_train,y_val = train_test_split(X_resampled,y_resampled,random_state=1,train_size=0.8)

model.fit(X_train,y_train)
preds = model.predict(X_val)
print(metrics.classification_report(preds,y_val))

In [None]:
metrics.roc_auc_score(preds,y_val)

In [None]:
predictions = model.predict(test)
out =pd.DataFrame({'ID':test.index,'TARGET':predictions})
out.set_index('ID',inplace=True)
out.to_csv('Predicitions2.csv',index=True)

On kaggle this gives us a score of 0.61 auc

Let's try a couple of different combinations for sampling strategy in SMOTE

In [None]:
train.TARGET.value_counts()

In [None]:
train_new = under_sampler(10)

X = train_new.drop(columns='TARGET')
y = train_new.TARGET

In [None]:
max_auc = 0
for i in np.arange(0.2,1,0.1):
    sm = SMOTE(random_state=1,n_jobs=12,sampling_strategy = i)
    X_resampled, y_resampled = sm.fit_resample(X,y)
    #print("Before sampling : ",Counter(y),"\nAfter Sampling : ",Counter(y_resampled))
    X_train,X_val,y_train,y_val = train_test_split(X_resampled,y_resampled,random_state=1,train_size=0.8)

    model.fit(X_train,y_train)
    preds = model.predict(X_val)
    #print(metrics.classification_report(preds,y_val))
    auc = metrics.roc_auc_score(preds,y_val)
    if auc>max_auc:
        max_auc = auc
        SMOTE_ratio = i
        
print("Optimal Soltution -\nAUC Score : ",max_auc,"\nSMOTE Ratio : ",SMOTE_ratio)

In [None]:
sm = SMOTE(random_state=1,n_jobs=12,sampling_strategy = 0.9)
X_resampled, y_resampled = sm.fit_resample(X,y)
print("Before sampling : ",Counter(y),"\nAfter Sampling : ",Counter(y_resampled))
X_train,X_val,y_train,y_val = train_test_split(X_resampled,y_resampled,random_state=1,train_size=0.8)

model.fit(X_train,y_train)
preds = model.predict(X_val)
print(metrics.classification_report(preds,y_val))

In [None]:
metrics.roc_auc_score(preds,y_val)

In [None]:
predictions = model.predict(test)

out =pd.DataFrame({'ID':test.index,'TARGET':predictions})
out.set_index('ID',inplace=True)
out.to_csv('Predicitions3.csv',index=True)

We get a kaggle score of only 0.6 auc

## Ensemble models

### Random Forest

The simplest ensemble model? Random Forest. Let's try that out

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,y,train_size=0.8,random_state=1)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators = 100,n_jobs = 12,random_state=1)
rfc.fit(X_train,y_train)
preds = rfc.predict(X_val)
print(metrics.classification_report(preds,y_val))

In [None]:
metrics.roc_auc_score(preds,y_val)

These are still bad scores - especially precision and recall rates

Let's combine our previous best method of under-sampling and over-sampling with RFC

In [None]:
train_new = under_sampler(6)

X = train_new.drop(columns='TARGET')
y = train_new.TARGET
sm = SMOTE(random_state=1,n_jobs=12,sampling_strategy = 0.4)

X_resampled, y_resampled = sm.fit_resample(X,y)
#print("Before sampling : ",Counter(y),"\nAfter Sampling : ",Counter(y_resampled))
X_train,X_val,y_train,y_val = train_test_split(X_resampled,y_resampled,random_state=1,train_size=0.8)

rfc.fit(X_train,y_train)
preds = rfc.predict(X_val)
print(metrics.classification_report(preds,y_val))

In [None]:
metrics.roc_auc_score(preds,y_val)

In [None]:
predictions = rfc.predict(test)

out =pd.DataFrame({'ID':test.index,'TARGET':predictions})
out.set_index('ID',inplace=True)
out.to_csv('Predicitions4.csv',index=True)