# Overview

In this notebook, I've explored some ways of dealing with imbalanced datasets along with classification with LogisticRegression.The high level steps are:
1. Load the training and validation datasets and clean them
    - Convert commas to decimal points
    - Drop columns that are missing too many values
    - Drop missing value rows if there aren't too many of them
    - Apply each action to both datasets
2. Create a Pipeline for missing values imputation and Logistic Regression classification
    - Using the StandardScaler class that centers data columns to a mean of 0 and STD of 1
3. Try the Pipeline classifier on the initial training dataset and get a base performance figure using cross validation
4. Try 2 methods of balancing the classes
    - Undersampling the majority class so that it matches the minority class
    - Oversampling the minority class so that it matches the majority class
5. For each of the methods above, the hyberparameter C is tuned using GridSearchCV to reduce overfitting on the validation set.

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In [2]:
%matplotlib inline

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
# Sklearn imports
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler, LabelBinarizer, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, KFold, StratifiedKFold
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from sklearn.utils import resample


In [5]:
import helperFunctions

In [6]:
# Setting the random state for later use
random_state = 565

## Load dataset and clean

__Train Set__

In [7]:
trainSet = pd.read_csv('training.csv', delimiter=';')
trainSet = helperFunctions.cleanData(trainSet)

In [8]:
trainSet = helperFunctions.encodeCategoricals(trainSet, 
                                                dummyColList=['v1', 'v4', 'v8', 'v9', 'v11', 'v12'], 
                                                labelCol='classLabel',
                                                labelEncoding={'no.':'1', 'yes.':'0'})

In [9]:
y_train = trainSet['classLabel']
X_train = trainSet.drop('classLabel', axis=1)

__Validation Set__

In [10]:
validSet = pd.read_csv('validation.csv', delimiter=';')
validSet = helperFunctions.cleanData(validSet)

In [11]:
validSet.head()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,classLabel
0,b,32.33,0.00075,u,0.840107,0.544982,1.585,t,f,0,t,s,420.0,0,no.
1,b,23.58,0.000179,u,-4.174396,0.864362,0.54,f,f,0,t,g,136.0,1,no.
2,b,36.42,7.5e-05,y,2.232226,0.627476,0.585,f,f,0,f,g,240.0,3,no.
3,b,18.42,0.001041,y,-2.46997,0.846741,0.125,t,f,0,f,g,120.0,375,no.
4,b,24.5,0.001334,y,-3.149422,0.321087,0.04,f,f,0,t,g,120.0,475,no.


In [12]:
validSet = helperFunctions.encodeCategoricals(validSet, 
                                                dummyColList=['v1', 'v4', 'v8', 'v9', 'v11', 'v12'], 
                                                labelCol='classLabel',
                                                labelEncoding={'no.':'1', 'yes.':'0'})

In [13]:
y_valid = validSet['classLabel']
X_valid = validSet.drop('classLabel', axis=1)

Make sure that the train and validation sets have the same columns

In [14]:
X_train, X_valid = helperFunctions.equalizeColumns(X_train, X_valid)

## 0 - Logistic - Base Performance

In [15]:
lrPipe0 = Pipeline(steps = [
    ('imputer', Imputer(strategy='mean', axis=0)),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(random_state=random_state)),
])

In [16]:
scores = cross_val_score(estimator=lrPipe0, X=X_train, y=y_train, n_jobs=-1, scoring='accuracy', verbose=10, cv=5)
print('CV Accuracy scores: %s' % scores)
print('CV score mean: %.2f' % np.mean(scores))

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.1s remaining:    3.3s
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:    2.9s remaining:    1.9s


CV Accuracy scores: [ 0.9625      0.96388889  0.96666667  0.95549374  0.95961003]
CV score mean: 0.96


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.7s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.7s finished


In [17]:
lrPipe0 = lrPipe0.fit(X=X_train, y=y_train)
accuracy_score(y_pred=lrPipe0.predict(X=X_valid), y_true=y_valid)

0.64102564102564108

#### Tune the Regularization parameter strength
Looks like the classifier is overfitting on the training set and underperforming on the validation set. We can try to tune the regularization strength by doing a GridSearch on the space of possible values

The class_weight parameter is None by default, but by setting it to "balanced" we can account for the class imbalance. Including that in the parameter grid.

In [18]:
param_grid = [{ 
               'clf__C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0], 
               'clf__penalty': ['l1', 'l2'],
               'clf__class_weight': [None, 'balanced'],
              }]

# Using a predefined function for gridSearch in helperFunctions
helperFunctions.gridSearch(lrPipe0, param_grid, X_train, y_train, scoring='precision')

Best score: 0.885
Best parameters set:
	clf__C: 0.01
	clf__class_weight: None
	clf__penalty: 'l1'


Grid scores:
0.000 (+/-0.000) for {'clf__C': 1e-05, 'clf__class_weight': None, 'clf__penalty': 'l1'}
0.521 (+/-0.033) for {'clf__C': 1e-05, 'clf__class_weight': None, 'clf__penalty': 'l2'}
0.000 (+/-0.000) for {'clf__C': 1e-05, 'clf__class_weight': 'balanced', 'clf__penalty': 'l1'}
0.193 (+/-0.015) for {'clf__C': 1e-05, 'clf__class_weight': 'balanced', 'clf__penalty': 'l2'}
0.000 (+/-0.000) for {'clf__C': 0.0001, 'clf__class_weight': None, 'clf__penalty': 'l1'}
0.535 (+/-0.055) for {'clf__C': 0.0001, 'clf__class_weight': None, 'clf__penalty': 'l2'}
0.000 (+/-0.000) for {'clf__C': 0.0001, 'clf__class_weight': 'balanced', 'clf__penalty': 'l1'}
0.211 (+/-0.015) for {'clf__C': 0.0001, 'clf__class_weight': 'balanced', 'clf__penalty': 'l2'}
0.000 (+/-0.000) for {'clf__C': 0.001, 'clf__class_weight': None, 'clf__penalty': 'l1'}
0.625 (+/-0.086) for {'clf__C': 0.001, 'clf__class_weight': None, '

Since we want to reduce overfitting, let's choose the C value below 1 that has the best score

__Final Validation__

In [19]:
lrPipe0.set_params(**{'clf__C':0.1, 'clf__penalty':'l1'})
lrPipe0 = lrPipe0.fit(X=X_train, y=y_train)
accuracy_score(y_pred=lrPipe0.predict(X=X_valid), y_true=y_valid)

0.72820512820512817

So even after regularization tuning, the best we can do is about 68% accuracy on the validation set.

## 1 - Logistic Regression with Balanced classes - Undersampling Majority Class

This is one of the methods for class balancing where we reduce the numbers of the majority class samples to match those of the minority class

In [20]:
y_train.value_counts()

0    3328
1     269
Name: classLabel, dtype: int64

In [21]:
# First resample the majority class to get the same number of samples as the minority class
Xdown, ydown = resample(X_train[y_train == 0], y_train[y_train == 0], replace=False, n_samples=X_train[y_train == 1].shape[0])

# Now concatenate the resampled majority set to the minority set
xBal_Und = pd.concat([X_train[y_train==1], Xdown], axis=0)
yBal_Und = pd.concat([y_train[y_train==1], ydown], axis=0)

In [22]:
lrPipe1 = Pipeline(steps = [
    ('imputer', Imputer(missing_values='NaN', strategy='median', axis=0)),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(random_state=random_state)),
])

__Training CV__

In [23]:
scores = cross_val_score(estimator=lrPipe1, X=xBal_Und, y=yBal_Und, n_jobs=-1, scoring='accuracy', 
                         cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True,))
print('CV scores: %s' % scores)
print('CV score mean: %.2f' % np.mean(scores))

CV scores: [ 0.87962963  0.94444444  0.86111111  0.84259259  0.90566038]
CV score mean: 0.89


In [24]:
lrPipe1 = lrPipe1.fit(X=xBal_Und, y=yBal_Und)

__Validation Score__

In [25]:
accuracy_score(y_pred=lrPipe1.predict(X=X_valid), y_true=y_valid)

0.76923076923076927

#### Tune the Regularization strength
Tuning the regularization parameter using GridSearchCV

In [26]:
param_grid = [{ 
               'clf__C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0], 
               'clf__penalty': ['l1', 'l2'],
              }]

# Using a predefined function for gridSearch in helperFunctions
helperFunctions.gridSearch(lrPipe1, param_grid, xBal_Und, yBal_Und, scoring='precision')

Best score: 0.907
Best parameters set:
	clf__C: 0.01
	clf__penalty: 'l1'


Grid scores:
0.000 (+/-0.000) for {'clf__C': 1e-05, 'clf__penalty': 'l1'}
0.848 (+/-0.097) for {'clf__C': 1e-05, 'clf__penalty': 'l2'}
0.000 (+/-0.000) for {'clf__C': 0.0001, 'clf__penalty': 'l1'}
0.850 (+/-0.093) for {'clf__C': 0.0001, 'clf__penalty': 'l2'}
0.000 (+/-0.000) for {'clf__C': 0.001, 'clf__penalty': 'l1'}
0.860 (+/-0.098) for {'clf__C': 0.001, 'clf__penalty': 'l2'}
0.907 (+/-0.095) for {'clf__C': 0.01, 'clf__penalty': 'l1'}
0.889 (+/-0.076) for {'clf__C': 0.01, 'clf__penalty': 'l2'}
0.889 (+/-0.081) for {'clf__C': 0.1, 'clf__penalty': 'l1'}
0.893 (+/-0.081) for {'clf__C': 0.1, 'clf__penalty': 'l2'}
0.887 (+/-0.072) for {'clf__C': 1.0, 'clf__penalty': 'l1'}
0.887 (+/-0.072) for {'clf__C': 1.0, 'clf__penalty': 'l2'}



Since we want to reduce overfitting, let's choose the C value below 1 that has the best score

__Final Validation__

In [27]:
lrPipe1.set_params(**{'clf__C':0.01, 'clf__penalty':'l1'})
lrPipe1 = lrPipe1.fit(X=xBal_Und, y=yBal_Und)
accuracy_score(y_pred=lrPipe1.predict(X=X_valid), y_true=y_valid)

0.84615384615384615

## 2 - Logistic Regression with Balanced classes - Oversampling Minority Class

This is an alternate method of balancing the classes by oversampling the minority class to match the numbers in the majority class

In [28]:
y_train.value_counts()

0    3328
1     269
Name: classLabel, dtype: int64

In [29]:
# First resample the minority class to get the same number of samples as the majority class
X_upsample, y_upsample = resample(X_train[y_train == 1], y_train[y_train == 1], 
                                  replace=True, n_samples=X_train[y_train == 0].shape[0])

# Now concatenate the resampled majority set to the minority set
xBal_Ovr = pd.concat([X_train[y_train==0], X_upsample], axis=0)
yBal_Ovr = pd.concat([y_train[y_train==0], y_upsample], axis=0)

In [30]:
yBal_Ovr.value_counts()

1    3328
0    3328
Name: classLabel, dtype: int64

In [31]:
lrPipe2 = Pipeline(steps = [
    ('imputer', Imputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(random_state=random_state)),
])

__Training CV__

In [32]:
scores = cross_val_score(estimator=lrPipe2, X=xBal_Ovr, y=yBal_Ovr, n_jobs=-1, scoring='accuracy', 
                         cv=10)
print('CV scores: %s' % scores)
print('CV score mean: %.2f' % np.mean(scores))

CV scores: [ 0.88738739  0.91441441  0.89339339  0.90840841  0.91891892  0.90990991
  0.91891892  0.9009009   0.92018072  0.9246988 ]
CV score mean: 0.91


In [33]:
lrPipe2 = lrPipe2.fit(X=xBal_Ovr, y=yBal_Ovr)

__Validation Score__

In [34]:
accuracy_score(y_pred=lrPipe2.predict(X=X_valid), y_true=y_valid)

0.68205128205128207

__GridSearch Hyperparameter Tuning__

In [35]:
param_grid = [{
               'clf__C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0], 
               'clf__penalty': ['l1', 'l2'],
              }]

helperFunctions.gridSearch(lrPipe2, param_grid, xBal_Ovr, yBal_Ovr, scoring='precision')

Best score: 0.925
Best parameters set:
	clf__C: 0.01
	clf__penalty: 'l1'


Grid scores:
0.000 (+/-0.000) for {'clf__C': 1e-05, 'clf__penalty': 'l1'}
0.873 (+/-0.021) for {'clf__C': 1e-05, 'clf__penalty': 'l2'}
0.000 (+/-0.000) for {'clf__C': 0.0001, 'clf__penalty': 'l1'}
0.882 (+/-0.022) for {'clf__C': 0.0001, 'clf__penalty': 'l2'}
0.924 (+/-0.006) for {'clf__C': 0.001, 'clf__penalty': 'l1'}
0.912 (+/-0.019) for {'clf__C': 0.001, 'clf__penalty': 'l2'}
0.925 (+/-0.019) for {'clf__C': 0.01, 'clf__penalty': 'l1'}
0.919 (+/-0.017) for {'clf__C': 0.01, 'clf__penalty': 'l2'}
0.914 (+/-0.016) for {'clf__C': 0.1, 'clf__penalty': 'l1'}
0.917 (+/-0.016) for {'clf__C': 0.1, 'clf__penalty': 'l2'}
0.917 (+/-0.018) for {'clf__C': 1.0, 'clf__penalty': 'l1'}
0.916 (+/-0.018) for {'clf__C': 1.0, 'clf__penalty': 'l2'}



Looks like the best low value for C might be 0.01 with l2 penalty

__Final Validation__

In [36]:
lrPipe2.set_params(**{'clf__C':0.001, 'clf__penalty':'l1'})
lrPipe2 = lrPipe2.fit(X=xBal_Ovr, y=yBal_Ovr)
accuracy_score(y_pred=lrPipe2.predict(X=X_valid), y_true=y_valid)

0.85128205128205126

# Summary

The best performance with the Logistic Regression classifier was about 79% accuracy on the validation dataset. This was achieved after adjusting the regularization parameter to reduce overfitting using GridSearch.


### Other things that could be tried
Obviously, Logistic Regression is just one of many different classification techniques. But I always start with it as it is easy to understand and, with some feature engineering, can give very good results.

The next thing to try would be some tree based methods like RandomForests and Boosted Trees to see what kind of accuracy they give.