# Commercial Bank Customer Retention Prediction

## APSTA-GE.2401: Statistical Consulting

## Scripts

Created on: 12/07/2020

Modified on: 12/08/2020

## Supervised Learning Models

----

### Description

This script contains the machine learning models.

### Research Design

The strategy of supervised learning is to train models using the `X_train` data and validate model performance using the `y_train` data. After training, we fit the model to the `X_test` data. The model will then generate predictions, `y_test`, based on `X_test`. 

To increase model performance, we splited the train set into two sets: 80% of the train data goes to the `X_train` set and 20% of the data goes to the `X_test` set. Then, we conducted a 5-fold cross validation and selected the best performed model output. We also find tuned hyperparameters using randomized search.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

print('SUCCESS! All modules are imported.')

SUCCESS! All modules are imported.


----

In [12]:
X = pd.read_csv('../data/X_train.csv')
y = pd.read_csv('../data/y_train.csv')
X_hold = pd.read_csv('../data/X_test.csv')

In [13]:
print('The model-ready training set has {} rows and {} columns.'.format(X.shape[0], X.shape[1]))
print('The model-ready validation set has {} rows and {} columns.'.format(y.shape[0], y.shape[1]))
print('The model-ready testing set has {} rows and {} columns.'.format(X_true.shape[0], X_hold.shape[1]))

The model-ready training set has 145296 rows and 87 columns.
The model-ready validation set has 145296 rows and 2 columns.
The model-ready testing set has 76722 rows and 87 columns.


----

### Train Test Split

In [14]:
ID_train = X['cust_no']
ID_test = y['cust_no']

In [15]:
X = X.drop('cust_no', axis=1)
y = y.drop('cust_no', axis=1)
X_hold = X_true.drop('cust_no', axis=1)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1061)

In [17]:
print('After train test split, the training set has {} rows and {} columns.'.format(X_train.shape[0], X_train.shape[1]))
print('After train test split, the train has {} labels.'.format(y_train.shape[0]))
print('After train test split, the test set has {} rows and {} columns.'.format(X_test.shape[0], X_test.shape[1]))
print('After train test split, the test has {} labels.'.format(y_test.shape[0]))

After train test split, the training set has 116236 rows and 86 columns.
After train test split, the train has 116236 labels.
After train test split, the test set has 29060 rows and 86 columns.
After train test split, the test has 29060 labels.


In [26]:
X_train.iloc[1:10, 5:20]

Unnamed: 0,X6,X7,X8,B1,B2,B3,B4,B5,B6,B7,E1,E2,E3,E4,E5
86298,0.0,0.0,0,3,0,0.0,2,63000.0,2019-12-09 00:08:00.000000000,49.0,2019-07-13 00:00:00.000000000,2019-07-13 00:00:00.000000000,2019-07-13 00:00:00.000000000,2019-12-31 00:00:00.000000000,2019-07-13 00:00:00.000000000
109633,0.0,0.0,0,0,0,0.0,0,0.0,2019-12-13 16:34:00.000000000,6.0,2017-09-07 00:00:00.000000000,2019-12-31 00:00:00.000000000,2019-12-31 00:00:00.000000000,2019-12-31 00:00:00.000000000,2019-12-31 00:00:00.000000000
47462,0.0,0.0,0,0,6,5707.0,0,0.0,2019-09-27 22:58:00.000000000,11.0,2017-09-07 00:00:00.000000000,2017-09-10 00:00:00.000000000,2017-09-10 00:00:00.000000000,2019-12-31 00:00:00.000000000,2019-12-31 00:00:00.000000000
1856,0.0,0.0,500000,1,1,1229634.6,1,50000.0,2019-09-30 18:33:00.000000000,15.0,2015-11-27 00:00:00.000000000,2017-10-10 00:00:00.000000000,2017-10-10 00:00:00.000000000,2016-02-03 00:00:00.000000000,2017-02-21 00:00:00.000000000
24858,0.0,0.0,0,0,0,0.0,0,0.0,2019-05-08 10:57:00.000000000,0.0,2018-11-14 00:00:00.000000000,2018-11-14 00:00:00.000000000,2018-11-14 00:00:00.000000000,2018-11-14 00:00:00.000000000,2019-12-31 00:00:00.000000000
128389,0.0,0.0,0,1,1,53896.8,0,0.0,2019-12-12 16:58:00.000000000,4.0,2019-06-03 00:00:00.000000000,2019-06-03 00:00:00.000000000,2019-06-03 00:00:00.000000000,2019-12-31 00:00:00.000000000,2019-09-05 00:00:00.000000000
38839,0.0,0.0,0,24,3,258301.67,7,258303.0,2019-09-26 04:14:00.000000000,14.0,2018-05-11 00:00:00.000000000,2018-05-11 00:00:00.000000000,2018-05-11 00:00:00.000000000,2019-07-04 00:00:00.000000000,2018-05-11 00:00:00.000000000
49469,0.0,0.0,0,0,0,0.0,0,0.0,2019-02-23 02:31:00.000000000,0.0,2014-05-20 00:00:00.000000000,2015-04-03 00:00:00.000000000,2015-04-03 00:00:00.000000000,2017-05-03 00:00:00.000000000,2015-09-25 00:00:00.000000000
86484,0.0,0.0,0,1,1,35000.0,1,35000.0,2019-12-11 04:33:00.000000000,7.0,2019-08-28 00:00:00.000000000,2019-08-28 00:00:00.000000000,2019-08-28 00:00:00.000000000,2019-12-31 00:00:00.000000000,2019-12-31 00:00:00.000000000


----

### SVD

In [18]:
scaler = StandardScaler()
scaler.fit(X_train)

ValueError: could not convert string to float: '2019-07-12 04:55:00.000000000'

In [None]:
U, sig, Vt = np.linalg.svd()