# Commercial Bank Customer Retention Prediction

## APSTA-GE.2401: Statistical Consulting

## Scripts

Created on: 12/08/2020

Modified on: 12/08/2020

## Data Processing

----

### Description

This script processes data from the proprocess step.

### Data

The data are preprocessed feature sets:

  - `X_train.csv`: contains all features in Q3 and Q4 of 2019 for training. Imported as `X`.
  - `y_train.csv`: contains the label variable for validation. Imported as `y`.
  - `X_test.csv`: contains all features in Q1 of 2020 for testing. Imported as `X_true`.
   
After importing the data, we confirmed that both train sets have the same number of records: **145296**. We also confirmed that the testing set has **76722** records.

### Procedures

We first inspected the feature set. 

1. There are 55 features in the feature set. 

2. We checked if there are any missing values in the set. We found multiple columns that contain missing values, ranging from 0.005% to 100%. For columns containing a large portion of missing values, we dropped the column to reduce computational burden. For columns containing a small portion of missing values, we applied a deep learning library, [Datawig](https://github.com/awslabs/datawig), which learns machine learning models using deep neural networks to impute missing values in the data.

    - After dropping columns containing large portion of missing values, we reduced number of features to 45.

3. We then performed dummy coding to 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print('SUCCESS! All modules are imported.')

SUCCESS! All modules are imported.


In [2]:
X = pd.read_csv('../data/preprocess/X_train.csv')
y = pd.read_csv('../data/preprocess/y_train.csv')
X_true = pd.read_csv('../data/preprocess/X_test.csv')

In [3]:
print('The proprocessed training set has {} rows and {} columns.'.format(X.shape[0], X.shape[1]))
print('The proprocessed validation set has {} rows and {} columns.'.format(y.shape[0], y.shape[1]))
print('The proprocessed testing set has {} rows and {} columns.'.format(X_true.shape[0], X_true.shape[1]))

The proprocessed training set has 145296 rows and 56 columns.
The proprocessed validation set has 145296 rows and 2 columns.
The proprocessed testing set has 76722 rows and 56 columns.


### Missing Values

We first processed missing values in the data.

In [4]:
# Check missing values
missing_val = X.isnull().sum()
for index in missing_val.index:
    if missing_val[index] > 0:
        print('{} has {} missing values. ({:.4%})'.format(index, missing_val[index], missing_val[index]/len(X)))

B6 has 8878 missing values. (6.1103%)
E2 has 6370 missing values. (4.3842%)
E3 has 6370 missing values. (4.3842%)
E4 has 84483 missing values. (58.1454%)
E5 has 55129 missing values. (37.9425%)
E6 has 7538 missing values. (5.1880%)
E7 has 142402 missing values. (98.0082%)
E8 has 127381 missing values. (87.6700%)
E9 has 145227 missing values. (99.9525%)
E10 has 816 missing values. (0.5616%)
E11 has 145296 missing values. (100.0000%)
E12 has 121324 missing values. (83.5013%)
E13 has 127502 missing values. (87.7533%)
E14 has 90010 missing values. (61.9494%)
E16 has 68530 missing values. (47.1658%)
E18 has 62147 missing values. (42.7727%)
C1 has 7 missing values. (0.0048%)
C2 has 7 missing values. (0.0048%)
I1 has 64 missing values. (0.0440%)
I5 has 11604 missing values. (7.9865%)
I9 has 145296 missing values. (100.0000%)
I10 has 128487 missing values. (88.4312%)
I13 has 143108 missing values. (98.4941%)
I14 has 129650 missing values. (89.2316%)
