# Commercial Bank Customer Retention Prediction

## APSTA-GE.2401: Statistical Consulting

## Scripts

Created on: 12/08/2020

Modified on: 12/09/2020

## Data Processing

----

### Description

This script processes data from the proprocess step.

### Data

The data are preprocessed feature sets:

  - `X_train.csv`: contains all features in Q3 and Q4 of 2019 for training. Imported as `X`.
  - `y_train.csv`: contains the label variable for validation. Imported as `y`.
  - `X_test.csv`: contains all features in Q1 of 2020 for testing. Imported as `X_true`.
   
After importing the data, we confirmed that both train sets have the same number of records: **145296**. We also confirmed that the testing set has **76722** records.

### Procedures

We first inspected the feature set. 

1. There are 55 features in the feature set. 

2. We checked if there are any missing values in the set. We found multiple columns that contain missing values, ranging from 0.005% to 100%. For columns containing a large portion of missing values, we dropped the column to reduce computational burden. For columns containing a small portion of missing values, we applied a deep learning library, [Datawig](https://github.com/awslabs/datawig), which learns machine learning models using deep neural networks to impute missing values in the data.

    - After dropping columns containing large portion of missing values, we reduced number of features to 45.

3. We then performed dummy coding to 

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, date

print('SUCCESS! All modules are imported.')

SUCCESS! All modules are imported.


In [2]:
X = pd.read_csv('../data/preprocess/X_train.csv')
y = pd.read_csv('../data/preprocess/y_train.csv')
X_true = pd.read_csv('../data/preprocess/X_test.csv')

In [3]:
print('The proprocessed training set has {} rows and {} columns.'.format(X.shape[0], X.shape[1]))
print('The proprocessed validation set has {} rows and {} columns.'.format(y.shape[0], y.shape[1]))
print('The proprocessed testing set has {} rows and {} columns.'.format(X_true.shape[0], X_true.shape[1]))

The proprocessed training set has 145296 rows and 56 columns.
The proprocessed validation set has 145296 rows and 2 columns.
The proprocessed testing set has 76722 rows and 56 columns.


### Functions

In [6]:
def check_missing(dat):
    '''Print missing values in each column of the dat
    @Param df dat: input data frame
    '''
    missing_val = dat.isnull().sum()
    for index in missing_val.index:
        if missing_val[index] > 0:
            print('{} has {} missing values. ({:.4%})'.format(index, missing_val[index], missing_val[index]/len(X)))

In [7]:
def code_cat_dummy(dat, col):
    '''Print descriptive summary of the input column
    @Param df dat: input data frame
    @Param str col: column name as a string
    '''
    count = 0
    levels = dat[col].value_counts().index
    for level in levels:
        dat[col] = dat[col].replace(level, count)
        count += 1

In [8]:
def code_df_dummy(dat, col, day0, fmt):
    '''Convert col in dat to float using day0 as the reference date
    @Param df dat: input data frame
    @Param str col: column name
    @Param datetime day0: reference date
    @Param str fmt: date time format
    '''
    dat[col] = pd.to_datetime(dat[col], format=fmt, errors='ignore')
    for index in dat[col].index:
        dat.loc[index, col] = day0 - dat.loc[index, col]
        dat.loc[index, col] = dat.loc[index, col].total_seconds() / (24 * 60 * 60)

----

## y (Label for Validation)

We applied `LabelBinarizer` to make the label binary. Originally, the label column contains three values: 
- 1: indicating churn
- 0: indicating no preference
- -1: indicating not churn

In [24]:
y['label'].value_counts()

 1    92818
 0    30237
-1    22241
Name: label, dtype: int64

## X (Feature for Training)

### Missing Values

We first processed missing values in the data. Multiple columns contain missing values. The percentage of missing values in each column ranges from 0.0048% to 100.00%. We removed columns containing large portion of missing values.

In [9]:
# Check missing values
check_missing(X)

B6 has 8878 missing values. (6.1103%)
E2 has 6370 missing values. (4.3842%)
E3 has 6370 missing values. (4.3842%)
E4 has 84483 missing values. (58.1454%)
E5 has 55129 missing values. (37.9425%)
E6 has 7538 missing values. (5.1880%)
E7 has 142402 missing values. (98.0082%)
E8 has 127381 missing values. (87.6700%)
E9 has 145227 missing values. (99.9525%)
E10 has 816 missing values. (0.5616%)
E11 has 145296 missing values. (100.0000%)
E12 has 121324 missing values. (83.5013%)
E13 has 127502 missing values. (87.7533%)
E14 has 90010 missing values. (61.9494%)
E16 has 68530 missing values. (47.1658%)
E18 has 62147 missing values. (42.7727%)
C1 has 7 missing values. (0.0048%)
C2 has 7 missing values. (0.0048%)
I1 has 64 missing values. (0.0440%)
I5 has 11604 missing values. (7.9865%)
I9 has 145296 missing values. (100.0000%)
I10 has 128487 missing values. (88.4312%)
I13 has 143108 missing values. (98.4941%)
I14 has 129650 missing values. (89.2316%)


In [10]:
X_original = X.copy()

In [11]:
# Drop columns with large portion of missing values
col_to_drop = ['E7', 'E8', 'E9', 'E11', 'E12', 'E13', 'I9', 'I10', 'I13', 'I14']
X = X.drop(col_to_drop, axis=1)

In [12]:
print('After dropping columns containing large portion of missing values, now the set has {} columns.'.format(X.shape[1]))

After dropping columns containing large portion of missing values, now the set has 46 columns.


In [13]:
check_missing(X)

B6 has 8878 missing values. (6.1103%)
E2 has 6370 missing values. (4.3842%)
E3 has 6370 missing values. (4.3842%)
E4 has 84483 missing values. (58.1454%)
E5 has 55129 missing values. (37.9425%)
E6 has 7538 missing values. (5.1880%)
E10 has 816 missing values. (0.5616%)
E14 has 90010 missing values. (61.9494%)
E16 has 68530 missing values. (47.1658%)
E18 has 62147 missing values. (42.7727%)
C1 has 7 missing values. (0.0048%)
C2 has 7 missing values. (0.0048%)
I1 has 64 missing values. (0.0440%)
I5 has 11604 missing values. (7.9865%)


### Drop Meaningless Columns

Based on the codebook, after mining into the data, we determined that the following columns contain meaningless information and, therefore, we dropped these columns:

- `I8`: constellation. We don't believe constellation can alter customer behavior.
- `I12`: field description. Contain only 1 different values.
- `I15`: QR code recipient.

In [14]:
col_to_drop = ['I8', 'I12', 'I15']
X = X.drop(col_to_drop, axis=1)

In [15]:
print('After dropping columns containing large portion of missing values, now the set has {} columns.'.format(X.shape[1]))

After dropping columns containing large portion of missing values, now the set has 43 columns.


### Dummy Coding

Before applying `Datawig`, we dummy coded categorical columns.


#### Date Time Columns

To dummy code columns containing date and time, we used `2019-07-01` as day0 and converted those date time into numeric inputs.

In [16]:
X.columns

Index(['cust_no', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'B1', 'B2',
       'B3', 'B4', 'B5', 'B6', 'B7', 'E1', 'E2', 'E3', 'E4', 'E5', 'E6', 'E10',
       'E14', 'E15', 'E16', 'E17', 'E18', 'C1', 'C2', 'I1', 'I2', 'I3', 'I4',
       'I5', 'I6', 'I7', 'I11', 'I16', 'I17', 'I18', 'I19', 'I20'],
      dtype='object')

In [18]:
def code_df_dummy(dat, col, day0, fmt):
    '''Convert col in dat to float using day0 as the reference date
    @Param df dat: input data frame
    @Param str col: column name
    @Param datetime day0: reference date
    @Param str fmt: date time format
    '''
    dat[col] = pd.to_datetime(dat[col], format=fmt, errors='ignore')
    for index in dat[col].index:
        dat.loc[index, col] = day0 - dat.loc[index, col]
        dat.loc[index, col] = dat.loc[index, col].total_seconds() / (24 * 60 * 60)

In [19]:
# B6: Latest transfer time
X['B6'] = pd.to_datetime(X['B6'], format=fmt, errors='ignore')

NameError: name 'fmt' is not defined

In [22]:
# B6: Latest transfer time
day0 = datetime(2019, 12, 31)
fmt = '%Y-%m-%d %H:%M:%S'
code_df_dummy(X, 'B6', day0, fmt)

KeyboardInterrupt: 

In [None]:
# E category
day0 = datetime(2019, 12, 31)
fmt = '%Y-%m-%d'
col_names = ['E1', 'E2', 'E3', 'E4', 'E5', 'E6', 'E10', 'E16', 'E18']
for col_name in col_names:
    code_df_dummy(X, col_name, day0, fmt)

#### Categorical Columns

In [None]:
# I1: Gender
code_cat_dummy(X, 'I1')

In [None]:
# I3: Class
code_cat_dummy(X, 'I3')

In [None]:
# I5: Occupation
code_cat_dummy(X, 'I5')

----

## X_true (Features for Testing)

### Missing Values

Similar to the `X`, we first processed missing values in the data. Multiple columns contain missing values. The percentage of missing values in each column ranges from 0.0048% to 100.00%. We removed columns containing large portion of missing values.

### Drop Meaningless Columns

Based on the codebook, after mining into the data, we determined that the following columns contain meaningless information and, therefore, we dropped these columns:

- `I8`: constellation. We don't believe constellation can alter customer behavior.
- `I12`: field description. Contain only 1 different values.
- `I15`: QR code recipient.

In [None]:
# Check missing values
check_missing(X_true)

In [None]:
X_true_original = X.copy()

In [None]:
# Drop columns with large portion of missing values
col_to_drop = ['E7', 'E8', 'E9', 'E11', 'E12', 'E13', 'I9', 'I10', 'I13', 'I14',
              'I8', 'I12', 'I15']
X_true = X_true.drop(col_to_drop, axis=1)

In [None]:
print('After dropping columns containing large portion of missing values and meaningless columns, now the set has {} columns.'.format(X_true.shape[1]))

In [None]:
check_missing(X)

### Dummy Coding

Before applying `Datawig`, we dummy coded categorical columns.

In [None]:
# B6: Latest transfer time
day0 = datetime(2019, 12, 31)
fmt = '%Y-%m-%d %H:%M:%S'
code_df_dummy(X_true, 'B6', day0, fmt)

In [None]:
# E category
day0 = datetime(2019, 12, 31)
fmt = '%Y-%m-%d'
col_names = ['E1', 'E2', 'E3', 'E4', 'E5', 'E6', 'E10', 'E16', 'E18']
for col_name in col_names:
    code_df_dummy(X_true, col_name, day0, fmt)

In [None]:
# I1: Gender
code_cat_dummy(X_true, 'I1')

In [None]:
# I3: Class
code_cat_dummy(X_true, 'I3')

In [None]:
# I5: Occupation
code_cat_dummy(X_true, 'I5')