# Commercial Bank Customer Retention Prediction

## APSTA-GE.2401: Statistical Consulting

### Scripts

## Data Pre-process

### Code Book
##### aum_m(Y) 
This data set contains customer's assest at the end of month Y.

| Variable Name | Description |
| ------------- | ------------- |
| cust_no  | custumer's ID (primary key)  |
| X1  | structured deposit balance |
| X2  | time deposit balance  |
| X3  | demand deposit balance  |
| X4  | financial products balance  |
| X5  | fund balance  |
| X6  | assest management balance  |
| X7  | loan balance  |
| X8  | large deposit certificate balance  |

##### behavior_m(Y) 
This data set records customers' behaviors in month Y.


B6 and B7 only have data of month 3, 6, 9 and 12.

| Variable Name | Description |
| ------------- | ------------- |
| cust_no  | custumer's ID (primary key)  |
| B1  | mobile banking login times |
| B2  | transfer-in times |
| B3  | transfer-in money amount  |
| B4  | transfer-out times |
| B5  | transfer-out money amount  |
| B6  | lateset transfer time  |
| B7  | number of transfers in a season |

##### big_event_Q(Z)
This data set records customers' important behaviors in the season Z.

| Variable Name | Description |
| ------------- | ------------- |
| cust_no  | custumer's ID (primary key)  |
| E1  | account opening date |
| E2  | online banking opening date |
| E3  | mobile banking opening date  |
| E4  | first online banking login date |
| E5  | first mobile banking login date |
| E6  | first demand deposit date |
| E7  | first time deposit date |
| E8  | first loan date |
| E9  | first overdue date |
| E10  | first cash transaction date |
| E11  | first bank-securities transfer date |
| E12  | first transfer at counter date |
| E13  | first transfer via online banking date |
| E14  | first transfer via mobile banking date |
| E15  | maximum amount transferred out of another bank |
| E16  | maximum amount transferred out of another bank date |
| E17  | Maximum transfer amount from other bank |
| E18  | Maximum transfer amount from other bank date|

#####  cunkuan_m(Y)
This data set contains customers' deposits in month Y.

| Variable Name | Description |
| ------------- | ------------- |
| cust_no  | custumer's ID (primary key)  |
| C1  | deposit products value|
| C2  | number of deposit products  |

#####  cust_avli_Q(Z)
This data set contains customers list in the season Z.

| Variable Name | Description |
| ------------- | ------------- |
| cust_no  | custumer's ID (primary key)  |

##### cust_info_q(Z)
This data set contains customer information in the season Z.

| Variable Name | Description |
| ------------- | ------------- |
| cust_no  | custumer's ID (primary key)  |
| l1  | gender |
| l2  | age |
| l3  | class |
| l4  | tag |
| l5  | occupation |
| l6  | deposit customer tag |
| l7  | number of products owning |
| l8  | constellation |
| l9  | contribution |
| l10  | education level |
| l11  | family annual income |
| l12  | field description |
| l13  | marriage description |
| l14  | occupation description |
| l15  | QR code recipient |
| l16  | VIP |
| l17  | online banking client |
| l18  | mobile banking client |
| l19  | SMS client |
| l20  | WeChat Pay client|

In [2]:
import pandas as pd
import csv
import glob
import re
import os

In [12]:
X_test = '../data/raw/x_test/'
X_train = '../data/raw/x_train/'
y_train_3 = '../data/raw/y_train_3/'

### Load in Test Data

In [25]:
aum_m1 = pd.read_csv(X_test + "aum_test/aum_m1.csv")
aum_m2 = pd.read_csv(X_test + "aum_test/aum_m2.csv")
aum_m3 = pd.read_csv(X_test + "aum_test/aum_m3.csv")

behavior_m1 = pd.read_csv(X_test + "behavior_test/behavior_m1.csv")
behavior_m2 = pd.read_csv(X_test + "behavior_test/behavior_m2.csv")
behavior_m3 = pd.read_csv(X_test + "behavior_test/behavior_m3.csv")

cunkuan_m1 = pd.read_csv(X_test + "cunkuan_test/cunkuan_m1.csv")
cunkuan_m2 = pd.read_csv(X_test + "cunkuan_test/cunkuan_m2.csv")
cunkuan_m3 = pd.read_csv(X_test + "cunkuan_test/cunkuan_m3.csv")

big_event_Q1 = pd.read_csv(X_test + "big_event_Q1.csv")
cust_avli_Q1 = pd.read_csv(X_test + "cust_avli_Q1.csv")
cust_info_q1 = pd.read_csv(X_test + "cust_info_q1.csv")

### Load in Train Data

In [26]:
aum_m7 = pd.read_csv(X_train + "aum_train/aum_m7.csv")
aum_m8 = pd.read_csv(X_train + "aum_train/aum_m8.csv")
aum_m9 = pd.read_csv(X_train + "aum_train/aum_m9.csv")
aum_m10 = pd.read_csv(X_train + "aum_train/aum_m10.csv")
aum_m11 = pd.read_csv(X_train + "aum_train/aum_m11.csv")
aum_m12 = pd.read_csv(X_train + "aum_train/aum_m12.csv")

behavior_m7 = pd.read_csv(X_train + "behavior_train/behavior_m7.csv")
behavior_m8 = pd.read_csv(X_train + "behavior_train/behavior_m8.csv")
behavior_m9 = pd.read_csv(X_train + "behavior_train/behavior_m9.csv")
behavior_m10 = pd.read_csv(X_train + "behavior_train/behavior_m10.csv")
behavior_m11 = pd.read_csv(X_train + "behavior_train/behavior_m11.csv")
behavior_m12 = pd.read_csv(X_train + "behavior_train/behavior_m12.csv")

big_event_Q3 = pd.read_csv(X_train + "big_event_train/big_event_Q3.csv")
big_event_Q4 = pd.read_csv(X_train + "big_event_train/big_event_Q4.csv")

cunkuan_m7 = pd.read_csv(X_train + "cunkuan_train/cunkuan_m7.csv")
cunkuan_m8 = pd.read_csv(X_train + "cunkuan_train/cunkuan_m8.csv")
cunkuan_m9 = pd.read_csv(X_train + "cunkuan_train/cunkuan_m9.csv")
cunkuan_m10 = pd.read_csv(X_train + "cunkuan_train/cunkuan_m10.csv")
cunkuan_m11 = pd.read_csv(X_train + "cunkuan_train/cunkuan_m11.csv")
cunkuan_m12 = pd.read_csv(X_train + "cunkuan_train/cunkuan_m12.csv")

cust_avli_Q3 = pd.read_csv(X_train + "cust_avli_Q3.csv")
cust_avli_Q4 = pd.read_csv(X_train + "cust_avli_Q4.csv")

cust_info_q3 = pd.read_csv(X_train + "cust_info_q3.csv")
cust_info_q4 = pd.read_csv(X_train + "cust_info_q4.csv")

In [27]:
y_Q3_3 = pd.read_csv(y_train_3 + "y_Q3_3.csv")
y_Q4_3 = pd.read_csv(y_train_3 + "y_Q4_3.csv")

### Missing Values

In [139]:
# check which data set has missing values
def nulltracker(self):
    counter = 0
    for names in self:
        indicator = names.isnull().sum() == 0
        if indicator.all() == False:
            print(counter)
        counter = counter + 1 

In [127]:
aum_m = aum_m1, aum_m2, aum_m3, aum_m7, aum_m8, aum_m9, aum_m10, aum_m11, aum_m12
nulltracker(aum_m)

In [151]:
behavior_m = behavior_m1, behavior_m2, behavior_m3, behavior_m7, behavior_m8, behavior_m9, behavior_m10, behavior_m11, behavior_m12 
nulltracker(behavior_m)

2
5
8


In [129]:
cunkuan_m = cunkuan_m1, cunkuan_m2, cunkuan_m3, cunkuan_m7, cunkuan_m8, cunkuan_m9, cunkuan_m10, cunkuan_m11, cunkuan_m12
nulltracker(cunkuan_m)

In [141]:
big_event_Q = big_event_Q1, big_event_Q3, big_event_Q4
nulltracker(big_event_Q)

0
1
2


In [142]:
cust_avli_Q = cust_avli_Q1, cust_avli_Q3, cust_avli_Q4
nulltracker(cust_avli_Q)

In [143]:
cust_info_q = cust_info_q1, cust_info_q3, cust_info_q4
nulltracker(cust_info_q)

0
1
2


In [144]:
y_Q = y_Q3_3, y_Q4_3
nulltracker(y_Q)

In [146]:
#print(behavior_m[3][behavior_m[3].isnull().T.any()])
behavior_m[2].isnull().any()
behavior_m[2]["B6"]

0                         NaN
1                         NaN
2                         NaN
3         2020-03-31 22:06:00
4                         NaN
5         2020-03-30 19:07:00
6                         NaN
7                         NaN
8                         NaN
9                         NaN
10                        NaN
11                        NaN
12                        NaN
13                        NaN
14                        NaN
15        2020-01-16 11:07:00
16                        NaN
17                        NaN
18                        NaN
19                        NaN
20                        NaN
21                        NaN
22                        NaN
23                        NaN
24        2020-01-20 04:17:00
25                        NaN
26                        NaN
27                        NaN
28                        NaN
29                        NaN
                 ...         
659594                    NaN
659595                    NaN
659596    

In [156]:
behavior_m12

Unnamed: 0,cust_no,B1,B2,B3,B4,B5,B6,B7
0,0xb2d14994,5,2,1346.15,2,5346.15,2019-12-13 18:03:00,22
1,0xb2d65824,0,0,0.00,0,0.00,,0
2,0xb2d539b7,0,0,0.00,0,0.00,,0
3,0xb2d807ae,0,0,0.00,0,0.00,,0
4,0xb2d176b2,14,3,292654.81,8,323939.53,2019-12-31 06:02:00,28
5,0xb2d1386f,0,0,0.00,0,0.00,,0
6,0xb2d5ae1e,0,1,0.01,0,0.00,2019-12-13 19:35:00,1
7,0xb2d73522,0,0,0.00,0,0.00,,0
8,0xb2d4bec7,0,0,0.00,0,0.00,,0
9,0xb2d86da5,0,0,0.00,0,0.00,,0


In [36]:
display(aum_m1.describe(), aum_m2.describe(), aum_m3.describe(), aum_m7.describe(), aum_m8.describe(), 
        aum_m9.describe(), aum_m10.describe(), aum_m11.describe(), aum_m12.describe())

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,556195.0,556195.0,556195.0,556195.0,556195.0,556195.0,556195.0,556195.0
mean,47529.63,3296.358,4633.316,4351.619,193.9474,315.8859,64462.12,12647.14
std,1555244.0,547665.8,217063.8,100014.2,10393.17,27438.51,652513.7,186607.5
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,7.67,0.0,0.0,0.0,0.0,0.0
max,290000000.0,327849800.0,109041200.0,30000000.0,3740000.0,10000000.0,30000000.0,58000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,603631.0,603631.0,603631.0,603631.0,603631.0,603631.0,603631.0,603631.0
mean,43858.46,3259.812,4922.958,3361.816,185.0724,297.5686,59819.33,12305.64
std,1492338.0,555233.9,210876.8,78903.69,8912.695,26737.54,629864.7,184637.7
min,0.0,0.0,0.0,0.0,-1.77,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,3.97,0.0,0.0,0.0,0.0,0.0
max,290000000.0,333514200.0,109041200.0,18772000.0,2100000.0,10000000.0,30000000.0,58000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,659624.0,659624.0,659624.0,659624.0,659624.0,659624.0,659624.0,659624.0
mean,41027.85,2397.763,4548.653,3144.846,184.2109,259.9848,57239.52,12733.69
std,1397215.0,477838.9,223023.2,71846.28,15288.95,24475.11,625542.1,308846.7
min,0.0,0.0,0.0,0.0,-1.77,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,2.37,0.0,0.0,0.0,0.0,0.0
max,200000000.0,337250800.0,109145900.0,18220000.0,10000000.0,10000000.0,30000000.0,150000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,465441.0,465441.0,465441.0,465441.0,465441.0,465441.0,465441.0,465441.0
mean,45778.86,3865.932,6319.957,3782.432,213.8875,505.8526,57950.86,7914.9
std,1511018.0,535391.2,279307.2,166731.6,12732.93,68757.66,619003.2,145328.3
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,4.3,0.0,0.0,0.0,0.0,0.0
max,347000000.0,210653500.0,100071800.0,68000000.0,4035640.0,40757140.0,34750000.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,479063.0,479063.0,479063.0,479063.0,479063.0,479063.0,479063.0,479063.0
mean,50150.33,3243.406,4790.877,3527.987,207.1597,480.9048,59681.86,8685.593
std,1633422.0,508359.1,202644.2,121356.3,11906.52,68032.67,627669.7,154111.3
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,4.31,0.0,0.0,0.0,0.0,0.0
max,290000000.0,216889700.0,94675160.0,50860000.0,3024484.0,40862450.0,34750000.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,493441.0,493441.0,493441.0,493441.0,493441.0,493441.0,493441.0,493441.0
mean,51071.97,3042.671,5425.967,3708.233,222.7427,427.9809,61600.78,10219.8
std,1656046.0,405245.5,302873.6,121347.4,14517.66,37401.05,639543.1,191507.7
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,5.73,0.0,0.0,0.0,0.0,0.0
max,290000000.0,216430700.0,142022500.0,50860000.0,5003991.0,11541710.0,32831460.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,506513.0,506513.0,506513.0,506513.0,506513.0,506513.0,506513.0,506513.0
mean,49653.24,3594.652,6016.074,4000.847,203.9064,346.0443,62738.38,10664.76
std,1647982.0,570159.7,218568.2,81978.53,11563.78,30681.68,645591.8,187077.1
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,5.7,0.0,0.0,0.0,0.0,0.0
max,290000000.0,335737100.0,79130930.0,19860000.0,3100000.0,10182110.0,32783190.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,521566.0,521566.0,521566.0,521566.0,521566.0,521566.0,521566.0,521566.0
mean,49963.68,3872.203,4449.279,4441.739,198.6803,337.6526,64102.21,11251.53
std,1623167.0,571757.6,192396.7,95223.27,10376.71,30055.5,651373.2,184065.2
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,5.75,0.0,0.0,0.0,0.0,0.0
max,290000000.0,334618500.0,79130930.0,30000000.0,2900067.0,10208150.0,32734560.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,543823.0,543823.0,543823.0,543823.0,543823.0,543823.0,543823.0,543823.0
mean,47674.5,3674.215,5989.938,4462.179,257.2643,435.3697,63953.52,12304.28
std,1590177.0,575038.7,244861.2,116987.6,18356.57,39167.26,650883.5,206312.4
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,8.51,0.0,0.0,0.0,0.0,0.0
max,290000000.0,332067100.0,109041200.0,45000000.0,7000000.0,10010830.0,30000000.0,76000000.0
