# Commercial Bank Customer Retention Prediction

## APSTA-GE.2401: Statistical Consulting

## Scripts

Created on: 12/04/2020

Modified on: 12/06/2020

## Data Preprocess

### Description

This script preprocesses the raw data originated from the competition data warehouse. There are three main data packages:

- `x_train`: the train data package containing all features.
- `y_train`: the data package for feature test.
- `x_test`: the train data package for prediction. It contains the same features as `x_train`.

We started from examining the `y_train` because it contains labels that can validate our model predictions. `y_train` contains random sampled label data from two quarters: Q3_2020 and Q4_2020. Since the customer ID column, `cust_no`, only contains unique values, we determined our data processing strategies as follows:

1. Use **quarter** to separate data processing procedures. We created two training sets, `X_train_Q3` and `X_train_Q4`, and merged them before applying models. In this way, we bypassed duplicated customer IDs in both `y_train` sets caused by random sampling. This allowed us to maximize the number of labels that can be validated.

    - `y_Q3_3` contains 69126 rows, `y_Q3_3` contains 76170 rows.
    - `y_train` has 62397 duplicated customer IDs. 
    - `y_train` has 40090 completely identical records (same customer ID, same label).
    - Two samples are heavily overlapped.
    - 22307 customers changed their churn preference from Q3 to Q4.
    
2. Based on quarterly-separated `y_train` set, we merged `X_train` raw data accordingly. For each quarter, we dropped duplicated customer IDs except for the last occurrance.

3. During data preprocessing, we examined records in the `cust_avli` column of the `X_train` sets. These sets contain the ID of all effective customers. We confirmed that these ID are the same as those in the `y_train` set. Therefore, we trimmed the dataset based on the `cust_no` column in the `cust_avli`, separated by quarters.

    - Confirmed that `cust_avli` is the key indexing column.

In [2]:
import pandas as pd
import numpy as np
import csv
import glob
import re
import os

print('SUCCESS! All modules are imported.')

SUCCESS! All modules are imported.


In [157]:
def merge_file(path):
    '''Concatenate files from path
    Param: path: file path from which data are imported and concatenated
    '''
    file_names = [name for name in glob.glob(path)]
    df_temp = (pd.read_csv(file) for file in file_names)
    return(pd.concat(df_temp, ignore_index=True, axis='index'))

----

## y_train

In this step, we proved that there were duplicated customer IDs in each `y_train` sets. We did not perform trimming because even for the same customer, it is possible that he/she changed his/her churn preference after a period of time.

In [261]:
path = '../data/raw/y_train_3/y_Q[34]_3.csv'
y_train = merge_file(path)

In [476]:
y_train.to_csv('../data/preprocess/y_train.csv', index=False)

In [280]:
path = '../data/raw/y_train_3/'
y_Q3_3 = pd.read_csv(path + 'y_Q3_3.csv')
y_Q4_3 = pd.read_csv(path + 'y_Q4_3.csv')

In [281]:
# Index for trimming
idxQ3 = y_Q3_3['cust_no']
idxQ4 = y_Q4_3['cust_no']

In [294]:
print(len(idxQ3))
print(len(idxQ4))

69126
76170


In [287]:
def trim_by_quarter(dat, isQ3):
    '''Trim the data by quarterly index
    Param: dat: imported data
    Param: isQ3: binary, 1 if Q3; 0 if else
    '''
    if isQ3:
        return(dat[dat['cust_no'].isin(idxQ3)])
    else:
        return(dat[dat['cust_no'].isin(idxQ4)])

In [265]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
# Confirm that all customer IDs are unique
print(len(y_Q3_3['cust_no'].value_counts()) == y_Q3_3.shape[0])
print(len(y_Q4_3['cust_no'].value_counts()) == y_Q4_3.shape[0])

True
True


In [266]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('y_Q3_3 has {} rows and {} columns.'.format(y_Q3_3.shape[0], y_Q3_3.shape[1]))
print('y_Q4_3 has {} rows and {} columns.'.format(y_Q4_3.shape[0], y_Q4_3.shape[1]))
print('y_train has {} rows and {} columns.'.format(y_train.shape[0], y_train.shape[1]))
print('y_train has {} duplicated customer IDs.'.format(y_train['cust_no'].duplicated().sum()))
print('y_train has {} completely identical records.'.format(y_train.duplicated().sum()))
print('{} customers changed their churn preference from Q3 to Q4.'.format(
    y_train['cust_no'].duplicated().sum()-y_train.duplicated().sum()))

y_Q3_3 has 69126 rows and 2 columns.
y_Q4_3 has 76170 rows and 2 columns.
y_train has 145296 rows and 2 columns.
y_train has 62397 duplicated customer IDs.
y_train has 40090 completely identical records.
22307 customers changed their churn preference from Q3 to Q4.


----

### Sample Submission File

In [244]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
sample = pd.read_csv('../instructions/sample_submission.csv')
print('Sample submission file has {} rows and {} columns.'.format(sample.shape[0], sample.shape[1]))
display(sample.head())

Sample submission file has 76722 rows and 2 columns.


Unnamed: 0,cust_no,label
0,0x3b9b4615,0
1,0x3b9ae61b,0
2,0x3b9add69,0
3,0x3b9b3601,0
4,0x3b9b2599,0


----

## X_train

In [461]:
X_train_Q3 = y_Q3_3.drop('label', axis=1).copy()
X_train_Q4 = y_Q4_3.drop('label', axis=1).copy()

### Customer Assets (aum)

In [377]:
# Q3
path = '../data/raw/x_train/aum_m[789].csv'
aum_Q3 = merge_file(path)

In [378]:
# Q4
path = '../data/raw/x_train/aum_m1[012].csv'
aum_Q4 = merge_file(path)

In [379]:
# Drop duplicated customer IDs except for the last occurrance
aum_Q3 = aum_Q3.drop_duplicates(subset=['cust_no'], keep='last')
aum_Q4 = aum_Q4.drop_duplicates(subset=['cust_no'], keep='last')

In [380]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, aum_Q3 has {} rows and {} columns.'.format(aum_Q3.shape[0], aum_Q3.shape[1]))
print('After dropping duplicated customer IDs, aum_Q4 has {} rows and {} columns.'.format(aum_Q4.shape[0], aum_Q4.shape[1]))

After dropping duplicated customer IDs, aum_Q3 has 493441 rows and 9 columns.
After dropping duplicated customer IDs, aum_Q4 has 543823 rows and 9 columns.


In [381]:
# Trim by the customer IDs in `y_train` set, separated by quarter
aum_Q3 = trim_by_quarter(aum_Q3, True)
aum_Q4 = trim_by_quarter(aum_Q4, False)

In [382]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After trimming, aum_Q3 has {} rows and {} columns.'.format(aum_Q3.shape[0], aum_Q3.shape[1]))
print('After trimming, aum_Q4 has {} rows and {} columns.'.format(aum_Q4.shape[0], aum_Q4.shape[1]))

After trimming, aum_Q3 has 69126 rows and 9 columns.
After trimming, aum_Q4 has 76170 rows and 9 columns.


In [386]:
# Save to archive
aum_Q3.to_csv('../data/preprocess/archive/aum_Q3.csv', index=False)
aum_Q4.to_csv('../data/preprocess/archive/aum_Q4.csv', index=False)

In [462]:
# Merge to X_train
X_train_Q3 = X_train_Q3.merge(aum_Q3, how='left', on='cust_no')
X_train_Q4 = X_train_Q4.merge(aum_Q4, how='left', on='cust_no')

### Customer Behavior (behavior)

In [184]:
# Q3
path = '../data/raw/x_train/behavior_m[789].csv'
behavior_Q3 = merge_file(path)

In [185]:
# Q4
path = '../data/raw/x_train/behavior_m1[012].csv'
behavior_Q4 = merge_file(path)

In [186]:
# Drop duplicated customer IDs except for the last occurrance
behavior_Q3 = behavior_Q3.drop_duplicates(subset=['cust_no'], keep='last')
behavior_Q4 = behavior_Q4.drop_duplicates(subset=['cust_no'], keep='last')

In [187]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, behavior_Q3 has {} rows and {} columns.'.format(behavior_Q3.shape[0], behavior_Q3.shape[1]))
print('After dropping duplicated customer IDs, behavior_Q4 has {} rows and {} columns.'.format(behavior_Q4.shape[0], behavior_Q4.shape[1]))

After dropping duplicated customer IDs, behavior_Q3 has 493441 rows and 8 columns.
After dropping duplicated customer IDs, behavior_Q4 has 543823 rows and 8 columns.


In [389]:
# Trim by the customer IDs in `y_train` set, separated by quarter
behavior_Q3 = trim_by_quarter(behavior_Q3, True)
behavior_Q4 = trim_by_quarter(behavior_Q4, False)

In [390]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After trimming, behavior_Q3 has {} rows and {} columns.'.format(behavior_Q3.shape[0], behavior_Q3.shape[1]))
print('After trimming, behavior_Q4 has {} rows and {} columns.'.format(behavior_Q4.shape[0], behavior_Q4.shape[1]))

After trimming, behavior_Q3 has 69126 rows and 8 columns.
After trimming, behavior_Q4 has 76170 rows and 8 columns.


In [391]:
# Save to archive
behavior_Q3.to_csv('../data/preprocess/archive/behavior_Q3.csv', index=False)
behavior_Q4.to_csv('../data/preprocess/archive/behavior_Q4.csv', index=False)

In [463]:
# Merge to X_train
X_train_Q3 = X_train_Q3.merge(behavior_Q3, how='left', on='cust_no')
X_train_Q4 = X_train_Q4.merge(behavior_Q4, how='left', on='cust_no')

### Important Customer Behavior (big_event)

In [189]:
# Q3
path = '../data/raw/x_train/big_event_Q3.csv'
big_event_Q3 = merge_file(path)

  return(pd.concat(df_temp, ignore_index=True, axis='index'))


In [190]:
# Q4
path = '../data/raw/x_train/big_event_Q4.csv'
big_event_Q4 = merge_file(path)

In [193]:
# Drop duplicated customer IDs except for the last occurrance
big_event_Q3 = big_event_Q3.drop_duplicates(subset=['cust_no'], keep='last')
big_event_Q3 = big_event_Q4.drop_duplicates(subset=['cust_no'], keep='last')

In [194]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, big_event_Q3 has {} rows and {} columns.'.format(big_event_Q3.shape[0], big_event_Q3.shape[1]))
print('After dropping duplicated customer IDs, big_event_Q4 has {} rows and {} columns.'.format(big_event_Q4.shape[0], big_event_Q4.shape[1]))

After dropping duplicated customer IDs, big_event_Q3 has 493441 rows and 19 columns.
After dropping duplicated customer IDs, big_event_Q4 has 543823 rows and 19 columns.


In [393]:
# Trim by the customer IDs in `y_train` set, separated by quarter
big_event_Q3 = trim_by_quarter(big_event_Q3, True)
big_event_Q4 = trim_by_quarter(big_event_Q4, False)

In [394]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After trimming, big_event_Q3 has {} rows and {} columns.'.format(big_event_Q3.shape[0], big_event_Q3.shape[1]))
print('After trimming, big_event_Q4 has {} rows and {} columns.'.format(big_event_Q4.shape[0], big_event_Q4.shape[1]))

After trimming, big_event_Q3 has 69126 rows and 19 columns.
After trimming, big_event_Q4 has 76170 rows and 19 columns.


In [395]:
# Save to archive
big_event_Q3.to_csv('../data/preprocess/archive/big_event_Q3.csv', index=False)
big_event_Q4.to_csv('../data/preprocess/archive/big_event_Q4.csv', index=False)

In [464]:
# Merge to X_train
X_train_Q3 = X_train_Q3.merge(big_event_Q3, how='left', on='cust_no')
X_train_Q4 = X_train_Q4.merge(big_event_Q4, how='left', on='cust_no')

### Customer Deposits (cunkuan)

In [398]:
# Q3
path = '../data/raw/x_train/cunkuan_m[789].csv'
savings_Q3 = merge_file(path)

In [399]:
# Q4
path = '../data/raw/x_train/cunkuan_m1[012].csv'
savings_Q4 = merge_file(path)

In [400]:
# Drop duplicated customer IDs except for the last occurrance
savings_Q3 = savings_Q3.drop_duplicates(subset=['cust_no'], keep='last')
savings_Q4 = savings_Q4.drop_duplicates(subset=['cust_no'], keep='last')

In [401]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, savings_Q3 has {} rows and {} columns.'.format(savings_Q3.shape[0], savings_Q3.shape[1]))
print('After dropping duplicated customer IDs, savings_Q4 has {} rows and {} columns.'.format(savings_Q4.shape[0], savings_Q4.shape[1]))

After dropping duplicated customer IDs, savings_Q3 has 200721 rows and 3 columns.
After dropping duplicated customer IDs, savings_Q4 has 237049 rows and 3 columns.


In [402]:
# Trim by the customer IDs in `y_train` set, separated by quarter
savings_Q3 = trim_by_quarter(savings_Q3, True)
savings_Q4 = trim_by_quarter(savings_Q4, False)

In [403]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After trimming, savings_Q3 has {} rows and {} columns.'.format(savings_Q3.shape[0], savings_Q3.shape[1]))
print('After trimming, savings_Q4 has {} rows and {} columns.'.format(savings_Q4.shape[0], savings_Q4.shape[1]))

After trimming, savings_Q3 has 69122 rows and 3 columns.
After trimming, savings_Q4 has 76167 rows and 3 columns.


In [404]:
# Save to archive
savings_Q3.to_csv('../data/preprocess/archive/savings_Q3.csv', index=False)
savings_Q4.to_csv('../data/preprocess/archive/savings_Q4.csv', index=False)

In [465]:
# Merge to X_train
X_train_Q3 = X_train_Q3.merge(savings_Q3, how='left', on='cust_no')
X_train_Q4 = X_train_Q4.merge(savings_Q4, how='left', on='cust_no')

### Valid Customer (cust_avli)

Valid customer set contains customer IDs that are same as those in the test set.

In [204]:
# Q3
path = '../data/raw/x_train/cust_avli_Q3.csv'
cust_avli_Q3 = merge_file(path)

In [205]:
# Q4
path = '../data/raw/x_train/cust_avli_Q4.csv'
cust_avli_Q4 = merge_file(path)

In [212]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('cust_avli_Q3 has {} rows and {} columns.'.format(cust_avli_Q3.shape[0], cust_avli_Q3.shape[1]))
print('cust_avli_Q4 has {} rows and {} columns.'.format(cust_avli_Q4.shape[0], cust_avli_Q4.shape[1]))

cust_avli_Q3 has 69126 rows and 1 columns.
cust_avli_Q4 has 76170 rows and 1 columns.


In [211]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
# Confirm that valid customers are the same as those in the test set
print(len(cust_avli_Q3.value_counts()) == y_Q3_3.shape[0])
print(len(cust_avli_Q4.value_counts()) == y_Q4_3.shape[0])

True
True


### Customer Trivias (cust_info)

In [451]:
# Q3
path = '../data/raw/x_train/cust_info_q3.csv'
cust_info_Q3 = merge_file(path)

In [452]:
# Q4
path = '../data/raw/x_train/cust_info_q4.csv'
cust_info_Q4 = merge_file(path)

In [453]:
# Drop duplicated customer IDs except for the last occurrance
cust_info_Q3 = cust_info_Q3.drop_duplicates(subset=['cust_no'], keep='last')
cust_info_Q4 = cust_info_Q4.drop_duplicates(subset=['cust_no'], keep='last')

In [454]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, cust_info_Q3 has {} rows and {} columns.'.format(cust_info_Q3.shape[0], cust_info_Q3.shape[1]))
print('After dropping duplicated customer IDs, cust_info_Q4 has {} rows and {} columns.'.format(cust_info_Q4.shape[0], cust_info_Q4.shape[1]))

After dropping duplicated customer IDs, cust_info_Q3 has 493441 rows and 21 columns.
After dropping duplicated customer IDs, cust_info_Q4 has 543823 rows and 21 columns.


In [455]:
# Trim by the customer IDs in `y_train` set, separated by quarter
cust_info_Q3 = trim_by_quarter(cust_info_Q3, True)
cust_info_Q4 = trim_by_quarter(cust_info_Q4, False)

In [456]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After trimming, cust_info_Q3 has {} rows and {} columns.'.format(cust_info_Q3.shape[0], cust_info_Q3.shape[1]))
print('After trimming, cust_info_Q4 has {} rows and {} columns.'.format(cust_info_Q4.shape[0], cust_info_Q4.shape[1]))

After trimming, cust_info_Q3 has 69126 rows and 21 columns.
After trimming, cust_info_Q4 has 76170 rows and 21 columns.


In [457]:
# Save to archive
cust_info_Q3.to_csv('../data/preprocess/archive/cust_info_Q3.csv', index=False)
cust_info_Q4.to_csv('../data/preprocess/archive/cust_info_Q4.csv', index=False)

In [470]:
# Merge to X_train
X_train_Q3 = X_train_Q3.merge(cust_info_Q3, how='left', on='cust_no')
X_train_Q4 = X_train_Q4.merge(cust_info_Q4, how='left', on='cust_no')

### X_Train Ready

In [471]:
print(X_train_Q3.shape)
print(X_train_Q4.shape)

(69126, 56)
(76170, 56)


In [472]:
X_train_Q3.to_csv('../data/preprocess/archive/X_train_Q3.csv', index=False)
X_train_Q4.to_csv('../data/preprocess/archive/X_train_Q4.csv', index=False)

In [473]:
X_train = X_train_Q3.append(X_train_Q4)

In [474]:
X_train.shape

(145296, 56)

In [475]:
X_train.to_csv('../data/preprocess/X_train.csv', index=False)

----

## X_test

### Customer Assets (aum)

In [220]:
# Q1
path = '../data/raw/x_test/aum_m[123].csv'
aum_Q1 = merge_file(path)

In [221]:
# Drop duplicated customer IDs except for the last occurrance
aum_Q1 = aum_Q1.drop_duplicates(subset=['cust_no'], keep='last')

In [222]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, aum_Q1 has {} rows and {} columns.'.format(aum_Q1.shape[0], aum_Q1.shape[1]))

After dropping duplicated customer IDs, aum_Q1 has 659624 rows and 9 columns.


### Customer Behavior (behavior)

In [246]:
# Q1
path = '../data/raw/x_test/behavior_m[123].csv'
behavior_Q1 = merge_file(path)

In [247]:
# Drop duplicated customer IDs except for the last occurrance
behavior_Q1 = behavior_Q1.drop_duplicates(subset=['cust_no'], keep='last')

In [248]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, behavior_Q1 has {} rows and {} columns.'.format(behavior_Q1.shape[0], behavior_Q1.shape[1]))

After dropping duplicated customer IDs, behavior_Q1 has 659624 rows and 8 columns.


### Important Customer Behavior (big_event)

In [249]:
# Q1
path = '../data/raw/x_test/big_event_Q1.csv'
big_event_Q1 = merge_file(path)

In [250]:
# Drop duplicated customer IDs except for the last occurrance
big_event_Q1 = big_event_Q1.drop_duplicates(subset=['cust_no'], keep='last')

In [251]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, big_event_Q1 has {} rows and {} columns.'.format(big_event_Q1.shape[0], big_event_Q1.shape[1]))

After dropping duplicated customer IDs, big_event_Q1 has 659624 rows and 19 columns.


### Customer Deposits (cunkuan)

In [252]:
# Q1
path = '../data/raw/x_test/cunkuan_m[123].csv'
savings_Q1 = merge_file(path)

In [253]:
# Drop duplicated customer IDs except for the last occurrance
savings_Q1 = savings_Q1.drop_duplicates(subset=['cust_no'], keep='last')

In [254]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, savings_Q1 has {} rows and {} columns.'.format(savings_Q1.shape[0], savings_Q1.shape[1]))

After dropping duplicated customer IDs, savings_Q1 has 254816 rows and 3 columns.


### Valid Customer (cust_avli)

The `cust_avli` in the `x_test` sets is identical to the sample submission file.

In [255]:
# Q1
path = '../data/raw/x_test/cust_avli_Q1.csv'
cust_avli_Q1 = merge_file(path)

In [256]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('cust_avli_Q1 has {} rows and {} columns.'.format(cust_avli_Q1.shape[0], cust_avli_Q1.shape[1]))

cust_avli_Q1 has 76722 rows and 1 columns.


### Customer Trivias (cust_info)

In [257]:
# Q1
path = '../data/raw/x_test/cust_info_q1.csv'
cust_info_Q1 = merge_file(path)

In [None]:
# Drop duplicated customer IDs except for the last occurrance
cust_info_Q1 = cust_info_Q1.drop_duplicates(subset=['cust_no'], keep='last')

In [258]:
# FOR DISPLAY PURPOSE. SKIP THIS CHUNK.
print('After dropping duplicated customer IDs, cust_info_Q1 has {} rows and {} columns.'.format(cust_info_Q1.shape[0], cust_info_Q1.shape[1]))

After dropping duplicated customer IDs, cust_info_Q1 has 659624 rows and 21 columns.


### X_test Ready

In [418]:
X_test = cust_avli_Q1.copy()

In [420]:
X_test = X_test.merge(aum_Q1, how='left', on='cust_no')

In [421]:
X_test = X_test.merge(behavior_Q1, how='left', on='cust_no')

In [422]:
X_test = X_test.merge(big_event_Q1, how='left', on='cust_no')

In [423]:
X_test = X_test.merge(savings_Q1, how='left', on='cust_no')

In [424]:
X_test = X_test.merge(cust_info_Q1, how='left', on='cust_no')

In [425]:
X_test.shape

(76722, 56)

In [477]:
X_test.to_csv('../data/preprocess/X_test.csv', index=False)

----

### Missing Values

In [139]:
# check which data set has missing values
def nulltracker(self):
    counter = 0
    for names in self:
        indicator = names.isnull().sum() == 0
        if indicator.all() == False:
            print(counter)
        counter = counter + 1 

In [127]:
aum_m = aum_m1, aum_m2, aum_m3, aum_m7, aum_m8, aum_m9, aum_m10, aum_m11, aum_m12
nulltracker(aum_m)

In [151]:
behavior_m = behavior_m1, behavior_m2, behavior_m3, behavior_m7, behavior_m8, behavior_m9, behavior_m10, behavior_m11, behavior_m12 
nulltracker(behavior_m)

2
5
8


In [129]:
cunkuan_m = cunkuan_m1, cunkuan_m2, cunkuan_m3, cunkuan_m7, cunkuan_m8, cunkuan_m9, cunkuan_m10, cunkuan_m11, cunkuan_m12
nulltracker(cunkuan_m)

In [141]:
big_event_Q = big_event_Q1, big_event_Q3, big_event_Q4
nulltracker(big_event_Q)

0
1
2


In [142]:
cust_avli_Q = cust_avli_Q1, cust_avli_Q3, cust_avli_Q4
nulltracker(cust_avli_Q)

In [143]:
cust_info_q = cust_info_q1, cust_info_q3, cust_info_q4
nulltracker(cust_info_q)

0
1
2


In [144]:
y_Q = y_Q3_3, y_Q4_3
nulltracker(y_Q)

In [146]:
#print(behavior_m[3][behavior_m[3].isnull().T.any()])
behavior_m[2].isnull().any()
behavior_m[2]["B6"]

0                         NaN
1                         NaN
2                         NaN
3         2020-03-31 22:06:00
4                         NaN
5         2020-03-30 19:07:00
6                         NaN
7                         NaN
8                         NaN
9                         NaN
10                        NaN
11                        NaN
12                        NaN
13                        NaN
14                        NaN
15        2020-01-16 11:07:00
16                        NaN
17                        NaN
18                        NaN
19                        NaN
20                        NaN
21                        NaN
22                        NaN
23                        NaN
24        2020-01-20 04:17:00
25                        NaN
26                        NaN
27                        NaN
28                        NaN
29                        NaN
                 ...         
659594                    NaN
659595                    NaN
659596    

In [156]:
behavior_m12

Unnamed: 0,cust_no,B1,B2,B3,B4,B5,B6,B7
0,0xb2d14994,5,2,1346.15,2,5346.15,2019-12-13 18:03:00,22
1,0xb2d65824,0,0,0.00,0,0.00,,0
2,0xb2d539b7,0,0,0.00,0,0.00,,0
3,0xb2d807ae,0,0,0.00,0,0.00,,0
4,0xb2d176b2,14,3,292654.81,8,323939.53,2019-12-31 06:02:00,28
5,0xb2d1386f,0,0,0.00,0,0.00,,0
6,0xb2d5ae1e,0,1,0.01,0,0.00,2019-12-13 19:35:00,1
7,0xb2d73522,0,0,0.00,0,0.00,,0
8,0xb2d4bec7,0,0,0.00,0,0.00,,0
9,0xb2d86da5,0,0,0.00,0,0.00,,0


In [36]:
display(aum_m1.describe(), aum_m2.describe(), aum_m3.describe(), aum_m7.describe(), aum_m8.describe(), 
        aum_m9.describe(), aum_m10.describe(), aum_m11.describe(), aum_m12.describe())

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,556195.0,556195.0,556195.0,556195.0,556195.0,556195.0,556195.0,556195.0
mean,47529.63,3296.358,4633.316,4351.619,193.9474,315.8859,64462.12,12647.14
std,1555244.0,547665.8,217063.8,100014.2,10393.17,27438.51,652513.7,186607.5
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,7.67,0.0,0.0,0.0,0.0,0.0
max,290000000.0,327849800.0,109041200.0,30000000.0,3740000.0,10000000.0,30000000.0,58000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,603631.0,603631.0,603631.0,603631.0,603631.0,603631.0,603631.0,603631.0
mean,43858.46,3259.812,4922.958,3361.816,185.0724,297.5686,59819.33,12305.64
std,1492338.0,555233.9,210876.8,78903.69,8912.695,26737.54,629864.7,184637.7
min,0.0,0.0,0.0,0.0,-1.77,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,3.97,0.0,0.0,0.0,0.0,0.0
max,290000000.0,333514200.0,109041200.0,18772000.0,2100000.0,10000000.0,30000000.0,58000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,659624.0,659624.0,659624.0,659624.0,659624.0,659624.0,659624.0,659624.0
mean,41027.85,2397.763,4548.653,3144.846,184.2109,259.9848,57239.52,12733.69
std,1397215.0,477838.9,223023.2,71846.28,15288.95,24475.11,625542.1,308846.7
min,0.0,0.0,0.0,0.0,-1.77,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,2.37,0.0,0.0,0.0,0.0,0.0
max,200000000.0,337250800.0,109145900.0,18220000.0,10000000.0,10000000.0,30000000.0,150000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,465441.0,465441.0,465441.0,465441.0,465441.0,465441.0,465441.0,465441.0
mean,45778.86,3865.932,6319.957,3782.432,213.8875,505.8526,57950.86,7914.9
std,1511018.0,535391.2,279307.2,166731.6,12732.93,68757.66,619003.2,145328.3
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,4.3,0.0,0.0,0.0,0.0,0.0
max,347000000.0,210653500.0,100071800.0,68000000.0,4035640.0,40757140.0,34750000.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,479063.0,479063.0,479063.0,479063.0,479063.0,479063.0,479063.0,479063.0
mean,50150.33,3243.406,4790.877,3527.987,207.1597,480.9048,59681.86,8685.593
std,1633422.0,508359.1,202644.2,121356.3,11906.52,68032.67,627669.7,154111.3
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,4.31,0.0,0.0,0.0,0.0,0.0
max,290000000.0,216889700.0,94675160.0,50860000.0,3024484.0,40862450.0,34750000.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,493441.0,493441.0,493441.0,493441.0,493441.0,493441.0,493441.0,493441.0
mean,51071.97,3042.671,5425.967,3708.233,222.7427,427.9809,61600.78,10219.8
std,1656046.0,405245.5,302873.6,121347.4,14517.66,37401.05,639543.1,191507.7
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,5.73,0.0,0.0,0.0,0.0,0.0
max,290000000.0,216430700.0,142022500.0,50860000.0,5003991.0,11541710.0,32831460.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,506513.0,506513.0,506513.0,506513.0,506513.0,506513.0,506513.0,506513.0
mean,49653.24,3594.652,6016.074,4000.847,203.9064,346.0443,62738.38,10664.76
std,1647982.0,570159.7,218568.2,81978.53,11563.78,30681.68,645591.8,187077.1
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,5.7,0.0,0.0,0.0,0.0,0.0
max,290000000.0,335737100.0,79130930.0,19860000.0,3100000.0,10182110.0,32783190.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,521566.0,521566.0,521566.0,521566.0,521566.0,521566.0,521566.0,521566.0
mean,49963.68,3872.203,4449.279,4441.739,198.6803,337.6526,64102.21,11251.53
std,1623167.0,571757.6,192396.7,95223.27,10376.71,30055.5,651373.2,184065.2
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,5.75,0.0,0.0,0.0,0.0,0.0
max,290000000.0,334618500.0,79130930.0,30000000.0,2900067.0,10208150.0,32734560.0,50000000.0


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8
count,543823.0,543823.0,543823.0,543823.0,543823.0,543823.0,543823.0,543823.0
mean,47674.5,3674.215,5989.938,4462.179,257.2643,435.3697,63953.52,12304.28
std,1590177.0,575038.7,244861.2,116987.6,18356.57,39167.26,650883.5,206312.4
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,8.51,0.0,0.0,0.0,0.0,0.0
max,290000000.0,332067100.0,109041200.0,45000000.0,7000000.0,10010830.0,30000000.0,76000000.0
