# 3_PreProcessing
In this notebook I shall implement the strategy suggested in ```2_EDA_WhichModel_FeatureEngineering```. Namely, I will design 4 models:
1. **Churn Prediction for 10 day history**: 

| MODEL | Churn_10 |
| --- | ----------- |
| *Model Type* | Binary classifier |
| *Datasets* | <ul><li>Geographic and Demographic ```df_geo_gen```</li><li>Behavioral ```df_10```</li><li>Player details ```df_players``` </li></ul>|        
| *Target* | ```df_players[churn_10]```: {0,1}   |

2. **Customer Value Preduction for 10 day history**:
        
| MODEL | CVP_10 |
| --- | ----------- |
| *Model Type* | Multi-class classifier |
| *Datasets* | <ul><li>Geographic and Demographic ```df_geo_gen```</li><li>Behavioral ```df_10```</li><li>NGR values ```df_ngr```</li></ul>|        
| *Target* | ```df_ngr[CLTV]```: {0, 1, 2, 3} |

3. **Churn Prediction for 30 day history**: 

| MODEL | Churn_30 |
| --- | ----------- |
| *Model Type* | Binary classifier |
| *Datasets* | <ul><li>Geographic and Demographic ```df_geo_gen```</li><li>Behavioral ```df_30```</li><li>Player details ```df_players``` </li></ul>|        
| *Target* | ```df_players[churn_30]```: {0,1}   |

4. **Customer Value Preduction for 30 day history**:
        
| MODEL | CVP_30 |
| --- | ----------- |
| *Model Type* | Multi-class classifier |
| *Datasets* | <ul><li>Geographic and Demographic ```df_geo_gen```</li><li>Behavioral ```df_30```</li><li>NGR values ```df_ngr```</li></ul>|        
| *Target* | ```df_ngr[CLTV]```: {0, 1, 2, 3} |

In [124]:
import pandas as pd
from functools import reduce
import numpy as np
import calendar
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import scikitplot as skplt

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

## Pre-Processing
Normalising, One-Hot encoding and creating the 4 datasets

In [76]:
# import dataset
df_geo_gen = pd.read_pickle('./data/df_2_FE_df_geo_gen.pkl')
df_10 = pd.read_pickle('./data/df_2_FE_df_10.pkl')
df_30 = pd.read_pickle('./data/df_2_FE_df_30.pkl')
df_players = pd.read_pickle('./data/df_2_FE_df_players.pkl')
df_ngr = pd.read_pickle('./data/df_2_FE_df_ngr.pkl')

In [77]:
def getDataFrame(dayRange, MODEL,
                 df_geo_gen=df_geo_gen, df_10=df_10, df_30=df_30, df_players=df_players,
                 df_ngr=df_ngr.drop(['GROSS_sum_ngr'],axis=1)):
    
    print('Creating dataset for {} model for a {} day duration'.format(MODEL, dayRange))
    # define dataframes and which one to be removed based on dataframe     
    print('--> Select dfs')
    if dayRange == 10:
        dfs = [df_10, df_geo_gen, df_ngr, 
               df_players.drop(['churn_30','DIFF_deposit_reg','DIFF_trans_deposit','DIFF_trans_max_min'],
                               axis=1)]
    if dayRange == 31:
        dfs = [df_30, df_geo_gen, df_ngr, 
               df_players.drop(['churn_10','DIFF_deposit_reg','DIFF_trans_deposit','DIFF_trans_max_min'], 
                               axis=1)] 
    
    #### Merge data frame ####
    print('--> Merge dfs')
    df_final = reduce(lambda left,right: pd.merge(left,right,on='customer_id'), dfs)
    # Check if dataframe size is correct
    if (len(set([D.shape[0] for D in dfs])) == 1 ):
        print('---> Dataframe Merge check : All OK')
    else:
        print([D.shape[0] for D in dfs])
        raise ValueError('Dataframes are not the same size')
        
    #### drop cid ####
    print('--> Drop customer id')
    df_final.drop('customer_id',axis=1, inplace=True)
    
    #### Rename columns to remove dayRange ####     
    print('--> Rename columns to remove DayRange')
    df_final.columns = [ col.replace('_10', '') for col in df_final.columns.tolist() ]
    df_final.columns = [ col.replace('_31', '') for col in df_final.columns.tolist() ]
    df_final.columns = [ col.replace('_30', '') for col in df_final.columns.tolist() ]
    
    #### Inpute inf ####    
    print('--> Inpute inf')
    for col in colReplaceINF:
        df_final[col] = df_final[col].replace(np.inf, np.median(df_final[col]))
        df_final[col] = df_final[col].replace(-1*np.inf, np.median(df_final[col]))

    #### Normalise Cols ####
    print('--> Normalise cols')
    normaliseCheck = []
    normaliseCheck.append(df_final[colNormalise].shape)
    pipe_normalise = Pipeline([('ss', StandardScaler())])
    X = pd.DataFrame(pipe_normalise.fit_transform(df_final[colNormalise]), columns=colNormalise )
    normaliseCheck.append(X.shape)
    df_final.drop(colNormalise, axis=1, inplace=True)
    df_final = pd.concat([df_final, X], axis=1)
    normaliseCheck.append(df_final[colNormalise].shape)
    if len(set([n[1] for n in normaliseCheck])) == 1:
        print('---> Normalisation check : All OK')
    else:
        raise ValueError('---> Normalisation check : Columns are not agreeing in count')
    
    
    #### Day Of Week & colOHE One Hot Encoding ####
    print('--> DOW & OHE ')
    df_final[colDOW] = df_final[colDOW].replace( [float(d) for d in range(7)], list(calendar.day_abbr) )
    
    colLE = colDOW + colOHE
    LEcheck = []
    LEcheck.append( (len(colDOW) * 7) + (len(colOHE)*4) )
    df_final[colLE] = df_final[colLE].astype(str)
    X = pd.get_dummies(df_final[colLE])
    LEcheck.append(X.shape)
    df_final.drop(colLE, axis=1, inplace=True)
    LEcheck.append(df_final.shape[1] + X.shape[1])
    df_final = pd.concat([df_final, X], axis=1)
    LEcheck.append(df_final.shape)
    if (LEcheck[0]==LEcheck[1][1]) & (LEcheck[2]==LEcheck[3][1]):
        print('---> Day Of Week Encoding & OHE check : All OK')
    else:
        raise ValueError('---> Day Of Week Encoding & OHE check : Columns are not agreeing in count {}'.format(LEcheck))
    
    #### select model target ####
    print('--> Model target')
    if MODEL == 'CHURN':
        df_final.drop(['CLTV'], axis=1, inplace=True)
    if MODEL == 'CVP':
        df_final.drop(['churn'], axis=1, inplace=True)
    print('')
    return df_final

In [78]:
colDrop = ['customer_id','GROSS_sum_ngr']
colReplaceINF = ['RATIO_NGR_Deposit','RATIO_Bets_Deposit']
colNormalise = ['Country_Lat', 'Country_Long', 'TOTAL_count_deposit', 'TOTAL_sum_free_spin', 'TOTAL_sum_deposit',
                'TOTAL_sum_bonus_cost', 'TOTAL_sum_ngr', 'TOTAL_sum_bet_real', 'TOTAL_sum_bet_bonus', 
                'TOTAL_sum_win_real', 'TOTAL_sum_win_bonus', 'COUNT_numDeposits', 'COUNT_numFreeSpin',
                'COUNT_numRealBets', 'COUNT_numBonusBets', 'RATIO_NGR_Deposit', 'RATIO_Bets_Deposit',
                'RFM_R', 'RFM_I', 'AVG_deposit', 'AVG_count_deposit', 'AVG_sum_free_spin', 'AVG_sum_deposit',
                'AVG_sum_bonus_cost', 'AVG_sum_ngr', 'AVG_sum_bet_real', 'AVG_sum_bet_bonus', 'AVG_sum_win_real',
                'AVG_sum_win_bonus', 'DIFF_deposit_reg', 'DIFF_trans_deposit', 'DIFF_trans_max_min', 'GROSS_sum_ngr']
colDOW = ['BEH_DOW_Interaction', 'BEH_DOW_count_deposit', 'BEH_DOW_bet_real', 'BEH_DOW_bet_bonus',
          'BEH_DOW_free_spin', 'BEH_DOW_win_real', 'BEH_DOW_win_bonus']
colOHE = ['NGR_SPLIT_TOTAL_sum_ngr','NGR_SPLIT_AVG_sum_ngr']
colTarget = ['CLTV', 'churn']

In [79]:
def getDF_PreProcessingSettings(df_final):
    return pd.DataFrame({'DataFrame Columns' :  [col.replace('_10','').replace('_30','').replace('_31','') for col in df_final.columns.tolist()],
                         'Drop' : [int(col.replace('_10','').replace('_30','').replace('_31','') in colDrop) for col in df_final.columns.tolist() ],
                         'Normalise' : [int(col.replace('_10','').replace('_30','').replace('_31','') in colNormalise) for col in df_final.columns.tolist() ],
                         'DOW' : [int(col.replace('_10','').replace('_30','').replace('_31','') in colDOW) for col in df_final.columns.tolist() ],
                         'OHE' : [int(col.replace('_10','').replace('_30','').replace('_31','') in colOHE) for col in df_final.columns.tolist() ],
                         'ReplaceINF' : [int(col.replace('_10','').replace('_30','').replace('_31','') in colReplaceINF) for col in df_final.columns.tolist() ],
                         'Target' : [int(col.replace('_10','').replace('_30','').replace('_31','') in colTarget) for col in df_final.columns.tolist() ],
                         'Sample1' : df_final.loc[0].values.tolist(),
                         'Sample2' : df_final.loc[1].values.tolist(),
                         'Sample3' : df_final.loc[2].values.tolist(),
                         'Sample4' : df_final.loc[3].values.tolist(),
                         })

In [80]:
getDF_PreProcessingSettings(df_players)

Unnamed: 0,DataFrame Columns,Drop,Normalise,DOW,OHE,ReplaceINF,Target,Sample1,Sample2,Sample3,Sample4
0,customer_id,1,0,0,0,0,0,1,2,3,4
1,DIFF_deposit_reg,0,1,0,0,0,0,1792,1426,1418,1394
2,DIFF_trans_deposit,0,1,0,0,0,0,0,0,0,0
3,DIFF_trans_max_min,0,1,0,0,0,0,0,1,0,51
4,churn,0,0,0,0,0,1,0,1,0,0
5,churn,0,0,0,0,0,1,0,0,0,0


In [81]:
getDF_PreProcessingSettings(df_10)

Unnamed: 0,DataFrame Columns,Drop,Normalise,DOW,OHE,ReplaceINF,Target,Sample1,Sample2,Sample3,Sample4
0,customer_id,1,0,0,0,0,0,1.0,2.0,3.0,4.0
1,TOTAL_count_deposit,0,1,0,0,0,0,1.0,1.0,1.0,5.0
2,TOTAL_sum_free_spin,0,1,0,0,0,0,0.0,0.0,0.0,0.0
3,TOTAL_sum_deposit,0,1,0,0,0,0,145.65,20.0,52.4,256.0
4,TOTAL_sum_bonus_cost,0,1,0,0,0,0,145.335,30.0,0.0,0.0
5,TOTAL_sum_ngr,0,1,0,0,0,0,142.026,-140.0,37.468,255.662
6,TOTAL_sum_bet_real,0,1,0,0,0,0,0.0,23.99,198.596,2415.718
7,TOTAL_sum_bet_bonus,0,1,0,0,0,0,234.756,1109.88,0.0,0.0
8,TOTAL_sum_win_real,0,1,0,0,0,0,0.0,11.81,146.227,2160.056
9,TOTAL_sum_win_bonus,0,1,0,0,0,0,89.655,1252.06,0.0,0.0


In [82]:
getDF_PreProcessingSettings(df_30)

Unnamed: 0,DataFrame Columns,Drop,Normalise,DOW,OHE,ReplaceINF,Target,Sample1,Sample2,Sample3,Sample4
0,customer_id,1,0,0,0,0,0,1.0,2.0,3.0,4.0
1,TOTAL_count_deposit,0,1,0,0,0,0,1.0,1.0,1.0,5.0
2,TOTAL_sum_free_spin,0,1,0,0,0,0,0.0,0.0,0.0,0.0
3,TOTAL_sum_deposit,0,1,0,0,0,0,145.65,20.0,52.4,256.0
4,TOTAL_sum_bonus_cost,0,1,0,0,0,0,145.335,30.0,0.0,0.0
5,TOTAL_sum_ngr,0,1,0,0,0,0,142.026,-140.0,37.468,255.662
6,TOTAL_sum_bet_real,0,1,0,0,0,0,0.0,23.99,198.596,2415.718
7,TOTAL_sum_bet_bonus,0,1,0,0,0,0,234.756,1109.88,0.0,0.0
8,TOTAL_sum_win_real,0,1,0,0,0,0,0.0,11.81,146.227,2160.056
9,TOTAL_sum_win_bonus,0,1,0,0,0,0,89.655,1252.06,0.0,0.0


In [83]:
getDF_PreProcessingSettings(df_geo_gen)

Unnamed: 0,DataFrame Columns,Drop,Normalise,DOW,OHE,ReplaceINF,Target,Sample1,Sample2,Sample3,Sample4
0,customer_id,1,0,0,0,0,0,1.0,2.0,3.0,4.0
1,Country_Lat,0,1,0,0,0,0,60.128161,61.92411,60.472024,60.128161
2,Country_Long,0,1,0,0,0,0,18.643501,25.748151,8.468946,18.643501
3,gender,0,0,0,0,0,0,1.0,1.0,1.0,1.0


In [84]:
getDF_PreProcessingSettings(df_ngr)

Unnamed: 0,DataFrame Columns,Drop,Normalise,DOW,OHE,ReplaceINF,Target,Sample1,Sample2,Sample3,Sample4
0,customer_id,1,0,0,0,0,0,1.0,2.0,3.0,4.0
1,GROSS_sum_ngr,1,1,0,0,0,0,142.026,-140.0,37.468,308.098
2,CLTV,0,0,0,0,0,1,3.0,0.0,2.0,3.0


In [315]:
# Create and save the df
df_churn10 = getDataFrame(10, 'CHURN')
df_cvp10 = getDataFrame(10, 'CVP')
df_churn30 = getDataFrame(31, 'CHURN')
df_cvp30 = getDataFrame(31, 'CVP')

Creating dataset for CHURN model for a 10 day duration
--> Select dfs
--> Merge dfs
---> Dataframe Merge check : All OK
--> Drop customer id
--> Rename columns to remove DayRange
--> Inpute inf
--> Normalise cols
---> Normalisation check : All OK
--> DOW & OHE 
---> Day Of Week Encoding & OHE check : All OK
--> Model target

Creating dataset for CVP model for a 10 day duration
--> Select dfs
--> Merge dfs
---> Dataframe Merge check : All OK
--> Drop customer id
--> Rename columns to remove DayRange
--> Inpute inf
--> Normalise cols
---> Normalisation check : All OK
--> DOW & OHE 
---> Day Of Week Encoding & OHE check : All OK
--> Model target

Creating dataset for CHURN model for a 31 day duration
--> Select dfs
--> Merge dfs
---> Dataframe Merge check : All OK
--> Drop customer id
--> Rename columns to remove DayRange
--> Inpute inf
--> Normalise cols
---> Normalisation check : All OK
--> DOW & OHE 
---> Day Of Week Encoding & OHE check : All OK
--> Model target

Creating dataset for 

In [316]:
df_churn10.to_pickle('./data/df_3_df_churn10.pkl')
df_cvp10.to_pickle('./data/df_3_df_cvp10.pkl')
df_churn30.to_pickle('./data/df_3_df_churn30.pkl')
df_cvp30.to_pickle('./data/df_3_df_cvp30.pkl')