## Problem Statement
When the performance of your model is not good, revisit your data. This is what we will do in this notebook. 
We will have a look at our data and try to engineer some features using which the model can use to improve its performance and stability.


In [1]:
#make sure your path is set to source folder
%cd /home/

/home


In [2]:
!pwd

/home


### 1.1 Importing packages

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Imported Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from scripts import utils
from pycaret.classification import *
# Other Libraries
import mlflow

In [5]:
# Setting up all directory
root_folder = "/home/"

data_directory = root_folder+"data/raw/"
data_profile_path = root_folder+"/data/profile_report/"
intermediate_data_path = root_folder+"data/interim/"
final_processed_data_path = root_folder+"data/processed/"
database_path = root_folder+"database/"
print("directory loaded")

directory loaded


 ### 1.2 Reading Data

We will be using the raw data for our analysis instead of the sampled one so that we can better judge the features that we create. But before moving on with our analysis it is advised to revisit that you revisit the EDA that we performed previously.
 
* Recall that we had 4 categories of data, User Profile data, user logs, transactions, and historic data.
* Here we will try to create features using that better represent the user’s engagement and the transaction that the user made.
* But before that let’s load and clean the raw data.
* Recall that during our preliminary analysis we found that the merging of the data was simply done as common aggregation. This needs to be improved primarly.

In [6]:
%%time

#Reading the data
#data pipeline 

members, user_logs, transactions, train  = utils.load_data( [
                                                            f"{data_directory}members_profile.csv",
                                                            f"{data_directory}userlogs.csv",
                                                            f"{data_directory}transactions_logs.csv",
                                                            f"{data_directory}churn_logs.csv"
                                                            ]
                                                          )

CPU times: user 11.1 s, sys: 1.82 s, total: 12.9 s
Wall time: 12.9 s


In [7]:
print(members.shape)
print(transactions.shape)
print(user_logs.shape)
print(train.shape)

(4348970, 6)
(4380726, 9)
(4828886, 9)
(385591, 2)


 ### 1.3 Data cleaning
    
Converting the columns to date-time column

In [8]:
%%time
members_c, transactions_c, user_logs_c = utils.compress_dataframes([members, transactions, user_logs])
members = members_c[0]

transactions = transactions_c[0]
user_logs = user_logs_c[0]

CPU times: user 708 ms, sys: 553 ms, total: 1.26 s
Wall time: 1.26 s


In [9]:
print("members DF before compress was in MB ,",members_c[1], "and after compress , ", members_c[2])
print("transactions DF before compress was in MB ,",transactions_c[1], "and after compress , ", transactions_c[2])
print("user_logs DF before compress was in MB ,",user_logs_c[1], "and after compress , ", user_logs_c[2])

members DF before compress was in MB , 199.08016967773438 and after compress ,  99.54014587402344
transactions DF before compress was in MB , 300.8007049560547 and after compress ,  104.44476890563965
user_logs DF before compress was in MB , 331.5734100341797 and after compress ,  119.73492050170898


### 1.4 Data pre-processing

##### 1.4.1 Members data

In [10]:
members.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,xYPYczYWex38H7EahgDJL/IzmJUGxLqibbtKaL2hGR8=,1,0,,4,20161223
1,p80xUzvq8nBE/ExNjM9Q6/9TZ1meicF6cFZK0YBpgek=,1,0,,4,20161223
2,qr8IVotaLLFgJ7b9bOAtSdFQq2BaefkkiZdUoggHoy8=,1,0,,4,20161223
3,dKb/fL1RaKObYERPM1R2jO8Tjaj2076cWSTCoP5H3B4=,1,0,,4,20161223
4,eOvxwd67VVh+j+0rnAwlh8+cVPwFzudbH03XHv9ZZAc=,1,0,,4,20161223


In [11]:
# #this function is also available in utils.py
# def get_label_encoding_dataframe(dataframe, column_name, mapping_dict):
#     return dataframe[column_name].map(mapping_dict) 
# # #average_age if (x <=0 or x >100) else x
# def get_apply_condiiton_on_column(dataframe, column_name, condition):
#     return dataframe[column_name].apply(lambda x :eval(condition))

In [12]:
%%time

#Replacing missing values in gender
members['gender'] = utils.get_fill_na_dataframe(members, 'gender', value="others")

gender_mapping = {'male':0,'female':1,'others':2}
members['gender'] = utils.get_label_encoding_dataframe(members, 'gender',gender_mapping)


members['registered_via'] = utils.get_convert_column_dtype(members, 'registered_via', data_type='str')
members['city'] = utils.get_convert_column_dtype(members, 'city', data_type='str')
members['registration_init_time'] = utils.fix_time_in_df(members, 'registration_init_time', expand=False)

average_age = round(members['bd'].mean(),2)
condition = f"{average_age} if (x <=0 or x >100) else x"
members['bd'] = utils.get_apply_condiiton_on_column(members, 'bd', condition)

members.head()

CPU times: user 52.1 s, sys: 1.27 s, total: 53.4 s
Wall time: 53.3 s


Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,xYPYczYWex38H7EahgDJL/IzmJUGxLqibbtKaL2hGR8=,1,6.3,2,4,2016-12-23
1,p80xUzvq8nBE/ExNjM9Q6/9TZ1meicF6cFZK0YBpgek=,1,6.3,2,4,2016-12-23
2,qr8IVotaLLFgJ7b9bOAtSdFQq2BaefkkiZdUoggHoy8=,1,6.3,2,4,2016-12-23
3,dKb/fL1RaKObYERPM1R2jO8Tjaj2076cWSTCoP5H3B4=,1,6.3,2,4,2016-12-23
4,eOvxwd67VVh+j+0rnAwlh8+cVPwFzudbH03XHv9ZZAc=,1,6.3,2,4,2016-12-23


In [13]:
# observing the distribution of columns
utils.get_data_describe(members)

Unnamed: 0,bd,gender
count,4348970.0,4348970.0
mean,11.13,1.65
std,10.53,0.68
min,1.0,0.0
25%,6.3,2.0
50%,6.3,2.0
75%,6.3,2.0
max,100.0,2.0


##### 1.4.2 Transactions data

In [14]:
%%time
#date conversion

transactions['transaction_date'] = utils.fix_time_in_df(transactions, 'transaction_date', expand=False)
transactions['membership_expire_date'] = utils.fix_time_in_df(transactions, 'membership_expire_date', expand=False)
transactions.head()

CPU times: user 4.06 s, sys: 486 ms, total: 4.54 s
Wall time: 4.49 s


Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,Qw6UVFUknPVOLxSSsejinxU/8a5/AgmiWMvPoEt0rik=,39,30,149,149,1,2015-09-30,2015-11-08,0
1,o/7kLhrPCLGM9bhiD72KnzyESFAA8e0j0hQ5E0ZV6Uk=,23,0,0,149,1,2015-09-30,2015-10-31,0
2,fDxZokyT74FiBLf96N1JTmW0szBM+nHMFWsUaDQNFtw=,33,30,149,149,1,2015-11-30,2016-01-03,0
3,wWeIJemrBKSCN5eueejlHoZB3ns6RHD0itRv2SZOcEk=,41,30,99,99,1,2015-11-30,2015-12-31,0
4,Z23GxUZWHQpJOxwlGBSTHRf1KgRfb0/hTVgVWo4eDCc=,41,30,99,99,1,2015-11-30,2015-12-31,0


### 2 Feature Engineering

#### 2.1 Generating features from transactions data


* **is_discount**
Recall that in our dataset there are 2 columns named “plan_list_price” and “actual_amount_paid”. From here we can figure out if a user bought the plan at a discounted price or not by checking whether the amount paid by the user is smaller than the actual plan’s price or not. This feature is stored in “is_discount” where
	1 represents that the plan was bought at a discounted price
	0 represents that the plan was bought at the original price
We will also store the discount that the user received in “discount”
 
* **amt_per_day**
We will now create a feature that calculates the per-day cost of a user’s subscription. It is expected that if the per-day cost of the subscription is high then the propensity of the user to churn increases. We will store this information in a column called “amt_per_day”.
 
* **membership_duration**
We also expect the older customer to have a lower probability to churn, thus we will create a feature “membership_duration” which will hold the number of months that the user has been a member of our platform.
 
After creating and storing the above-mentioned features in “transactions.csv” we will generate a profile report for the same


In [15]:
#these functions are also present in utils.py
# def get_two_column_operations(dataframe, columns_1, columns_2, operator):
#     if operator == "+":
#         return dataframe[columns_1]+dataframe[columns_2]
#     elif operator == "-":
#         return dataframe[columns_1]-dataframe[columns_2]
#     elif operator == "/":
#         return dataframe[columns_1]/dataframe[columns_2]
#     elif operator == "*":
#         return dataframe[columns_1]*dataframe[columns_2]
    
# def get_timedelta_division(dataframe, column, td_type='D'):
#     return dataframe[column] /np.timedelta64(1,td_type)

# def get_replace_value_in_df(dataframe, column, value, replace_with):
#     return dataframe[column].replace(value,replace_with) 

In [16]:
%%time

transactions['discount'] =  utils.get_two_column_operations(transactions, 'plan_list_price', 'actual_amount_paid', "-")

condition = f"1 if x > 0 else 0"
transactions['is_discount'] = utils.get_apply_condiiton_on_column(transactions, 'discount', condition)


transactions['amt_per_day'] = utils.get_two_column_operations(transactions, 'actual_amount_paid', 'payment_plan_days', "/")
transactions['amt_per_day'] = utils.get_replace_value_in_df(transactions, 'amt_per_day', [np.inf, -np.inf], replace_with=0)


transactions['membership_duration'] = utils.get_two_column_operations(transactions, 'membership_expire_date', 'transaction_date', "-")
transactions['membership_duration'] = utils.get_timedelta_division(transactions, "membership_duration", td_type='D')
transactions['membership_duration'] = utils.get_convert_column_dtype(transactions, 'membership_duration', data_type='int')

condition = f"1 if x>30 else 0"
transactions['more_than_30'] = utils.get_apply_condiiton_on_column(transactions, 'membership_duration', condition)

CPU times: user 1min 6s, sys: 410 ms, total: 1min 7s
Wall time: 1min 6s


In [17]:
transactions.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,discount,is_discount,amt_per_day,membership_duration,more_than_30
0,Qw6UVFUknPVOLxSSsejinxU/8a5/AgmiWMvPoEt0rik=,39,30,149,149,1,2015-09-30,2015-11-08,0,0,0,4.966667,39,1
1,o/7kLhrPCLGM9bhiD72KnzyESFAA8e0j0hQ5E0ZV6Uk=,23,0,0,149,1,2015-09-30,2015-10-31,0,-149,0,0.0,31,1
2,fDxZokyT74FiBLf96N1JTmW0szBM+nHMFWsUaDQNFtw=,33,30,149,149,1,2015-11-30,2016-01-03,0,0,0,4.966667,34,1
3,wWeIJemrBKSCN5eueejlHoZB3ns6RHD0itRv2SZOcEk=,41,30,99,99,1,2015-11-30,2015-12-31,0,0,0,3.3,31,1
4,Z23GxUZWHQpJOxwlGBSTHRf1KgRfb0/hTVgVWo4eDCc=,41,30,99,99,1,2015-11-30,2015-12-31,0,0,0,3.3,31,1


We will apply different aggregation techniques on each column to derive additional features to map the relationship between independent and dependent vairables better.

In [18]:
agg = {'payment_method_id':['count','nunique'], # How many transactions user had done in past, captures if payment method is changed
       'payment_plan_days':['mean', 'nunique'] , #Average plan of customer in days, captures how many times plan is changed
       'plan_list_price':'mean', # Average amount charged on user
       'actual_amount_paid':'mean', # Average amount paid by user
       'is_auto_renew':['mean','max'], # Captures if user changed its auto_renew state
       'transaction_date':['min','max','count'], # First and the last transaction of a user
       'membership_expire_date':'max' , # Membership exipry date of the user's last subscription
       'is_cancel':['mean','max'], # Captures the average value of is_cancel and to check if user changed its is_cancel state
       'discount' : 'mean', # Average discount given to customer
       'is_discount':['mean','max'], # Captures the average value of is_discount and to check if user was given any discount in the past
       'amt_per_day' : 'mean', # Average amount a user spends per day
       'membership_duration' : 'mean' ,# Average membership duration 
       'more_than_30' : 'sum' #Flags if the difference in days if more than 30
        }

In [19]:
transactions_features = utils.get_groupby(transactions, by_column='msno', agg_dict=agg, agg_func = 'mean', simple_agg_flag=False, reset_index=True)
transactions_features.columns= transactions_features.columns.get_level_values(0)+'_'+transactions_features.columns.get_level_values(1)
transactions_features.rename(columns = {'msno_':'msno','payment_plan_days_nunique':'change_in_plan', 'payment_method_id_count':'total_payment_channels',
                                        'payment_method_id_nunique':'change_in_payment_methods','is_cancel_max':'is_cancel_change_flag',
                                        'is_auto_renew_max':'is_autorenew_change_flag','transaction_date_count':'total_transactions'}, inplace = True)
transactions_features.head()

Unnamed: 0,msno,total_payment_channels,change_in_payment_methods,payment_plan_days_mean,change_in_plan,plan_list_price_mean,actual_amount_paid_mean,is_auto_renew_mean,is_autorenew_change_flag,transaction_date_min,transaction_date_max,total_transactions,membership_expire_date_max,is_cancel_mean,is_cancel_change_flag,discount_mean,is_discount_mean,is_discount_max,amt_per_day_mean,membership_duration_mean,more_than_30_sum
0,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,12,1,30.0,1,149.0,149.0,1.0,1,2016-03-15,2017-02-15,12,2017-03-15,0.0,0,0.0,0.0,0,4.966667,30.416667,7
1,++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI=,12,1,30.0,1,149.0,149.0,1.0,1,2016-03-20,2017-02-20,12,2017-03-20,0.0,0,0.0,0.0,0,4.966667,30.416667,7
2,++0BJXY8tpirgIhJR14LDM1pnaRosjD1mdO1mIKxlJA=,16,2,34.1875,3,160.1875,160.1875,0.0,0,2015-09-02,2017-02-24,16,2017-03-26,0.0,0,0.0,0.0,0,4.594271,36.25,2
3,++1G0wVY14Lp0VXak1ymLhPUdXPSFJVBnjWwzGxBKJs=,18,1,30.0,1,99.0,99.0,1.0,1,2015-09-09,2017-02-09,18,2017-03-09,0.0,0,0.0,0.0,0,3.3,30.388889,10
4,++1GCIyXZO7834NjDKmcK1lBVLQi9PsN6sOC7wfW+8g=,8,1,30.0,1,99.0,99.0,1.0,1,2016-07-05,2017-02-04,8,2017-03-04,0.0,0,0.0,0.0,0,3.3,30.25,4


In [20]:
transactions_features.shape

(385591, 21)

#### 2.2 Generating features from user profiles

Here we will engineer features that will better represent a user’s behavior. We will try to measure the users engagement with the platform

* **login_frequency**
A decent way to quantize a user’s engagement will be to simply check the number of times the user has used the platform in a given period of time. We create this feature and store this in “login_frequency”. We expect that a user who is engaged with the platform will have less propensity to churn.
 
* **last_login**
A user who is not active recently has more propensity to churn. We create a feature that checks the last login of a user and store it in "last_login column.

In [21]:
user_logs.head()

Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,20151029,8,1,1,2,54,58,13576.0
1,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,20160103,0,0,0,1,27,28,6232.0
2,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,20160302,19,5,2,0,69,78,17520.0
3,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,20160428,26,0,0,0,1,27,278.25
4,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,20160628,2,0,0,0,0,2,11.34375


In [22]:
utils.get_data_describe(user_logs)

Unnamed: 0,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
count,4828886.0,4828886.0,4828886.0,4828886.0,4828886.0,4828886.0,4828886.0,4828886.0
mean,20160860.03,6.93,1.69,1.04,1.11,28.28,28.77,
std,5344.84,14.63,4.38,2.15,2.48,37.4,31.84,
min,20150101.0,0.0,0.0,0.0,0.0,0.0,1.0,-inf
25%,20160323.0,1.0,0.0,0.0,0.0,6.0,8.0,1843.0
50%,20160812.0,2.0,1.0,0.0,0.0,16.0,18.0,4396.0
75%,20161129.0,7.0,2.0,1.0,1.0,35.0,38.0,9528.0
max,20170228.0,2574.0,846.0,327.0,784.0,8396.0,2020.0,inf


In [23]:
user_logs['date'] =  utils.fix_time_in_df(user_logs, column_name='date', expand=False)
user_logs_transformed = utils.get_fix_skew_with_log(user_logs, ['num_25','num_50','num_75','num_985','num_100','num_unq','total_secs'], 
                                              replace_inf = True, replace_inf_with = 0)
user_logs_transformed.head()

Unnamed: 0,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,msno,date
0,2.079442,0.0,0.0,0.693147,3.988984,4.060443,9.515625,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,2015-10-29
1,0.0,0.0,0.0,0.0,3.295837,3.332205,8.734375,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,2016-01-03
2,2.944439,1.609438,0.693147,0.0,4.234107,4.356709,9.773438,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,2016-03-02
3,3.258096,0.0,0.0,0.0,0.0,3.295837,5.628906,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,2016-04-28
4,0.693147,0.0,0.0,0.0,0.0,0.693147,2.427734,kvn0x25i/D06AX1K3Sv9djeZA5oRsjPm8ysAl4rzBYs=,2016-06-28


In [24]:
utils.get_data_describe(user_logs_transformed)

Unnamed: 0,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
count,4828886.0,4828886.0,4828886.0,4828886.0,4828886.0,4828886.0,4828648.0
mean,1.14,0.39,0.25,0.27,2.62,2.78,
std,1.19,0.68,0.51,0.54,1.31,1.17,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,-6.91
25%,0.0,0.0,0.0,0.0,1.79,2.08,7.51
50%,0.69,0.0,0.0,0.0,2.77,2.89,8.38
75%,1.95,0.69,0.0,0.0,3.56,3.64,9.16
max,7.85,6.74,5.79,6.66,9.04,7.61,11.09


In [25]:
user_logs_transformed_base = utils.get_groupby(user_logs_transformed,'msno', agg_dict=None, agg_func = 'mean', simple_agg_flag=True, reset_index=True)
user_logs_transformed_base.head()

Unnamed: 0,msno,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1.212655,0.426138,0.471913,0.584749,2.38954,2.062883,8.078125
1,++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI=,1.174083,0.585266,0.310613,0.42515,1.780883,2.394411,7.570312
2,++0BJXY8tpirgIhJR14LDM1pnaRosjD1mdO1mIKxlJA=,0.276101,0.268682,0.061034,0.176559,2.661876,2.832618,8.242188
3,++1G0wVY14Lp0VXak1ymLhPUdXPSFJVBnjWwzGxBKJs=,0.658754,0.155307,0.198628,0.180648,2.417614,2.583583,7.972656
4,++1GCIyXZO7834NjDKmcK1lBVLQi9PsN6sOC7wfW+8g=,0.346574,0.173287,0.0,0.0,3.005417,2.81224,8.5625


In [26]:
agg_dict = { 'date':['count','max'] }
user_logs_transformed_dates = utils.get_groupby(user_logs_transformed,'msno', agg_dict=agg_dict, agg_func = 'mean', simple_agg_flag=False, reset_index=True)
user_logs_transformed_dates.columns = user_logs_transformed_dates.columns.droplevel()
user_logs_transformed_dates.rename(columns = {'count':'login_freq', 'max': 'last_login'}, inplace = True)
user_logs_transformed_dates.reset_index(inplace=True)
user_logs_transformed_dates.drop('index',inplace=True,axis=1)
user_logs_transformed_dates.columns = ['msno','login_freq','last_login']
user_logs_transformed_dates.head()

Unnamed: 0,msno,login_freq,last_login
0,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,24,2017-02-07
1,++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI=,8,2016-12-08
2,++0BJXY8tpirgIhJR14LDM1pnaRosjD1mdO1mIKxlJA=,18,2016-12-14
3,++1G0wVY14Lp0VXak1ymLhPUdXPSFJVBnjWwzGxBKJs=,16,2016-12-26
4,++1GCIyXZO7834NjDKmcK1lBVLQi9PsN6sOC7wfW+8g=,4,2016-08-25


In [27]:
user_logs_final = utils.get_merge(user_logs_transformed_base, user_logs_transformed_dates, on = 'msno') 
user_logs_final.head()

Unnamed: 0,msno,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,login_freq,last_login
0,++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,1.212655,0.426138,0.471913,0.584749,2.38954,2.062883,8.078125,24,2017-02-07
1,++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI=,1.174083,0.585266,0.310613,0.42515,1.780883,2.394411,7.570312,8,2016-12-08
2,++0BJXY8tpirgIhJR14LDM1pnaRosjD1mdO1mIKxlJA=,0.276101,0.268682,0.061034,0.176559,2.661876,2.832618,8.242188,18,2016-12-14
3,++1G0wVY14Lp0VXak1ymLhPUdXPSFJVBnjWwzGxBKJs=,0.658754,0.155307,0.198628,0.180648,2.417614,2.583583,7.972656,16,2016-12-26
4,++1GCIyXZO7834NjDKmcK1lBVLQi9PsN6sOC7wfW+8g=,0.346574,0.173287,0.0,0.0,3.005417,2.81224,8.5625,4,2016-08-25


### Joining the dataset

In [28]:
print(members.shape)
print(train.shape)
print(transactions_features.shape)
print(user_logs_final.shape)

(4348970, 6)
(385591, 2)
(385591, 21)
(324000, 10)


In [46]:
%%time
train_df_v01 = utils.get_merge(members, train, on='msno', axis=1, how='inner')
train_df_v02 = utils.get_merge(train_df_v01, transactions_features, on='msno', axis=1, how='inner')
train_df_final = utils.get_merge(train_df_v02, user_logs_final, on='msno', axis=1, how='inner')
train_df_final.head()

CPU times: user 4.39 s, sys: 467 ms, total: 4.86 s
Wall time: 4.83 s


Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time,is_churn,total_payment_channels,change_in_payment_methods,payment_plan_days_mean,change_in_plan,plan_list_price_mean,actual_amount_paid_mean,is_auto_renew_mean,is_autorenew_change_flag,transaction_date_min,transaction_date_max,total_transactions,membership_expire_date_max,is_cancel_mean,is_cancel_change_flag,discount_mean,is_discount_mean,is_discount_max,amt_per_day_mean,membership_duration_mean,more_than_30_sum,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,login_freq,last_login
0,/7XuYVGXYHPggWsdtok0JEurQ10CtUO1Y8dDgy1/B0M=,1,6.3,2,7,2016-12-23,0,3,1,30.0,1,149.0,149.0,1.0,1,2016-12-23,2017-02-22,3,2017-03-22,0.0,0,0.0,0.0,0,4.966667,29.666667,1,0.0,0.0,0.0,0.0,2.289867,2.289867,7.808594,4,2017-02-14
1,gB3/kawEQSauWFArU9Z0kZo+ikw9GqJv0rXqNbpVnTY=,1,6.3,2,7,2016-12-23,0,3,1,30.0,1,99.0,99.0,1.0,1,2016-12-23,2017-02-23,3,2017-03-23,0.0,0,0.0,0.0,0,3.3,30.0,2,0.274653,0.0,0.0,0.0,2.845647,2.28193,8.359375,4,2017-02-22
2,2aFAPs3QmxD+bNcCe8beuWcI7SZHg1k+1irALOxiw3k=,15,23.0,1,4,2016-12-24,0,3,1,30.0,1,149.0,149.0,1.0,1,2016-12-27,2017-02-27,3,2017-03-26,0.0,0,0.0,0.0,0,4.966667,29.0,0,0.0,0.0,0.0,0.0,4.708342,4.6837,10.234375,4,2017-01-16
3,FjEZAhwFky8sWoaNGTp+p/r3/hH30WxLr396iSho3gs=,1,6.3,2,7,2016-12-25,0,3,1,30.0,1,99.0,99.0,1.0,1,2016-12-25,2017-02-24,3,2017-03-24,0.0,0,0.0,0.0,0,3.3,29.666667,1,0.621227,0.173287,0.0,0.346574,3.070758,2.640511,8.632812,4,2017-02-20
4,C5PNTuQxUQmHOXPptQnokhqH1XQoAHHL8pMWIX0nAh0=,1,6.3,2,7,2016-12-25,0,3,1,30.0,1,99.0,99.0,1.0,1,2016-12-25,2017-02-24,3,2017-03-24,0.0,0,0.0,0.0,0,3.3,29.666667,1,0.0,0.0,0.0,0.0,1.595831,1.499937,7.082031,3,2017-02-09


#### Registration Duration
* It is important to understand how long the customer has been part of the system. We can calculate it using the columns 'membership_expire_date_max' &  'registration_init_time'

In [47]:
train_df_final['registration_duration'] = utils.get_two_column_operations(train_df_final, 'membership_expire_date_max', 'registration_init_time', "-")
train_df_final['registration_duration'] = utils.get_timedelta_division(train_df_final, "registration_duration", td_type='D')
train_df_final['registration_duration'] = utils.get_convert_column_dtype(train_df_final, 'registration_duration', data_type='int')

In [48]:
train_df_final.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time,is_churn,total_payment_channels,change_in_payment_methods,payment_plan_days_mean,change_in_plan,plan_list_price_mean,actual_amount_paid_mean,is_auto_renew_mean,is_autorenew_change_flag,transaction_date_min,transaction_date_max,total_transactions,membership_expire_date_max,is_cancel_mean,is_cancel_change_flag,discount_mean,is_discount_mean,is_discount_max,amt_per_day_mean,membership_duration_mean,more_than_30_sum,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,login_freq,last_login,registration_duration
0,/7XuYVGXYHPggWsdtok0JEurQ10CtUO1Y8dDgy1/B0M=,1,6.3,2,7,2016-12-23,0,3,1,30.0,1,149.0,149.0,1.0,1,2016-12-23,2017-02-22,3,2017-03-22,0.0,0,0.0,0.0,0,4.966667,29.666667,1,0.0,0.0,0.0,0.0,2.289867,2.289867,7.808594,4,2017-02-14,89
1,gB3/kawEQSauWFArU9Z0kZo+ikw9GqJv0rXqNbpVnTY=,1,6.3,2,7,2016-12-23,0,3,1,30.0,1,99.0,99.0,1.0,1,2016-12-23,2017-02-23,3,2017-03-23,0.0,0,0.0,0.0,0,3.3,30.0,2,0.274653,0.0,0.0,0.0,2.845647,2.28193,8.359375,4,2017-02-22,90
2,2aFAPs3QmxD+bNcCe8beuWcI7SZHg1k+1irALOxiw3k=,15,23.0,1,4,2016-12-24,0,3,1,30.0,1,149.0,149.0,1.0,1,2016-12-27,2017-02-27,3,2017-03-26,0.0,0,0.0,0.0,0,4.966667,29.0,0,0.0,0.0,0.0,0.0,4.708342,4.6837,10.234375,4,2017-01-16,92
3,FjEZAhwFky8sWoaNGTp+p/r3/hH30WxLr396iSho3gs=,1,6.3,2,7,2016-12-25,0,3,1,30.0,1,99.0,99.0,1.0,1,2016-12-25,2017-02-24,3,2017-03-24,0.0,0,0.0,0.0,0,3.3,29.666667,1,0.621227,0.173287,0.0,0.346574,3.070758,2.640511,8.632812,4,2017-02-20,89
4,C5PNTuQxUQmHOXPptQnokhqH1XQoAHHL8pMWIX0nAh0=,1,6.3,2,7,2016-12-25,0,3,1,30.0,1,99.0,99.0,1.0,1,2016-12-25,2017-02-24,3,2017-03-24,0.0,0,0.0,0.0,0,3.3,29.666667,1,0.0,0.0,0.0,0.0,1.595831,1.499937,7.082031,3,2017-02-09,89


In [None]:
%%time
utils.get_data_profile(train_df_final,html_save_path=None, 
                     embed_in_cell=True,take_sample=False, sample_frac=0.01, 
                dataframe_name='train_df_final')

Summarize dataset:   0%|          | 0/45 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

### Saving the dataset

In [50]:
%%time
utils.get_save_intermediate_data(train_df_final, path=final_processed_data_path, filename="final_train_data_process")

CPU times: user 11.7 s, sys: 34.2 ms, total: 11.8 s
Wall time: 11.8 s


('Data Saved Here :',
 '/home/data/processed/final_train_data_process_1660309884.csv')