# Problem Statement:

** Explore the datasets and develop a model to predict customer churn over time. **

**By:** Yusuf Firoz

## Step 2. Final Dataset Prepearation

** Main steps: **
* Import all the required libraries and the 4 pickle objects created in **Step 1**.
* Join all the 4 tables on **subscription_id**. 
* We have total **1684288** entries and 70 columns.
* Generate new feature named **Total_delivery_duration** which is the total duration between first and last delivery date.
* Generate new feature named **Gap_of_Cancellation** which is the total duration between cancellation date and last delivery date.
* Add churn status of subscriber.

** Data Cleaning: **

** In this step we will remove all the illogical data:**
    * Delivery date cannot be after cancellation date i.e. Last_event_date.
    * This will remove 63884 rows.
    * First error reporting date cannot be greater than first delivery date.
    * This will remove 94390 rows.
    * First Pause date cannot be before first delivery date
    * This will remove 738768 rows

* We have total 787246 entries and 70 columns after cleaning
* Drop all the categorical columns.
* Fill all the null columns with **0**
* Generate a pickle object for the final dataset.




** Import required libraries**

In [3]:
import numpy as np
import pandas as pd
from datetime import datetime  
from time import time 
import pickle 
from IPython.display import display

**Import all the pickle objects **

In [7]:
boxes = pd.read_pickle('boxes.pickle')
errors = pd.read_pickle('errors.pickle')
pauses = pd.read_pickle('pauses.pickle')
cancels = pd.read_pickle('cancels.pickle')

** Join Tables**

In [8]:
# Joining all the data frames on 'subscription_id'

boxes_pauses = pd.merge(boxes,pauses,how='left',on='subscription_id')
boxes_pauses_errors = pd.merge(boxes_pauses,errors,how='left',on='subscription_id')
final = pd.merge(boxes_pauses_errors,cancels,how='left',on='subscription_id')


**Create new Features**

In [9]:
#Total duration between first and last delivery date.
final['Total_delivery_duration'] = final['Last_delivery_date'].subtract(final['First_delivery_date']) 

#Total duration between cancellation date and last delivery date.
final['Gap_of_Cancellation'] = final['Last_event_date'].subtract(final['Last_delivery_date']) 


**Add the label**

In [11]:
#Tag the data by adding churn status of subscriber:
final['churn_status'] = pd.notnull(final['Last_event_date']).astype(int)

** Remove illogical Data**

In [12]:
#Delivery date cannot be after cancellation date i.e. Last_event_date
#This will remove 63884 rows.
#final[final['Last_delivery_date'] > final['Last_event_date']]
final.drop(final.ix[final['Last_delivery_date'] > final['Last_event_date']].index, inplace=True)
#We have total 1620404 entries and 70 columns after cleaning

#First error reporting date cannot be greater than first delivery date.
#This will remove 94390 rows.
#final[final['First_delivery_date'] > final['first_reported_date']]
final.drop(final.ix[final['First_delivery_date'] > final['first_reported_date']].index, inplace=True)
#We have total 1526014  entries and 70 columns after cleaning

#First Pause date cannot be before first delivery date
#This will remove 738768 rows
#final[final['First_pause_start'] > final['First_delivery_date']]
final.drop(final.ix[final['First_pause_start'] > final['First_delivery_date']].index, inplace=True)
#We have total 787246 entries and 70 columns after cleaning

** Remove categorical features**

In [13]:
# Delete extra columns like date, Product type, Channel type etc.
final.drop(['started_week','Product','Channel','First_delivery_date',
            'Last_delivery_date','First_pause_start','Last_pause_end','first_reported_date',
           'last_reported_date','First_event_date','Last_event_date'], axis=1, inplace=True)


* **Fill all the null with 0**
* **Convert days to integer**

In [14]:
# Fill all the nulls with NA
final = final.fillna(0)
#convert days timstamp to int:
final['Total_pause_duration'] = final['Total_pause_duration'].dt.days   
final['Total_delivery_duration'] = final['Total_delivery_duration'].dt.days        
final['Gap_of_Cancellation'] = final['Gap_of_Cancellation'].dt.days

# Convert subscription_id to the index of final dataset
final.set_index('subscription_id', inplace=True)

** Pickle file for Pauses** 

In [16]:
#Create pickle file for Final datset
with open('final.pickle', 'wb') as f:
    pickle.dump(final, f)