# Telecom Churn Prediction

 > Developers - Muni , Sreedhar K

# Problem Statement

> In the telecommunication industry, customers tend to change operators if not provided with attractive schemes
and offers. It is very important for any telecom operator to prevent the present customers from churning to other operators. 
In this case study would be to build an ML model which can predict if the customer will churn or not in a particular month based on the past data


# Problem data
<br>
<a href="https://www.kaggle.com/competitions/telecom-churn-case-study-hackathon-c41/overview">Competition link</a>
</br>
<br>
<a href="https://cdn.upgrad.com/uploads/production/a1e63cc1-7b2a-4d87-886f-fcb90bcda68b/Upgrad+hackathon.pdf">Upgrad Hackathon details</a>
</br>
<br>
<a href="https://www.kaggle.com/competitions/telecom-churn-case-study-hackathon-c41/data">Dataset</a>
<br>
<br>
Please note that you need to submit only from one account on Kaggle and the team name should be: <br><b>Name_of_member1_Name_of_member2</b>

# Business Objective
To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. The given dataset contains customer-level informations for few consecutive months June, July & August they are encoded as 6,7 & 8. The business objective is to predict the cusotmer which will churn in next month by analyzing the dataset
High Value Customers:

One of the primary goal is to identify high value customers which are more likely to churn, as most of the profit comes from high value customers.
Customers which are likely to churn will starting decreasing rhe recharge amount and other facilities. To identify high value customers, total_rech_data can be calculated and total dataser can be filtered which are greater than 70th percentil of the data

 ## Steps:
 
0. [EDA](#EDA)<br>
<ul>
    <li>Load library</li>
    <li>Data Load</li>
    <li>Data Overview</li>
    <li>Metadata Information</li>
</ul>
1. [Data_Cleaning_and_Missing_Data_Analysis](#Data_Cleaning_and_Missing_Data_Analysis)<br>
2. [Outlier Analysis & Treatment Assumption values > Q3+1.5IQR and values < Q1-1.5IQR will be treated](#Outlier_Analysis_and_Treatment_Assumption_values)<br>
3. [Transforming_Categorical_Columns](#Transforming_Categorical_Columns)<br>
<ul>
    <li>Filter High-Value Customers</li>
    <ul>
        <li>calculate total data recharge amount</li>
    </ul>
    <li>Display the correlation matrix again to analyze correlation coefficient between features</li>
</ul>
4. [Univariate_Analysis](#Univariate_Analysis)<br>
5. [Bivariate_Analysis](#Bivariate_Analysis)<br>
6. [Multivariate_Analysis](#Multivariate_Analysis)<br>


> Model Preperation

- Training and Test data split
- Feature Scaling - StandardScaler
- Strategy steps
- Handle Imbalance dataset using SMOTE
- PCA - Dimensionality Reduction
- Case1 : 
- - Split train data into train and test split
- - Created below models using Hyper Parameter Tuning
- - - LOGISTICREGRESSION
- - - RANDOMFOREST
- - - ADABOOST
- - - XGBBOOST
- - - Made predictions by using combination of Random Forest + Adaboost + XGBOOST
- Case2 : 
- - Use entire train dataset for model building using K Cross Validation
- - Created below models on entire train set
- - - RANDOMFOREST
- - - ADABOOST
- - - XGBBOOST
- - - Made predictions by using combination of Random Forest + Adaboost + XGBOOST
- Model Evaluation & Assessment
- Prediction
- - - Made predictions on combination of case1 and case2 
- - - Important Features
- Conclusion & Analysis

<hr>
<hr>

<h1><a id='EDA'>EDA</a><br></h1>
<ul>
    <li>Load library</li>
    <li>Data Load</li>
    <li>Data Overview</li>
    <li>Metadata Information</li>
</ul>


# Load Library

In [1]:
!pip uninstall -y scikit-learn
!pip install scikit-learn

Found existing installation: scikit-learn 1.6.1
Uninstalling scikit-learn-1.6.1:
  Successfully uninstalled scikit-learn-1.6.1
Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Using cached scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Installing collected packages: scikit-learn
Successfully installed scikit-learn-1.6.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import pandas as pd 
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import warnings
from IPython.display import display,HTML
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV,KFold,StratifiedKFold
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.decomposition import IncrementalPCA
from sklearn.ensemble import RandomForestClassifier

from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.metrics import sensitivity_specificity_support

# Data Load

In [3]:
data = pd.read_csv("datasets/train.csv")
test = pd.read_csv("datasets/test.csv")

In [4]:
pd.set_option('display.max_columns',500)

In [5]:
pd.set_option('display.max_rows',500)

In [6]:
display(
    data.head(10)
)

Unnamed: 0,id,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,offnet_mou_6,offnet_mou_7,offnet_mou_8,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_t2c_mou_6,std_og_t2c_mou_7,std_og_t2c_mou_8,std_og_mou_6,std_og_mou_7,std_og_mou_8,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,og_others_6,og_others_7,og_others_8,total_og_mou_6,total_og_mou_7,total_og_mou_8,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_t2o_mou_6,std_ic_t2o_mou_7,std_ic_t2o_mou_8,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,ic_others_6,ic_others_7,ic_others_8,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_8,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,date_of_last_rech_data_6,date_of_last_rech_data_7,date_of_last_rech_data_8,total_rech_data_6,total_rech_data_7,total_rech_data_8,max_rech_data_6,max_rech_data_7,max_rech_data_8,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_2g_6,arpu_2g_7,arpu_2g_8,night_pck_user_6,night_pck_user_7,night_pck_user_8,monthly_2g_6,monthly_2g_7,monthly_2g_8,sachet_2g_6,sachet_2g_7,sachet_2g_8,monthly_3g_6,monthly_3g_7,monthly_3g_8,sachet_3g_6,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,churn_probability
0,0,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,31.277,87.009,7.527,48.58,124.38,1.29,32.24,96.68,2.33,0.0,0.0,0.0,0.0,0.0,0.0,2.23,0.0,0.28,5.29,16.04,2.33,0.0,0.0,0.0,0.0,0.0,0.0,7.53,16.04,2.61,46.34,124.38,1.01,18.75,80.61,0.0,0.0,0.0,0.0,0.0,0.0,0.0,65.09,204.99,1.01,0.0,0.0,0.0,8.2,0.63,0.0,0.38,0.0,0.0,81.21,221.68,3.63,2.43,3.68,7.79,0.83,21.08,16.91,0.0,0.0,0.0,3.26,24.76,24.71,0.0,7.61,0.21,7.46,19.96,14.96,0.0,0.0,0.0,0.0,0.0,0.0,7.46,27.58,15.18,11.84,53.04,40.56,0.0,0.0,0.66,0.0,0.0,0.0,1.11,0.69,0.0,3,2,2,77,65,10,65,65,10,6/22/2014,7/10/2014,8/24/2014,65,65,0,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,1958,0.0,0.0,0.0,0
1,1,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,0.0,122.787,42.953,0.0,0.0,0.0,0.0,25.99,30.89,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.01,29.79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30.73,31.66,0.0,0.0,0.0,0.0,30.73,31.66,1.68,19.09,10.53,1.41,18.68,11.09,0.35,1.66,3.4,3.44,39.44,25.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.44,39.44,25.04,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,3,4,5,0,145,50,0,145,50,6/12/2014,7/10/2014,8/26/2014,0,0,0,,7/8/2014,,,1.0,,,145.0,,,0.0,,,1.0,,,145.0,,0.0,352.91,0.0,0.0,3.96,0.0,,122.07,,,122.08,,,0.0,,0,0,0,0,0,0,0,1,0,0,0,0,,1.0,,710,0.0,0.0,0.0,0
2,2,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,60.806,103.176,0.0,0.53,15.93,0.0,53.99,82.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.53,12.98,0.0,24.11,0.0,0.0,0.0,0.0,0.0,2.14,0.0,0.0,24.64,12.98,0.0,0.0,2.94,0.0,28.94,82.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.94,84.99,0.0,0.0,0.0,0.0,2.89,1.38,0.0,0.0,0.0,0.0,56.49,99.36,0.0,4.51,6.16,6.49,89.86,25.18,23.51,0.0,0.0,0.0,94.38,31.34,30.01,11.69,0.0,0.0,18.21,2.48,6.38,0.0,0.0,0.0,0.0,0.0,0.0,29.91,2.48,6.38,124.29,33.83,36.64,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,2,4,2,70,120,0,70,70,0,6/11/2014,7/22/2014,8/24/2014,70,50,0,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,882,0.0,0.0,0.0,0
3,3,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,156.362,205.26,111.095,7.26,16.01,0.0,68.76,78.48,50.23,0.0,0.0,0.0,0.0,0.0,1.63,6.99,3.94,0.0,37.91,44.89,23.63,0.0,0.0,0.0,0.0,0.0,8.03,44.91,48.84,23.63,0.26,12.06,0.0,15.33,25.93,4.6,0.56,0.0,0.0,0.0,0.0,0.0,16.16,37.99,4.6,0.0,0.0,0.0,14.95,9.13,25.61,0.0,0.0,0.0,76.03,95.98,53.84,24.98,4.84,23.88,53.99,44.23,57.14,7.23,0.81,0.0,86.21,49.89,81.03,0.0,0.0,0.0,8.89,0.28,2.81,0.0,0.0,0.0,0.0,0.0,0.0,8.89,0.28,2.81,95.11,50.18,83.84,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,4,3,160,240,130,110,110,50,6/15/2014,7/21/2014,8/25/2014,110,110,50,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,982,0.0,0.0,0.0,0
4,4,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,240.708,128.191,101.565,21.28,4.83,6.13,56.99,38.11,9.63,53.64,0.0,0.0,15.73,0.0,0.0,10.16,4.83,6.13,36.74,19.88,4.61,11.99,1.23,5.01,0.0,9.85,0.0,58.91,25.94,15.76,0.0,0.0,0.0,4.35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.35,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0,0.0,63.26,42.94,15.76,5.44,1.39,2.66,10.58,4.33,19.49,5.51,3.63,6.14,21.54,9.36,28.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.54,9.36,28.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13,10,8,290,136,122,50,41,30,6/25/2014,7/26/2014,8/30/2014,25,10,30,6/25/2014,7/23/2014,8/20/2014,7.0,7.0,6.0,25.0,41.0,25.0,7.0,6.0,6.0,0.0,1.0,0.0,175.0,191.0,142.0,390.8,308.89,213.47,0.0,0.0,0.0,0.0,35.0,0.0,0.0,35.12,0.0,0.0,0.0,0.0,0,0,0,7,6,6,0,0,0,0,1,0,1.0,1.0,1.0,647,0.0,0.0,0.0,0
5,5,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,86.193,83.351,88.462,1.39,1.78,0.45,17.14,24.26,3.44,0.0,0.0,0.0,0.0,0.0,0.0,1.39,1.78,0.45,3.56,20.63,3.44,0.0,0.0,0.0,11.5,15.28,0.0,4.96,22.41,3.89,0.0,0.0,0.0,0.0,0.65,0.0,2.08,0.0,0.0,0.0,0.0,0.0,2.08,0.65,0.0,0.0,0.0,0.0,11.5,18.69,0.0,0.0,0.0,0.0,18.54,41.76,3.89,17.51,37.94,34.18,221.79,296.31,125.66,27.96,38.13,41.08,267.28,372.39,200.93,0.0,0.0,0.0,3.83,4.46,0.0,0.0,6.44,8.13,0.0,0.0,0.0,3.83,10.91,8.13,271.29,383.51,209.43,0.0,0.0,0.11,0.0,0.0,0.0,0.18,0.19,0.24,9,8,10,100,90,100,30,30,30,6/30/2014,7/27/2014,8/28/2014,30,20,30,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,698,0.0,0.0,0.0,0
6,6,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,54.173,95.13,6.386,38.81,40.56,19.61,31.63,54.18,5.69,0.0,0.0,0.0,0.0,0.0,0.0,38.81,24.51,19.61,31.63,54.18,5.58,0.0,0.0,0.0,0.0,0.0,0.0,70.44,78.69,25.19,0.0,16.05,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.05,0.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70.44,94.74,25.31,94.09,54.59,86.83,7.38,0.81,31.53,0.16,3.36,3.46,101.64,58.78,121.83,3.58,4.34,2.95,0.91,1.16,9.78,0.0,0.18,0.0,0.0,0.0,0.0,4.49,5.69,12.73,106.14,64.48,134.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,2,3,130,0,130,110,0,130,6/29/2014,7/19/2014,8/26/2014,110,0,130,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,1083,0.0,0.0,0.0,0
7,7,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,167.861,167.869,167.866,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.36,6.04,28.44,13.41,9.16,8.73,0.0,0.0,4.14,24.78,15.21,41.33,0.0,0.0,0.0,0.0,0.0,0.0,0.33,0.68,2.49,0.0,0.0,0.0,0.33,0.68,2.49,25.11,15.89,43.83,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,3,2,198,198,198,198,198,198,6/20/2014,7/22/2014,8/28/2014,198,198,0,6/20/2014,7/22/2014,8/20/2014,1.0,1.0,1.0,198.0,198.0,198.0,1.0,1.0,1.0,0.0,0.0,0.0,198.0,198.0,198.0,167.53,6.29,5.4,177.9,151.58,271.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,0,0,0,0,0,0,0,0,0,1.0,1.0,1.0,584,82.26,73.56,177.14,0
8,8,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,163.55,90.25,51.726,0.0,0.0,0.0,47.81,50.88,21.74,28.26,11.31,47.81,47.81,50.88,21.74,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,3,1,200,0,150,200,0,150,6/28/2014,7/30/2014,8/19/2014,200,0,150,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,2455,0.0,0.0,0.0,1
9,9,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,101.984,259.265,65.686,27.49,111.78,53.38,110.51,124.04,85.54,0.0,7.23,0.0,0.0,32.28,0.0,27.49,94.74,53.38,103.44,104.88,77.99,2.41,3.43,0.0,0.0,0.0,0.0,133.36,203.06,131.38,0.0,0.0,0.0,4.64,0.0,7.55,0.0,0.0,0.0,0.0,0.0,0.0,4.64,0.0,7.55,0.0,0.0,0.0,0.0,0.48,0.0,0.0,0.0,0.0,138.01,203.54,138.93,60.88,259.66,60.48,250.21,243.13,191.64,13.94,24.46,15.11,325.04,527.26,267.24,0.0,0.0,0.0,0.0,6.64,13.64,3.51,0.0,0.0,0.0,0.0,0.0,3.51,6.64,13.64,328.56,533.91,280.89,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,3,2,128,160,279,128,110,149,6/27/2014,7/16/2014,8/28/2014,128,0,149,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,,,,2530,0.0,0.0,0.0,0


In [7]:
display(
    data.describe()
)

Unnamed: 0,id,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,offnet_mou_6,offnet_mou_7,offnet_mou_8,roam_ic_mou_6,roam_ic_mou_7,roam_ic_mou_8,roam_og_mou_6,roam_og_mou_7,roam_og_mou_8,loc_og_t2t_mou_6,loc_og_t2t_mou_7,loc_og_t2t_mou_8,loc_og_t2m_mou_6,loc_og_t2m_mou_7,loc_og_t2m_mou_8,loc_og_t2f_mou_6,loc_og_t2f_mou_7,loc_og_t2f_mou_8,loc_og_t2c_mou_6,loc_og_t2c_mou_7,loc_og_t2c_mou_8,loc_og_mou_6,loc_og_mou_7,loc_og_mou_8,std_og_t2t_mou_6,std_og_t2t_mou_7,std_og_t2t_mou_8,std_og_t2m_mou_6,std_og_t2m_mou_7,std_og_t2m_mou_8,std_og_t2f_mou_6,std_og_t2f_mou_7,std_og_t2f_mou_8,std_og_t2c_mou_6,std_og_t2c_mou_7,std_og_t2c_mou_8,std_og_mou_6,std_og_mou_7,std_og_mou_8,isd_og_mou_6,isd_og_mou_7,isd_og_mou_8,spl_og_mou_6,spl_og_mou_7,spl_og_mou_8,og_others_6,og_others_7,og_others_8,total_og_mou_6,total_og_mou_7,total_og_mou_8,loc_ic_t2t_mou_6,loc_ic_t2t_mou_7,loc_ic_t2t_mou_8,loc_ic_t2m_mou_6,loc_ic_t2m_mou_7,loc_ic_t2m_mou_8,loc_ic_t2f_mou_6,loc_ic_t2f_mou_7,loc_ic_t2f_mou_8,loc_ic_mou_6,loc_ic_mou_7,loc_ic_mou_8,std_ic_t2t_mou_6,std_ic_t2t_mou_7,std_ic_t2t_mou_8,std_ic_t2m_mou_6,std_ic_t2m_mou_7,std_ic_t2m_mou_8,std_ic_t2f_mou_6,std_ic_t2f_mou_7,std_ic_t2f_mou_8,std_ic_t2o_mou_6,std_ic_t2o_mou_7,std_ic_t2o_mou_8,std_ic_mou_6,std_ic_mou_7,std_ic_mou_8,total_ic_mou_6,total_ic_mou_7,total_ic_mou_8,spl_ic_mou_6,spl_ic_mou_7,spl_ic_mou_8,isd_ic_mou_6,isd_ic_mou_7,isd_ic_mou_8,ic_others_6,ic_others_7,ic_others_8,total_rech_num_6,total_rech_num_7,total_rech_num_8,total_rech_amt_6,total_rech_amt_7,total_rech_amt_8,max_rech_amt_6,max_rech_amt_7,max_rech_amt_8,last_day_rch_amt_6,last_day_rch_amt_7,last_day_rch_amt_8,total_rech_data_6,total_rech_data_7,total_rech_data_8,max_rech_data_6,max_rech_data_7,max_rech_data_8,count_rech_2g_6,count_rech_2g_7,count_rech_2g_8,count_rech_3g_6,count_rech_3g_7,count_rech_3g_8,av_rech_amt_data_6,av_rech_amt_data_7,av_rech_amt_data_8,vol_2g_mb_6,vol_2g_mb_7,vol_2g_mb_8,vol_3g_mb_6,vol_3g_mb_7,vol_3g_mb_8,arpu_3g_6,arpu_3g_7,arpu_3g_8,arpu_2g_6,arpu_2g_7,arpu_2g_8,night_pck_user_6,night_pck_user_7,night_pck_user_8,monthly_2g_6,monthly_2g_7,monthly_2g_8,sachet_2g_6,sachet_2g_7,sachet_2g_8,monthly_3g_6,monthly_3g_7,monthly_3g_8,sachet_3g_6,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,churn_probability
count,69999.0,69999.0,69297.0,69297.0,69297.0,69999.0,69999.0,69999.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,69999.0,69999.0,69999.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,69999.0,69999.0,69999.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,67231.0,67312.0,66296.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,17568.0,17865.0,18417.0,17568.0,17865.0,18417.0,17568.0,17865.0,18417.0,17568.0,17865.0,18417.0,17568.0,17865.0,18417.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,17568.0,17865.0,18417.0,17568.0,17865.0,18417.0,17568.0,17865.0,18417.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,69999.0,17568.0,17865.0,18417.0,69999.0,69999.0,69999.0,69999.0,69999.0
mean,34999.0,109.0,0.0,0.0,0.0,283.134365,278.185912,278.858826,133.153275,133.894438,132.978257,198.874771,197.153383,196.543577,9.765435,7.014568,7.004892,14.186457,9.842191,9.771783,46.904854,46.166503,45.686109,93.238231,90.79924,91.121447,3.743179,3.777031,3.661652,1.126025,1.361052,1.42084,143.893585,140.75012,140.476486,80.619382,83.775851,83.471486,88.15211,91.538615,90.586999,1.126377,1.084062,1.057739,0.0,0.0,0.0,169.900601,176.401217,175.118852,0.845763,0.8111,0.841648,3.958619,4.976783,5.045027,0.462581,0.024425,0.033059,306.451436,310.572674,304.513065,48.043255,47.882736,47.256388,107.152439,106.489856,108.154731,12.050672,12.563665,11.716763,167.255126,166.945103,167.136761,9.476958,9.873468,9.910217,20.734858,21.685359,21.089042,2.146273,2.199395,2.075179,0.0,0.0,0.0,32.360632,33.760809,33.07703,199.71064,201.878029,198.486034,0.061932,0.033371,0.040392,7.394167,8.171162,8.348424,0.854063,1.01968,0.963214,7.566522,7.706667,7.224932,328.139788,322.376363,323.846355,104.569265,104.137573,107.540351,63.426949,59.294218,62.489478,2.467612,2.679989,2.652441,126.5,126.402071,125.374925,1.865323,2.056311,2.016018,0.602288,0.623678,0.636423,192.831096,201.45594,196.815792,51.773924,51.240204,50.127506,122.171882,128.934444,135.486541,90.069931,89.115767,90.618564,86.8639,85.846074,86.348404,0.025273,0.024069,0.021013,0.079287,0.083401,0.08093,0.388863,0.441406,0.449492,0.075815,0.07773,0.081958,0.075344,0.081444,0.085487,0.916325,0.909544,0.890319,1220.639709,68.108597,65.93583,60.07674,0.101887
std,20207.115084,0.0,0.0,0.0,0.0,334.213918,344.366927,351.924315,299.963093,311.277193,311.896596,316.818355,322.482226,324.089234,57.374429,55.960985,53.408135,73.469261,58.511894,64.618388,150.971758,154.739002,153.71688,162.046699,153.852597,152.997805,13.319542,13.56811,13.009193,5.741811,7.914113,6.542202,252.034597,246.313148,245.342359,255.098355,266.693254,267.021929,255.771554,267.532089,270.032002,8.136645,8.325206,7.696853,0.0,0.0,0.0,392.0466,409.299501,410.697098,29.747486,29.220073,29.563367,15.854529,22.229842,17.708507,4.768437,1.71643,2.232547,465.502866,479.13177,477.936832,140.499757,147.761124,141.249368,168.455999,165.452459,166.223461,39.416076,43.495179,38.606895,252.576231,254.688718,249.28841,51.664472,56.137824,54.248186,80.294236,87.31451,81.534344,16.522232,16.171533,15.865403,0.0,0.0,0.0,104.381082,114.14223,108.469864,290.114823,296.771338,288.336731,0.164823,0.137322,0.148417,60.951165,63.604165,63.09757,12.149144,13.225373,11.697686,7.041452,7.050614,7.195597,404.211068,411.07012,426.181405,121.407701,120.782543,124.39675,97.954876,95.429492,101.996729,2.79461,3.073472,3.101265,109.352573,109.459266,109.648799,2.566377,2.799916,2.728246,1.279297,1.40123,1.457058,190.623115,198.346141,192.280532,212.513909,211.114667,213.101403,554.869965,554.096072,568.310234,193.600413,195.82699,189.907986,171.321203,178.06728,170.297094,0.156958,0.153269,0.143432,0.294719,0.304802,0.299254,1.494206,1.651012,1.63245,0.358905,0.383189,0.381821,0.573003,0.634547,0.680035,0.276907,0.286842,0.312501,952.426321,269.328659,267.899034,257.22681,0.302502
min,0.0,109.0,0.0,0.0,0.0,-2258.709,-1289.715,-945.808,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-20.38,-26.04,-24.49,-35.83,-13.09,-55.83,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,180.0,0.0,0.0,0.0,0.0
25%,17499.5,109.0,0.0,0.0,0.0,93.581,86.714,84.095,7.41,6.675,6.41,34.86,32.24,31.575,0.0,0.0,0.0,0.0,0.0,0.0,1.66,1.65,1.61,9.92,10.09,9.83,0.0,0.0,0.0,0.0,0.0,0.0,17.235,17.59,17.2375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,44.78,42.91,38.71,3.03,3.26,3.28,17.39,18.61,18.94,0.0,0.0,0.0,30.63,32.71,32.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,38.64,41.34,38.29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,3.0,110.0,100.0,90.0,30.0,30.0,30.0,0.0,0.0,0.0,1.0,1.0,1.0,25.0,25.0,25.0,1.0,1.0,1.0,0.0,0.0,0.0,82.0,92.0,84.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,468.0,0.0,0.0,0.0,0.0
50%,34999.0,109.0,0.0,0.0,0.0,197.484,191.588,192.234,34.11,32.28,32.1,96.48,91.885,91.8,0.0,0.0,0.0,0.0,0.0,0.0,11.91,11.58,11.74,41.03,40.17,40.35,0.0,0.0,0.0,0.0,0.0,0.0,65.19,63.43,63.52,0.0,0.0,0.0,3.98,3.71,3.3,0.0,0.0,0.0,0.0,0.0,0.0,11.73,11.26,10.505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,145.28,141.23,138.36,15.74,15.83,16.04,56.46,56.93,58.21,0.88,0.91,0.93,92.43,92.51,93.89,0.0,0.0,0.0,2.04,2.06,2.03,0.0,0.0,0.0,0.0,0.0,0.0,5.91,5.98,5.83,114.78,116.33,114.61,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,6.0,5.0,229.0,220.0,225.0,110.0,110.0,98.0,30.0,30.0,30.0,1.0,2.0,1.0,145.0,145.0,145.0,1.0,1.0,1.0,0.0,0.0,0.0,154.0,154.0,154.0,0.0,0.0,0.0,0.0,0.0,0.0,0.52,0.42,0.84,11.3,8.8,9.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,868.0,0.0,0.0,0.0,0.0
75%,52498.5,109.0,0.0,0.0,0.0,370.791,365.3695,369.909,119.39,115.8375,115.06,232.99,227.63,229.345,0.0,0.0,0.0,0.0,0.0,0.0,40.74,39.76,39.895,110.43,107.54,109.245,2.06,2.08,2.03,0.0,0.0,0.0,167.88,163.9325,165.615,31.02,31.3,30.76,53.745,54.64,52.66,0.0,0.0,0.0,0.0,0.0,0.0,146.335,151.645,149.015,0.0,0.0,0.0,2.4,3.66,4.0025,0.0,0.0,0.0,374.305,380.045,370.895,46.98,45.69,46.28,132.02,131.01,134.38,8.14,8.23,8.09,208.325,205.53,208.06,4.06,4.18,4.0525,14.96,15.83,15.31,0.0,0.0,0.0,0.0,0.0,0.0,26.78,28.16,27.615,251.07,249.47,249.71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,10.0,9.0,438.0,430.0,436.0,120.0,128.0,144.0,110.0,110.0,130.0,3.0,3.0,3.0,177.0,177.0,179.0,2.0,2.0,2.0,1.0,1.0,1.0,252.0,252.0,252.0,0.0,0.0,0.0,0.0,0.0,0.0,122.07,120.86,122.07,122.07,122.07,122.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1813.0,0.0,0.0,0.0,0.0
max,69998.0,109.0,0.0,0.0,0.0,27731.088,35145.834,33543.624,7376.71,8157.78,10752.56,8362.36,7043.98,14007.34,2850.98,4155.83,4169.81,3775.11,2812.04,5337.04,6431.33,7400.66,10752.56,4696.83,4557.14,4961.33,617.58,815.33,588.29,342.86,916.24,351.83,10643.38,7674.78,11039.91,7366.58,8133.66,8014.43,8314.76,6622.54,13950.04,628.56,465.79,354.16,0.0,0.0,0.0,8432.99,8155.53,13980.06,5900.66,5490.28,5681.54,1023.21,2372.51,1075.08,800.89,270.24,394.93,10674.03,8285.64,14043.06,5315.59,9324.66,10696.23,4450.74,4455.83,6274.19,1872.34,1983.01,1676.58,7454.63,9669.91,10830.16,3336.38,4708.71,3930.24,5647.16,6141.88,5512.76,1351.11,1136.08,1394.89,0.0,0.0,0.0,5712.11,6745.76,5658.74,7716.14,9699.01,10830.38,19.76,13.46,16.86,6789.41,5289.54,4127.01,1362.94,1495.94,1209.86,170.0,138.0,138.0,35190.0,40335.0,45320.0,4010.0,3299.0,4449.0,4010.0,3100.0,4449.0,61.0,54.0,60.0,1555.0,1555.0,1555.0,42.0,48.0,44.0,29.0,34.0,45.0,5920.0,4365.0,4076.0,10285.9,7873.55,11117.61,45735.4,28144.12,30036.06,5054.37,4980.9,3716.9,5054.35,4809.36,3483.17,1.0,1.0,1.0,4.0,5.0,5.0,42.0,48.0,44.0,9.0,16.0,16.0,29.0,33.0,41.0,1.0,1.0,1.0,4337.0,12916.22,9165.6,11166.21,1.0


# Data Overview

TypeError: DataFrame.info() got an unexpected keyword argument 'null_counts'

# Metadata Information

In [None]:
display ("Rows     : " ,data.shape[0])
display ("Columns  : " ,data.shape[1])
display ("\nMissing values :  ", data.isnull().sum().values.sum())
display ("\nUnique values :  \n",data.nunique())
display ("\nFeatures : \n" ,data.columns.tolist())

<hr>
<h1><a id='Data_Cleaning_and_Missing_Data_Analysis'>Data_Cleaning_and_Missing_Data_Analysis</a><br></h1>
<ul>
    <li>Data_Cleaning</li>
    <li>Missing Data Treatment Analysis Summary</li>
</ul>

In [None]:
## Dropping ID columns since all values are unique
## Dropping circle_id since all columns have value 109

test_v1 = test.drop(columns=['id','circle_id'])
data_v1 = data.drop(columns=['id','circle_id'])

In [None]:
## No of rows which have the all null values

display("No of rows = {} which are having null values in all columns".format(data_v1.isnull().all(axis=1).sum()))

In [None]:
## No of colummns which have the all null values

display("No of columns = {} which have null values in all rows".format(data_v1.isnull().all(axis=0).sum()))

In [None]:
# dropping columns which have null values in all the rows i.e. 100 % Null Values

data_v1.dropna(axis=1, how='all',inplace=True)
test_v1.dropna(axis=1, how='all',inplace=True)

In [None]:
# Check Duplicate Rows - No Duplicate rows found

duplicate = data_v1[data_v1.duplicated()]

display(duplicate)

In [None]:
## Dropping columns since these columns have single value only : 'loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou','std_og_t2c_mou_6','std_og_t2c_mou_7','std_og_t2c_mou_8','std_ic_t2o_mou_6',
## 'std_ic_t2o_mou_7','std_ic_t2o_mou_8

test_v1 = test_v1.drop(columns=['loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou','std_og_t2c_mou_6','std_og_t2c_mou_7','std_og_t2c_mou_8','std_ic_t2o_mou_6',
                               'std_ic_t2o_mou_7','std_ic_t2o_mou_8'],axis=1)

data_v1 = data_v1.drop(columns=['loc_og_t2o_mou','std_og_t2o_mou','loc_ic_t2o_mou','std_og_t2c_mou_6','std_og_t2c_mou_7','std_og_t2c_mou_8','std_ic_t2o_mou_6',
                               'std_ic_t2o_mou_7','std_ic_t2o_mou_8'],axis=1)

In [None]:
display(data_v1.shape)

## We are now having 69999 rows and 161 columns

In [None]:
# Percentage of Missing Columns
pd.options.display.max_rows = None # This function is to display all rows
round(100 * data_v1.isnull().sum()/len(data_v1.index),2)

In [None]:
#Segregating categorical , continuous and date columns
cat = []
con = []
for i in data_v1.columns:
    if data_v1[i].dtype == 'object':
        cat.append(i)
    else:
        con.append(i)

dtcols = cat

cat = ['night_pck_user_6' , 'night_pck_user_8' , 'night_pck_user_7' , 'monthly_2g_6' , 'monthly_2g_7'
       ,'monthly_2g_8','sachet_2g_6','sachet_2g_7','sachet_2g_8','monthly_3g_6','monthly_3g_7',
       'monthly_3g_8','sachet_3g_6','sachet_3g_7','sachet_3g_8','fb_user_6','fb_user_7', 'fb_user_8']

con = [column for column in con if column not in cat]

display("Numerical Columns" ,len(con),"\nCategorical Columns",len(cat),"\nDate Columns" ,len(dtcols))

## Missing Data Treatment Analysis Summary ##

> ### Replacing null values for below categorical variables with -1 considering no recharge has been made
- - 'night_pck_user_6', 'night_pck_user_8', 'night_pck_user_7','monthly_2g_6', 'monthly_2g_7', 'monthly_2g_8', 'sachet_2g_6',
- - 'sachet_2g_7', 'sachet_2g_8', 'monthly_3g_6', 'monthly_3g_7','monthly_3g_8', 'sachet_3g_6', 'sachet_3g_7', 'sachet_3g_8',
- - 'fb_user_6', 'fb_user_7', 'fb_user_8'

> ### Dropping these columns since percentage of missing values are more than 70 %

- - 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_2g_6', 'arpu_2g_7','arpu_2g_8'

> ### Based on the analysis imputing null values in recharge_amount_columns list with 0 considereing no recharge has been made by the customer.
Also we can see that the min value of the recharge amount columns is 1

- - 'total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8','max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8','av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8'

> ### Imputing Count recharge columns with zero considering no recharge has been made by the customer

- - 'count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8','count_rech_3g_6', 'count_rech_3g_7', 'count_rech_3g_8'

> ### Dropping below date columns since it will not be in use much and for other date columns we will be calculating age
- - 'last_date_of_month_6','last_date_of_month_7', 'last_date_of_month_8'

> ### Replacing Missing age data with -1 considering no recharge has been made by the user

- - date_of_last_rech_6age date_of_last_rech_7age date_of_last_rech_8age

> ### For other Continuous columns missing values has been replaced with median since we have outliers

In [None]:
# Percentage of Missing Columns
pd.options.display.max_rows = None #(<< This function is use to display all rows >>)
round(100 * data_v1.isnull().sum()/len(data_v1.index),2)

In [None]:
# Replacing null values for below categorical variables with -1 considering no recharge has been made

data_v1['night_pck_user_6'] = data_v1['night_pck_user_6'].fillna(-1)
data_v1['night_pck_user_7'] = data_v1['night_pck_user_7'].fillna(-1)
data_v1['night_pck_user_8'] = data_v1['night_pck_user_8'].fillna(-1)
data_v1['fb_user_6'] = data_v1['fb_user_6'].fillna(-1)
data_v1['fb_user_7'] = data_v1['fb_user_7'].fillna(-1)
data_v1['fb_user_8'] = data_v1['fb_user_8'].fillna(-1)

test_v1['night_pck_user_6'] = data_v1['night_pck_user_6'].fillna(-1)
test_v1['night_pck_user_7'] = data_v1['night_pck_user_7'].fillna(-1)
test_v1['night_pck_user_8'] = data_v1['night_pck_user_8'].fillna(-1)
test_v1['fb_user_6'] = data_v1['fb_user_6'].fillna(-1)
test_v1['fb_user_7'] = data_v1['fb_user_7'].fillna(-1)
test_v1['fb_user_8'] = data_v1['fb_user_8'].fillna(-1)

In [None]:
#Analysing Average revenue Per User data
# Dropping these columns since percentage of missing values are more than 70 %
average_revenue_per_user = data.columns[data.columns.str.contains('arpu_3g|arpu_2g')]

pd.options.display.max_rows = None #(<< This function is use to display all rows >>)
display("Before",round(100 * data_v1[average_revenue_per_user].isnull().sum()/len(data_v1.index),2))

data_v1 = data_v1.drop(columns=average_revenue_per_user,axis=1)

test_v1 = test_v1.drop(columns=average_revenue_per_user,axis=1)


In [None]:
# Handling Missing Values Recharge Columns : 

recharge_amount_columns = data.columns[data.columns.str.contains('av_rech|total_rech_data|max_rech_data|amt_data')]

recharge_amount_columns

# List of recharge columns contain below information and percentage of missing values

# total_rech_data_6           74.90
# total_rech_data_7           74.48
# total_rech_data_8           73.69

# max_rech_data_6             74.90
# max_rech_data_7             74.48
# max_rech_data_8             73.69

# av_rech_amt_data_6          74.90
# av_rech_amt_data_7          74.48
# av_rech_amt_data_8          73.69

display(
    data_v1[recharge_amount_columns].describe()
)

In [None]:
# check - 'total_rech_data_6/7/8' and 'date_of_last_rech_data_6/7/8' both have null values at the same index. 

total_rech_data_6_index = data_v1['total_rech_data_6'].isnull()
date_of_last_rech_data_6_index = data_v1['date_of_last_rech_data_6'].isnull()

if total_rech_data_6_index.equals(date_of_last_rech_data_6_index):
    display('The indexes for NULL values for month 6 are equal')
    
total_rech_data_7_index = data_v1['total_rech_data_7'].isnull()
date_of_last_rech_data_7_index = data_v1['date_of_last_rech_data_7'].isnull()

if total_rech_data_7_index.equals(date_of_last_rech_data_7_index):
    display('The indexes for NULL values for month 7 are equal')
    
total_rech_data_8_index = data_v1['total_rech_data_8'].isnull()
date_of_last_rech_data_8_index = data_v1['date_of_last_rech_data_8'].isnull()

if total_rech_data_8_index.equals(date_of_last_rech_data_8_index):
    display('The indexes for NULL values for month 8 are equal')

- -  Based on the analysis imputing null values in recharge_amount_columns list with 0 considereing no recharge has been made by the customer.
- - Also we can see that the min value of the recharge amount columns is 1 

In [None]:
# Replacing the values with 0 
for column in recharge_amount_columns:
    data_v1[column] = data_v1[column].fillna(0)
    test_v1[column] = test_v1[column].fillna(0)

# Check % of missing values after the imputation
round(100 * data_v1[recharge_amount_columns].isnull().sum()/len(data_v1.index),2)

In [None]:
### Imputing Count recharge columns with zero considering no recharge has been made by the customer

In [None]:
# Replacing the values with 0

count_rech = data_v1.columns[data_v1.columns.str.contains('count_rech')]


for column in count_rech:
    data_v1[column] = data_v1[column].fillna(0)
    test_v1[column] = test_v1[column].fillna(0)

# Check % of missing values after the imputation
round(100 * data_v1[count_rech].isnull().sum()/len(data_v1.index),2)

In [None]:
#Dropping below date columns since it will not be in use much and for other date columns we will be calculating age
# 'last_date_of_month_6',
#  'last_date_of_month_7',
#  'last_date_of_month_8'

data_v1 = data_v1.drop(columns=['last_date_of_month_6','last_date_of_month_7','last_date_of_month_8'],axis=1)
test_v1 = test_v1.drop(columns=['last_date_of_month_6','last_date_of_month_7','last_date_of_month_8'],axis=1)

In [None]:
dtcols.remove('last_date_of_month_6')
dtcols.remove('last_date_of_month_7')
dtcols.remove('last_date_of_month_8')

In [None]:
test_v1[dtcols].head()

In [None]:
# Handling date columns 
# Replacing Missing age data with -1 considering no recharge has been made by the user

for column in dtcols:
    colstr = column + 'age'
    display(colstr)
    data_v1[column] = pd.to_datetime(data_v1[column])
    data_v1[colstr] = data_v1[column].max() - data_v1[column]
    data_v1[colstr] = data_v1[colstr].dt.days
    data_v1 = data_v1.drop(columns=column,axis=1)
    data_v1[colstr] = data_v1[colstr].fillna(-1)
    

for column in dtcols:
    colstr = column + 'age'
    display(colstr)
    test_v1[column] = pd.to_datetime(test_v1[column])
    test_v1[colstr] = test_v1[column].max() - test_v1[column]
    test_v1[colstr] = test_v1[colstr].dt.days
    test_v1 = test_v1.drop(columns=column,axis=1)
    test_v1[colstr] = test_v1[colstr].fillna(-1)
    

In [None]:
#Segregating cat , con and date columns
cat = []
con = []
for i in data_v1.columns:
    if data_v1[i].dtype == 'object':
        cat.append(i)
    else:
        con.append(i)

dtcols = cat

cat = ['night_pck_user_6' , 'night_pck_user_8' , 'night_pck_user_7' , 'monthly_2g_6' , 'monthly_2g_7'
       ,'monthly_2g_8','sachet_2g_6','sachet_2g_7','sachet_2g_8','monthly_3g_6','monthly_3g_7',
       'monthly_3g_8','sachet_3g_6','sachet_3g_7','sachet_3g_8','fb_user_6','fb_user_7', 'fb_user_8']

con = [column for column in con if column not in cat]

display("Numerical Columns" ,len(con),"\nCategorical Columns",len(cat),"\nDate Columns" ,len(dtcols))

In [None]:
data_v1[con].describe()

In [None]:
# Missing Value data treatment onnet_mou
onnet_mou = data_v1.columns[data_v1.columns.str.contains('onnet_mou')]

display(data_v1[onnet_mou].describe())

#Replacing the values with median since we have outliers here

for column in onnet_mou:
    data_v1[column] = data_v1[column].fillna(data_v1[column].median())
    test_v1[column] = test_v1[column].fillna(data_v1[column].median())


In [None]:
# Missing Value data treatment offnet_mou
offnet_mou = data_v1.columns[data_v1.columns.str.contains('offnet_mou')]

#Replacing the values with median since we have outliers here

for column in offnet_mou:
    data_v1[column] = data_v1[column].fillna(data_v1[column].median())
    test_v1[column] = test_v1[column].fillna(data_v1[column].median())



In [None]:
# Missing Value data treatment roaming
roam_col = data_v1.columns[data_v1.columns.str.contains('roam')]

#Replacing the values with median since we have outliers here

for column in roam_col:
    data_v1[column] = data_v1[column].fillna(data_v1[column].median())
    test_v1[column] = test_v1[column].fillna(data_v1[column].median())


In [None]:
# Missing Value data treatment roaming
local_calls = data_v1.columns[data_v1.columns.str.contains('loc')]

#Replacing the values with median since we have outliers here

for column in local_calls:
    data_v1[column] = data_v1[column].fillna(data_v1[column].median())
    test_v1[column] = test_v1[column].fillna(data_v1[column].median())

In [None]:
# Missing Value data treatment std
std_calls = data_v1.columns[data_v1.columns.str.contains('std')]

#Replacing the values with median since we have outliers here

for column in std_calls:
    data_v1[column] = data_v1[column].fillna(data_v1[column].median())
    test_v1[column] = test_v1[column].fillna(data_v1[column].median())

In [None]:
# Missing Value data treatment std
isd_calls = data_v1.columns[data_v1.columns.str.contains('isd')]

# #Replacing the values with median since we have outliers here

for column in isd_calls:
    data_v1[column] = data_v1[column].fillna(data_v1[column].median())
    test_v1[column] = test_v1[column].fillna(data_v1[column].median())

In [None]:
# Missing Value data treatment std
spl_list = data_v1.columns[data_v1.columns.str.contains('spl')]

# #Replacing the values with median since we have outliers here

for column in spl_list:
    data_v1[column] = data_v1[column].fillna(data_v1[column].median())
    test_v1[column] = test_v1[column].fillna(data_v1[column].median())

In [None]:
# Missing Value data treatment std
other_list = data_v1.columns[data_v1.columns.str.contains('other')]

# #Replacing the values with median since we have outliers here

for column in other_list:
    data_v1[column] = data_v1[column].fillna(data_v1[column].median())
    test_v1[column] = test_v1[column].fillna(data_v1[column].median())

In [None]:
display(data_v1.isnull().sum())

In [None]:

display ("Rows     : " ,data_v1.shape[0])
display ("Columns  : " ,data_v1.shape[1])
display ("\nFeatures : \n" ,data_v1.columns.tolist())
display ("\nMissing values :  ", data_v1.isnull().sum().values.sum())
display ("\nUnique values :  \n",data_v1.nunique())

In [None]:
data_v1[con].describe()

In [None]:
display ("Rows     : " ,test_v1.shape[0])
display ("Columns  : " ,test_v1.shape[1])
display ("\nFeatures : \n" ,test_v1.columns.tolist())
display ("\nMissing values :  ", test_v1.isnull().sum().values.sum())
display ("\nUnique values :  \n",test_v1.nunique())

<hr>
<h1><a id='Outlier_Analysis_and_Treatment_Assumption_values'>Data Cleaning - Outlier Analysis</a><br></h1>

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
col = ['churn_probability','date_of_last_rech_data_6age', 'date_of_last_rech_data_7age', 'date_of_last_rech_8age', 'date_of_last_rech_6age', 'date_of_last_rech_data_8age', 'date_of_last_rech_7age']

for i in col:
    con.remove(i)

In [None]:
# Outlier Analysis

outliers = []
out_cols = con
out_summary = []

for i in out_cols:
    Q3 = data_v1[i].quantile(.95)
    Q1 = data_v1[i].quantile(.05)
    IQR = Q3-Q1
    lower_bound = Q1-1.5*IQR
    upper_bound = Q3+1.5*IQR
    
    if ((data_v1[i].min() < lower_bound) or (data_v1[i].max() > upper_bound)):
        out_summary.append("attribute \"{}\" with min value : {} -> max value : {} -> IQR {} -> lower_bound : {} match is {} -> upper_bound : {} match is {}".format(i,data_v1[i].min(),data_v1[i].max(),IQR,Q1-1.5*IQR,(data_v1[i].min() < lower_bound),Q3+1.5*IQR,data_v1[i].max() > upper_bound))
        outliers.append(i)


# List of outliers satisfying lower or upper bound        
for i in range(0,len(out_summary)):
    display("\nOutlier column with stats : \n\n{}\n".format(out_summary[i]))
    

#Outliers Treatment

for i in outliers:
    Q3 = data_v1[i].quantile(.95)
    Q1 = data_v1[i].quantile(.05)
    IQR = Q3-Q1
    lower_bound = Q1-1.5*IQR
    upper_bound = Q3+1.5*IQR
    data_v1[i][data_v1[i]<=lower_bound] = lower_bound
    data_v1[i][data_v1[i]>=upper_bound] = upper_bound
    test_v1[i][test_v1[i]<=lower_bound] = lower_bound
    test_v1[i][test_v1[i]>=upper_bound] = upper_bound

display(outliers)

display("After outliers treatment\n\n",data_v1[out_cols].describe())

<h1><a id='Transforming_Categorical_Columns'>Deriving and Transforming Columns</a><br></h1>
<ul>
    <li>Filter High-Value Customers</li>
    <ul>
        <li>calculate total data recharge amount</li>
    </ul>
    <li>Display the correlation matrix again to analyze correlation coefficient between features</li>
</ul>

In [None]:
display(
    data_v1.head()
)

# Filter High-Value Customers

## calculate total data recharge amount

In [None]:
# calculate the total data recharge amount for June and July --> number of recharges * average recharge amount

data_v1['total_data_rech_6'] = data_v1.total_rech_data_6 * data_v1.av_rech_amt_data_6
data_v1['total_data_rech_7'] = data_v1.total_rech_data_7 * data_v1.av_rech_amt_data_7


In [None]:
# add total data recharge and total recharge to get total combined recharge amount for a month
# calculate total recharge amount for June and July --> recharge amount + data recharge amount
data_v1['amt_data_6'] = data_v1.total_rech_amt_6 + data_v1.total_data_rech_6
data_v1['amt_data_7'] = data_v1.total_rech_amt_7 + data_v1.total_data_rech_7


In [None]:
# calculate average recharge done by customer in June and July
data_v1['av_amt_data_6_7'] = (data_v1.amt_data_6 + data_v1.amt_data_7)/2
test_v1['av_amt_data_6_7'] = (data_v1.amt_data_6 + data_v1.amt_data_7)/2


In [None]:
# look at the 70th percentile recharge amount
display("Recharge amount at 70th percentile: {0}".format(data_v1.av_amt_data_6_7.quantile(0.7)))

In [None]:
# Retain Customers who have made the recharge equivalent to 70th Percentile


data_v1_filtered = data_v1.loc[data_v1['av_amt_data_6_7'] >= data_v1['av_amt_data_6_7'].quantile(0.7),:]
data_v1_filtered = data_v1_filtered.reset_index(drop=True)
data_v1_filtered.shape

In [None]:
#Dropping variables that are used to filter high-value customers

data_v1_filtered = data_v1_filtered.drop(['total_data_rech_6','total_data_rech_7','amt_data_6','amt_data_7'],axis=1)

In [None]:
data_v1_filtered.shape

In [None]:
data_v1_filtered['onnet_mou_diff'] = data_v1_filtered.onnet_mou_8 - ((data_v1_filtered.onnet_mou_6 + data_v1_filtered.onnet_mou_7)/2)

data_v1_filtered['offnet_mou_diff'] = data_v1_filtered.offnet_mou_8 - ((data_v1_filtered.offnet_mou_6 + data_v1_filtered.offnet_mou_7)/2)

data_v1_filtered['roam_ic_mou_diff'] = data_v1_filtered.roam_ic_mou_8 - ((data_v1_filtered.roam_ic_mou_6 + data_v1_filtered.roam_ic_mou_7)/2)

data_v1_filtered['roam_og_mou_diff'] = data_v1_filtered.roam_og_mou_8 - ((data_v1_filtered.roam_og_mou_6 + data_v1_filtered.roam_og_mou_7)/2)

data_v1_filtered['loc_og_mou_diff'] = data_v1_filtered.loc_og_mou_8 - ((data_v1_filtered.loc_og_mou_6 + data_v1_filtered.loc_og_mou_7)/2)

data_v1_filtered['std_og_mou_diff'] = data_v1_filtered.std_og_mou_8 - ((data_v1_filtered.std_og_mou_6 + data_v1_filtered.std_og_mou_7)/2)

data_v1_filtered['isd_og_mou_diff'] = data_v1_filtered.isd_og_mou_8 - ((data_v1_filtered.isd_og_mou_6 + data_v1_filtered.isd_og_mou_7)/2)

data_v1_filtered['spl_og_mou_diff'] = data_v1_filtered.spl_og_mou_8 - ((data_v1_filtered.spl_og_mou_6 + data_v1_filtered.spl_og_mou_7)/2)

data_v1_filtered['total_og_mou_diff'] = data_v1_filtered.total_og_mou_8 - ((data_v1_filtered.total_og_mou_6 + data_v1_filtered.total_og_mou_7)/2)

data_v1_filtered['loc_ic_mou_diff'] = data_v1_filtered.loc_ic_mou_8 - ((data_v1_filtered.loc_ic_mou_6 + data_v1_filtered.loc_ic_mou_7)/2)

data_v1_filtered['std_ic_mou_diff'] = data_v1_filtered.std_ic_mou_8 - ((data_v1_filtered.std_ic_mou_6 + data_v1_filtered.std_ic_mou_7)/2)

data_v1_filtered['isd_ic_mou_diff'] = data_v1_filtered.isd_ic_mou_8 - ((data_v1_filtered.isd_ic_mou_6 + data_v1_filtered.isd_ic_mou_7)/2)

data_v1_filtered['spl_ic_mou_diff'] = data_v1_filtered.spl_ic_mou_8 - ((data_v1_filtered.spl_ic_mou_6 + data_v1_filtered.spl_ic_mou_7)/2)

data_v1_filtered['total_ic_mou_diff'] = data_v1_filtered.total_ic_mou_8 - ((data_v1_filtered.total_ic_mou_6 + data_v1_filtered.total_ic_mou_7)/2)

data_v1_filtered['total_rech_num_diff'] = data_v1_filtered.total_rech_num_8 - ((data_v1_filtered.total_rech_num_6 + data_v1_filtered.total_rech_num_7)/2)

data_v1_filtered['total_rech_amt_diff'] = data_v1_filtered.total_rech_amt_8 - ((data_v1_filtered.total_rech_amt_6 + data_v1_filtered.total_rech_amt_7)/2)

data_v1_filtered['max_rech_amt_diff'] = data_v1_filtered.max_rech_amt_8 - ((data_v1_filtered.max_rech_amt_6 + data_v1_filtered.max_rech_amt_7)/2)

data_v1_filtered['total_rech_data_diff'] = data_v1_filtered.total_rech_data_8 - ((data_v1_filtered.total_rech_data_6 + data_v1_filtered.total_rech_data_7)/2)

data_v1_filtered['max_rech_data_diff'] = data_v1_filtered.max_rech_data_8 - ((data_v1_filtered.max_rech_data_6 + data_v1_filtered.max_rech_data_7)/2)

data_v1_filtered['av_rech_amt_data_diff'] = data_v1_filtered.av_rech_amt_data_8 - ((data_v1_filtered.av_rech_amt_data_6 + data_v1_filtered.av_rech_amt_data_7)/2)

data_v1_filtered['vol_2g_mb_diff'] = data_v1_filtered.vol_2g_mb_8 - ((data_v1_filtered.vol_2g_mb_6 + data_v1_filtered.vol_2g_mb_7)/2)

data_v1_filtered['vol_3g_mb_diff'] = data_v1_filtered.vol_3g_mb_8 - ((data_v1_filtered.vol_3g_mb_6 + data_v1_filtered.vol_3g_mb_7)/2)

In [None]:
test_v1['onnet_mou_diff'] = test_v1.onnet_mou_8 - ((test_v1.onnet_mou_6 + test_v1.onnet_mou_7)/2)

test_v1['offnet_mou_diff'] = test_v1.offnet_mou_8 - ((test_v1.offnet_mou_6 + test_v1.offnet_mou_7)/2)

test_v1['roam_ic_mou_diff'] = test_v1.roam_ic_mou_8 - ((test_v1.roam_ic_mou_6 + test_v1.roam_ic_mou_7)/2)

test_v1['roam_og_mou_diff'] = test_v1.roam_og_mou_8 - ((test_v1.roam_og_mou_6 + test_v1.roam_og_mou_7)/2)

test_v1['loc_og_mou_diff'] = test_v1.loc_og_mou_8 - ((test_v1.loc_og_mou_6 + test_v1.loc_og_mou_7)/2)

test_v1['std_og_mou_diff'] = test_v1.std_og_mou_8 - ((test_v1.std_og_mou_6 + test_v1.std_og_mou_7)/2)

test_v1['isd_og_mou_diff'] = test_v1.isd_og_mou_8 - ((test_v1.isd_og_mou_6 + test_v1.isd_og_mou_7)/2)

test_v1['spl_og_mou_diff'] = test_v1.spl_og_mou_8 - ((test_v1.spl_og_mou_6 + test_v1.spl_og_mou_7)/2)

test_v1['total_og_mou_diff'] = test_v1.total_og_mou_8 - ((test_v1.total_og_mou_6 + test_v1.total_og_mou_7)/2)

test_v1['loc_ic_mou_diff'] = test_v1.loc_ic_mou_8 - ((test_v1.loc_ic_mou_6 + test_v1.loc_ic_mou_7)/2)

test_v1['std_ic_mou_diff'] = test_v1.std_ic_mou_8 - ((test_v1.std_ic_mou_6 + test_v1.std_ic_mou_7)/2)

test_v1['isd_ic_mou_diff'] = test_v1.isd_ic_mou_8 - ((test_v1.isd_ic_mou_6 + test_v1.isd_ic_mou_7)/2)

test_v1['spl_ic_mou_diff'] = test_v1.spl_ic_mou_8 - ((test_v1.spl_ic_mou_6 + test_v1.spl_ic_mou_7)/2)

test_v1['total_ic_mou_diff'] = test_v1.total_ic_mou_8 - ((test_v1.total_ic_mou_6 + test_v1.total_ic_mou_7)/2)

test_v1['total_rech_num_diff'] = test_v1.total_rech_num_8 - ((test_v1.total_rech_num_6 + test_v1.total_rech_num_7)/2)

test_v1['total_rech_amt_diff'] = test_v1.total_rech_amt_8 - ((test_v1.total_rech_amt_6 + test_v1.total_rech_amt_7)/2)

test_v1['max_rech_amt_diff'] = test_v1.max_rech_amt_8 - ((test_v1.max_rech_amt_6 + test_v1.max_rech_amt_7)/2)

test_v1['total_rech_data_diff'] = test_v1.total_rech_data_8 - ((test_v1.total_rech_data_6 + test_v1.total_rech_data_7)/2)

test_v1['max_rech_data_diff'] = test_v1.max_rech_data_8 - ((test_v1.max_rech_data_6 + test_v1.max_rech_data_7)/2)

test_v1['av_rech_amt_data_diff'] = test_v1.av_rech_amt_data_8 - ((test_v1.av_rech_amt_data_6 + test_v1.av_rech_amt_data_7)/2)

test_v1['vol_2g_mb_diff'] = test_v1.vol_2g_mb_8 - ((test_v1.vol_2g_mb_6 + test_v1.vol_2g_mb_7)/2)

test_v1['vol_3g_mb_diff'] = test_v1.vol_3g_mb_8 - ((test_v1.vol_3g_mb_6 + test_v1.vol_3g_mb_7)/2)

In [None]:
display(
    test_v1.shape
)

In [None]:
display(
    data_v1_filtered.shape
)

In [None]:
display(
    data_v1_filtered.isnull().sum()
)

In [None]:
#Testing

In [None]:
display(
    data_v1_filtered.shape
)

# display the correlation matrix again to analyze correlation coefficient between features

In [None]:
display(
    data_v1_filtered['churn_probability'].value_counts()
)

<h1><a id='Univariate_Analysis'>Univariate Analysis</a><br></h1>

In [None]:
display(
    data_v1_filtered.describe(percentiles=[.10,.25,.50,.75,.90,.95,.99])
)

In [None]:
# Boxplot for categorical variables to see how they are related to sales price.
def plot_analysis(data_frame, vars):
    var_cats =  data_v1_filtered.columns[data_v1_filtered.columns.str.contains(vars)]
    display(var_cats)
    plt.figure(figsize=(15, 150))
    for i in enumerate(var_cats):
        plt.subplot(20, 2,i[0]+1)
        ax = sb.boxplot(data = data_frame, x=i[1], y='churn_probability') 
        ax.set_xticklabels(ax.get_xticklabels(),rotation = 90) 
        ax.set_title(i[1] + " vs Churn prob.", fontsize=14) 
        ax.set_xlabel(i[1], fontsize=14)
        ax.set_ylabel("Churn Probability", fontsize=15)
    plt.show()

In [None]:
display(
    sb.countplot(x="churn_probability",data = data_v1_filtered)
)

# We can see that the we have the imbalanced dataset as percent of churn is quite low in compared to non churn

# Bivariate Analysis

In [None]:
sb.set(rc={'figure.figsize':(11.7,8.27)})
bins = [0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
labels = [0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
sb.countplot(pd.cut(round(((data_v1_filtered['aon']/30)/12),1), bins = bins, labels = labels),hue = data_v1_filtered['churn_probability'])
plt.show()

# With the increase in age on network churn count is decreasing : - So if the customers is old his probability of getting churn will decrease

In [None]:
def colbox(cols):
    plt.figure(figsize=(40, 25))
    for i in range(len(cols)):
        plt.subplot(2,len(cols),i+1)
        K = pd.concat([data_v1_filtered[cols[i]],data_v1_filtered['churn_probability']], axis=1)
        K = pd.melt(K,id_vars="churn_probability",var_name="features",value_name='value')
        sb.boxplot(x="features", y="value", hue="churn_probability",data = K)
        plt.xticks()    
        plt.suptitle('Incoming Calls Usage')
        plt.subplot(2,3,3+i+1)
        sb.distplot(data[cols[i]])

In [None]:
# Analysis : Incoming Minutes of Usage 
cols =["total_ic_mou_6","total_ic_mou_7","total_ic_mou_8"]
colbox(cols)

# Inference : Total Minutes of usage for Incoming calls are skewed to left side 

# if the total MOU is more the probability of getting churned is less

In [None]:
# Analysis : Outgoing Minutes of Usage 
cols = [['total_og_mou_6'],
        ['total_og_mou_7'],
        ['total_og_mou_8']]
colbox(cols)

# If the amount of outgoing is increase can see for june and july month the amount of churn is relatively more

In [None]:
cols = ['total_rech_num_6','total_rech_num_7','total_rech_num_8']
colbox(cols)

# Total Recharge Number Analysis 

# Can see for June month with the increase in total rechage number we had observed for more churn 


In [None]:
def Bivariate_box(cols):
    plt.figure(figsize=(60, 45))
    for i in range(0,7):
        plt.subplot(3,3,i+1)
        K = pd.concat([data_v1_filtered[cols[i]],data_v1_filtered['churn_probability']], axis=1)
        K = pd.melt(K,id_vars="churn_probability",var_name="features",value_name='value')
        sb.boxplot(x="features", y="value", hue="churn_probability",data = K)
        plt.xticks()    
        plt.suptitle('2G-3G Volume')

In [None]:
def filter_columns(data_frame, prefix): 
    columns_with_prefix = []
    for col in data_frame.columns.tolist():
        if prefix in col: 
            columns_with_prefix.append(col) 
    return columns_with_prefix

In [None]:
# WIth the increase in roaming churn is increasing
# With the increase outgoing std churn is more
cols = [
        ['vol_2g_mb_6','vol_2g_mb_7','vol_2g_mb_8'],
        ['vol_3g_mb_6','vol_3g_mb_7','vol_3g_mb_8'],
        ['monthly_2g_6','monthly_2g_7','monthly_2g_8'],
        ['monthly_3g_6','monthly_3g_7','monthly_3g_8'],
        ['sachet_2g_6','sachet_2g_7','sachet_2g_8'],
        ['sachet_3g_6','sachet_3g_7','sachet_3g_8'],
        ['jun_vbc_3g','jul_vbc_3g','aug_vbc_3g'],
        ['roam_ic_mou_6','roam_ic_mou_7','roam_ic_mou_8'],
        ['roam_og_mou_6','roam_og_mou_7','roam_og_mou_8'],
        ['std_og_mou_6','std_og_mou_7','std_og_mou_8'],
        ['isd_og_mou_6','isd_og_mou_7','isd_og_mou_8']
       ]
# plot for the 2g-3g volume
plt.figure(figsize=(25, 20))
plt.subplots_adjust(hspace=0.5)
for i in range(0,11):
    plt.subplot(4,3,i+1)
    K = pd.concat([data_v1_filtered[cols[i]], data_v1_filtered['churn_probability']], axis=1)
    K = pd.melt(K,id_vars="churn_probability",var_name="features",value_name='value')
    sb.boxplot(x="features", y="value", hue="churn_probability", data=K)
    plt.xticks()    
    plt.suptitle('2G-3G Volume')

In [None]:
#Correlation Matrix

#Most of the features are highly correlated so we need to use PCA to handly multicollinearity and dimensionality reductions
plt.figure(figsize = (25, 20))

sb.heatmap(data_v1.corr())

plt.show()

In [None]:
#Spliting testing and training data 

X = data_v1_filtered.drop(["churn_probability"],1)
Y = data_v1_filtered.churn_probability

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(X, Y, test_size = 0.3, random_state = 42, stratify = Y)

In [None]:
# Aggregating the Categorical Columns

train = pd.concat([xtrain, ytrain], axis=1)

# aggregate the categorical variables
display(train.groupby('night_pck_user_6').churn_probability.mean())
display(train.groupby('night_pck_user_7').churn_probability.mean())
display(train.groupby('night_pck_user_8').churn_probability.mean())
display(train.groupby('fb_user_6').churn_probability.mean())
display(train.groupby('fb_user_7').churn_probability.mean())
display(train.groupby('fb_user_8').churn_probability.mean())

# replace categories with aggregated values in each categorical column
mapping = {'night_pck_user_6' : {-1: 0.107529, 0: 0.084044, 1: 0.127596},
           'night_pck_user_7' : {-1: 0.114231, 0: 0.065649, 1: 0.068750},
           'night_pck_user_8' : {-1: 0.126636, 0: 0.032644, 1: 0.034602},
           'fb_user_6'        : {-1: 0.107529, 0: 0.105496, 1: 0.083258},
           'fb_user_7'        : {-1: 0.114231, 0: 0.087029, 1: 0.063630},
           'fb_user_8'        : {-1: 0.126636, 0: 0.062458, 1: 0.029049}
          }
xtrain.replace(mapping, inplace = True)
xtest.replace(mapping, inplace = True)

In [None]:
data_v1_filtered.shape

# Principal Component Analysis

In [None]:
round(100*data_v1_filtered['churn_probability'].value_counts()/len(data_v1_filtered.index),2)

In [None]:
#Churn Distribution
pie_chart = data_v1_filtered['churn_probability'].value_counts()*100.0 /len(data_v1_filtered)
ax = pie_chart.plot.pie(autopct='%.1f%%', labels = ['No', 'Yes'],figsize =(8,6), fontsize = 14 )                                                                           
ax.set_ylabel('Churn',fontsize = 12)
ax.set_title('Churn Distribution', fontsize = 12)
plt.show()

# Data Scaling

In [None]:
# Scaling the data - Using Standard Scaler
col = list(xtrain.columns)
# Data Scaling
scaler = StandardScaler()
xtrain_scaled = scaler.fit_transform(xtrain)
xtest_scaled = scaler.transform(xtest)
test_v1_scaled = scaler.transform(test_v1)

# Applying Principal Component Analysis
pca = PCA()
pca.fit(xtrain)
xtrain_pca = pca.fit_transform(xtrain_scaled)

# PCA

In [None]:
#  feature variance Graph
var_cumu = np.cumsum(pca.explained_variance_ratio_)
fig = plt.figure(figsize=[12,8])
plt.vlines(x=15, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=30, xmin=0, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.xlabel("PCA Components")
plt.show()

In [None]:
df_pca = pd.DataFrame({'PC1':pca.components_[0],'PC2':pca.components_[1], 'PC3':pca.components_[2],'Feature':col})
df_pca.head(10)

In [None]:
np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

# 75 variables are enough to describe 95 % of the variance in the dataset hence selecting 75 variables for our modelling

# Handling imbalance dataset using Smote 

In [None]:
display("Applying SMOTE to normalize imbalance ")
import imblearn
smote = imblearn.over_sampling.SMOTE(.8)
x_smote,y_smote = smote.fit_sample(xtrain_scaled,ytrain)
display("Shape of train datatset after SMOTE : "+ str(x_smote.shape))

# Applying PCA : Pricnipal Component Analysis
pca = IncrementalPCA(n_components=75)    
x_train_smote_pca = pca.fit_transform(x_smote)
x_test_smote_pca = pca.transform(xtest_scaled)
test_v1_scaled_pca = pca.transform(test_v1_scaled)


In [None]:
display("Shape of train datatset after PCA : "+str(x_train_smote_pca.shape))

# After Smote the Shape of train datatset after PCA : (27040, 75)

In [None]:
x_test_smote_pca.shape

In [None]:
from collections import Counter

display(Counter(ytrain))
display(Counter(y_smote))

# Function to Evaluate metrics`m

In [None]:
def evaluate_model(dt_classifier,ytrain,ytest,xtrain,xtest):
    display("Train Accuracy :", accuracy_score(ytrain, dt_classifier.predict(xtrain)))
    display("Train Confusion Matrix:")
    display(confusion_matrix(ytrain, dt_classifier.predict(xtrain)))
    display("-"*50)
    display("Test Accuracy :", accuracy_score(ytest, dt_classifier.predict(xtest)))
    display("Test Confusion Matrix:")
    display(confusion_matrix(ytest, dt_classifier.predict(xtest)))
    display("recall_score",round(metrics.recall_score(ytest,dt_classifier.predict(xtest)),2))
    display("precision_score",round(metrics.precision_score(ytest,dt_classifier.predict(xtest)),2))
    display("auc",round(metrics.roc_auc_score(ytest,dt_classifier.predict(xtest)),2))
    display("f1",round(metrics.f1_score(ytest,dt_classifier.predict(xtest)),2))
    sensitivity, specificity, _ = sensitivity_specificity_support(ytest, dt_classifier.predict(xtest), average='binary')
    display("Sensitivity: \t", round(sensitivity, 2), "\n", "Specificity: \t", round(specificity, 2), sep='')

# Model Creation 

## Model 1 : Logistic Regression Without hyperparameter tuning

In [None]:
# Logistic Regression without Hyper Parameter Turning

#Training the model on the train data
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

lr = LogisticRegression()
model = lr.fit(x_train_smote_pca,y_smote)
#Making prediction on the test data
pred_probs_test = model.predict_proba(x_test_smote_pca)[:,1]
display("Logistic Regression Accurancy : "+"{:2.2}".format(metrics.roc_auc_score(ytest, pred_probs_test)))
evaluate_model(lr,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)

## Model 2 : Logistic Regression With hyperparameter tuning

In [None]:
# Logistic Regression with Hyper Parameter Turning

logistic = LogisticRegression()

# hyperparameter space
params = {'C': [0.1, 0.5, 1, 2, 3, 4, 5, 10], 'penalty': ['l1', 'l2']}

# create 5 folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 4)


# create gridsearch object
lrmodel = GridSearchCV(estimator=logistic, cv=folds, param_grid=params, scoring='roc_auc', n_jobs=-1, verbose=1)

# fit model
lrmodel.fit(x_train_smote_pca, y_smote)

# display best hyperparameters
display("Best AUC: ", lrmodel.best_score_)
display("Best hyperparameters: ", lrmodel.best_params_)

evaluate_model(lrmodel,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)

##  Inferences : Logistic Regression before tuning its giving 89 % accurancy and after tuning it is giving 92 % accuracy 


# Model 3 : Random Forest without hyperparameter tuning

In [None]:
# Random Forest

rfc = RandomForestClassifier()
rfc.fit(x_train_smote_pca,y_smote)


evaluate_model(rfc,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)

In [None]:
# Tuning 1 - Random Forest Classifier

# run a random forest model on train data
max_features = int(round(np.sqrt(xtrain.shape[1])))    # number of variables to consider to split each node
display(max_features)

rf_model = RandomForestClassifier(n_estimators=100, max_features=max_features, class_weight={0:0.1, 1: 0.9}, oob_score=True, random_state=4, verbose=1)

rf_model.fit(x_train_smote_pca, y_smote)

display("OOB Score",rf_model.oob_score_)

evaluate_model(rfc,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)

# Random Forest with Hyper Parameter Tuning

In [None]:
# Tuning 2 - Random Forest Classifier - Max Depth

rf = RandomForestClassifier(random_state=42, n_jobs=-1,oob_score=True)
params = {
    'max_depth': range(2, 40, 5)
}

grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy",return_train_score=True)
grid_search.fit(x_train_smote_pca, y_smote)

In [None]:
evaluate_model(grid_search,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)
scores = pd.DataFrame(grid_search.cv_results_)

In [None]:
for key in params.keys():
    hyperparameters = key
    break
plt.figure(figsize=(16,5))
plt.plot(scores["param_"+hyperparameters], scores["mean_train_score"], label="training accuracy")
plt.plot(scores["param_"+hyperparameters], scores["mean_test_score"], label="test accuracy")
plt.xlabel(hyperparameters)
plt.ylabel("Accuracy")
plt.legend()
plt.show()

# From the above plot max depth can be12 or 18 since after 18 graph become constant

In [None]:
# Tuning 3 - Random Forest Classifier - parameters = {'n_estimators': range(100, 2000, 200)}

rf = RandomForestClassifier(random_state=42, n_jobs=-1,oob_score=True)
params = {
    'n_estimators': range(100, 800, 200)
}

grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy",return_train_score=True)
grid_search.fit(x_train_smote_pca, y_smote)

evaluate_model(grid_search,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)
scores = pd.DataFrame(grid_search.cv_results_)

for key in params.keys():
    hyperparameters = key
    break
plt.figure(figsize=(16,5))
plt.plot(scores["param_"+hyperparameters], scores["mean_train_score"], label="training accuracy")
plt.plot(scores["param_"+hyperparameters], scores["mean_test_score"], label="test accuracy")
plt.xlabel(hyperparameters)
plt.ylabel("Accuracy")
plt.legend()
plt.show()

# n_estimators seems to be constant lets take 200

In [None]:
# Tuning 4 - Random Forest Classifier - parameters = {'max_features': [20,30,40,50,60]}

rf = RandomForestClassifier(random_state=42, n_jobs=-1,oob_score=True)
params = {
    'max_features': [20,30,40,50,60]
}

grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy",return_train_score=True)
grid_search.fit(x_train_smote_pca, y_smote)

evaluate_model(grid_search,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)
scores = pd.DataFrame(grid_search.cv_results_)

for key in params.keys():
    hyperparameters = key
    break
plt.figure(figsize=(16,5))
plt.plot(scores["param_"+hyperparameters], scores["mean_train_score"], label="training accuracy")
plt.plot(scores["param_"+hyperparameters], scores["mean_test_score"], label="test accuracy")
plt.xlabel(hyperparameters)
plt.ylabel("Accuracy")
plt.legend()
plt.show()

# Lets take max features as 40 since after this graph started declining

In [None]:
# Tuning 5 - Random Forest Classifier - parameters = {'min_samples_leaf': range(1, 100, 10)}

rf = RandomForestClassifier(random_state=42, n_jobs=-1,oob_score=True)
params = {
    'min_samples_leaf': range(1, 100, 10)
}

grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy",return_train_score=True)
grid_search.fit(x_train_smote_pca, y_smote)

evaluate_model(grid_search,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)
scores = pd.DataFrame(grid_search.cv_results_)

for key in params.keys():
    hyperparameters = key
    break
plt.figure(figsize=(16,5))
plt.plot(scores["param_"+hyperparameters], scores["mean_train_score"], label="training accuracy")
plt.plot(scores["param_"+hyperparameters], scores["mean_test_score"], label="test accuracy")
plt.xlabel(hyperparameters)
plt.ylabel("Accuracy")
plt.legend()
plt.show()

# Model may start to overfit accuracy is decreasing with min sample leaf lets take it 20

In [None]:
# Tuning 6 - Random Forest Classifier - parameters = {'min_samples_split': range(10, 100, 10)}

rf = RandomForestClassifier(random_state=42, n_jobs=-1,oob_score=True)
params = {
    'min_samples_split': range(10, 100, 10)
}

grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy",return_train_score=True)
grid_search.fit(x_train_smote_pca, y_smote)

evaluate_model(grid_search,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)
scores = pd.DataFrame(grid_search.cv_results_)

for key in params.keys():
    hyperparameters = key
    break
plt.figure(figsize=(16,5))
plt.plot(scores["param_"+hyperparameters], scores["mean_train_score"], label="training accuracy")
plt.plot(scores["param_"+hyperparameters], scores["mean_test_score"], label="test accuracy")
plt.xlabel(hyperparameters)
plt.ylabel("Accuracy")
plt.legend()
plt.show()

# take min_samples_split as 40 after this plot start decreasing

In [None]:
# Tuning 6 - Random Forest Classifier - parameters = # Final Model after all the tuning

rf = RandomForestClassifier(random_state=42, n_jobs=-1,oob_score=True)
params = {
    'max_depth': [12],
    'min_samples_split' : [40],
    'min_samples_leaf' : [10,20],
    'max_features' : [40],
    'n_estimators' : [200]
}

rf_final_model = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy",return_train_score=True)
rf_final_model.fit(x_train_smote_pca, y_smote)

evaluate_model(rf_final_model,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)
scores = pd.DataFrame(rf_final_model.cv_results_)


# Inference Random Forest Model - This is the best that we got from random forest after hyperparameter tuning 

> Train Accuracy : 94.1 

> Test Accuracy : 89.0 , Recall : 0.79 and precision 0.42

# MODEL 4 : ADABOOST

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
from sklearn.ensemble import AdaBoostClassifier
#Using adaBoosting to predict 'Attrition' 
adaboost =  AdaBoostClassifier(n_estimators=200, random_state=1)
adaboost.fit(x_train_smote_pca, y_smote)

In [None]:
display('Accuracy of the Train model is:  ',accuracy_score(y_smote, adaboost.predict(x_train_smote_pca)))
display('Accuracy of the Test model is:  ',accuracy_score(ytest, adaboost.predict(x_test_smote_pca)))

In [None]:
evaluate_model(adaboost,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)

# Model 5 : ADABOOST : xtrain and y train directly without using PCA

In [None]:
from sklearn.ensemble import AdaBoostClassifier
#Using adaBoosting to predict 'Attrition' 
adaboost =  AdaBoostClassifier(n_estimators=200, random_state=1)
adaboost.fit(xtrain, ytrain)

In [None]:
display('Accuracy of the Train model is:  ',accuracy_score(ytrain, adaboost.predict(xtrain)))
display('Accuracy of the Test model is:  ',accuracy_score(ytest, adaboost.predict(xtest)))

In [None]:
evaluate_model(adaboost,ytrain,ytest,xtrain,xtest)

In [None]:
# Hyperparameter Tuning Adaboost
params = {
        'n_estimators' : [50,100, 200], # no of trees   # eta
        'algorithm': ['SAMME', 'SAMME.R'],
        }

folds = 5

param_comb = 800

random_search_ada = RandomizedSearchCV(adaboost, param_distributions=params, n_iter=param_comb, scoring='accuracy', n_jobs=-1, cv=5, verbose=3, random_state=42)
random_search_ada.fit(xtrain, ytrain)

In [None]:
display('Accuracy of the Train model is:  ',accuracy_score(ytrain, random_search_ada.predict(xtrain)))
display('Accuracy of the Test model is:  ',accuracy_score(ytest, random_search_ada.predict(xtest)))

In [None]:
evaluate_model(random_search_ada,ytrain,ytest,xtrain,xtest)

In [None]:
display('\n Best estimator:')
display(random_search_ada.best_estimator_)
display('\n Best accuracy for %d-fold search with %d parameter combinations:' % (folds, param_comb))
display(random_search_ada.best_score_ )
display('\n Best hyperparameters:')
display(random_search_ada.best_params_)

# Best Accuracy that we got from adaboost is 94% on train and 93.9 % percent on test data with recall score of 52 % and precision score of 71 %

# XGBBOOST

In [None]:
### XG Boost - Model 1 
# fit model on training data with default hyperparameters

model = XGBClassifier()
model.fit(x_train_smote_pca, y_smote)
evaluate_model(model,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)

# Lets Fine tune the model for improved accuracy

In [None]:
# hyperparameter tuning with XGBoost - Model 2

# creating a KFold object 
folds = 5

# 'min_child_weight': [1, 5, 7, 10],
# 'gamma': [0.1, 0.5, 1, 1.5, 5]
# specify range of hyperparameters
param_grid = {'learning_rate': [0.1,0.2,0.3], 
             'subsample': [0.3,0.4,0.5]
             }          


# specify model
xgb_model = XGBClassifier(max_depth=2, n_estimators=200,n_jobs=-1)

# set up GridSearchCV()
model_cv = GridSearchCV(estimator = xgb_model, 
                        param_grid = param_grid, 
                        scoring= 'accuracy', # accuracy
                        cv = folds, 
                        n_jobs = -1,
                        verbose = 1,
                        return_train_score=True)

# fit the model
model_cv.fit(x_train_smote_pca, y_smote)
evaluate_model(model_cv,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)

In [None]:
# displaying the optimal accuracy score and hyperparameters
display('We  get best score of '+str(round(model_cv.best_score_,2)) +'using parameters`m '+str(model_cv.best_params_))

In [None]:
# chosen hyperparameters - Model 3
params = {'learning_rate': 0.3,
          'max_depth': 3, 
          'n_estimators':200,
          'subsample':0.4,
          'gamma': 1,
         'objective':'binary:logistic'}

# fit model on training data
model_1 = XGBClassifier(params = params,max_depth=2, n_estimators=200,min_child_weight=1,scale_pos_weight = 1)
model_1.fit(x_train_smote_pca, y_smote)
evaluate_model(model_1,y_smote,ytest,x_train_smote_pca,x_test_smote_pca)

In [None]:
## Kaggle_CSV 

# XGBBOOST is giving good accuracy and recall score as compared to others but instead of using one model

# We will be using combination of Multiple models

## Derive output of all models and predict test data based on the combination

# XGBOOST

In [None]:
test_v1.shape

In [None]:
# XGBOOOST
churn_probability_model1 = model_1.predict(test_v1_scaled_pca)
csvdata = {'id':test['id'],'churn_probability_xgboost':churn_probability_model1}
df = pd.DataFrame(csvdata)
display(df.shape)
#df[['id','churn_probability_xgboost']].to_csv('samplechurn_model1.csv',index=False)



In [None]:
df['churn_probability_xgboost'].value_counts()

# Logistic Regression

In [None]:
# Logistic Regression
churn_probability_logistic = lrmodel.predict(test_v1_scaled_pca)
df['churn_probability_logistic'] = pd.DataFrame(churn_probability_logistic,columns=['churn_probability_logistic'])
display(df.shape)
#df[['id','churn_probability_xgboost']].to_csv('samplechurn_model1.csv',index=False)



In [None]:
df['churn_probability_logistic'].value_counts()

# Random Forest

In [None]:
# Logistic Regression
churn_probability_randomforest = rf_final_model.predict(test_v1_scaled_pca)
df['churn_probability_randomforest'] = pd.DataFrame(churn_probability_randomforest,columns=['churn_probability_randomforest'])
display(df.shape)
#df[['id','churn_probability_xgboost']].to_csv('samplechurn_model1.csv',index=False)



In [None]:
df['churn_probability_randomforest'].value_counts()

In [None]:
test_v1.shape

In [None]:
df.head()

# Adaboost

In [None]:
# Adaboost
churn_probability_adaboost = random_search_ada.predict(test_v1)
df['churn_probability_adaboost'] = pd.DataFrame(churn_probability_adaboost,columns=['churn_probability_adaboost'])
display(df.shape)
#df[['id','churn_probability_xgboost']].to_csv('samplechurn_model1.csv',index=False)



In [None]:
df.head()

In [None]:
df['churn_probability_adaboost'].value_counts()

In [None]:
df['total_churn'] = df['churn_probability_xgboost'] + df['churn_probability_adaboost'] + df['churn_probability_randomforest']

In [None]:
df['Total_Case1'] = df['total_churn'].apply(lambda x : 1 if x>1 else 0)

In [None]:
df['Total_Case1'].value_counts()

In [None]:
# df[['id','churn_probability']].to_csv('final_model.csv',index=False)

> # In order to improve further accuracy lets build another set of models on train data directly instead of spliting 
- we will again use these models in linear combination

# Apply Algorithms on X and Y train directly

In [None]:
trainX = X.copy()
trainY = Y.copy()

display(trainX.shape,trainY.shape)

In [None]:
# Aggregating the Categorical Columns

Dtrain = pd.concat([trainX, trainY], axis=1)

# aggregate the categorical variables
display(Dtrain.groupby('night_pck_user_6').churn_probability.mean())
display(Dtrain.groupby('night_pck_user_7').churn_probability.mean())
display(Dtrain.groupby('night_pck_user_8').churn_probability.mean())
display(Dtrain.groupby('fb_user_6').churn_probability.mean())
display(Dtrain.groupby('fb_user_7').churn_probability.mean())
display(Dtrain.groupby('fb_user_8').churn_probability.mean())



In [None]:
# replace categories with aggregated values in each categorical column
mapping = {'night_pck_user_6' : {-1: 0.099925, 0: 0.067941, 1: 0.109091},
           'night_pck_user_7' : {-1: 0.116903, 0: 0.056424, 1: 0.062500},
           'night_pck_user_8' : {-1: 0.142189, 0: 0.030350, 1: 0.029412},
           'fb_user_6'        : {-1: 0.099925, 0: 0.080432, 1: 0.068025},
           'fb_user_7'        : {-1: 0.116903, 0: 0.070099, 1: 0.055439},
           'fb_user_8'        : {-1: 0.142189, 0: 0.076923, 1: 0.024883}
          }
trainX.replace(mapping, inplace = True)

In [None]:
#Scaling on entire set 

# Scaling the data - Using Standard Scaler
col = list(trainX.columns)
# Data Scaling
scaler1 = StandardScaler()
trainX_scaled = scaler1.fit_transform(trainX)
testX_v1_scaled = scaler1.transform(test_v1)

# Applying Principal Component Analysis
pca = PCA()
pca.fit(trainX)
trainx_pca = pca.fit_transform(trainX_scaled)

# PCA

In [None]:
#  feature variance Graph
var_cumu = np.cumsum(pca.explained_variance_ratio_)
fig = plt.figure(figsize=[12,8])
plt.vlines(x=15, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=30, xmin=0, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.xlabel("no of pca components")
plt.show()

In [None]:
df_pca = pd.DataFrame({'PC1':pca.components_[0],'PC2':pca.components_[1], 'PC3':pca.components_[2],'Feature':col})
df_pca.head(10)

In [None]:
np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

# Handling imbalance dataset using Smote 

In [None]:
display("Applying SMOTE to normalize imbalance ")

smote = SMOTE(.8)
trainX_smote,trainY_smote = smote.fit_sample(trainX_scaled,trainY)
display("Shape of train datatset after SMOTE : "+ str(trainX_smote.shape))

# Applying PCA : Pricnipal Component Analysis
pca = IncrementalPCA(n_components=80)    
trainX_smote_pca = pca.fit_transform(trainX_smote)
testX_v1_scaled_pca = pca.transform(testX_v1_scaled)

In [None]:
display("Shape of train datatset after PCA : "+str(trainX_smote_pca.shape))
from collections import Counter

display(Counter(trainY))
display(Counter(trainY_smote))

In [None]:
from sklearn.model_selection import KFold

# RandomForest

In [None]:
# Random Forest Classifier - parameters = # No Splited Column

cv = KFold(n_splits=4, shuffle=True, random_state=1)

Drf = RandomForestClassifier(random_state=42, n_jobs=-1,oob_score=True)
params = {
    'max_depth': [12],
    'min_samples_split' : [40],
    'min_samples_leaf' : [10,20],
    'max_features' : [40],
    'n_estimators' : [200]
}

Drf_final_model = GridSearchCV(estimator=Drf,
                           param_grid=params,
                           cv = cv,
                           n_jobs=-1, verbose=1, scoring="accuracy",return_train_score=True)
Drf_final_model.fit(trainX_smote_pca, trainY_smote)

scores = pd.DataFrame(Drf_final_model.cv_results_)



In [None]:
display("Train Accuracy :", accuracy_score(trainY_smote, Drf_final_model.predict(trainX_smote_pca)))
display("Train Confusion Matrix:")
display(confusion_matrix(trainY_smote, Drf_final_model.predict(trainX_smote_pca)))

# AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
#Using adaBoosting to predict 'Attrition' 
Dadaboost =  AdaBoostClassifier(n_estimators=200, random_state=1)
Dadaboost.fit(trainX, trainY)

In [None]:
display('Accuracy of the Train model is:  ',accuracy_score(trainY, Dadaboost.predict(trainX)))

In [None]:
# Hyperparameter Tuning Adaboost
params = {
        'n_estimators' : [50,100, 200], # no of trees   # eta
        'algorithm': ['SAMME', 'SAMME.R'],
        }

cv = KFold(n_splits=5, shuffle=True, random_state=1)

param_comb = 800

Drandom_search_ada = RandomizedSearchCV(Dadaboost, param_distributions=params, n_iter=param_comb, scoring='accuracy', n_jobs=-1, cv=cv, verbose=3, random_state=42)
Drandom_search_ada.fit(trainX, trainY)

In [None]:
display('Accuracy of the Train model is:  ',accuracy_score(trainY, Drandom_search_ada.predict(trainX)))

# XGBOOST

In [None]:
### XG Boost - Model initial 
# fit model on training data with default hyperparameters

xgb_model = XGBClassifier(max_depth=2, n_estimators=200,n_jobs=-1,learning_rate=.3,subsample=.5)
xgb_model.fit(trainX_scaled, trainY)
# importances = pd.DataFrame(data={
#     'Attribute': X_train.columns,
#     'Importance': model.feature_importances_
# })
# importances = importances.sort_values(by='Importance', ascending=False)

In [None]:
# hyperparameter tuning with XGBoost - Model Direct

# creating a KFold object 
cv = KFold(n_splits=5, shuffle=True, random_state=1)

# 'min_child_weight': [1, 5, 7, 10],
# 'gamma': [0.1, 0.5, 1, 1.5, 5]
# specify range of hyperparameters
param_grid = {'learning_rate': [0.1,0.2,0.3], 
             'subsample': [0.3,0.4,0.5]
             }          


# specify model
Dxgb_model = XGBClassifier(max_depth=2, n_estimators=200,n_jobs=-1)

# set up GridSearchCV()
Dmodel_cv = GridSearchCV(estimator = Dxgb_model, 
                        param_grid = param_grid, 
                        scoring= 'accuracy', # accuracy
                        cv = folds, 
                        n_jobs = -1,
                        verbose = 1,
                        return_train_score=True)

# fit the model
Dmodel_cv.fit(trainX_smote_pca, trainY_smote)



In [None]:
# displaying the optimal accuracy score and hyperparameters
display('We  get best score of '+str(round(Dmodel_cv.best_score_,2)) +'using parameters`m '+str(Dmodel_cv.best_params_))

In [None]:
# chosen hyperparameters - Model 3
params = {'learning_rate': 0.3,
          'max_depth': 3, 
          'n_estimators':200,
          'subsample':0.4,
          'gamma': 1,
         'objective':'binary:logistic'}

# fit model on training data
Dmodel_1 = XGBClassifier(params = params,max_depth=2, n_estimators=200,min_child_weight=1,scale_pos_weight = 1)
Dmodel_1.fit(trainX_smote_pca, trainY_smote)

display('Accuracy of the Train model is:  ',accuracy_score(trainY_smote, Dmodel_1.predict(trainX_smote_pca)))

# Calculate the Test Results based on the multiple model outputs

In [None]:
# Random Forest 
Dchurn_probability_randomforest = Drf_final_model.predict(testX_v1_scaled_pca)
df['Dchurn_probability_randomforest'] = pd.DataFrame(Dchurn_probability_randomforest,columns=['Dchurn_probability_randomforest'])
display(df.shape)


In [None]:
# XGBOOST 
Dchurn_XGBOOST = Dmodel_1.predict(testX_v1_scaled_pca)
df['Dchurn_XGBOOST'] = pd.DataFrame(Dchurn_XGBOOST,columns=['Dchurn_XGBOOST'])
display(df.shape)

In [None]:
#ADABOOST
# Adaboost
Dchurn_probability_adaboost = random_search_ada.predict(test_v1)
df['Dchurn_probability_adaboost'] = pd.DataFrame(Dchurn_probability_adaboost,columns=['Dchurn_probability_adaboost'])
display(df.shape)
#df[['id','churn_probability_xgboost']].to_csv('samplechurn_model1.csv',index=False)

In [None]:
df.head()

In [None]:
df['Dtotal_churn'] = df['Dchurn_XGBOOST'] + df['Dchurn_probability_adaboost'] + df['Dchurn_probability_randomforest']

In [None]:
df['Total_Case2'] = df['Dtotal_churn'].apply(lambda x : 1 if x>1 else 0)

In [None]:
df['Total_sum'] = df['Total_Case2'] + df['Total_Case1']

In [None]:
df['churn_probability'] = df['Total_sum'].apply(lambda x : 1 if x>=1 else 0)

In [None]:
df['churn_probability'].value_counts()

In [None]:
df[['id','churn_probability']].to_csv('Submission.csv',index=False)

# Feature Importance using the Xgboost Model

In [None]:
xgb_model.feature_importances_

In [None]:
import plotly.offline as py 
py.init_notebook_mode(connected=True) 
import plotly.graph_objs as go
import plotly.tools as tls

In [None]:
# Scatter plot
trace = go.Scatter(
y = xgb_model.feature_importances_, x = xtrain.columns.values, mode='markers',
marker=dict(
sizemode = 'diameter',
sizeref = 1.3,
size = 12,
color = xgb_model.feature_importances_, colorscale='Portland', showscale=True
),
text = xtrain.columns.values )
data = [trace]
layout= go.Layout( autosize= True,
title= 'XGBOOST Model Feature Importance', hovermode= 'closest',
xaxis= dict( ticklen= 5,
showgrid=False, zeroline=False, showline=False
), yaxis=dict(
title= 'Feature Importance', showgrid=False, zeroline=False,
ticklen= 5,
gridwidth= 2 ),
showlegend= False )
fig = go.Figure(data=data, layout=layout) 
py.iplot(fig,filename='scatter')

In [None]:
# Top 20 Features based on Feature Selection
#y = model_1.feature_importances_, x = xtrain.columns.values,
results = pd.DataFrame()
results['columns'] = xtrain.columns.values
results['importances'] = xgb_model.feature_importances_ *100
results.sort_values(by = 'importances', ascending = False, inplace=True)
results[:20]

# Conclusion
- Telecom company should provide good offers to the customers who are using services from a roaming zone.
- Telecom company should provide some kind of STD and ISD packages to reduce churn rate.

In [None]:
import datetime, pytz; 
print("Current Time in IST:", datetime.datetime.now(pytz.utc).astimezone(pytz.timezone('Asia/Kolkata')).strftime('%Y-%m-%d %H:%M:%S'))