## Problem Statement
In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business
goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyze customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn.

In this competition, your goal is to build a machine learning model that is able to predict churning customers based on the features provided for their usage.



## The notebook consist of below 4 parts

* Data Understanding, Preparation, and Pre-Processing
* Exploratory Data Analysis 
* Feature Engineering and Variable Transformation
* Model Selection, Model Building, and  Prediction

### Data Understanding, Preparation, and Pre-Processing

#### Importing Libraries

In [1]:
import pandas as pd

In [2]:
#reading data
df=pd.read_csv("train (1).csv")

In [3]:
#checking shape of data
df.shape

(69999, 172)

In [4]:
## checking sample of data
df.head()

Unnamed: 0,id,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,arpu_6,arpu_7,...,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,churn_probability
0,0,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,31.277,87.009,...,0,0,,,,1958,0.0,0.0,0.0,0
1,1,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,0.0,122.787,...,0,0,,1.0,,710,0.0,0.0,0.0,0
2,2,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,60.806,103.176,...,0,0,,,,882,0.0,0.0,0.0,0
3,3,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,156.362,205.26,...,0,0,,,,982,0.0,0.0,0.0,0
4,4,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,240.708,128.191,...,1,0,1.0,1.0,1.0,647,0.0,0.0,0.0,0


In [5]:
#checking null values and data types in columns
df.info(verbose=True,show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69999 entries, 0 to 69998
Data columns (total 172 columns):
 #    Column                    Non-Null Count  Dtype  
---   ------                    --------------  -----  
 0    id                        69999 non-null  int64  
 1    circle_id                 69999 non-null  int64  
 2    loc_og_t2o_mou            69297 non-null  float64
 3    std_og_t2o_mou            69297 non-null  float64
 4    loc_ic_t2o_mou            69297 non-null  float64
 5    last_date_of_month_6      69999 non-null  object 
 6    last_date_of_month_7      69600 non-null  object 
 7    last_date_of_month_8      69266 non-null  object 
 8    arpu_6                    69999 non-null  float64
 9    arpu_7                    69999 non-null  float64
 10   arpu_8                    69999 non-null  float64
 11   onnet_mou_6               67231 non-null  float64
 12   onnet_mou_7               67312 non-null  float64
 13   onnet_mou_8               66296 non-null  fl

##### The dataset has 135 float columns, 28 integer and 9 string type columns.
##### Out of 70k records, many columns have less than 20k values in them. They have only 25% of values. So we can drop them.

In [6]:
lst=[]
for i in df.columns:
    if int((df[i].isna().sum()/df.shape[0])*100)>70:
        lst.append(i)
print(len(lst),lst)

30 ['date_of_last_rech_data_6', 'date_of_last_rech_data_7', 'date_of_last_rech_data_8', 'total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8', 'max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8', 'count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 'count_rech_3g_6', 'count_rech_3g_7', 'count_rech_3g_8', 'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_2g_6', 'arpu_2g_7', 'arpu_2g_8', 'night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8', 'fb_user_6', 'fb_user_7', 'fb_user_8']


##### Above 30 columns would be dropped as they dont have sufficient data

In [7]:
df.drop(columns=lst,axis=0,inplace=True)

In [8]:
#checking new shape
df.shape

(69999, 142)

In [9]:
#dropping rows having all null values
df=df.dropna(axis=0,how='all')

In [10]:
#checking new shape
df.shape

(69999, 142)

##### There is not change in rows of original dataset after dropping rows with all null values. This means no row has all null values in dataset.

##### Checking if in a columns all values are unique or contains a single not null value in all rows because these columns do not add much information to model

In [11]:
# calculating unique values of each column and checking on that
k=df.nunique()
j=k[(k.values ==1) | (k.values==69999)]
j.index

Index(['id', 'circle_id', 'loc_og_t2o_mou', 'std_og_t2o_mou', 'loc_ic_t2o_mou',
       'last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8',
       'std_og_t2c_mou_6', 'std_og_t2c_mou_7', 'std_og_t2c_mou_8',
       'std_ic_t2o_mou_6', 'std_ic_t2o_mou_7', 'std_ic_t2o_mou_8'],
      dtype='object')

In [12]:
#dropping the columns 
df.drop(j.index,axis=1,inplace=True)

In [13]:
#checking new shape
df.shape

(69999, 128)

In [14]:
df['churn_probability'].value_counts()

0    62867
1     7132
Name: churn_probability, dtype: int64

##### we can see that we have around 62K data for loyal customers and around 7k data for churned customer

In [15]:
df.info(verbose=True,show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69999 entries, 0 to 69998
Data columns (total 128 columns):
 #    Column               Non-Null Count  Dtype  
---   ------               --------------  -----  
 0    arpu_6               69999 non-null  float64
 1    arpu_7               69999 non-null  float64
 2    arpu_8               69999 non-null  float64
 3    onnet_mou_6          67231 non-null  float64
 4    onnet_mou_7          67312 non-null  float64
 5    onnet_mou_8          66296 non-null  float64
 6    offnet_mou_6         67231 non-null  float64
 7    offnet_mou_7         67312 non-null  float64
 8    offnet_mou_8         66296 non-null  float64
 9    roam_ic_mou_6        67231 non-null  float64
 10   roam_ic_mou_7        67312 non-null  float64
 11   roam_ic_mou_8        66296 non-null  float64
 12   roam_og_mou_6        67231 non-null  float64
 13   roam_og_mou_7        67312 non-null  float64
 14   roam_og_mou_8        66296 non-null  float64
 15   loc_og_t2t_mou_6 

In [32]:
#checking on object data types
df[['date_of_last_rech_6','date_of_last_rech_7','date_of_last_rech_8']].head()

Unnamed: 0,date_of_last_rech_6,date_of_last_rech_7,date_of_last_rech_6.1
0,6/22/2014,7/10/2014,6/22/2014
1,6/12/2014,7/10/2014,6/12/2014
2,6/11/2014,7/22/2014,6/11/2014
3,6/15/2014,7/21/2014,6/15/2014
4,6/25/2014,7/26/2014,6/25/2014


In [35]:
##### converting string to dates
df['date_of_last_rech_6']=pd.to_datetime(df['date_of_last_rech_6'])
df['date_of_last_rech_7']=pd.to_datetime(df['date_of_last_rech_7'])
df['date_of_last_rech_8']=pd.to_datetime(df['date_of_last_rech_8'])

In [39]:
df['date_of_last_rech_7'].isna().sum()

1234

##### Filling null values in date columns with mode

In [None]:
df['date_of_last_rech_6'].fillna(value=df['date_of_last_rech_6'].mode(),inplace=True)
df['date_of_last_rech_7'].fillna(value=df['date_of_last_rech_7'].mode(),inplace=True)
df['date_of_last_rech_8'].fillna(value=df['date_of_last_rech_8'].mode(),inplace=True)

##### Filling all null values in data with mean of columns

In [41]:
for i in df.columns[df.isnull().sum()!=0]:
    df[i].fillna(value=df[i].mean(),inplace=True)

In [42]:
df.columns[df.isnull().sum()!=0]

Index([], dtype='object')

##### All null values were filled.

### Exploratory Data Analysis

#### Importing EDA Libraries

In [46]:
import matplotlib.pyplot as plt
import seaborn as sns

#### Univariate Analysis on numeric columns

In [None]:
#starting with numerical columns
for i,col in enumerate(df.describe().columns):
    plt.figure(i)
    sns.boxplot(data=df,x=col)