# TELCO CHURN PREDICTION EDA

### Problem Statement:

Churn quantifies the number of customers who have left your brand by cancelling their subscription or stopping paying for your services. This is bad news for any business as it costs five times as much to attract a new customer as it does to keep an existing one. A high customer churn rate will hit your company’s finances hard. By leveraging advanced artificial intelligence techniques like machine learning (ML), you will be able to anticipate potential churners who are about to abandon your services.

#### Dataset credits : https://www.kaggle.com/blastchar/telco-customer-churn

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy  as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split,KFold,cross_val_score,StratifiedShuffleSplit
from sklearn.metrics import classification_report,confusion_matrix
from pandas_profiling import ProfileReport
from sklearn.cluster import KMeans
from datetime import datetime, timedelta,date

In [None]:
!pip install chart_studio

In [None]:
import chart_studio.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go

### Loading data

In [None]:
df_data = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df_data.head()

### EDA

In [None]:
profile = ProfileReport(df_data,title='Profile Report')
profile.to_widgets()

### Data Observation:

####  1) Dataset contains 7043 rows and 21 columns and data with no missing values.
####  2) Data contans 13 categorical , 6 Boolean and 2 numerical features.
####  3) Target variable Churn depicts with retention rate of 73.4% and 26.6% churn rate.
####  4) Total Charges column contain whitespaces which need to be dropped and need to convert data type to float.

### Label Encoding

In [None]:
def label_encoder(data):
    le = preprocessing.LabelEncoder()
    data = le.fit_transform(data)
    return data

### RGB Bar Plots

In [None]:
pyoff.init_notebook_mode()
def ploting(data,col):
    df_plot = data.groupby(col).Churn.mean().reset_index()
    plot_data = [
        go.Bar(
            x=df_plot[col],
            y=df_plot['Churn'],
            width = [0.2, 0.2, 0.2, 0.2],
            marker=dict(
            color=['green','blue','red','yellow'])
        )
    ]

    plot_layout = go.Layout(
            xaxis={"type": "category"},
            yaxis={"title": "Churn Rate"},
            title=col,
            plot_bgcolor  = 'rgb(243,243,243)',
            paper_bgcolor  = 'rgb(243,243,243)',
        )
    fig = go.Figure(data=plot_data, layout=plot_layout)
    pyoff.iplot(fig)
    pass

In [None]:
category_cols = ['gender', 'Partner', 'Dependents',
        'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod']
df_data['Churn'] = label_encoder(df_data['Churn'])
df_data.head()

## Categorical features v/s Churn Rate

### Categorical variables against Churn Rate Plot Observation :

| Category | Churn |
| :- | :- | 
| Gender | Female Customers Churn Rate is more than the male customers | 
| partners | Customers without partners Churn Rate is 13% more than the customers with partners | 
| Dependents | Customers without Dependents Churn Rate is 15% more than the customers with Dependents | 
| PhoneService | Customers with PhoneService Churn Rate is more |
| MultipleLines | Customers with MultipleLines Churn Rate is more | 
| Internet Service | 42% of Customers with Internet Service fibre optic churn | 
| Online Security | 42% of Customers without Online Security churn | 
| Online Backup | 40% Customers without Online Backup churn  | 
| Device Protection | 40% of Customers without Device Protection churn  | 
| Tech Support | 42% of Customers without Tech Support churn | 
| Streaming TV, Streaming Movies| Customers without these services Churn Rate is more| 
| Contract | Customers with month-to-month contract Churn Rate is more as its obvious with a year/2 years contract Churn Rate is less | 
| Paperless Billing | 34% of Customers with Paperless Billing churn | 
| Payment Method | 45% Customers with Payment Method Electronic check churn | 

In [None]:
for col in category_cols:
    ploting(df_data,col)

### Tenure v/s Churn Rate Plot
##### Larger the Tenure lesser the churn rate as observed in the below plot. As Tenure period increases churn rate is decreased.
##### After Quantile based Binning on Tenure and plot below, 50% of customers whose Tenure is between 1 to 9 Churn rate is high.

### Quantile Based Binning

In [None]:
quantile_list = [0, .25, .5, .75, 1.]
quantiles = df_data['tenure'].quantile(quantile_list)
quantiles

In [None]:
quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
df_data['tenure_quantiles'] = pd.qcut(df_data['tenure'],q=quantile_list, labels=quantile_labels)
df_data.head()

In [None]:
ploting(df_data,'tenure_quantiles')

In [None]:
df_plot = df_data.groupby('tenure').Churn.mean().reset_index()


plot_data = [
    go.Scatter(
        x=df_plot['tenure'],
        y=df_plot['Churn'],
        mode='markers',
        name='Low',
        marker= dict(size= 8,
            line= dict(width=2),
            color= 'green',
            opacity= 0.9
           ),
    )
]

plot_layout = go.Layout(
        yaxis= {'title': "Churn Rate"},
        xaxis= {'title': "Tenure"},
        title='Tenure based Churn rate',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

## Monthly Charges v/s Churn Rate
#### After Quantile based Binning on Monthly Charges and plot below, 37% of customers whose Monthly charges is between Rs.70 - Rs.90 Churn rate is high.

In [None]:
quantile_list = [0, .25, .5, .75, 1.]
quantiles = df_data['MonthlyCharges'].quantile(quantile_list)
quantiles

In [None]:
quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
df_data['MonthlyCharges_quantiles'] = pd.qcut(df_data['MonthlyCharges'],q=quantile_list, labels=quantile_labels)
df_data.head()

In [None]:
ploting(df_data,'MonthlyCharges_quantiles')

In [None]:
df_plot = df_data.copy()
df_plot['MonthlyCharges'] = df_plot['MonthlyCharges'].astype(int)
df_plot = df_plot.groupby('MonthlyCharges').Churn.mean().reset_index()


plot_data = [
    go.Scatter(
        x=df_plot['MonthlyCharges'],
        y=df_plot['Churn'],
        mode='markers',
        name='Low',
        marker= dict(size= 8,
            line= dict(width=2),
            color= 'red',
            opacity= 0.8
           ),
    )
]

plot_layout = go.Layout(
        yaxis= {'title': "Churn Rate"},
        xaxis= {'title': "Monthly Charges"},
        title='Monthly Charge vs Churn rate',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)

## TotalCharges v/s Churn Rate
#### Total Charges column contain whitespaces which need to be dropped and convert data type to float.
#### After Quantile based Binning on Total Charges and plot below, 43% of customers whose total charges is between Rs.18 - Rs.401 Churn rate is high.

In [None]:
df_data = df_data.replace(r'^\s*$', np.nan, regex=True)

In [None]:
df_data.isna().sum()

In [None]:
df_data = df_data.dropna()

In [None]:
df_data['TotalCharges'] = df_data['TotalCharges'].astype('float64')

### Quantile Based Binning

In [None]:
quantile_list = [0, .25, .5, .75, 1.]
quantiles = df_data['TotalCharges'].quantile(quantile_list)
quantiles

In [None]:
quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
df_data['TotalCharges_quantiles'] = pd.qcut(df_data['TotalCharges'],q=quantile_list, labels=quantile_labels)
df_data.head()

In [None]:
ploting(df_data,'TotalCharges_quantiles')

## EDA Conclusion:

####  1) Dataset contains 7043 rows and 21 columns and data with no missing values.
####  2) Data contans 13 categorical , 6 Boolean and 2 numerical features.
####  3) Target variable Churn depicts with retention rate of 73.4% and 26.6% churn rate.
####  4) Total Charges column contain whitespaces which need to be dropped and need to convert data type to float.
####  5) 43% of customers whose total charges is between Rs.18 - Rs.401 Churn rate is high
####  6) 37% of customers whose Monthly charges is between Rs.70 - Rs.90 Churn rate is high.
####  7) 50% of customers whose Tenure is between 1 to 9 Churn rate is high.
####  8) Category features v/s churn rate :
| Category | Churn |
| :- | :- | 
| Gender | Female Customers Churn Rate is more than the male customers | 
| partners | Customers without partners Churn Rate is 13% more than the customers with partners | 
| Dependents | Customers without Dependents Churn Rate is 15% more than the customers with Dependents | 
| PhoneService | Customers with PhoneService Churn Rate is more |
| MultipleLines | Customers with MultipleLines Churn Rate is more | 
| Internet Service | 42% of Customers with Internet Service fibre optic churn | 
| Online Security | 42% of Customers without Online Security churn | 
| Online Backup | 40% Customers without Online Backup churn  | 
| Device Protection | 40% of Customers without Device Protection churn  | 
| Tech Support | 42% of Customers without Tech Support churn | 
| Streaming TV, Streaming Movies| Customers without these services Churn Rate is more| 
| Contract | Customers with month-to-month contract Churn Rate is more as its obvious with a year/2 years contract Churn Rate is less | 
| Paperless Billing | 34% of Customers with Paperless Billing churn | 
| Payment Method | 45% Customers with Payment Method Electronic check churn | 