# Customer Churn Analysis and Prediction

The goal of this project is to learn from customers historical data with a service provider using machine learning and create a model to predict who is going to leave the service in future based on user behaviour and user profile.

* Churn rate is a critical metric of customer satisfaction. Low churn rates mean happy customers; high churn rates mean customers are leaving you. A small rate of monthly/quarterly churn compounds over time. 1% monthly churn quickly translates to almost 12% yearly churn. 
* According to Forbes, it takes a lot more money (up to five times more) to get new customers than to keep the ones you already have. Churn tells you how many existing customers are leaving your business, so lowering churn has a big positive impact on your revenue streams.
* Churn is a good indicator of growth potential. Churn rates track lost customers, and growth rates track new customers—comparing and analyzing both of these metrics tells you exactly how much your business is growing over time. If growth is higher than churn, you can say your business is growing. If churn is higher than growth, your business is getting smaller. 

You can classify churn as:

1. Customer and revenue churn
2. Voluntary and involuntary churn

**Customer and revenue churn:** Customer churn is simply the rate at which customers cancel their subscriptions. Also known as subscriber churn or logo churn, its value is represented in percentages. On the other hand, revenue churn is the loss in your monthly recurring revenue (MRR) at the beginning of the month. Customer churn and revenue churn aren’t always the same. You might have no customer churn, but still have revenue churn if customers are downgrading subscriptions. Negative churn is an ideal situation that only applies to revenue churn. The amount of new revenue from your existing customers (through cross-sells, upsells, and new signups) is more than the revenue you lose from cancellations and downgrades.

**churn rate** = (Number of customers lost during a time frame **/** Number of customers at beginning of the time frame) * **100**

**revenue rate** = (MRR at the beginning of month - MRR at the end of month - any upgrades)**/** MRR at the beginning of month

**Voluntary and involuntary Churn:** Voluntary churn is when the customer decides to cancel and takes the necessary steps to exit the service. It could be caused by dissatisfaction, or not receiving the value they expected. Involuntary churn happens due to situations such as expired payment details, server errors, insufficient funds, and other unpredictable predicaments. 

Customer satisfaction, happiness, and loyalty can be achieved to a certain degree, but churn will always be a part of the business. Churn can happen because of:

* Bad customer service (poor service quality, response rate, or overall customer experience),
* Finance issues (fees and rates),
* Customer needs change,
* Dissatisfaction (your service failed to meet expectations),
* Customers don’t see the value, 
* Customers switch to competitors,
* Long-time customers don’t feel appreciated.

0% churn rate is impossible. The trick is to keep the churn rate as low as possible at all times.

The flow of this project is as below:

* Reading the data in python
* Defining the problem statement
* Identifying the Target variable
* Looking at the distribution of Target variable
* Basic Data exploration
* Rejecting useless columns
* Visual Exploratory Data Analysis for data distribution (Histogram and Barcharts)
* Feature Selection based on data distribution
* Outlier treatment
* Missing Values treatment
* Visual correlation analysis
* Statistical correlation analysis (Feature Selection)
* Converting data to numeric for ML
* Sampling and K-fold cross validation
* Trying multiple classification algorithms
* Selecting the best Model
* Deploying the best model in production

### Reading the data in Python

**Data Description** : 

* **CustomerID:** Customer ID unique for each customer
* **gender:** Whether the customer is a male or a female
* **SeniorCitizen:** Whether the customer is a senior citizen or not (1, 0)
* **Partner:** Whether the customer has a partner or not (Yes, No)
* **Dependent:** Whether the customer has dependents or not (Yes, No)
* **PhoneService:** Whether the customer has a phone service or not (Yes, No)
* **MultipeLines:** Whether the customer has multiple lines or not (Yes, No, No phone service)
* **InternetService:** Customer’s internet service provider (DSL, Fiber optic, No)
* **OnlineSecurity:** Whether the customer has online security or not (Yes, No, No internet service)
* **OnlineBackup:** Whether the customer has an online backup or not (Yes, No, No internet service)
* **DeviceProtection:** Whether the customer has device protection or not (Yes, No, No internet service)
* **TechSupport:** Whether the customer has tech support or not (Yes, No, No internet service)
* **StreamingTV:** Whether the customer has streaming TV or not (Yes, No, No internet service)
* **StreamingMovies:** Whether the customer has streaming movies or not (Yes, No, No internet service)
* **Contract:** The contract term of the customer (Month-to-month, One year, Two years)
* **PaperlessBilling:** The contract term of the customer (Month-to-month, One year, Two years)
* **PaymentMethod:** The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
* **Tenure:** Number of months the customer has stayed with the company
* **MonthlyCharges:** The amount charged to the customer monthly
* **TotalCharges:** The total amount charged to the customer
* **Churn:** Whether the customer churned or not (Yes or No)

In [2]:
# supressing the warnings
import warnings
warnings.filterwarnings('ignore')

In [7]:
# Reading the dataset
import pandas as pd
import numpy as np

import plotly.express as px #for visualization
import matplotlib.pyplot as plt #for visualization 

#displaying all columns
pd.set_option('display.max_columns', None)

churn_data=pd.read_csv("churn.csv")
print('Shape before deleting duplicate values:', churn_data.shape)

# Removing duplicate rows if any
churn_data=churn_data.drop_duplicates()
print('Shape After deleting duplicate values:', churn_data.shape)

# Printing sample data
# Start observing the Quantitative/Categorical/Qualitative variables
churn_data.head(10)

Shape before deleting duplicate values: (7043, 21)
Shape After deleting duplicate values: (7043, 21)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
5,9305-CDSKC,Female,0,No,No,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes
6,1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,No
7,6713-OKOMC,Female,0,No,No,10,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No
8,7892-POOKP,Female,0,Yes,No,28,Yes,Yes,Fiber optic,No,No,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes
9,6388-TABGU,Male,0,No,Yes,62,Yes,No,DSL,Yes,Yes,No,No,No,No,One year,No,Bank transfer (automatic),56.15,3487.95,No


### Defining the problem statement:

**Analyse the churn rate and create a model to predict if a customer is going to churn or not?**

**Target Variable:** Churn

**Predictors:** gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService,\
    MultipleLines, InternetService, OnlineSecurity, OnlineBackup, eviceProtection,\
    TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod,\
    MonthlyCharges, TotalCharges
    
* churn=**yes** then customer is leaving
* churn=**no** then customer is not leaving

### Determining the type of Machine Learning

Based on the problem statement you can understand that we need to create a supervised ML classification model, as the target variable is categorical.

### Looking at the distribution of Target variable

* If target variable's distribution is too skewed then the predictive modeling will not be possible.

In [16]:
target_variable = churn_data["Churn"].value_counts().to_frame()
target_variable = target_variable.reset_index()
target_variable = target_variable.rename(columns={'index': 'Category'})
fig = px.pie(target_variable, values='Churn', names='Category', color_discrete_sequence=["green", "red"],
             title='Distribution of Churn', width=600, height=400)
fig.show()

The target variable seems to be imbalanced. It has to be balanced. 

**Aproaches to balance a dataset** :
1. Choosing proper evaluation metric(F1 score)
2. Resampling
3. SMOTE(synthetic sampling)

Here, We use another evaluation metric rather than accuracy i.e. F1 score

### Basic Data Exploration

In [17]:
churn_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [18]:
churn_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [19]:
churn_data.describe(include='all')

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
count,7043,7043,7043.0,7043,7043,7043.0,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043.0,7043.0,7043
unique,7043,2,,2,2,,2,3,3,3,3,3,3,3,3,3,2,4,,6531.0,2
top,7590-VHVEG,Male,,No,No,,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,,,No
freq,1,3555,,3641,4933,,6361,3390,3096,3498,3088,3095,3473,2810,2785,3875,4171,2365,,11.0,5174
mean,,,0.162147,,,32.371149,,,,,,,,,,,,,64.761692,,
std,,,0.368612,,,24.559481,,,,,,,,,,,,,30.090047,,
min,,,0.0,,,0.0,,,,,,,,,,,,,18.25,,
25%,,,0.0,,,9.0,,,,,,,,,,,,,35.5,,
50%,,,0.0,,,29.0,,,,,,,,,,,,,70.35,,
75%,,,0.0,,,55.0,,,,,,,,,,,,,89.85,,


In [20]:
churn_data.nunique()

customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64

#### Observations

* CustomerID - Qualitative. Rejected
* Gender - categorical. Selected
* SeniorCitizen - categorical. Selected
* Partner - categorical. Selected
* Dependents - categorical. Selected
* tenure - quantitative. Selected
* PhoneService - categorical. Selected
* MultipleLines - categorical Selected
* InternetService - categorical Selected
* OnlineSecurity - categorical Selected
* OnlineBackup - categorical Selected
* DeviceProtection - categorical Selected
* TechSupport - categorical Selected
* StreamingTV - categorical Selected
* StreamingMovies - categorical Selected
* Contract - categorical Selected
* PaperlessBilling - categorical Selected
* PaymentMethod - categorical Selected
* MonthlyCharges - quantitative Selected
* TotalCharges - quantitative Selected
* Churn - **Target variable** categorical Selected

### Removing useless variables from data

In [21]:
UselessColumns = ['customerID']
churn_data = churn_data.drop(UselessColumns,axis=1)
churn_data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Visual Exploratory Data Analysis

* **Categorical variables:** Bar plot
* **Continuous variables:** Histogram

* **Demographic customer information:**
gender , SeniorCitizen , Partner , Dependents

* **Services that each customer has signed up for:**
PhoneService , MultipleLines , InternetService , OnlineSecurity , OnlineBackup , DeviceProtection , TechSupport , StreamingTV , StreamingMovies

* **Customer account information:**
tenure , Contract , PaperlessBilling , PaymentMethod , MonthlyCharges , TotalCharges

In [74]:
value_counts_df = churn_data['gender'].value_counts().to_frame().reset_index()
value_counts_df


Unnamed: 0,index,gender
0,Male,3555
1,Female,3488


In [77]:
def bar(feature, df=churn_data):
    
    temp_df = df.groupby([feature, 'Churn']).size().reset_index()
    temp_df = temp_df.rename(columns={0:'Count'})

    fig = px.bar(temp_df, x=feature, y='Count', color='Churn', title=f'Churn rate by {feature}',\
                 barmode="group", color_discrete_sequence=["green", "red"], width=600, height=400)
    return fig.show()

In [79]:
bar('gender')
bar('SeniorCitizen')
bar('Partner')
bar('Dependents')