# Churn prediction 
is one of the most popular Big Data use cases in business. It consists of detecting customers who are likely to cancel a subscription to a service.

Although originally a telcom giant thing, this concerns businesses of all sizes, including startups. Now, thanks to prediction services and APIs, predictive analytics are no longer exclusive to big players that can afford to hire teams of data scientists.

As an example of how to use churn prediction to improve your business, let’s consider businesses that sell subscriptions. This can be telecom companies, SaaS companies, and any other company that sells a service for a monthly fee.

There are three possible strategies those businesses can use to generate more revenue: acquire more customers, upsell existing customers, or increase customer retention. All the efforts made as part of one of the strategies have a cost, and what we’re ultimately interested in is the return on investment: the ratio between the extra revenue that results from these efforts and their cost[[**1**](https://neilpatel.com/blog/improve-by-predicting-churn/#:~:text=Churn%20prediction%20is%20one%20of,a%20subscription%20to%20a%20service.&text=This%20can%20be%20telecom%20companies,service%20for%20a%20monthly%20fee.)]

![](https://miro.medium.com/max/844/1*MyKDLRda6yHGR_8kgVvckg.png)

In this study, we tried to predict Customer Churn using Random Forest and Naive Bayesian classifier.

Variable Prediction:
1.    CustomerID                   
1.   MonthlyRevenue             
1.     MonthlyMinutes            
1.     TotalRecurringCharge      
1.     DirectorAssistedCalls      
1.     OverageMinutes             
1.    RoamingCalls              
1.    PercChangeMinutes         
1.     PercChangeRevenues        
1.    DroppedCalls               
1.    BlockedCalls               
1.    UnansweredCalls           
1.    CustomerCareCalls         
1.    ThreewayCalls             
1.    ReceivedCalls              
1.    OutboundCalls              
1.    InboundCalls              
1.    PeakCallsInOut            
1.   OffPeakCallsInOut          
1.   DroppedBlockedCalls        
1.    CallForwardingCalls        
1.    CallWaitingCalls           
1.    MonthsInService           
1.   UniqueSubs               
1.    ActiveSubs                
1.   ServiceArea                
1.   Handsets                  
1.   HandsetModels              
1.    CurrentEquipmentDays      
1.   AgeHH1                     
1.    AgeHH2                    
1.    ChildrenInHH              
1.    HandsetRefurbished         
1.   HandsetWebCapable          
1.    TruckOwner                 
1.   RVOwner                   
1.    Homeownership            
1.    BuysViaMailOrder           
1.    RespondsToMailOffers       
1.    OptOutMailings            
1.   NonUSTravel               
1.    OwnsComputer              
1.    HasCreditCard             
1.   RetentionCalls            
1.    RetentionOffersAccepted    
1.   NewCellphoneUser         
1.    NotNewCellphoneUser       
1.    ReferralsMadeBySubscriber  
1.    IncomeGroup                
1.   OwnsMotorcycle           
1.   AdjustmentsToCreditRating  
1.    HandsetPrice               
1.    MadeCallToRetentionTeam    
1.   CreditRating               
1.    PrizmCode                 
1.    Occupation                
1.   MaritalStatus              

import library

In [None]:
import numpy as np
import pylab as pl
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

read dataset

In [None]:
train = pd.read_csv("../input/datasets-for-churn-telecom/cell2celltrain.csv")
test = pd.read_csv("../input/datasets-for-churn-telecom/cell2cellholdout.csv")

In [None]:
train.info()
train[0:10]

In [None]:
#Churn : Yes:1 , No:0
Churn = {'Yes': 1,'No': 0} 
  
# traversing through dataframe 
# values where key matches 
train.Churn = [Churn[item] for item in train.Churn] 
print(train)

# Handling missing data

Some might quibble over our usage of missing. By “missing” we simply mean NA (“not available”) or “not present for whatever reason”. Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed.



In [None]:
print("Any missing sample in training set:",train.isnull().values.any())
print("Any missing sample in test set:",test.isnull().values.any(), "\n")

Here we handling missing value filled by zero rather than dropping NA values. Another technique of handling missing value in addition to filled by a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.

In [None]:
# for column
#train['MonthlyRevenue'].fillna((train['MonthlyRevenue'].median()), inplace=True)
# for column
train['MonthlyRevenue'] = train['MonthlyRevenue'].replace(np.nan, 0)

# for whole dataframe
train = train.replace(np.nan, 0)

# inplace
train.replace(np.nan, 0, inplace=True)

print(train)



In [None]:
# for column
#train['MonthlyMinutes'].fillna((train['MonthlyMinutes'].median()), inplace=True)
train['MonthlyMinutes'] = train['MonthlyMinutes'].replace(np.nan, 0)

# for whole dataframe
train = train.replace(np.nan, 0)

# inplace
train.replace(np.nan, 0, inplace=True)

print(train)

In [None]:
# for column
#train['TotalRecurringCharge'].fillna((train['TotalRecurringCharge'].median()), inplace=True)
train['TotalRecurringCharge'] = train['TotalRecurringCharge'].replace(np.nan, 0)

# for whole dataframe
train = train.replace(np.nan, 0)

# inplace
train.replace(np.nan, 0, inplace=True)

print(train)

In [None]:
# for column
#train['DirectorAssistedCalls'].fillna((train['DirectorAssistedCalls'].median()), inplace=True)
train['DirectorAssistedCalls'] = train['DirectorAssistedCalls'].replace(np.nan, 0)

# for whole dataframe
train = train.replace(np.nan, 0)

# inplace
train.replace(np.nan, 0, inplace=True)

print(train)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
def FunLabelEncoder(df):
    for c in df.columns:
        if df.dtypes[c] == object:
            le.fit(df[c].astype(str))
            df[c] = le.transform(df[c].astype(str))
    return df

In [None]:
train = FunLabelEncoder(train)
train.info()
train.iloc[235:300,:]

In [None]:
test = FunLabelEncoder(test)
test.info()
test.iloc[235:300,:]

In [None]:
test = test.drop(columns=['Churn'],

                 axis=1)
test = test.dropna(how='any')
print(test.shape)

In [None]:
#Frequency distribution of classes"
train_outcome = pd.crosstab(index=train["Churn"],  # Make a crosstab
                              columns="count")      # Name the count column

train_outcome

In [None]:
# Distribution of Churn
train.Churn.value_counts()[0:30].plot(kind='bar')
plt.show()

# Plotting Heatmap
Heatmap can be defined as a method of graphically representing numerical data where individual data points contained in the matrix are represented using different colors. 
The colors in the heatmap can denote the frequency of an event, the performance of various metrics in the data set, and so on. Different color schemes are selected by varying businesses to present the data they want to be plotted on a heatmap [[2](https://vwo.com/blog/heatmap/)].

In [None]:
train = train[['CustomerID','MonthlyRevenue','MonthlyMinutes','TotalRecurringCharge','DirectorAssistedCalls','OverageMinutes',
         'RoamingCalls','PercChangeMinutes','PercChangeRevenues','DroppedCalls','BlockedCalls','UnansweredCalls','CustomerCareCalls',
         'ThreewayCalls','ReceivedCalls','OutboundCalls','InboundCalls','PeakCallsInOut','OffPeakCallsInOut','DroppedBlockedCalls','CallForwardingCalls'
         ,'CallWaitingCalls','MonthsInService','UniqueSubs','ActiveSubs','ServiceArea','Handsets','HandsetModels',              
'CurrentEquipmentDays','AgeHH1','AgeHH2','ChildrenInHH','HandsetRefurbished','HandsetWebCapable','TruckOwner','RVOwner','Homeownership','BuysViaMailOrder','RespondsToMailOffers','OptOutMailings',          
'NonUSTravel','OwnsComputer','HasCreditCard','RetentionCalls','RetentionOffersAccepted','NewCellphoneUser',          
'NotNewCellphoneUser','ReferralsMadeBySubscriber','IncomeGroup','OwnsMotorcycle','AdjustmentsToCreditRating', 
'HandsetPrice','MadeCallToRetentionTeam','CreditRating','PrizmCode','Occupation','MaritalStatus','Churn']] #Subsetting the data
cor = train.corr() #Calculate the correlation of the above variables
sns.heatmap(cor, square = True) #Plot the correlation as heat map

As you can see above, we obtain the heatmap of correlation among the variables. The color palette in the side represents the amount of correlation among the variables. The lighter shade represents a high correlation. Here appear important variables (customer churn behavior):
1. TotalRecurringCharge
1. RoamingCalls
1. DroppedCalls
1. CustomerCareCalls
1. OutboundCalls
1. OffPeakCallsInOut
1. CallWaitingCalls
1. ActiveSubs
1. HandsetModels
1. AgeHH2
1. HandsetWebCapable
1. Homeownership
1. OptOutMailings
1. HasCreditCard
1. NewCellphoneUser
1. IncomeGroup
1. HandsetPrice
1. PrizmCode

# SPLITING DATA

Data for training and testing
To select a set of training data that will be input in the Machine Learning algorithm, to ensure that the classification algorithm training can be generalized well to new data. For this study using a sample size of 30%, assumed it ideal ratio between training and testing

In [None]:
from sklearn.model_selection import train_test_split
Y = train['Churn']
X = train.drop(columns=['Churn'])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=9)

In [None]:
print('X train shape: ', X_train.shape)
print('Y train shape: ', Y_train.shape)
print('X test shape: ', X_test.shape)
print('Y test shape: ', Y_test.shape)

## 1. Random forest classification

Based on the previous classification method, random forest is a supervised learning algorithm that creates a forest randomly. This forest, is a set of decision trees, most of the times trained with the bagging method. The essential idea of bagging is to average many noisy but approximately impartial models, and therefore reduce the variation. Each tree is constructed using the following algorithm:

* Let $N$ be the number of test cases, $M$ is the number of variables in the classifier.
* Let $m$ be the number of input variables to be used to determine the decision in a given node; $m<M$.
* Choose a training set for this tree and use the rest of the test cases to estimate the error.
* For each node of the tree, randomly choose $m$ variables on which to base the decision. Calculate the best partition of the training set from the $m$ variables.

For prediction a new case is pushed down the tree. Then it is assigned the label of the terminal node where it ends. This process is iterated by all the trees in the assembly, and the label that gets the most incidents is reported as the prediction. We define the number of trees in the forest in 100. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

# We define the model
rfcla = RandomForestClassifier(n_estimators=100,random_state=9,n_jobs=-1)

# We train model
rfcla.fit(X_train, Y_train)

# We predict target values
Y_predict5 = rfcla.predict(X_test)

In [None]:
# The confusion matrix
rfcla_cm = confusion_matrix(Y_test, Y_predict5)
f, ax = plt.subplots(figsize=(5,5))
sns.heatmap(rfcla_cm, annot=True, linewidth=0.7, linecolor='black', fmt='g', ax=ax, cmap="BuPu")
plt.title('Random Forest Classification Confusion Matrix')
plt.xlabel('Y predict')
plt.ylabel('Y test')
plt.show()

In [None]:
# Test score
score_rfcla = rfcla.score(X_test, Y_test)
print(score_rfcla)

## 2. Naive bayes classification

The naive Bayesian classifier is a probabilistic classifier based on Bayes' theorem with strong independence assumptions between the features. Thus, using Bayes theorem $\left(P(X|Y)=\frac{P(Y|X)P(X)}{P(Y)}\right)$, we can find the probability of $X$ happening, given that $Y$ has occurred. Here, $Y$ is the evidence and $X$ is the hypothesis. The assumption made here is that the presence of one particular feature does not affect the other (the predictors/features are independent). Hence it is called naive. In this case we will assume that we assume the values are sampled from a Gaussian distribution and therefore we consider a Gaussian Naive Bayes.

In [None]:
from sklearn.naive_bayes import GaussianNB

# We define the model
nbcla = GaussianNB()

# We train model
nbcla.fit(X_train, Y_train)

# We predict target values
Y_predict3 = nbcla.predict(X_test)

In [None]:
# The confusion matrix
nbcla_cm = confusion_matrix(Y_test, Y_predict3)
f, ax = plt.subplots(figsize=(5,5))
sns.heatmap(nbcla_cm, annot=True, linewidth=0.7, linecolor='black', fmt='g', ax=ax, cmap="BuPu")
plt.title('Naive Bayes Classification Confusion Matrix')
plt.xlabel('Y predict')
plt.ylabel('Y test')
plt.show()

In [None]:
# Test score
score_nbcla = nbcla.score(X_test, Y_test)
print(score_nbcla)

# Comparison of classification techniques

# Test score


In [None]:
Testscores = pd.Series([score_rfcla,score_nbcla, ], 
                        index=['Random Forest Score','Naive Bayes Score' ]) 
print(Testscores)

# ROC Curve
is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

In [None]:
from sklearn.metrics import roc_curve
# Random Forest Classification
Y_predict5_proba = rfcla.predict_proba(X_test)
Y_predict5_proba = Y_predict5_proba[:, 1]
fpr, tpr, thresholds = roc_curve(Y_test, Y_predict5_proba)
plt.subplot(331)
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr, label='ANN')
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC Curve Random Forest')
plt.grid(True)
plt.subplots_adjust(top=2, bottom=0.08, left=0.10, right=1.4, hspace=0.45, wspace=0.45)
plt.show()

# Naive Bayes Classification
Y_predict3_proba = nbcla.predict_proba(X_test)
Y_predict3_proba = Y_predict3_proba[:, 1]
fpr, tpr, thresholds = roc_curve(Y_test, Y_predict3_proba)
plt.subplot(332)
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr, label='ANN')
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC Curve Naive Bayes')
plt.grid(True)
plt.subplots_adjust(top=2, bottom=0.08, left=0.10, right=1.4, hspace=0.45, wspace=0.45)
plt.show()

# Conclusion

Random Forest perform better than Naive Bayes. Random Forest can handle categorical features very well and it can handle high dimensional spaces as well as a large number of training examples. I guess Naive Bayes is not good enough to represent complex behavior.