# Title : Telecom Customer Churn

Aim : To predict the behaviour of customers in order to retain them.      
     In short, Analyze the data and use machile learning techniques in order to prevent the loss of clients/customers for      telecom industry 

Definition : (Source : Investopedia)        
The churn rate, also known as the rate of attrition or customer churn, is the rate at which customers stop doing business with an entity. It is most commonly expressed as the percentage of service subscribers who discontinue their subscriptions within a given time period

Import Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Read the data
data = pd.read_csv(r"../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
data.shape

1. Data consist of 7043 records with 21 features.
2. It is observed that there are around 11 records in "Total charges" column having space values.Hence these are replaced by Nan values and later on removed as they are very less in number. 

In [None]:
data['TotalCharges'] = data["TotalCharges"].replace(" ",np.nan)

In [None]:
data.dropna(inplace = True)

In [None]:
data.isnull().sum()

In [None]:
data.nunique()

'nunique' is used to see how many unique values each feature has.
It will be easy to plot and analyze each feature based on unique values.

In [None]:
data.head()

## Exploratory Data Analysis 

In [None]:
# Variables and their unique values for analysis
for item in data.columns:
    print(item," : ", data[item].unique())

From above, we can say that Features such as ( MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingMovies','StreamingTV') have multiple values as 'No Internet Service or No Phone Service' instead of 'NO'.
Hence these are replaced by 'No' for better analysis.

In [None]:
data['MultipleLines'].replace(to_replace = 'No phone service',value = 'No',inplace = True)

replace_columns = ['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingMovies','StreamingTV']
for i in replace_columns:
    data[i].replace(to_replace = 'No internet service',value = 'No',inplace = True)

In [None]:
data.nunique()

Exploring Individual Features

In [None]:
def plot_feature(feature):
    ax = sns.countplot(x = feature,data = data)
    total = len(data)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}'.format(height/total * 100),
            ha="center")     
    plt.title("Customer Distribution by {}".format(feature)) 
    plt.show()

In [None]:
plot_feature('Churn')
plot_feature('gender')
plot_feature('Partner')
plot_feature('SeniorCitizen')
plot_feature('PhoneService')
plot_feature('MultipleLines')
plot_feature('InternetService')
plot_feature('OnlineSecurity')
plot_feature('Contract')

Observations :
    1. Distribution of Customers by Churn 
        73% - No and  27% - Yes
    2. There are equal number of male and female customers(50-50%)
    3. 48% customers have partners while 52% don't have
    4. There are less number of senior citizens(around 16 %) while majority of the people are young(84%)
    5. 10% people do not have Phone Service
    6. Distribution of Customers by Multiple Lines 
        58% - No and 42% - Yes
    7. Distribution of Customers by Internet Service 
        a. 34% - DSL 
        b. 44% - Fiber Optic Service
        c. 22% - No Internet Service
    8. Distribution of Customers by Online Security 
        71% - No and 29 % Yes
    9. Distribution of Customers by Contract
        a. 55% - Month to Month 
        b. 21% - One Year Contract
        c. 24% - Two Year Contract

Exploring Features with respect to Churn

In [None]:
def plot_bar(d,var1,var2):
    grp = d.groupby(var1)[var2].value_counts()
    grp.unstack().plot(kind = 'bar')
    plt.xlabel(var1)
    plt.ylabel("Count of Churn Customers")
    plt.title("Churn Customer Distribution by {}".format(var1)) 

In [None]:
plot_bar(data,'gender','Churn');
plot_bar(data,'SeniorCitizen','Churn');
plot_bar(data,'Partner','Churn');
plot_bar(data,'Dependents','Churn');
plot_bar(data,'PhoneService','Churn');
plot_bar(data,'MultipleLines','Churn');
plot_bar(data,'InternetService','Churn');
plot_bar(data,'OnlineSecurity','Churn');
plot_bar(data,'OnlineBackup','Churn');
plot_bar(data,'DeviceProtection','Churn');
plot_bar(data,'TechSupport','Churn');
plot_bar(data,'StreamingTV','Churn');
plot_bar(data,'StreamingMovies','Churn');
plot_bar(data,'Contract','Churn');
plot_bar(data,'PaperlessBilling','Churn');
plot_bar(data,'PaymentMethod','Churn');

Observations :
    1. Churn rate is almost same for both male and female customers
    2. There are high number of churn customers who are not senior citizens in number, but if we see there are less number         of Senior citizens.Hence when compared with total Senior citizen, Churn rate is higher for SeniorCitizen as compared to Young People
    3. The Customers who don't have partners as well as no dependents have higher churn rate
    4. Customers having Phone service have higher chances of Churn while those who don't phone seervice , they have minimal chances of churn
    5. Customers with Internet Service type as "Fiber Optic" have higher chances of churn
    6. Customers who don't have TechSupport have higher churn rate
    7. Month Month Contract cutomers have a very high Churn Rate
    8. Cutomers having Paperless billing have higher Churn Rate
    9. Customers with Electronic Check have more churn than any other payment methods
    10.MultipleLines, StreamingTV, StreamingMovies have not much effect on Churn

In [None]:
ax = sns.distplot(data['tenure'], hist=True, kde=False, 
             bins=int(200/6), 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 4})
ax.set_ylabel('Number of Customers')
ax.set_xlabel('Tenure in months')
ax.set_title('Distribution of Customers by tenure')

It is clear that customers having one or two month tenure(New Customers to Service) have higher Churn Rate 

In [None]:
def tenure_count(row):
    if row['tenure'] <= 12 :
        return 'tenure_0-12'
    elif (row['tenure'] > 12 and row['tenure'] <= 24):
        return 'tenure_12-24'
    elif (row['tenure'] > 42 and row['tenure'] <= 36):
        return 'tenure_24-36'
    elif (row['tenure'] > 36 and row['tenure'] <= 48):
        return 'tenure_36-48'
    elif (row['tenure'] > 48 and row['tenure'] <= 60):
        return 'tenure_48-60'
    else:
        return 'tenure_60+'
    
data['grp_tenure'] = data.apply(tenure_count,axis = 1)
data['grp_tenure'].value_counts()

In [None]:
plot_bar(data,'grp_tenure','Churn')

1. There are large number of customers  with tenure greater than 60 months and also they have less Churn Rate as compared to count
2. There are around 90% chances that customers with tenure less than 12 months will left the service.(High Churn)


In [None]:
data['TotalCharges'] = data['TotalCharges'].astype(float)

In [None]:
plt.scatter(x = data['MonthlyCharges'],y = data['TotalCharges'],c = 'green')
plt.xlabel("Monthly Charges")
plt.ylabel("Total Charges Charges")
plt.title("Ralation between Monthly and Total Charges")

As monthly charge increases,Total Charge also increases.

In [None]:
ax = sns.kdeplot(data[data['Churn'] == 'No']['MonthlyCharges'])
ax = sns.kdeplot(data[data['Churn'] == 'Yes']['MonthlyCharges'],color = 'Red')
ax.set_xlabel('Monthly Charges')
ax.set_ylabel('Density')
ax.set_title('Distribution of Monthly charges by churn')

from the above graph in red line, As monthlly charges increases, there is higher probability of Churn

In [None]:
ax = sns.kdeplot(data[data['Churn'] == 'No']['TotalCharges'])
ax = sns.kdeplot(data[data['Churn'] == 'Yes']['TotalCharges'],color = 'Red')
ax.set_xlabel('Total Charges')
ax.set_ylabel('Density')
ax.set_title('Distribution of Total charges by churn')

from the above graph in red line, Churn Rate is Higher when Total charges are on lower side

Converting Categorical values into Numerical

Label Encoding: I used Label encoding.We can also use One hot encoding 

In [None]:
data.drop(['customerID'],axis = 1,inplace = True)

In [None]:
objList = data.select_dtypes(include = "object").columns
print (objList)

In [None]:
#Label Encoding for object to numeric conversion
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for feat in objList:
    data[feat] = le.fit_transform(data[feat].astype(str))

print (data.info())

In [None]:
y = data['Churn']
data.drop(['Churn'],axis = 1, inplace = True)
Train_x = data

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X=sc.fit_transform(Train_x)

# Implement Machine Learning Algorithms

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25,random_state = 0)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import *
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

Logistic Regression

In [None]:
LR = LogisticRegression(solver = 'liblinear')
LR.fit(X_train,y_train)
y_pred_LR = LR.predict(X_test)
LR_Score = accuracy_score(y_pred_LR,y_test)
print("Accuracy Using LR : ", LR_Score)

In [None]:
#Weights of the Variables
pd.Series(LR.coef_[0],index = Train_x.columns.values)

It seems Total Charges,Monthly charges are positively related to churn while Tenure,Phone Service,Contract are negatively related to Churn rate .All these are important factors while deciding churn rate based on their weights


Random Forest

In [None]:
RF = RandomForestClassifier(n_estimators=600,max_features=15,
                            n_jobs = -1,random_state=0,
                            min_samples_leaf=50,oob_score=True,
                            max_leaf_nodes=30 )
RF.fit(X_train,y_train)
y_pred_RF = RF.predict(X_test)
RF_Score = accuracy_score(y_pred_RF,y_test)
print("Accuracy Using RF  : ", RF_Score)

In [None]:
imp_features = pd.Series(RF.feature_importances_,index = Train_x.columns.values)
imp_features.sort_values()[-5:].plot(kind = 'bar')

Top five Important featues using Random Forest
1. Contract
2. Monthly Charges
3. Tenure
4. Total Charges
5. Internet Service

These features have also high weightage by Logistic Regression. Hence,We can conclude that,these features should be considered while deciding Churn 

Support Vector Machines

In [None]:
SVM = SVC(kernel='rbf',C =1) 
SVM.fit(X_train,y_train)
y_pred_SVM = SVM.predict(X_test)
SVM_Score = accuracy_score(y_pred_SVM,y_test)
print("Accuracy Using SVM  : ", SVM_Score)

Gaussian Naive Bayes

In [None]:
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
y_pred_GB = gaussian.predict(X_test)
GB_Score = accuracy_score(y_pred_GB,y_test)
print("Accuracy Using Gaussian Algorithm : ", GB_Score)

XGBOOST

In [None]:
XGB = XGBClassifier(n_estimators=100,learning_rate = 0.1,max_depth = 4)
XGB.fit(X_train, y_train)
y_pred_XGB = XGB.predict(X_test)
XGB_Score = accuracy_score(y_test, y_pred_XGB)
print("Accuracy Using XGBoost : ", XGB_Score)

LIGHTGBM Classifier

In [None]:
lgbm = LGBMClassifier(n_estimators=100,learning_rate = 0.1,max_depth = 5)
lgbm.fit(X_train, y_train)
y_pred_LGBM = lgbm.predict(X_test)
LGBM_Score = accuracy_score(y_test,y_pred_LGBM )
print("Accuracy Using LIGTH GBM Classifier : ", LGBM_Score)

In [None]:
labels = ['Churn', 'Not-Churn']
cm = confusion_matrix(y_test, y_pred_LGBM)
print(cm)

In [None]:
ax= plt.subplot()
sns.heatmap(cm,annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Not-Churn', 'Churn']); ax.yaxis.set_ticklabels(['Not-Churn', 'Churn']);

In [None]:
Results = pd.DataFrame({'Model': ['Logistic Regression','Gaussian Naive Bayes','SVM','Random Forest','XG_Boost','LightGBM'],
                        'Accuracy Score' : [LR_Score,GB_Score,SVM_Score,RF_Score,XGB_Score,LGBM_Score]})

In [None]:
Final_Results = Results.sort_values(by = 'Accuracy Score', ascending=False)
Final_Results = Final_Results.set_index('Model')
print(Final_Results)

Above results show Accuracy for LightGBM , Random Forest,LR and XGBoost is nearly same.                                
Accuracy maybe further increased using one hot encoding technique rather than Label encoder that we used above.

Reference : https://www.kaggle.com/bandiatindra/telecom-churn-prediction