<a href="https://colab.research.google.com/github/srilav/machinelearning/blob/main/M6_NB_Case_Study_Customer_Churn_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Case Study: Customer Churn Analysis

## Learning Objectives

At the end of the experiment, you will be able to

* understand what is customer churn
* know the importance of predicting customer churn
* build a prediction model for a credit card company dataset
* build a prediction model for a telecommunication company dataset

## Information

Customer churn or customer attrition is the phenomenon where customers of a business no longer purchase or interact with the business. A high churn means that a higher number of customers no longer want to purchase goods and services from the business. Customer churn rate or customer attrition rate is the mathematical calculation of the percentage of customers who are not likely to make another purchase from a business.

Customer churn happens when customers decide to not continue purchasing products/services from an organization and end their association. It is an integral parameter for the organization since acquiring a new customer could cost even more than retaining an existing customer. Customer churn can prove to be a roadblock for an exponentially growing organization and a retention strategy should be decided in order to avoid an increase in customer churn rates.

To know more about customer churn, click [here](https://www.questionpro.com/blog/customer-churn/).

### Importance of Predicting Customer Churn

The ability to predict that certain customers are at a very high risk of churning represents a substantial revenue maintenance source for any business:

* Acquiring new customers is a costly affair but losing the existing customers will cost even more for the business or the organization. The existing customer base should be happy to purchase repeatedly from your brand, for the best business outcomes

* Increasing market competition encourages organizations to focus not only on new business but also on retaining existing customers.

* The most important step towards predicting customer churn is to start awarding existing customers for regular purchases and support.

* Customer churn usually results from an entire customer journey and not just a few incidents. To avoid customer churn, organizations should start offering incentives on purchases of these soon-to-churn customers.

* A customer’s intention to stop using a particular product/service may always be a decision formed over time. There are various factors which lead to this decision and it is important for organizations to understand each and every factor so that customers can be convinced to stay and keep making purchases.

### Customer Churn Analysis

Here we will go through some consumer data and see how we can leverage data insights and predictive modeling in order to improve customer retention.

**Dataset Description**

Our first customer dataset is from a **credit card company**, where we are able to review customer attributes such as gender, age, tenure, balance, number of products they are subscribed to, their estimated salary and if they stopped the subscription or not. 

Here, tenure represents the number of months the customer has stayed with the company.

In [None]:
#@title Run this cell to download the datasets
from IPython import get_ipython

ipython = get_ipython()
  
notebook= "M6_AST_10_Customer_Churn_Analysis_C" #name of the notebook
 
ipython.magic("sx wget https://raw.githubusercontent.com/anilak1978/customer_churn/master/Churn_Modeling.csv")
ipython.magic("sx wget https://raw.githubusercontent.com/anilak1978/customer-churn/master/bigml_59c28831336c6604c800002a.csv")
    


### Import required packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, roc_curve, roc_auc_score, f1_score

In [None]:
# Read data
df = pd.read_csv("Churn_Modeling.csv")
df.head()

In [None]:
# Shape of dataset
df.shape

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# data types
df.dtypes

In [None]:
# Looking at the summary
df.describe()

From the above statistical insights, we see that the average age of our customers is 39, the average month customer has been a member is 5 and the estimated average salary is 100K.

In [None]:
# Columns of dataset
df.columns

In [None]:
# Looking at CreditScore for Churn and No churn data
sns.boxplot(x=df['Exited'], y=df['CreditScore'])
plt.show()

In [None]:
# Mean CreditScore for Churn and No churn data
df.groupby('Exited')['CreditScore'].mean()

From the above results, we can see the difference between the credit score for churn and no churn customers.

In [None]:
# Visualize Geography and Churn columns
sns.barplot(x='Geography', y='Exited', data = df, ci=None)
plt.show()

The churn rate is higher for Germany than France and Spain.

In [None]:
# Visualize Gender and Churn columns
sns.barplot(x='Gender', y='Exited', data = df, ci=None)
plt.show()

The churn rate is higher for female customers.

In [None]:
# Looking at Age of Churn and No churn data
sns.boxplot(x=df['Exited'], y=df['Age'])
plt.show()

We can see a significant difference between the age of churn and no churn customers.

In [None]:
# Visualize Tenure and Churn columns
sns.barplot(x='Tenure', y='Exited', data = df, ci=None)
plt.show()

From the above plot, we can see that 0 and 1 are the top two tenures with the highest churn rate.

In [None]:
# Looking at Balance for Churn and No churn data
sns.boxplot(x=df['Exited'], y=df['Balance'])
plt.show()

In [None]:
# Visualize Number of Products and Churn columns
sns.barplot(x='NumOfProducts', y='Exited', data = df, ci=None)
plt.show()

In [None]:
# Visualize Has credit card and Churn columns
sns.barplot(x='HasCrCard', y='Exited', data = df, ci=None)
plt.show()

In the above plot, a very less difference is there in churn rate between the customers who have credit card and who doesn't.

In [None]:
# Visualize Is active member and Churn columns
sns.barplot(x='IsActiveMember', y='Exited', data = df, ci=None)
plt.show()

In [None]:
# Looking at EstimatedSalary for Churn and No churn data
sns.boxplot(x=df['Exited'], y=df['EstimatedSalary'])
plt.show()

In [None]:
# Looking at Geography and Gender Distribution against Estimated Salary
plt.figure(figsize=(20,20))
sns.catplot(x="Geography", y="EstimatedSalary", hue="Gender", kind="box", data=df)
plt.title("Geography VS Estimated Salary")
plt.xlabel("Geography")
plt.ylabel("Estimated Salary")
plt.show()

When we look at the gender and geographic distribution of estimated salary, we see that male customers estimated average salary is little more as compared to that of females in France, however in Germany and Spain female customers’ estimated average salary is higher.

Based on our basic exploratory analysis, we can define the important customer attributes that can give us the best insight in order to predict the type of customers that can churn.

In this dataset, we can select credit score, geography, gender, age, tenure, balance, number of products, is active member and estimated salary attributes as the feature set and exited as the target variable.

In [None]:
# Feature set
X = df[["CreditScore", "Geography", "Gender", "Age", "Tenure", "Balance", "NumOfProducts", "IsActiveMember", "EstimatedSalary"]].values
# Target
y = df[["Exited"]]
X[0:5], y[0:5]

Update the categorical variables to numerical variables:

In [None]:
# preprocessing categorical variables
geography = LabelEncoder()
geography.fit(["France", "Spain", "Germany"])
X[:,1] = geography.transform(X[:,1])

gender = LabelEncoder()
gender.fit(["Female", "Male"])
X[:,2] = gender.transform(X[:,2])
X[0:5]

Splitting into training and testing set:

In [None]:
# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

In this dataset, let’s use DecisionTreeClassifier and RandomForestClassifier to create the model and prediction, further evaluate them both to see which one is better.

In [None]:
# Create model using DecisionTree Classifier and fit training data
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

# Prediction
dt_pred = dt_model.predict(X_test)
dt_pred[0:5]

In [None]:
# Evaluating the prediction model
accuracy_score(y_test, dt_pred)

In [None]:
# Create Random Forest Decision Tree model
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train.values.ravel())

# Prediction using rf_model
rf_pred = rf_model.predict(X_test)
rf_pred[0:5]

In [None]:
# Evaluate the model
accuracy_score(y_test, rf_pred)

Based on the metrics evaluations, while 79% of the predictions would be accurate with the Decision Classifier Model, 85% of the predictions would be accurate with the RandomForestClassifier. In this case Random Forest is performing better.

Now, let’s look at the second customer dataset to see if we can do better analysis and prediction models.

**Dataset Description**

Here we are looking at a **telecommunication company** and its existing customer attributes such as their current plan, charges, location in terms of state, amount of customer service calls, account length and churn.

In [None]:
# Read data
df1 = pd.read_csv("bigml_59c28831336c6604c800002a.csv")
df1.head()

In [None]:
# Shape of dataset
df1.shape

In [None]:
# Check for missing values
df1.isnull().sum()

In [None]:
# Datatypes
df1.dtypes

There are no missing data within the dataset and data types are correct.

In [None]:
# Visualize State and Churn columns
plt.figure(figsize=(20,6))
sns.barplot(x='state', y='churn', data = df1, ci=None)
plt.show()

When we look at the state and churn we see that California and New Jersey are the top two states with the highest churn rate.

In [None]:
# Visualize International plan and Churn columns
sns.barplot(x='international plan', y='churn', data = df1, ci=None)
plt.show()

In [None]:
# Visualize Voice mail plan and Churn columns
sns.barplot(x='voice mail plan', y='churn', data = df1, ci=None)
plt.show()

We also see that the churn rate is higher with the international plan customers and lower with the customers that have voice mail plan.

One possible reason for customers on the International plan having a significantly higher churn is that they are joining whenever they have to travel abroad for a short period of time and when the trip gets over, they leave.

In [None]:
# Relationship between Customer service calls and Churn columns
sns.regplot(x=df1['customer service calls'], y=df1['churn'], marker='.')
plt.xlabel('Customer service calls')
plt.ylabel("Churn")
plt.show()

Poor customer service is one of the well-known reasons for customer churn. In this case, we can see from the above plot a strong positive linear relationship with the customer service call amount and churn rate.

Now, let’s develop multiple different models and evaluate them to see which one would be the best fit to solve the business problem of customer churn.

In [None]:
# Feature selection
X1 = df1[["account length", "international plan", "total day charge", "total night charge", "total intl charge", "customer service calls", "state"]]
# Target selection
y1 = df1["churn"]
X1[0:5]

Update the categorical variables to numeric variables in order to create model:

In [None]:
# Update state with one hot coding
X1 = pd.get_dummies(X1, columns=["state"])
X1 = X1.values

# Preprocess to update str variables to numerical variables
international_plan = LabelEncoder()
international_plan.fit(["no", "yes"])
X1[:,1] = international_plan.transform(X1[:,1])
X1[0:5]

In [None]:
# Scaling data
sc = StandardScaler()
X1_scaled = sc.fit_transform(X1)

Splitting into training and testing set:

In [None]:
# Create training and testing set
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1_scaled, y1, test_size=0.2, stratify=y1, random_state=3)

Let’s create a model using the Support Vector Machine.

In [None]:
# Creating the svm model and fitting training set
svc_model = SVC(probability=True)
svc_model.fit(X_train1, y_train1)
# Prediction
svc_pred = svc_model.predict(X_test1)
print(svc_pred[0:5])

# Accuracy score
print("Accuracy score: ", accuracy_score(y_test1, svc_pred))

The accuracy score for SVM Model for predicting churn of the telecommunication company customers is 0.85. However, we should analyze this further as the data is impartial.

We can review additional evaluation metrics, such as cross validation matrix which will give us the number of true positives, false positives, true and false negatives, precision, recall and f1 score.

In [None]:
# Confusion matrix
confusion_matrix(y_test1, svc_pred)

The model predicts 564 True Negatives, 6 False Positives, 90 False Negatives, 7 True Positives.

In [None]:
# Precision score for svm
print("Precision: ", precision_score(y_test1, svc_pred))

In [None]:
# Recall score for svm
print("Recall: ", recall_score(y_test1, svc_pred))

In [None]:
# Probability for each prediction
prob_2 = svc_model.predict_proba(X_test1)[:,1]

# ROC curve giving the false and true positive predictions
fpr, tpr, thresholds = roc_curve(y_test1, prob_2)
plt.plot(fpr, tpr)
plt.title("ROC curve")
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.show()

In [None]:
# Area under the curve
auc = roc_auc_score(y_test1, prob_2)
print("Area under curve: ", auc)

In [None]:
# F1_score
f1_ = f1_score(y_test1, svc_pred)
print("F1-score: ", f1_)

Let’s create another model using RandomForestClassifier.

In [None]:
# Create model using RandomForestClassifier and fit the training set
rf_model1 = RandomForestClassifier(n_estimators=100, random_state=4)
rf_model1.fit(X_train1, y_train1)

# Create prediction
rf_pred1 = rf_model1.predict(X_test1)
rf_pred1[0:5]

In [None]:
# Accuracy score
accuracy_score(y_test1, rf_pred1)

We can see that the accuracy score for Random Forest Classification is higher than Support Vector Machine.

In [None]:
# Confusion matrix to find precision and recall
confusion_matrix(y_test1, rf_pred1)

The model predicts 558 True Negatives, 12 False Positives, 52 False Negatives, 45 True Positives.

Even though the False Positive count slightly went up, the True Positives are significantly more compared to SVM model.

In [None]:
# Precision score
print("Precision: ", precision_score(y_test1, rf_pred1))

In [None]:
# Recall score
print("Recall: ", recall_score(y_test1, rf_pred1))

In [None]:
# Probability for each prediction
prob = rf_model1.predict_proba(X_test1)[:,1]

# ROC curve giving the false and true positive predictions
fpr, tpr, thresholds = roc_curve(y_test1, prob)
plt.plot(fpr, tpr)
plt.title("ROC curve")
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.show()

In [None]:
# Area under the curve
auc = roc_auc_score(y_test1, prob)
print("Area under curve: ", auc)

In [None]:
# F1_score
f1 = f1_score(y_test1, rf_pred1)
print("F1-score: ", f1)

We can further look at the feature importance to see what features have the most impact on the prediction.

In [None]:
# Importance of each feature
importances = rf_model1.feature_importances_

# Visualize the feature importance
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10,5))
plt.bar(range(X1.shape[1]), importances[indices])
plt.ylabel("Feature Importance")
plt.xlabel("Column Index")
plt.show()

Based on the feature importance, we can remove state feature from our model.

From the above results, we can see that both the precision score and recall score for SVM is much lower than the Random Forest Classifier. Although, the area under the roc curve (auc)  is the same for both models which is 0.8.

Based on the two predictive models, the second one we created with Random Forest Classifier would be a better choice. We can also tune this model and improve it by updating the parameter and removing state variable from the feature set for better prediction.

With the existing consumer insights through data, companies can predict customers’ possible needs and issues, define proper strategies and solutions against them, meet their expectations and retain their business. Based on the predictive analysis and modeling, businesses can focus their attention with targeted approach by segmenting and offering them customized solutions.