<a href="https://colab.research.google.com/github/socialx-indonesia/bda-tpcc/blob/main/python/003_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*SocialX Indonesia - Muhammad Apriandito*


---






# **Classification: Predicting Customer Churn**
Getting customers is the main goal of business. However, retaining customers is a different matter. In increasingly competitive business conditions, a company will be left behind if it cannot take care of its customers. If that happens, then all efforts will be in vain. In this case, the Company must avoid customer churn. In short, customer churn is the most critical factor that any business should continue to evaluate, especially for a growing business. Customer churn, also known as customer attrition, is when customers stop using business products and services.

A telecommunication company has a customer churn problem. They found that their customer churn rate was very high, and the Company realized that they had to find a solution to lower this churn rate.

The Company provided 7043 customer data for analysis. The dataset contains information about:

* Customers who left within the last month – the column is called Churn

* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

* Demographic info about customers – gender, age range, and if they have partners and dependents

### **Import Data**

The first step is to import the data provided by the management into python environment. To do this, all the required packages must be installed and loaded. Because we use Google Collaboratory where all the packages have been installed, we just need to load the package.

In [None]:
# Load packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random
import warnings

# Load modules
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import tree
from sklearn.naive_bayes import GaussianNB 
from sklearn import metrics

# Set Parameter
random.seed(10)
plt.rcParams['figure.figsize'] = (16, 9)
warnings.filterwarnings('ignore')

In [None]:
# Import Data to Google Colab
df = pd.read_csv('https://raw.githubusercontent.com/socialx-indonesia/bda-tpcc/main/data/customer_churn.csv', sep = ';')

In [None]:
# Show 5 first Row
df.head()

In [None]:
# Prints the dataset information
df.info()

### **Data Exploration**
At the initial stage, the management wants to know the information on the number of customers who churn and determine why they churn by comparing the variables in the data such as gender, partnership, type of contract, duration of the contract, and contract others.

In [None]:
sns.countplot(x="Churn", 
              data=df)

In [None]:
sns.countplot(x="gender",
              hue="Churn", 
              data=df)

In [None]:
sns.countplot(x="InternetService",
              hue="Churn",
              data=df)

In [None]:
sns.countplot(x="StreamingTV",
              hue="Churn",
              data = df)

### **Make a Customer Churn Prediction Model using Machine Learning**
The information you provide during data exploration is not satisfactory for management. Knowing that customers who do not subscribe to the internet are less likely to churn is not actionable for the manager. Management wants to have a more actionable solution. They want to predict whether a customer will churn or not immediately. By knowing which customer will churn or not, management can prevent customer churn earlier.

Machine learning is a sub-area of ​​AI that allows computers to learn on their own from given data. In this case, machine learning is expected to learn customer churn patterns, making it a model that can predict whether customers will churn or not in the future.

The stages of making a machine learning model consisting of 2 parts, training and testing. The training aims to extract existing patterns in the data, and testing aims to evaluate the model’s ability to make predictions.

#### **Data Preprocessing**

#### **Set Feature and Target**

In [None]:
# Select Features
feature = df[['Partner', 'tenure', 'MonthlyCharges']]

In [None]:
# Select Target
target = df['Churn']

#### **Set Training and Testing Data**
Before being modeled, the data must be divided into two part: the train data to make the model and the test data to test the model’s performance. Generally, the data is divided by the proportion: 70% train and 30% test.


In [None]:
# Set Training and Testing Data (70:30)
X_train, X_test, y_train, y_test  = train_test_split(feature , target, shuffle = True, test_size=0.3)

# Show the Training and Testing Data
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

#### **Training: Creating The Prediction Model**
The next step is to determine what algorithm we will use to make classification predictions. In this module, we will use a decision tree, and naive bayes algorithm. 

In [None]:
# Modeling Decision Tree
dtc = tree.DecisionTreeClassifier(min_impurity_decrease=0.01)
dtc.fit(X_train, y_train)

# Predict to Test Data 
y_pred_dtc = dtc.predict(X_test)

In [None]:
# Modeling Naive Bayes Classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict to Test Data
y_pred_gnb= gnb.predict(X_test)

#### **Testing: Evaluate the Model**
The last step is to do an evaluation model. This assessment measures how well our model predicts by comparing the predicted value with the actual value.

In [None]:
# Show the Accuracy, Precision, Recall
acc_dtc = metrics.accuracy_score(y_test, y_pred_dtc)
prec_dtc = metrics.precision_score(y_test, y_pred_dtc)
rec_dtc = metrics.recall_score(y_test, y_pred_dtc)
f1_dtc = metrics.f1_score(y_test, y_pred_dtc)
kappa_dtc = metrics.cohen_kappa_score(y_test, y_pred_dtc)

print("Accuracy:", acc_dtc)
print("Precision:", prec_dtc)
print("Recall:", rec_dtc)
print("F1 Score:", f1_dtc)
print("Cohens Kappa Score:", kappa_dtc)

In [None]:
# Show the Accuracy, Precision, Recall
acc_gnb = metrics.accuracy_score(y_test, y_pred_gnb)
prec_gnb = metrics.precision_score(y_test, y_pred_gnb)
rec_gnb = metrics.recall_score(y_test, y_pred_gnb)
f1_gnb = metrics.f1_score(y_test, y_pred_gnb)
kappa_gnb = metrics.cohen_kappa_score(y_test, y_pred_gnb)

print("Accuracy:", acc_gnb)
print("Precision:", prec_gnb)
print("Recall:", rec_gnb)
print("F1 Score:", f1_gnb)
print("Cohens Kappa Score:", kappa_gnb)

As we can see above, the performance of the 2 models above is different. The model with the best performance is the nb model which has an accuracy of 75%

#### **Deployment**
This month there are a new customers, please check if the customer will churn or not.

In [None]:
# Create new data
df_new = pd.DataFrame([[1, 1, 2985]], columns = ['Partner', 'tenure', 'TotalCharges'])
df_new

In [None]:
# Predict using Naive Bayes Classifier
predicted_nb = pd.DataFrame(gnb.predict(df_new), columns = ['churn'])
pred_churn = pd.concat([df_new, predicted_nb], axis=1)

# Shoe Prediction
pred_churn.head()