<a href="https://colab.research.google.com/github/telkomselxrevou/ROY/blob/main/Module_4_Take_Home_Assignment_Roy_Samson_Daniel_Purba.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Packages

In [3]:
import sys
import os
import gdown
import logging
logging.getLogger('matplotlib.font_manager').setLevel(level=logging.CRITICAL)

import warnings
warnings.filterwarnings('ignore')

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

random_state = 1234 # get reproducible trees


# Prepare Data

**Restartnet** is a telecommunication company that are market leader in in Wakanda since 1990 and the first to create high speed mobile internet that integrate satellite and ground cable.

In the last 5 years, there are more fierce competition with new competitor emerging. A lot of Restartnet customer is moving to that new competitor and Restartnet CEO is quite concern about that issue.

After digging some data, Restarnet CEO realize that their churn rate is high at 25%.



As a **CEO Analyst**, we initiate to find which customers are likely to churn by creating a **customer churn model** so that we can offer accurate engagement packages to the targeted customers.

After we provide the list of customer, we calculate the impact for the company.

With assumsions:

* For each customer churn, we lost $500.

* Engagement program cost is $100, and

* All customer that get will stay


The **data** is provided in this [link](https://drive.google.com/file/d/1jAFn03vk055D9gZrrzM70_cdPyUDg-bv/view) which consist of sample **unique customer** that have already bought internet package in Restartnet company from 2010 to 2020. The customer data consist of their demographic data and the summary of their transaction in Restartnet. The detail of the data definition can be seen below.

Data Definition:

| Field           | Description                                     |
|-----------------|-------------------------------------------------|
| customerID      | Customer's unique identifier                     |
| gender          | Whether the customer is a male or a female      |
| SeniorCitizen   | Whether the customer is a senior citizen or not |
| Partner         | Whether the customer has a partner or not       |
| Dependents      | Whether the customer has dependents or not      |
| tenure          | Number of months the customer has stayed        |
| PhoneService    | Whether the customer has a phone service or not |
| MultipleLines   | Whether the customer has multiple lines or not  |
| InternetService | Customer's internet service provider            |
| OnlineSecurity  | Whether the customer has online security or not |
| OnlineBackup    | Whether the customer has online backup or not   |
| DeviceProtection| Whether the customer has device protection or not |
| TechSupport     | Whether the customer has tech support or not    |
| StreamingTV     | Whether the customer has streaming TV or not    |
| StreamingMovies | Whether the customer has streaming movies or not|
| Contract        | The contract term of the customer               |
| PaperlessBilling| Whether the customer has paperless billing or not |
| PaymentMethod   | The customer's payment method                   |
| MonthlyCharges  | The amount charged to the customer monthly      |
| TotalCharges    | The total amount charged to the customer        |
| Churn           | Whether the customer churned or not              |



In [4]:
# Download Data
gdrive_url = "https://drive.google.com/file/d/1jAFn03vk055D9gZrrzM70_cdPyUDg-bv/view"
file_name = 'churn_data.csv'
gdown.download(gdrive_url, file_name, fuzzy=True)


Downloading...
From: https://drive.google.com/uc?id=1jAFn03vk055D9gZrrzM70_cdPyUDg-bv
To: /content/churn_data.csv
100%|██████████| 977k/977k [00:00<00:00, 101MB/s]


'churn_data.csv'

In [5]:
df = pd.read_csv('churn_data.csv')

In [6]:
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod']

features = numeric_features + categorical_features
target = 'Churn'

print("numeric_features : ", numeric_features)
print("categorical_features : ", categorical_features)
print("features: ", features)
print("target: ", target)
print("columns used: ", features + [target])


numeric_features :  ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features :  ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
features:  ['tenure', 'MonthlyCharges', 'TotalCharges', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
target:  Churn
columns used:  ['tenure', 'MonthlyCharges', 'TotalCharges', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']


In [7]:
df = df[ features + [target] ]


In [8]:
# Handle missing value on TotalCharges with value 0
df['TotalCharges'] = df['TotalCharges'].fillna(0)

# Handle Categorical Data
## we transform categorical into several column as it will treated differently
df = pd.get_dummies(df, columns = categorical_features)


In [9]:
# transform target to 1 if Yes, 0 if No
df[target] = (df[target] == 'Yes').astype(int)

In [10]:
# Split data
## Asumming df_test data is new data
df_train, df_test = train_test_split(df, test_size=0.33, random_state=random_state)

In [11]:
df_train.head()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,SeniorCitizen_0,SeniorCitizen_1,Partner_No,Partner_Yes,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
2632,55,64.75,3617.1,0,True,False,True,False,True,False,...,True,False,False,True,False,True,False,False,False,True
1210,17,69.0,1149.65,1,False,True,True,False,False,True,...,False,True,False,False,False,True,False,False,True,False
5018,72,19.7,1379.8,0,True,False,True,False,False,True,...,False,False,False,True,True,False,False,True,False,False
4891,4,65.6,250.1,0,False,True,True,False,False,True,...,False,True,False,False,True,False,False,False,True,False
3794,8,54.75,445.85,0,False,True,True,False,False,True,...,False,True,False,False,False,True,False,False,False,True


In [12]:
df_train.columns

Index(['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn', 'gender_Female',
       'gender_Male', 'SeniorCitizen_0', 'SeniorCitizen_1', 'Partner_No',
       'Partner_Yes', 'Dependents_No', 'Dependents_Yes', 'PhoneService_No',
       'PhoneService_Yes', 'MultipleLines_No',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic',
       'InternetService_No', 'OnlineSecurity_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No', 'OnlineBackup_No internet service',
       'OnlineBackup_Yes', 'DeviceProtection_No',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No', 'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No', 'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No', 'StreamingMovies_No internet service',
       'StreamingMovies_Yes', 'Contract_Month-to-month', 'Contract_One year',

In [13]:
features = list(df_train.columns)
features.remove(target)

features

['tenure',
 'MonthlyCharges',
 'TotalCharges',
 'gender_Female',
 'gender_Male',
 'SeniorCitizen_0',
 'SeniorCitizen_1',
 'Partner_No',
 'Partner_Yes',
 'Dependents_No',
 'Dependents_Yes',
 'PhoneService_No',
 'PhoneService_Yes',
 'MultipleLines_No',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'OnlineSecurity_No',
 'OnlineSecurity_No internet service',
 'OnlineSecurity_Yes',
 'OnlineBackup_No',
 'OnlineBackup_No internet service',
 'OnlineBackup_Yes',
 'DeviceProtection_No',
 'DeviceProtection_No internet service',
 'DeviceProtection_Yes',
 'TechSupport_No',
 'TechSupport_No internet service',
 'TechSupport_Yes',
 'StreamingTV_No',
 'StreamingTV_No internet service',
 'StreamingTV_Yes',
 'StreamingMovies_No',
 'StreamingMovies_No internet service',
 'StreamingMovies_Yes',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaperlessBilling_No',
 'PaperlessBilling_Yes',
 'P

# Evaluation metrics comparison from several models

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 47 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   tenure                                   7043 non-null   int64  
 1   MonthlyCharges                           7043 non-null   float64
 2   TotalCharges                             7043 non-null   float64
 3   Churn                                    7043 non-null   int64  
 4   gender_Female                            7043 non-null   bool   
 5   gender_Male                              7043 non-null   bool   
 6   SeniorCitizen_0                          7043 non-null   bool   
 7   SeniorCitizen_1                          7043 non-null   bool   
 8   Partner_No                               7043 non-null   bool   
 9   Partner_Yes                              7043 non-null   bool   
 10  Dependents_No                            7043 no

## Train & Evaluate Decision Tree Classifier

with specs
```
max depth = 7
class weight = balanced
random state = 1234
```

Train Decision tree with specs / parameter as instructed

In [15]:
# import model
from sklearn.model_selection import train_test_split

# initiate model

model_tree = DecisionTreeClassifier(max_depth=7, random_state=random_state, class_weight = "balanced")

# Train model

model_tree.fit(df_train[features].values, df_train[target].values)


Evaluation

In [17]:
# Evaluate Precision, Recall, and F1 using Test Data

prediction = model_tree.predict(df_test[features])
label = df_test[target].values

#Print
print("precision_score\t:" ,precision_score(label, prediction))
print("recall_score \t:" ,recall_score(label, prediction))
print("f1_score \t:" ,f1_score(label, prediction))

precision_score	: 0.46303901437371664
recall_score 	: 0.754180602006689
f1_score 	: 0.573791348600509


## Train & Evaluate Random Forest

with specs
```
n estimators = 10
max_depth = 3
random_state=random_state
class_weight = 'balanced'
```

In [18]:
# import model
from sklearn.ensemble import RandomForestClassifier

# initiate model
model_rf = RandomForestClassifier(n_estimators= 75, max_depth=8, random_state=1234, class_weight = "balanced")

# Train model
model_rf.fit(df_train[features].values, df_train[target].values)

In [20]:
# Evaluate Precision, Recall, and F1 using Test Data
prediction_rf= model_rf.predict(df_test[features])
label_rf= df_test[target].values

#Print
print("precision_score\t:" ,round(precision_score(label_rf, prediction_rf),4))
print("recall_score \t:" ,round(recall_score(label_rf, prediction_rf),4))
print("f1_score \t:" ,round(f1_score(label_rf, prediction_rf),4))

precision_score	: 0.5158
recall_score 	: 0.7625
f1_score 	: 0.6154


## Train & Evaluate Your own model

Feel free to pick any classification model in https://scikit-learn.org/stable/supervised_learning.html

But you required to have higher f1_score more than `0.61`


In [21]:
# import model
from sklearn.ensemble import RandomForestClassifier

# initiate model
model_roy = RandomForestClassifier(n_estimators= 75, max_depth=8, random_state=1234, class_weight = "balanced")


# Train model
model_roy.fit(df_train[features].values, df_train[target].values)


In [22]:
# Evaluate Precision, Recall, and F1 using Test Data
prediction_roy= model_roy.predict(df_test[features])
label_roy= df_test[target].values

#Print
print("precision_score\t:" ,precision_score(label_roy, prediction_roy))
print("recall_score \t:" ,recall_score(label_roy, prediction_roy))
print("f1_score \t:" ,f1_score(label_roy, prediction_roy))

precision_score	: 0.5158371040723982
recall_score 	: 0.7625418060200669
f1_score 	: 0.6153846153846154


# Business impact comparison from several models

[recall the assumption]

assumsions:

* For each customer churn, we lost $500.

* Engagement program cost is $100, and

* All customer that get engagement will stay

----
We want to compare the business impact on:
* Case 1: if no engagement program
* Case 2: if we send engagement program to all user
* Case 3: if we send engagement program based on above decision tree (`model_tree`)
* Case 4: if we send engagement program based on above random forest (`model_rf`)  
* Case 5: if we send engagement program based on above the best model (`model`)

----

First we calculate how many customer and churn customer in test dataset

In [23]:
total_customer = len(df_test)
real_churn = len(df_test.loc[df_test[target] == 1])

print("Total customer \t:", total_customer)
print("Total churn \t:", real_churn)

Total customer 	: 2325
Total churn 	: 598


Save the assumption into variable

In [24]:
churn_value_lost_per_customer = 500
engagement_cost_per_customer = 100

print("Churn Value Lost per customer\t:", churn_value_lost_per_customer)
print("Engagement Cost per customer\t:", engagement_cost_per_customer)

Churn Value Lost per customer	: 500
Engagement Cost per customer	: 100


## Case 1: if no engagement program

In [25]:
print("CASE 1: If no engagement program")

value_lost_case1 = real_churn * churn_value_lost_per_customer
engagement_cost_case1 = 0 # because no engagement
total_cost_case1 = value_lost_case1 + engagement_cost_case1
print("\t Value Lost \t: $", value_lost_case1)
print("\t Engagement cost: $", engagement_cost_case1)
print("\t Total cost \t: $",  total_cost_case1)


CASE 1: If no engagement program
	 Value Lost 	: $ 299000
	 Engagement cost: $ 0
	 Total cost 	: $ 299000


## Case 2: if we send engagement program to all user

In [26]:
print("Case 2: if we send engagement program to all user")

value_lost_case2 = 0 # because no customer lost
engagement_cost_case2 = total_customer * engagement_cost_per_customer
total_cost_case2 = value_lost_case2 + engagement_cost_case2
print("\t Value Lost \t: $", value_lost_case2)
print("\t Engagement cost: $", engagement_cost_case2)
print("\t Total cost \t: $",  total_cost_case2)


Case 2: if we send engagement program to all user
	 Value Lost 	: $ 0
	 Engagement cost: $ 232500
	 Total cost 	: $ 232500


Looks like if we send engagement program to all customer, it is more beneficial for the company (232500 < 299000)

But lets see how the model performs

## Case 3: if we send engagement program based on above decision tree (`model_tree`)

Tips, you need to find the number of
* how many customer that predicted as churn (`predict_churn`)
* how many customer that actually churn **but** we predict it as stay (`real_churn_predict_stay`)

Hint: you can use confussion matrix
```python
confusion_matrix(y_true_test, y_pred_test)
```
explore the indexing of `confusion_matrix` like using `[0,0]` to get the number inside confusion matrix
```python
confusion_matrix(y_true_test, y_pred_test)[0,0]
```

for reminder, this is the content of confusion matrix
![Confusion metrics](https://miro.medium.com/v2/resize:fit:974/1*H_XIN0mknyo0Maw4pKdQhw.png)

In [27]:
print("CASE 3:  if we send engagement program based on above decision tree (model_tree)")

conf_mat = confusion_matrix(label, prediction)
conf_mat

real_stay_predict_stay = conf_mat[0,0]
real_stay_predict_churn = conf_mat[0,1]
real_churn_predict_stay = conf_mat[1,0]
real_churn_predict_churn = conf_mat[1,1]

real_churn = real_churn_predict_stay + real_churn_predict_churn
predict_churn = real_stay_predict_churn + real_churn_predict_churn



print("\t Value Lost \t: $",  real_churn_predict_stay * churn_value_lost_per_customer)
print("\t Engagement cost: $", predict_churn * engagement_cost_per_customer)
print("\t Total cost \t: $",  (real_churn_predict_stay * churn_value_lost_per_customer) + (predict_churn * engagement_cost_per_customer))




CASE 3:  if we send engagement program based on above decision tree (model_tree)
	 Value Lost 	: $ 73500
	 Engagement cost: $ 97400
	 Total cost 	: $ 170900


## Case 4: if we send engagement program based on above random forest (`model_rf`)  

In [28]:
print("Case 4: if we send engagement program based on above random forest (model_rf) ")
conf_mat_rf = confusion_matrix(label_rf, prediction_rf)
conf_mat_rf

real_stay_predict_stay_rf= conf_mat_rf[0,0]
real_stay_predict_churn_rf = conf_mat_rf[0,1]
real_churn_predict_stay_rf = conf_mat_rf[1,0]
real_churn_predict_churn_rf = conf_mat_rf[1,1]

real_churn_rf = real_churn_predict_stay_rf + real_churn_predict_churn_rf
predict_churn_rf = real_stay_predict_churn_rf + real_churn_predict_churn_rf

print("\t Value Lost \t: $",  real_churn_predict_stay_rf * churn_value_lost_per_customer)
print("\t Engagement cost: $", predict_churn_rf * engagement_cost_per_customer)
print("\t Total cost \t: $",  (real_churn_predict_stay_rf * churn_value_lost_per_customer) + (predict_churn_rf * engagement_cost_per_customer))


Case 4: if we send engagement program based on above random forest (model_rf) 
	 Value Lost 	: $ 71000
	 Engagement cost: $ 88400
	 Total cost 	: $ 159400


## Case 5: if we send engagement program based on above the best model (model)

In [30]:
print("Case 5: if we send engagement program based on above the best model (model)")
conf_mat_best = confusion_matrix(label_roy, prediction_roy)
conf_mat_best

real_stay_predict_stay_best= conf_mat_best[0,0]
real_stay_predict_churn_best = conf_mat_best[0,1]
real_churn_predict_stay_best = conf_mat_best[1,0]
real_churn_predict_churn_best = conf_mat_best[1,1]

real_churn_best = real_churn_predict_stay_best + real_churn_predict_churn_best
predict_churn_best = real_stay_predict_churn_best + real_churn_predict_churn_best

print("\t Value Lost \t: $",  real_churn_predict_stay_best * churn_value_lost_per_customer)
print("\t Engagement cost: $", predict_churn_best * engagement_cost_per_customer)
print("\t Total cost \t: $",  (real_churn_predict_stay_best * churn_value_lost_per_customer) + (predict_churn_best * engagement_cost_per_customer))


Case 5: if we send engagement program based on above the best model (model)
	 Value Lost 	: $ 71000
	 Engagement cost: $ 88400
	 Total cost 	: $ 159400


Based on the calculation above:

Use Case3
we can save the cost from 299000 to 170900 or 42.8%

Use Case4:
we can save the cost from 299000 to 160700 or 46.2%

Use Case5:
we can save the cost from 299000 to 159400 or 46.7%