# Churn Prediction Model 

In [18]:
%pip install --upgrade pip
%pip install --upgrade scipy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [19]:
# import libraries 
import pandas as pd 
from sklearn import metrics 
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTEENN

In [20]:
# load dataset 
df = pd.read_csv("/Users/tonyzhang/Desktop/customerchurn/data/telco_data_clean.csv")

In [21]:
df.head()

Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,29.85,29.85,0,1,0,0,1,1,0,...,0,0,1,0,1,0,0,0,0,0
1,0,56.95,1889.5,0,0,1,1,0,1,0,...,0,0,0,1,0,0,1,0,0,0
2,0,53.85,108.15,1,0,1,1,0,1,0,...,0,0,0,1,1,0,0,0,0,0
3,0,42.3,1840.75,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,1,0,0
4,0,70.7,151.65,1,1,0,1,0,1,0,...,0,0,1,0,1,0,0,0,0,0


## 1. Create X and y variables

In [22]:
X = df.drop("Churn", axis=1)
y = df['Churn']

### 1.1 Train test split 

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  

## 2. Decision Tree Classifier 

#### 2.1 Initialize and Fit Model

In [24]:
dt_model = DecisionTreeClassifier(criterion='gini', max_depth=6, min_samples_leaf=8, random_state=0)
dt_model.fit(X_train, y_train)


#### 2.2 Model Prediction

In [25]:

# model prediction
y_pred = dt_model.predict(X_test)
y_pred

array([0, 0, 0, ..., 1, 0, 0])

The model prediction(y_pred) is an array of 0s and 1s, which means that it's making binary predictions for whether a customer churns(1) or doesn't churn(0). And each array value is a binary prediction on the data in X_test 

#### 2.3 Evaluate Model Performance

In [26]:
print(classification_report(y_test, y_pred, labels=[0, 1]))

              precision    recall  f1-score   support

           0       0.83      0.88      0.86      1038
           1       0.60      0.51      0.55       369

    accuracy                           0.78      1407
   macro avg       0.72      0.70      0.71      1407
weighted avg       0.77      0.78      0.78      1407



#### Summary of Classification Report: 
Precision: ratio of true positive predictions to the total number of positive predictions(true positives + false positives)
Recall: ratio of true positive predictions to the total number of actual positives(true positives + false negatives)

The model performances better on class 0 than on class 1, as evidenced by higher precision, recall, and f1-score for class 0. 
In addition, the overall accuracy of 78% could be improved by raising the performance of the model for class 1. 

The weighted average scores are closer to the scores for class 0 because there is a significant larger support size of class 0 than class 1, so class 0 metrics are weighted more. We need to apply over-sampling techniques such as SMOTEENN to balance class 0 and class 1. 

#### 2.4 Confusion Matrix 

https://glassboxmedicine.com/2019/02/17/measuring-performance-the-confusion-matrix/


In [27]:
print(confusion_matrix(y_test, y_pred))

[[915 123]
 [181 188]]


- 915 true positives
- 188 true negatives 
- 123 actual negatives but labeled as positive
- 181 actual positives but labeled as negatives 

#### 2.5 SMOTEENN to balance instances 


In [28]:
sm = SMOTEENN()
X_resampled, y_resampled = sm.fit_resample(X, y)

In [29]:
# rerun the DecisionTreeClassifier on the newly resampled dataset.
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=0)
dt_model_resampled = DecisionTreeClassifier(criterion='gini', max_depth=6, min_samples_leaf=8, random_state=0)
dt_model_resampled.fit(Xr_train, yr_train)

y_pred_resampled = dt_model_resampled.predict(Xr_test)
print(classification_report(y_pred_resampled, yr_test, labels=[0, 1]))

              precision    recall  f1-score   support

           0       0.93      0.94      0.94       513
           1       0.95      0.94      0.95       648

    accuracy                           0.94      1161
   macro avg       0.94      0.94      0.94      1161
weighted avg       0.94      0.94      0.94      1161



New scores are significantly better after resampling dataset. We will now be implementing a RandomForestClassifier to improve our model's performance and robustness"

## 3 Random Forest Classifier 

In [30]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=6, min_samples_leaf=8, random_state=0)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.82      0.93      0.87      1038
           1       0.68      0.44      0.53       369

    accuracy                           0.80      1407
   macro avg       0.75      0.68      0.70      1407
weighted avg       0.78      0.80      0.78      1407



The performance scores are better than DecisionTreeClassifier pre-resampling. Let's apply SMOTEENN to the dataset and see if metrics improve for the Random Forest Classifier. 

#### 3.1 Apply SMOTEENN

In [31]:
sm = SMOTEENN()
X_resampled_rf, y_resampled_rf = sm.fit_resample(X, y)
Xr_train_rf, Xr_test_rf, yr_train_rf, yr_test_rf = train_test_split(X_resampled_rf, y_resampled_rf, test_size=0.2, random_state=0)

rf_model_resampled = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=6, min_samples_leaf=8, random_state=0)
rf_model_resampled.fit(Xr_train_rf, yr_train_rf)
y_pred_resampled_rf = rf_model_resampled.predict(Xr_test_rf)

print(classification_report(y_pred_resampled_rf, yr_test_rf, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.89      0.93      0.91       498
           1       0.95      0.91      0.93       681

    accuracy                           0.92      1179
   macro avg       0.92      0.92      0.92      1179
weighted avg       0.92      0.92      0.92      1179



In [32]:
print(confusion_matrix(y_pred_resampled_rf, yr_test_rf))

[[465  33]
 [ 58 623]]


## 4. Save the Model 

In [33]:
import pickle 

# save the model to disk
filename = 'finalized_model.sav'

# write to the file
pickle.dump(rf_model_resampled, open(filename, 'wb'))

In [35]:
# load the model from disk
load_model = pickle.load(open(filename, 'rb'))

# test the loaded model
model_score_r1 = load_model.score(Xr_test_rf, yr_test_rf) 
model_score_r1

0.9228159457167091