In [47]:
import pandas as pd
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model
import pickle
from IPython.display import Code



In [17]:
df = pd.read_csv("preped_churn_data.csv")

## Initalize auto ML environment

In [18]:
automl_setup = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,8517
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 13)"
4,Transformed data shape,"(7043, 13)"
5,Transformed train set shape,"(4930, 13)"
6,Transformed test set shape,"(2113, 13)"
7,Numeric features,12
8,Preprocess,True
9,Imputation type,simple


The output is a comprehensive summary of the configuration and transformations applied during the setup phase using PyCaret's setup function for the binary classification task on the churn dataset. The session ID `8517` serves as a unique identifier for the current session. The target variable for classification is identified as `Churn` and the dataset initially had 7043 rows and 13 columns. 

After transformations, both the transformed data and the training and test sets maintain the same shape as the original dataset. The configuration involves handling 12 numeric features, employing preprocessing steps, using simple imputation with mean for numeric features, and employing mode imputation for categorical features. Stratified K-Fold cross-validation with 10 folds is used, and the setup utilizes available CPU cores (-1) without GPU acceleration. The experiment is not logged, and the default name for the experiment is `clf-default-name.` The Unique Session ID (USI) is `ebba,` providing a unique identifier for tracking and logging purposes. This detailed summary offers insights into the specific settings and transformations applied to the dataset in preparation for the subsequent model comparison and selection steps.

## Compare various classification models and select the best-performng one

In [19]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7982,0.835,0.5298,0.6459,0.5812,0.4502,0.4546,2.499
gbc,Gradient Boosting Classifier,0.7949,0.8369,0.5069,0.6436,0.5662,0.4347,0.4405,0.477
ada,Ada Boost Classifier,0.7915,0.8357,0.4999,0.6351,0.5586,0.4249,0.4306,0.173
ridge,Ridge Classifier,0.7909,0.0,0.4641,0.6482,0.5401,0.4098,0.4197,0.031
lda,Linear Discriminant Analysis,0.7907,0.8258,0.5275,0.6245,0.5713,0.4343,0.4374,0.056
lightgbm,Light Gradient Boosting Machine,0.7858,0.8273,0.5153,0.6156,0.5606,0.4205,0.4237,29.083
rf,Random Forest Classifier,0.7757,0.8018,0.4862,0.5951,0.5343,0.3887,0.3926,0.412
svm,SVM - Linear Kernel,0.7732,0.0,0.4099,0.6202,0.4797,0.3466,0.3642,0.057
knn,K Neighbors Classifier,0.7671,0.7422,0.4565,0.5776,0.5096,0.3598,0.3642,0.071
et,Extra Trees Classifier,0.758,0.7771,0.4839,0.5502,0.5147,0.3545,0.3559,0.461


The output represents the results of the model comparison process, where various classification models were evaluated based on multiple performance metrics. The LogisticRegression (LR) emerges as the best-performing model with an accuracy of 0.7982, an AUC of 0.8350, a recall of 0.5298, precision of 0.6459, F1 score of 0.5812, Kappa of 0.4502, and MCC of 0.4546. These metrics collectively suggest that LR achieves a well-balanced performance across different aspects, making it the top choice among the models evaluated.

The table also presents other models and their respective performance metrics, allowing for a comparative analysis. LogisticRegression outperforms models such as LightBM, Gradient Boosting Classifier (gbc), Ada Boost Classifier (ada), Ridge Classifier (ridge), Linear Discriminant Analysis (lda), Random Forest Classifier (rf), e.t.c.


In [25]:
best_model.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 8517,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [23]:
selected_rows = df.iloc[20:32]
selected_rows

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,Churn,TotalCharges_to_MonthlyCharges_ratio,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Contract_Month-to-month,Contract_One year,Contract_Two year
20,1,0,39.65,39.65,1,1.0,0,0,0,0,0,0,0
21,12,1,19.8,202.25,0,10.214646,1,0,1,0,1,1,0
22,1,1,20.15,20.15,1,1.0,0,0,1,1,0,0,0
23,58,1,59.9,3505.1,0,58.51586,0,1,1,0,1,0,1
24,49,1,59.6,2970.3,0,49.837248,0,1,1,0,0,0,0
25,30,1,55.3,1530.6,0,27.678119,1,0,1,0,0,0,0
26,47,1,99.35,4749.15,1,47.802214,0,0,0,0,0,0,0
27,1,0,30.2,30.2,1,1.0,0,0,0,0,0,0,0
28,72,1,90.25,6369.45,0,70.575623,0,1,1,0,1,0,1
29,17,1,64.7,1093.1,1,16.8949,0,0,1,1,0,0,0


## predict churn for the selected rows using best_model

In [24]:
predict_model(best_model, selected_rows)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.6667,0.8571,0.4,0.6667,0.5,0.2727,0.2928


Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,TotalCharges_to_MonthlyCharges_ratio,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Contract_Month-to-month,Contract_One year,Contract_Two year,Churn,prediction_label,prediction_score
20,1,0,39.650002,39.650002,1.0,0,0,0,0,0,0,0,1,1,0.6476
21,12,1,19.799999,202.25,10.214646,1,0,1,0,1,1,0,0,0,0.9219
22,1,1,20.15,20.15,1.0,0,0,1,1,0,0,0,1,0,0.7724
23,58,1,59.900002,3505.100098,58.515862,0,1,1,0,1,0,1,0,0,0.9869
24,49,1,59.599998,2970.300049,49.83725,0,1,1,0,0,0,0,0,0,0.9184
25,30,1,55.299999,1530.599976,27.67812,1,0,1,0,0,0,0,0,0,0.804
26,47,1,99.349998,4749.149902,47.802216,0,0,0,0,0,0,0,1,0,0.5762
27,1,0,30.200001,30.200001,1.0,0,0,0,0,0,0,0,1,1,0.591
28,72,1,90.25,6369.450195,70.575623,0,1,1,0,1,0,1,0,0,0.9732
29,17,1,64.699997,1093.099976,16.894899,0,0,1,1,0,0,0,1,0,0.6854


Interpreting a few rows,

Row 20 - A customer with 1 month tenure, using PhoneService, has a MonthlyCharge of 39.65. The model predicts Churn (prediction_label = 1) with a probability score of 0.6476.

Row 21 - A customer with 12 months tenure, using PhoneService, MonthlyCharge of 19.80, and TotalCharges of 202.25. The model predicts No Churn (prediction_label = 0) with a high probability score of 0.9219.

Row 22 - A customer with 1 month tenure, using PhoneService, MonthlyCharge of 20.15. The model predicts No Churn (prediction_label = 0) with a probability score of 0.7724.

These interpretations demonstrate how the model predicts churn based on the input features and provides a probability score, which is useful for understanding the model's confidence in its predictions.

## Save best model

In [26]:
save_model(best_model, 'LR')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'MonthlyCharges', 'TotalCharges',
                                              'TotalCharges_to_MonthlyCharges_ratio',
                                              'PaymentMethod_Bank transfer '
                                              '(automatic)',
                                              'PaymentMethod_Credit card '
                                              '(automatic)',
                                              'PaymentMethod_Electronic check',
                                              'PaymentMethod_Mailed check',
                                              'Contr...
                  TransformerWrapper(exclude=None, include=None,
                                     transformer=CleanColumnNames(match='[\\]\\[\

In [27]:
with open('LR_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [28]:
with open('LR_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

## new_data from the selected rows

In [29]:
new_data = selected_rows.drop('Churn', axis=1).copy()
new_data.to_csv('new_churn_data.csv', index=False)

## predict churn for the new data

In [30]:
loaded_model.predict(new_data)

array([1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1], dtype=int8)

In [32]:
loaded_lr = load_model('LR')
predict_model(loaded_lr, new_data)

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,TotalCharges_to_MonthlyCharges_ratio,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Contract_Month-to-month,Contract_One year,Contract_Two year,prediction_label,prediction_score
20,1,0,39.650002,39.650002,1.0,0,0,0,0,0,0,0,1,0.6476
21,12,1,19.799999,202.25,10.214646,1,0,1,0,1,1,0,0,0.9219
22,1,1,20.15,20.15,1.0,0,0,1,1,0,0,0,0,0.7724
23,58,1,59.900002,3505.100098,58.515862,0,1,1,0,1,0,1,0,0.9869
24,49,1,59.599998,2970.300049,49.83725,0,1,1,0,0,0,0,0,0.9184
25,30,1,55.299999,1530.599976,27.67812,1,0,1,0,0,0,0,0,0.804
26,47,1,99.349998,4749.149902,47.802216,0,0,0,0,0,0,0,0,0.5762
27,1,0,30.200001,30.200001,1.0,0,0,0,0,0,0,0,1,0.591
28,72,1,90.25,6369.450195,70.575623,0,1,1,0,1,0,1,0,0.9732
29,17,1,64.699997,1093.099976,16.894899,0,0,1,1,0,0,0,0,0.6854


## Display predict_churn.py script

In [43]:
code_display = Code('predict_churn.py')
code_display

## Execute script

In [45]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


0        Churn
1     No Churn
2     No Churn
3     No Churn
4     No Churn
5     No Churn
6     No Churn
7        Churn
8     No Churn
9     No Churn
10    No Churn
11       Churn
Name: Churn_prediction, dtype: object


The output indicates the predictions for each row in the dataset after loading the model. The `Churn_prediction` column contains the predicted labels, where `Churn` represents instances predicted as churn and `No Churn` represents instances predicted as not churn.

## Summary

The goal is to perform churn prediction using PyCaret. The process begins by reading churn data from a CSV file into a pandas DataFrame. An auto ML environment is then set up using PyCaret's setup function, specifying Churn as the target variable. A variety of classification models are compared using the compare_models function, and the best-performing model which is LogisticRegression, is selected.

The 12 (from 20 to 32) rows of the dataset are extracted, and the best model is used to predict the target variable for these selected rows. The model is saved with the name LR using both PyCaret's save_model function and pickle serialization. The saved model is then loaded back into memory using pickle deserialization.

A new dataset (new_data) is created by copying the selected rows and dropping the Churn column. The loaded model is employed to predict the target variable for this new dataset, and the results are printed. Additionally, the LR model is loaded again using PyCaret's load_model function, and predictions are made for the new dataset.

We conclude by displaying the code for creating a Python module named `predict_churn.py` using IPython's Code display. The %run predict_churn.py command is executed to run the script, making churn predictions using the saved model.