In [None]:
import pandas as pd
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model
import pickle
from IPython.display import Code

## Load dataset

In [3]:
df = pd.read_csv("new_churn_data.csv")
df

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_log,TotalCharges_Tenure_Ratio,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
0,1,29.85,29.85,0,3.396185,29.850000,1.000000,0,0,0,0,0,0,0
1,34,56.95,1889.50,0,4.042174,55.573529,0.030140,0,0,1,1,1,1,0
2,2,53.85,108.15,1,3.986202,54.075000,0.497920,0,0,1,1,0,0,0
3,45,42.30,1840.75,0,3.744787,40.905556,0.022980,1,0,1,0,1,1,0
4,2,70.70,151.65,1,4.258446,75.825000,0.466205,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7027,24,84.80,1990.50,0,4.440296,82.937500,0.042602,0,0,1,1,1,1,0
7028,72,103.20,7362.90,0,4.636669,102.262500,0.014016,0,1,1,0,1,1,0
7029,11,29.60,346.45,0,3.387774,31.495455,0.085438,0,0,0,0,0,0,0
7030,4,74.40,306.60,1,4.309456,76.650000,0.242661,0,0,1,1,0,0,0


## Initialize the auto ML environment

In [4]:
automl_setup = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,8322
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 14)"
4,Transformed data shape,"(7032, 14)"
5,Transformed train set shape,"(4922, 14)"
6,Transformed test set shape,"(2110, 14)"
7,Numeric features,13
8,Preprocess,True
9,Imputation type,simple


The output summarizes the setup information for the PyCaret auto ML environment.

Session id: 8322 - A unique identifier for the PyCaret session.

Target: Churn - The target variable for the classification task is Churn.

Target type: Binary - The target variable is binary, indicating a binary classification task (Churn or no Churn).

Original data shape: (7032, 14) - The original dataset has 7032 rows and 14 columns.

Transformed data shape: (7032, 14) - The transformed dataset after preprocessing remains the same size as the original dataset.

Transformed train set shape: (4922, 14) - The training set after preprocessing contains 4922 samples.

Transformed test set shape: (2110, 14) - The test set after preprocessing contains 2110 samples.

Numeric features: 13 - There are 13 numeric features in the dataset.

Preprocess: True - The data has been preprocessed.

Imputation type: simple - Simple imputation method has been used for handling missing values.

Numeric imputation: mean - Mean imputation has been applied to numeric features.

Categorical imputation: mode - Mode imputation has been applied to categorical features.

Fold Generator: StratifiedKFold - Stratified K-Fold cross-validation is used during model training.

Number: 10 - 10 folds are used in cross-validation.

CPU Jobs: -1 - The number of CPU jobs is set to -1, allowing PyCaret to utilize all available CPUs.

Use GPU: False - GPU acceleration is not utilized for model training.

Log Experiment: False - Logging of the experiment is turned off.

Experiment Name: clf-default-name - The default name for the classification experiment is 'clf-default-name'.

USI: 627f - A unique identifier for the experiment setup.

In [5]:
automl_type = type(automl_setup)
automl_type


pycaret.classification.oop.ClassificationExperiment

## Compare models and select the best one

In [7]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.8007,0.8379,0.5306,0.6553,0.5855,0.4564,0.4613,0.047
ridge,Ridge Classifier,0.7989,0.0,0.4725,0.6735,0.5545,0.43,0.4417,0.081
lr,Logistic Regression,0.7987,0.8384,0.5207,0.6521,0.5784,0.4485,0.4537,3.067
ada,Ada Boost Classifier,0.7913,0.8368,0.5153,0.6329,0.5671,0.4318,0.4363,0.221
gbc,Gradient Boosting Classifier,0.7899,0.8351,0.4947,0.6339,0.5547,0.4203,0.4263,0.56
lightgbm,Light Gradient Boosting Machine,0.7822,0.8241,0.5077,0.6105,0.5535,0.4112,0.4148,31.412
rf,Random Forest Classifier,0.7755,0.8012,0.4847,0.5966,0.534,0.3884,0.3925,0.507
knn,K Neighbors Classifier,0.7615,0.7426,0.4343,0.568,0.4916,0.3396,0.3451,0.089
et,Extra Trees Classifier,0.758,0.7716,0.4717,0.554,0.5086,0.3497,0.3522,0.384
dummy,Dummy Classifier,0.7343,0.5,0.0,0.0,0.0,0.0,0.0,0.03


This output summarizes the performance metrics of various machine learning models trained on the prepared churn  dataset, including accuracy, area under the curve (AUC), recall, precision, F1 score, Kappa, Matthews correlation coefficient (MCC), and training time in seconds.

The best performing model based on accuracy:

The Linear Discriminant Analysis(LDA) achieved the highest accuracy of 80.07% followed closely by Ridge Classifier accuracy of 79.97%.
Logistic Regression (LR) is another high performer with an accuracy of 79.87%.

**Interpreting the results**

Accuracy: Indicates the proportion of correctly classified instances out of the total instances.

AUC: Represents the area under the receiver operating characteristic (ROC) curve, which measures the model's ability to distinguish between classes.

Recall: Denotes the proportion of actual positive cases that were correctly identified by the model.

Precision: Indicates the proportion of positive identifications that were actually correct.

F1 Score: Harmonic mean of precision and recall, providing a balance between the two metrics.

Kappa: Measures the agreement between predicted and actual classifications, considering the possibility of the agreement occurring by chance.

MCC (Matthews Correlation Coefficient): Another measure of the quality of binary classifications, considering both false positives and false negatives.

Training Time (TT): Indicates the time taken by each model to train on the dataset.

In [8]:
best_model_info = best_model
best_model_info

## select specific rows

In [9]:
selected_rows = df.iloc[:15]
selected_rows

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_log,TotalCharges_Tenure_Ratio,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
0,1,29.85,29.85,0,3.396185,29.85,1.0,0,0,0,0,0,0,0
1,34,56.95,1889.5,0,4.042174,55.573529,0.03014,0,0,1,1,1,1,0
2,2,53.85,108.15,1,3.986202,54.075,0.49792,0,0,1,1,0,0,0
3,45,42.3,1840.75,0,3.744787,40.905556,0.02298,1,0,1,0,1,1,0
4,2,70.7,151.65,1,4.258446,75.825,0.466205,0,0,0,0,0,0,0
5,8,99.65,820.5,1,4.601664,102.5625,0.12145,0,0,0,0,0,0,0
6,22,89.1,1949.4,0,4.489759,88.609091,0.045706,0,1,1,0,0,0,0
7,10,29.75,301.9,0,3.392829,30.19,0.098543,0,0,1,1,0,0,0
8,28,104.8,3046.05,1,4.652054,108.7875,0.034405,0,0,0,0,0,0,0
9,62,56.15,3487.95,0,4.028027,56.257258,0.016098,1,0,1,0,1,1,0


## use best model to predict target variable

In [10]:
predict_model(best_model, selected_rows)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.7333,0.86,0.6,0.6,0.6,0.4,0.4


Unnamed: 0,tenure,MonthlyCharges,TotalCharges,MonthlyCharges_log,TotalCharges_Tenure_Ratio,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year,Churn,prediction_label,prediction_score
0,1,29.85,29.85,3.396185,29.85,1.0,0,0,0,0,0,0,0,0,1,0.6476
1,34,56.950001,1889.5,4.042174,55.573528,0.03014,0,0,1,1,1,1,0,0,0,0.9583
2,2,53.849998,108.150002,3.986202,54.075001,0.49792,0,0,1,1,0,0,0,1,0,0.6443
3,45,42.299999,1840.75,3.744787,40.905556,0.02298,1,0,1,0,1,1,0,0,0,0.965
4,2,70.699997,151.649994,4.258446,75.824997,0.466205,0,0,0,0,0,0,0,1,1,0.7753
5,8,99.650002,820.5,4.601664,102.5625,0.12145,0,0,0,0,0,0,0,1,1,0.8431
6,22,89.099998,1949.400024,4.489759,88.609093,0.045706,0,1,1,0,0,0,0,0,0,0.5757
7,10,29.75,301.899994,3.392829,30.190001,0.098543,0,0,1,1,0,0,0,0,0,0.9234
8,28,104.800003,3046.050049,4.652054,108.787498,0.034405,0,0,0,0,0,0,0,1,1,0.699
9,62,56.150002,3487.949951,4.028027,56.257259,0.016098,1,0,1,0,1,1,0,0,0,0.9649


**Model Performance Metrics**

        Model - LDA
        Accuracy: 73.33%
        AUC: 86.00%
        Recall: 60.00%
        Precision: 60.00%
        F1 Score: 60.00%
        Kappa: 40.00%
        MCC: 40.00%

These metrics evaluate the performance of the LDA model on the selected data. An accuracy of 73.33% suggests that 73.33% of the predictions made by the model are correct. A high AUC of 86.00% indicates that the model has a good ability to distinguish between the positive and negative classes. The recall, precision, and F1 score of 60.00% indicate that the model correctly identifies 60.00% of the positive cases, and when it predicts positive, it is correct 60.00% of the time. The Kappa and MCC scores are 40.00%, indicating moderate agreement and correlation between predicted and actual classifications, respectively.

## Determining wrong predictions

In [11]:
predicted_rows = predict_model(best_model, selected_rows)
wrong_predictions = (predicted_rows['Churn'] != predicted_rows['prediction_label']).sum()

print("Number of times the model was wrong:", wrong_predictions)


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.7333,0.86,0.6,0.6,0.6,0.4,0.4


Number of times the model was wrong: 4


Out of the total predictions made, the model was incorrect in predicting the churn status of 4 customers.

## Save to disk

In [12]:
save_model(best_model, 'LDA')


Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'MonthlyCharges',
                                              'TotalCharges',
                                              'MonthlyCharges_log',
                                              'TotalCharges_Tenure_Ratio',
                                              'MonthlyCharges_to_TotalCharges_Ratio',
                                              'Bank transfer (automatic)',
                                              'Credit card (automatic)',
                                              'Electronic check', 'Mailed check',
                                              'Month-to-month', 'One year',
                                              'Two y...
                                                               strategy='most_frequent',
                                                      

## Use pickle serialization to save the best_model

In [13]:
with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

## Load the saved model using pickle deserialization

In [14]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

## Create new_data

In [15]:
new_data = selected_rows.copy()
new_data.drop('Churn', axis=1, inplace=True)
new_data.to_csv('newest_churn_data.csv', index=False)

## predict churn for the loaded data

In [16]:
loaded_model.predict(new_data)

array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1], dtype=int8)

1: Predicted churn (the model predicts that the customer will churn).
0: Predicted no churn (the model predicts that the customer will not churn).

So, interpreting the predictions

    The first customer is predicted to churn.
    The second customer is predicted not to churn.
    The third customer is predicted not to churn.
    And so on, for each customer in the new data.

In [17]:
loaded_ridge = load_model('LDA')
predict_model(loaded_ridge, new_data)

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,tenure,MonthlyCharges,TotalCharges,MonthlyCharges_log,TotalCharges_Tenure_Ratio,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year,prediction_label,prediction_score
0,1,29.85,29.85,3.396185,29.85,1.0,0,0,0,0,0,0,0,1,0.6476
1,34,56.950001,1889.5,4.042174,55.573528,0.03014,0,0,1,1,1,1,0,0,0.9583
2,2,53.849998,108.150002,3.986202,54.075001,0.49792,0,0,1,1,0,0,0,0,0.6443
3,45,42.299999,1840.75,3.744787,40.905556,0.02298,1,0,1,0,1,1,0,0,0.965
4,2,70.699997,151.649994,4.258446,75.824997,0.466205,0,0,0,0,0,0,0,1,0.7753
5,8,99.650002,820.5,4.601664,102.5625,0.12145,0,0,0,0,0,0,0,1,0.8431
6,22,89.099998,1949.400024,4.489759,88.609093,0.045706,0,1,1,0,0,0,0,0,0.5757
7,10,29.75,301.899994,3.392829,30.190001,0.098543,0,0,1,1,0,0,0,0,0.9234
8,28,104.800003,3046.050049,4.652054,108.787498,0.034405,0,0,0,0,0,0,0,1,0.699
9,62,56.150002,3487.949951,4.028027,56.257259,0.016098,1,0,1,0,1,1,0,0,0.9649


## Python module to predict churn

In [18]:
Code('predict_churn.py')

In [19]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


0        Churn
1     No Churn
2     No Churn
3     No Churn
4        Churn
5        Churn
6     No Churn
7     No Churn
8        Churn
9     No Churn
10    No Churn
11    No Churn
12    No Churn
13    No Churn
14       Churn
Name: Churn_prediction, dtype: object


The output indicates the churn predictions for each customer in the dataset. Each entry in the output corresponds to a customer, and it shows whether the model predicts that the customer will churn or not churn.

## Comparison with actual churn status

Prediction for index 0: Churn
Actual churn status - Churn

Prediction for index 1: No Churn
Actual churn status - No Churn

Prediction for index 2: No Churn
Actual churn status - No Churn

Prediction for index 3: No Churn
Actual churn status - No Churn

Prediction for index 4: Churn
Actual churn status - Churn

Prediction for index 5: Churn
Actual churn status - Churn

Prediction for index 6: No Churn
Actual churn status - No Churn

Prediction for index 7: No Churn
Actual churn status - No Churn

Prediction for index 8: Churn
Actual churn status - Churn

Prediction for index 9: No Churn
Actual churn status - No Churn

Prediction for index 10: No Churn
Actual churn status - No Churn

Prediction for index 11: No Churn
Actual churn status - No Churn

Prediction for index 12: No Churn
Actual churn status - No Churn

Prediction for index 13: No Churn
Actual churn status - No Churn

Prediction for index 14: Churn
Actual churn status - Churn

It can be seen that the predictions align with the actual churn status of the dataset

## Summary

Necessary libraries and functions were imported for our task, which involves building a churn prediction model using PyCaret, a Python library for automating machine learning workflows.

We successfully achieved the following,

Loaded and prepared the churn data.

Set up an auto ML environment and compared classification models.

Selected the best-performing model which was LDA

Predicted the churn status for 15 specific rows of data using the selected model.

Saved the best-performing model to a file using PyCaret's save_model function.

Serialized and deserialized the model using pickle.

Predicted the churn status for new data using both the loaded model and PyCaret's load_model function