In [1]:
import pandas as pd
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model
import pickle
from IPython.display import Code

## Load data

In [3]:
df = pd.read_csv("prepped_churn_data.csv")
df

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_log,TotalCharges_Tenure_Ratio,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
0,1,29.85,29.85,0,3.396185,29.850000,1.000000,0,0,0,0,0,0,0
1,34,56.95,1889.50,0,4.042174,55.573529,0.030140,0,0,1,1,1,1,0
2,2,53.85,108.15,1,3.986202,54.075000,0.497920,0,0,1,1,0,0,0
3,45,42.30,1840.75,0,3.744787,40.905556,0.022980,1,0,1,0,1,1,0
4,2,70.70,151.65,1,4.258446,75.825000,0.466205,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7027,24,84.80,1990.50,0,4.440296,82.937500,0.042602,0,0,1,1,1,1,0
7028,72,103.20,7362.90,0,4.636669,102.262500,0.014016,0,1,1,0,1,1,0
7029,11,29.60,346.45,0,3.387774,31.495455,0.085438,0,0,0,0,0,0,0
7030,4,74.40,306.60,1,4.309456,76.650000,0.242661,0,0,1,1,0,0,0


## initialize auto ML environment

In [4]:
automl_setup = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,7041
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 14)"
4,Transformed data shape,"(7032, 14)"
5,Transformed train set shape,"(4922, 14)"
6,Transformed test set shape,"(2110, 14)"
7,Numeric features,13
8,Preprocess,True
9,Imputation type,simple


The output summarizes the setup information for the PyCaret auto ML environment.

Session id: 7041 - A unique identifier for the PyCaret session.

Target: Churn - The target variable for the classification task is Churn.

Target type: Binary - The target variable is binary, indicating a binary classification task (Churn or no Churn).

Original data shape: (7032, 14) - The original dataset has 7032 rows and 14 columns.

Transformed data shape: (7032, 14) - The transformed dataset after preprocessing remains the same size as the original dataset.

Transformed train set shape: (4922, 14) - The training set after preprocessing contains 4922 samples.

Transformed test set shape: (2110, 14) - The test set after preprocessing contains 2110 samples.

Numeric features: 13 - There are 13 numeric features in the dataset.

Preprocess: True - The data has been preprocessed.

Imputation type: simple - Simple imputation method has been used for handling missing values.

Numeric imputation: mean - Mean imputation has been applied to numeric features.

Categorical imputation: mode - Mode imputation has been applied to categorical features.

Fold Generator: StratifiedKFold - Stratified K-Fold cross-validation is used during model training.

Number: 10 - 10 folds are used in cross-validation.

CPU Jobs: -1 - The number of CPU jobs is set to -1, allowing PyCaret to utilize all available CPUs.

Use GPU: False - GPU acceleration is not utilized for model training.

Log Experiment: False - Logging of the experiment is turned off.

Experiment Name: clf-default-name - The default name for the classification experiment is 'clf-default-name'.

USI: cc2a - A unique identifier for the experiment setup.

In [5]:
automl_type = type(automl_setup)
automl_type

pycaret.classification.oop.ClassificationExperiment

## Compare and select best model

In [6]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ridge,Ridge Classifier,0.7942,0.0,0.4564,0.667,0.5413,0.4145,0.4273,0.077
lr,Logistic Regression,0.7938,0.8351,0.4962,0.6478,0.5614,0.4297,0.4366,11.275
lda,Linear Discriminant Analysis,0.7936,0.8321,0.5069,0.6437,0.5665,0.4335,0.4393,0.229
ada,Ada Boost Classifier,0.7903,0.8321,0.4978,0.636,0.5581,0.4233,0.429,0.29
gbc,Gradient Boosting Classifier,0.7901,0.8366,0.4733,0.6453,0.5455,0.4132,0.422,0.696
lightgbm,Light Gradient Boosting Machine,0.7848,0.8271,0.5023,0.6176,0.5538,0.414,0.418,33.597
rf,Random Forest Classifier,0.7741,0.803,0.4725,0.595,0.5265,0.3807,0.3852,0.502
knn,K Neighbors Classifier,0.7666,0.7547,0.448,0.5801,0.5049,0.3555,0.361,0.078
et,Extra Trees Classifier,0.7657,0.7786,0.4832,0.571,0.5229,0.3691,0.3718,0.392
svm,SVM - Linear Kernel,0.7625,0.0,0.443,0.605,0.4796,0.3401,0.3641,0.123


This output summarizes the performance metrics of various machine learning models trained on the prepped churn  dataset, including accuracy, area under the curve (AUC), recall, precision, F1 score, Kappa, Matthews correlation coefficient (MCC), and training time in seconds.

The best performing model based on accuracy:

The Ridge Classifier achieved the highest accuracy of 79.42% followed closely by Logistic Regression accuracy of 79.38%.
LDA is another high performer with an accuracy of 79.36%.

**Interpreting the results**

Accuracy: Indicates the proportion of correctly classified instances out of the total instances.

AUC: Represents the area under the receiver operating characteristic (ROC) curve, which measures the model's ability to distinguish between classes.

Recall: Denotes the proportion of actual positive cases that were correctly identified by the model.

Precision: Indicates the proportion of positive identifications that were actually correct.

F1 Score: Harmonic mean of precision and recall, providing a balance between the two metrics.

Kappa: Measures the agreement between predicted and actual classifications, considering the possibility of the agreement occurring by chance.

MCC (Matthews Correlation Coefficient): Another measure of the quality of binary classifications, considering both false positives and false negatives.

Training Time (TT): Indicates the time taken by each model to train on the dataset.

In [7]:
best_model_info = best_model
best_model_info

## Select specific rows

In [8]:
selected_rows = df.iloc[400:415]
selected_rows

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_log,TotalCharges_Tenure_Ratio,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
400,32,19.75,624.15,0,2.983153,19.504687,0.031643,1,0,1,0,1,1,0
401,11,20.05,237.7,0,2.998229,21.609091,0.08435,0,1,1,0,1,1,0
402,69,99.45,7007.6,1,4.599655,101.55942,0.014192,0,1,1,0,0,0,0
403,68,55.9,3848.8,0,4.023564,56.6,0.014524,1,0,1,0,1,1,0
404,20,19.7,419.4,0,2.980619,20.97,0.046972,0,0,1,1,1,0,1
405,72,19.8,1468.75,0,2.985682,20.399306,0.013481,0,1,1,0,1,0,1
406,60,95.4,5812.0,0,4.558079,96.866667,0.016414,1,0,1,0,1,1,0
407,32,93.95,2861.45,0,4.542763,89.420312,0.032833,0,0,1,1,0,0,0
408,1,19.9,19.9,1,2.99072,19.9,1.0,0,0,1,1,0,0,0
409,1,19.6,19.6,1,2.97553,19.6,1.0,0,0,1,1,0,0,0


## Utilize best model to predict churn

In [9]:
predict_model(best_model, selected_rows)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Ridge Classifier,0.8,0.625,0.25,1.0,0.4,0.3284,0.4432


Unnamed: 0,tenure,MonthlyCharges,TotalCharges,MonthlyCharges_log,TotalCharges_Tenure_Ratio,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year,Churn,prediction_label
400,32,19.75,624.150024,2.983154,19.504688,0.031643,1,0,1,0,1,1,0,0,0
401,11,20.049999,237.699997,2.998229,21.609091,0.08435,0,1,1,0,1,1,0,0,0
402,69,99.449997,7007.600098,4.599655,101.559418,0.014192,0,1,1,0,0,0,0,1,0
403,68,55.900002,3848.800049,4.023564,56.599998,0.014524,1,0,1,0,1,1,0,0,0
404,20,19.700001,419.399994,2.980619,20.969999,0.046972,0,0,1,1,1,0,1,0,0
405,72,19.799999,1468.75,2.985682,20.399305,0.013481,0,1,1,0,1,0,1,0,0
406,60,95.400002,5812.0,4.558079,96.866669,0.016414,1,0,1,0,1,1,0,0,0
407,32,93.949997,2861.449951,4.542763,89.420311,0.032833,0,0,1,1,0,0,0,0,0
408,1,19.9,19.9,2.99072,19.9,1.0,0,0,1,1,0,0,0,1,0
409,1,19.6,19.6,2.97553,19.6,1.0,0,0,1,1,0,0,0,1,0


Metrics

Model: Ridge Classifier
Accuracy: 80%
AUC: 62.5%
Recall: 25%
Precision: 100%
F1 Score: 40%
Kappa: 32.84%
MCC: 44.32%

While the model exhibits high precision, suggesting it correctly identifies churn when it occurs, its recall is quite low, indicating it misses many actual churn instances.



## Incorrect predictions

In [10]:
predicted_rows = predict_model(best_model, selected_rows)
incorrect_predictions = (predicted_rows['Churn'] != predicted_rows['prediction_label']).sum()

print("Incorrect Predictions:", incorrect_predictions)


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Ridge Classifier,0.8,0.625,0.25,1.0,0.4,0.3284,0.4432


Incorrect Predictions: 3


Out of the total predictions made, the model was incorrect in predicting the churn status of 3 customers.

## Save model

In [11]:
save_model(best_model, 'ridge')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'MonthlyCharges',
                                              'TotalCharges',
                                              'MonthlyCharges_log',
                                              'TotalCharges_Tenure_Ratio',
                                              'MonthlyCharges_to_TotalCharges_Ratio',
                                              'Bank transfer (automatic)',
                                              'Credit card (automatic)',
                                              'Electronic check', 'Mailed check',
                                              'Month-to-month', 'One year',
                                              'Two y...
                                                               strategy='most_frequent',
                                                      

## Use pickle to save and load the model (serialization and deserialization)

In [14]:
with open('ridge_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [16]:
with open('ridge_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

## Create new data

In [17]:
new_data = selected_rows.copy()
new_data.drop('Churn', axis=1, inplace=True)
new_data.to_csv('new_churn_data.csv', index=False)

## Make predictions for churn on the loaded data

In [18]:
loaded_model.predict(new_data)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], dtype=int8)

In [19]:
loaded_ridge = load_model('ridge')
predict_model(loaded_ridge, new_data)

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,tenure,MonthlyCharges,TotalCharges,MonthlyCharges_log,TotalCharges_Tenure_Ratio,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year,prediction_label
400,32,19.75,624.150024,2.983154,19.504688,0.031643,1,0,1,0,1,1,0,0
401,11,20.049999,237.699997,2.998229,21.609091,0.08435,0,1,1,0,1,1,0,0
402,69,99.449997,7007.600098,4.599655,101.559418,0.014192,0,1,1,0,0,0,0,0
403,68,55.900002,3848.800049,4.023564,56.599998,0.014524,1,0,1,0,1,1,0,0
404,20,19.700001,419.399994,2.980619,20.969999,0.046972,0,0,1,1,1,0,1,0
405,72,19.799999,1468.75,2.985682,20.399305,0.013481,0,1,1,0,1,0,1,0
406,60,95.400002,5812.0,4.558079,96.866669,0.016414,1,0,1,0,1,1,0,0
407,32,93.949997,2861.449951,4.542763,89.420311,0.032833,0,0,1,1,0,0,0,0
408,1,19.9,19.9,2.99072,19.9,1.0,0,0,1,1,0,0,0,0
409,1,19.6,19.6,2.97553,19.6,1.0,0,0,1,1,0,0,0,0


## Python Module to predict churn

In [20]:
Code('predict_churn.py')

In [21]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


0     No Churn
1     No Churn
2     No Churn
3     No Churn
4     No Churn
5     No Churn
6     No Churn
7     No Churn
8     No Churn
9     No Churn
10       Churn
11    No Churn
12    No Churn
13    No Churn
14    No Churn
Name: Churn_prediction, dtype: object


The output indicates the churn predictions for each customer in the dataset. Each entry in the output corresponds to a customer, and it shows whether the model predicts that the customer will churn or not churn.

Necessary libraries and functions were imported for this process, which involves building a churn prediction model using PyCaret, a Python library for automating machine learning workflows.

We successfully achieved the following,

Loaded and prepared the churn data.

Set up an auto ML environment and compared classification models.

Selected the best-performing model which was Ridge Classifiet

Predicted the churn status for 15 specific rows of data using the selected model.

Saved the best-performing model to a file using PyCaret's save_model function.

Serialized and deserialized the model using pickle.

Predicted the churn status for new data using both the loaded model and PyCaret's load_model function