In [1]:
import pandas as pd

## Load dataset

In [2]:
df = pd.read_csv("prepared_churn_data.csv")
df.head(15)

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
0,1,0,29.85,29.85,0,1.0,0,0,0,0,0,0,0
1,34,1,56.95,1889.5,0,0.03014,0,0,1,1,1,1,0
2,2,1,53.85,108.15,1,0.49792,0,0,1,1,0,0,0
3,45,0,42.3,1840.75,0,0.02298,1,0,1,0,1,1,0
4,2,1,70.7,151.65,1,0.466205,0,0,0,0,0,0,0
5,8,1,99.65,820.5,1,0.12145,0,0,0,0,0,0,0
6,22,1,89.1,1949.4,0,0.045706,0,1,1,0,0,0,0
7,10,0,29.75,301.9,0,0.098543,0,0,1,1,0,0,0
8,28,1,104.8,3046.05,1,0.034405,0,0,0,0,0,0,0
9,62,1,56.15,3487.95,0,0.016098,1,0,1,0,1,1,0


## Setup autoML Environment

In [3]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [4]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1589
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 13)"
4,Transformed data shape,"(7032, 13)"
5,Transformed train set shape,"(4922, 13)"
6,Transformed test set shape,"(2110, 13)"
7,Numeric features,12
8,Preprocess,True
9,Imputation type,simple


Output from the setup function in PyCaret, is used to set up the environment for machine learning. It provides information about the configuration and the preprocessing steps applied to the dataset.

Session id: A unique identifier for the current PyCaret session which is 1589.

Target: The target variable for the machine learning task. In this case, it is Churn, indicating thatwe are working on a binary classification problem where the goal is to predict whether a customer will churn or not.

Target type: Specifies the nature of the target variable. Binary indicates that it is a binary classification task.

Original data shape: The shape of the original dataset before any preprocessing. In this case, it was (7032, 13), meaning there are 7032 rows and 13 columns in the original dataset.

Transformed data shape: The shape of the dataset after preprocessing. It remains the same in this case, indicating that no feature engineering or dimensionality reduction was performed.

Transformed train set shape: The shape of the training set after preprocessing. In this case, it is (4922, 13), meaning that 4922 samples are used for training.

Transformed test set shape: The shape of the test set after preprocessing is (2110, 13), showing that 2110 samples are used for testing.

Numeric features: The number of numeric features in the dataset, which are 12 numeric features.

Preprocess: Indicates whether preprocessing was performed. `True` suggests that preprocessing steps, such as imputation and scaling, were applied.

Imputation type: Specifies the type of imputation used for missing values. `Simple` means basic imputation techniques were applied.

Numeric imputation: The strategy used for imputing missing values in numeric features. `Mean` means that the mean value was used.

Categorical imputation: The strategy used for imputing missing values in categorical features. `Mode` indicates that the mode (most frequent value) was used.

Fold Generator: The cross-validation strategy used. `StratifiedKFold` indicates that stratified k-fold cross-validation was employed.

Fold Number: The number of folds used in cross-validation - 10

CPU Jobs: The number of CPU cores used during parallel processing. `-1` usually means to use all available cores.


Experiment Name: The name assigned to the current machine learning experiment. In this case, it is `clf-default-name.`

USI: The User System Identifier, a unique identifier for the current user's system - 73dc

In [5]:
type(automl)

pycaret.classification.oop.ClassificationExperiment

In [6]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7917,0.8347,0.4793,0.6459,0.549,0.4178,0.4263,0.584
lda,Linear Discriminant Analysis,0.7915,0.8311,0.4869,0.6436,0.5532,0.4209,0.4285,0.204
ridge,Ridge Classifier,0.7905,0.0,0.4357,0.6617,0.5243,0.3975,0.4123,0.071
lr,Logistic Regression,0.7903,0.8353,0.4991,0.6362,0.5576,0.4232,0.4295,4.527
ada,Ada Boost Classifier,0.7812,0.8305,0.4762,0.614,0.5357,0.3958,0.4016,0.22
lightgbm,Light Gradient Boosting Machine,0.7785,0.8185,0.4877,0.6035,0.5385,0.3952,0.3996,50.77
rf,Random Forest Classifier,0.7633,0.7881,0.4502,0.5704,0.5019,0.3498,0.3547,0.611
knn,K Neighbors Classifier,0.7536,0.743,0.4364,0.5461,0.4838,0.325,0.3292,0.13
et,Extra Trees Classifier,0.7511,0.7656,0.4648,0.5381,0.4976,0.3337,0.3359,0.705
dummy,Dummy Classifier,0.7343,0.5,0.0,0.0,0.0,0.0,0.0,0.071


We performed Machine learning model selection process using PyCaret. The summary provides information about various classification models and their performance metrics based on a 10-fold cross-validation.

Below is the best model according to the metrics
```
    Best Model: Gradient Boosting Classifier (gbc)
        Accuracy: 0.7917
        AUC (Area Under the Curve): 0.8347
        Recall: 0.4793
        Precision: 0.6549
        F1 Score: 0.5490
        Kappa: 0.4178
        MCC (Matthews Correlation Coefficient): 0.4263
```

Other models were also evaluated, and their respective performance metrics are presented in the table.

In [7]:
best_model

## Select specific rows

In [8]:
df.iloc[90:100]

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
90,30,1,82.05,2570.2,0,0.031924,1,0,1,0,0,0,0
91,1,1,74.7,74.7,0,1.0,0,0,0,0,0,0,0
92,66,1,84.0,5714.25,0,0.0147,0,0,1,1,1,0,1
93,65,1,111.05,7107.0,0,0.015625,0,1,1,0,0,0,0
94,72,1,100.9,7459.05,0,0.013527,1,0,1,0,1,0,1
95,12,1,78.95,927.35,1,0.085135,0,0,0,0,0,0,0
96,71,1,66.85,4748.7,0,0.014078,0,1,1,0,1,1,0
97,5,1,21.05,113.85,1,0.184892,0,0,1,1,0,0,0
98,52,1,21.0,1107.2,0,0.018967,1,0,1,0,1,0,1
99,25,1,98.5,2514.5,1,0.039173,0,0,0,0,0,0,0


In [9]:
predict_model(best_model, df.iloc[90:100])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,0.8,0.7619,0.6667,0.6667,0.6667,0.5238,0.5238


Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year,Churn,prediction_label,prediction_score
90,30,1,82.050003,2570.199951,0.031924,1,0,1,0,0,0,0,0,0,0.712
91,1,1,74.699997,74.699997,1.0,0,0,0,0,0,0,0,0,1,0.8048
92,66,1,84.0,5714.25,0.0147,0,0,1,1,1,0,1,0,0,0.9664
93,65,1,111.050003,7107.0,0.015625,0,1,1,0,0,0,0,0,0,0.7174
94,72,1,100.900002,7459.049805,0.013527,1,0,1,0,1,0,1,0,0,0.9753
95,12,1,78.949997,927.349976,0.085135,0,0,0,0,0,0,0,1,1,0.5302
96,71,1,66.849998,4748.700195,0.014078,0,1,1,0,1,1,0,0,0,0.9741
97,5,1,21.049999,113.849998,0.184892,0,0,1,1,0,0,0,1,0,0.8441
98,52,1,21.0,1107.199951,0.018967,1,0,1,0,1,0,1,0,0,0.9647
99,25,1,98.5,2514.5,0.039173,0,0,0,0,0,0,0,1,1,0.7361


observations regarding the prediction of the churn label

Model Performance: The model achieved an accuracy of 80%, which indicates that 80% of the predictions made by the model were correct.

AUC Score: The AUC score is 0.7619, which suggests that the model has some ability to distinguish between churned and non-churned customers. AUC values closer to 1 indicate better discrimination ability.

Recall and Precision: The recall, precision, and F1 score are all approximately 0.67, which means that the model correctly identifies around 67% of the churned customers (recall), and when it predicts a customer will churn, it is correct around 67% of the time (precision). While these values are not extremely high, they indicate a moderate performance in identifying churned customers.

Kappa and MCC: The Kappa statistic and MCC are both 0.5238, which suggests moderate agreement between the actual and predicted churn labels. These metrics take into account the possibility of the agreement occurring by chance.

## save model to disk

In [10]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'MonthlyCharges', 'TotalCharges',
                                              'MonthlyCharges_to_TotalCharges_Ratio',
                                              'Bank transfer (automatic)',
                                              'Credit card (automatic)',
                                              'Electronic check', 'Mailed check',
                                              'Month-to-month', 'One year',
                                              'Two year'],
                                     transformer=SimpleImputer(ad...
                                             criterion='friedman_mse', init=None,
                                             learning_rate=0.1, loss='log_loss',
                                   

In [11]:
import pickle

## Save and load model

In [14]:
with open('GBC_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [15]:
with open('GBC_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [23]:

rows = df.iloc[90:100]
new_data = rows.copy()
new_data.drop('Churn', axis=1, inplace=True)
new_data.to_csv('new_churn_data.csv', index=False)


new_data

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
90,30,1,82.05,2570.2,0.031924,1,0,1,0,0,0,0
91,1,1,74.7,74.7,1.0,0,0,0,0,0,0,0
92,66,1,84.0,5714.25,0.0147,0,0,1,1,1,0,1
93,65,1,111.05,7107.0,0.015625,0,1,1,0,0,0,0
94,72,1,100.9,7459.05,0.013527,1,0,1,0,1,0,1
95,12,1,78.95,927.35,0.085135,0,0,0,0,0,0,0
96,71,1,66.85,4748.7,0.014078,0,1,1,0,1,1,0
97,5,1,21.05,113.85,0.184892,0,0,1,1,0,0,0
98,52,1,21.0,1107.2,0.018967,1,0,1,0,1,0,1
99,25,1,98.5,2514.5,0.039173,0,0,0,0,0,0,0


## predict churn for the loaded data

In [25]:
loaded_model.predict(new_data)

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 1], dtype=int8)

In [26]:
loaded_gbc = load_model('GBC')


Transformation Pipeline and Model Successfully Loaded


In [27]:
predict_model(loaded_gbc, new_data)

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year,prediction_label,prediction_score
90,30,1,82.050003,2570.199951,0.031924,1,0,1,0,0,0,0,0,0.712
91,1,1,74.699997,74.699997,1.0,0,0,0,0,0,0,0,1,0.8048
92,66,1,84.0,5714.25,0.0147,0,0,1,1,1,0,1,0,0.9664
93,65,1,111.050003,7107.0,0.015625,0,1,1,0,0,0,0,0,0.7174
94,72,1,100.900002,7459.049805,0.013527,1,0,1,0,1,0,1,0,0.9753
95,12,1,78.949997,927.349976,0.085135,0,0,0,0,0,0,0,1,0.5302
96,71,1,66.849998,4748.700195,0.014078,0,1,1,0,1,1,0,0,0.9741
97,5,1,21.049999,113.849998,0.184892,0,0,1,1,0,0,0,0,0.8441
98,52,1,21.0,1107.199951,0.018967,1,0,1,0,1,0,1,0,0.9647
99,25,1,98.5,2514.5,0.039173,0,0,0,0,0,0,0,1,0.7361


## Python module to make predictions

In [28]:
from IPython.display import Code
Code('predict_churn.py')

In [29]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
0    No Churn
1       Churn
2    No Churn
3    No Churn
4    No Churn
5       Churn
6    No Churn
7    No Churn
8    No Churn
9       Churn
Name: Churn_prediction, dtype: object


The Python module is successfully loading the transformation pipeline and model, and it's making predictions on the new data. The predictions are currently 7 No Churn and 3 Churn

## Summary

Started off by importing essential libraries, leveraging pandas for data manipulation and PyCaret for automating machine learning tasks. Next, we fetched the prepared churn data from a CSV file and loaded it into a pandas DataFrame. Using PyCaret's setup function, we initialized the automated ML environment, specifying the target variable as Churn.

Once the environment was set up, we conducted a comparative analysis of various classification models to pinpoint the top performer, which turned out to be Gradient Boosting Classifier. After identifying the best model, we saved it both as a file named `GBC` and using Python's pickle serialization method for objects.

Following this, we loaded the saved model using pickle deserialization and applied it to predict outcomes on a new dataset. This involved copying 10 rows of the DataFrame and omitting the Churn column.

In addition, we demonstrated how to load the saved model using PyCaret's load_model function and make predictions on the same new dataset. Finally, we encapsulated the entire process, from loading the model to making predictions on new data, within a reusable Python module named predict_churn.py. This module was created using IPython's Code display feature and executed using the %run magic command, thereby summarizing the workflow effectively for future use.