In [21]:
# ! pip install pycaret

In [22]:
import pandas as pd

## Load data

In [60]:
df = pd.read_csv("prepped_churn_data.csv")
df.head(15)

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
0,1,0,29.85,29.85,0,1.0,0,0,0,0,0,0,0
1,34,1,56.95,1889.5,0,0.03014,0,0,1,1,1,1,0
2,2,1,53.85,108.15,1,0.49792,0,0,1,1,0,0,0
3,45,0,42.3,1840.75,0,0.02298,1,0,1,0,1,1,0
4,2,1,70.7,151.65,1,0.466205,0,0,0,0,0,0,0
5,8,1,99.65,820.5,1,0.12145,0,0,0,0,0,0,0
6,22,1,89.1,1949.4,0,0.045706,0,1,1,0,0,0,0
7,10,0,29.75,301.9,0,0.098543,0,0,1,1,0,0,0
8,28,1,104.8,3046.05,1,0.034405,0,0,0,0,0,0,0
9,62,1,56.15,3487.95,0,0.016098,1,0,1,0,1,1,0


## Import Pycaret functions and classes

In [24]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

## setup

In [25]:
automl = setup(df, target='Churn')


Unnamed: 0,Description,Value
0,Session id,663
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 13)"
4,Transformed data shape,"(7032, 13)"
5,Transformed train set shape,"(4922, 13)"
6,Transformed test set shape,"(2110, 13)"
7,Numeric features,12
8,Preprocess,True
9,Imputation type,simple


The output from the setup function in PyCaret, is used to set up the environment for machine learning. It provides information about the configuration and the preprocessing steps applied to the dataset.

Session id: A unique identifier for the current PyCaret session which is 581.

Target: The target variable for the machine learning task. In this case, it is Churn, indicating thatwe are working on a binary classification problem where the goal is to predict whether a customer will churn or not.

Target type: Specifies the nature of the target variable. Binary indicates that it is a binary classification task.

Original data shape: The shape of the original dataset before any preprocessing. In this case, it was (7032, 13), meaning there are 7032 rows and 13 columns in the original dataset.

Transformed data shape: The shape of the dataset after preprocessing. It remains the same in this case, indicating that no feature engineering or dimensionality reduction was performed.

Transformed train set shape: The shape of the training set after preprocessing. In this case, it is (4922, 13), meaning that 4922 samples are used for training.

Transformed test set shape: The shape of the test set after preprocessing is (2110, 13), showing that 2110 samples are used for testing.

Numeric features: The number of numeric features in the dataset, which are 12 numeric features.

Preprocess: Indicates whether preprocessing was performed. `True` suggests that preprocessing steps, such as imputation and scaling, were applied.

Imputation type: Specifies the type of imputation used for missing values. `Simple` means basic imputation techniques were applied.

Numeric imputation: The strategy used for imputing missing values in numeric features. `Mean` means that the mean value was used.

Categorical imputation: The strategy used for imputing missing values in categorical features. `Mode` indicates that the mode (most frequent value) was used.

Fold Generator: The cross-validation strategy used. `StratifiedKFold` indicates that stratified k-fold cross-validation was employed.

Fold Number: The number of folds used in cross-validation - 10

CPU Jobs: The number of CPU cores used during parallel processing. `-1` usually means to use all available cores.


Experiment Name: The name assigned to the current machine learning experiment. In this case, it is `clf-default-name.`

USI: The User System Identifier, a unique identifier for the current user's system.

In [26]:
type(automl)

pycaret.classification.oop.ClassificationExperiment

## compare models

In [27]:
best_model = compare_models()

A machine learning model selection process was performed using PyCaret. The summary provides information about various classification models and their performance metrics based on a 10-fold cross-validation.

Below is the best model according to the metrics
```
    Best Model: Logistic Regression (lr)
        Accuracy: 0.7997
        AUC (Area Under the Curve): 0.8419
        Recall: 0.5252
        Precision: 0.6549
        F1 Score: 0.5824
        Kappa: 0.4528
        MCC (Matthews Correlation Coefficient): 0.4579
        Training Time: 2.0710 seconds
```

Other models were also evaluated, and their respective performance metrics are presented in the table. The evaluation metrics include Accuracy, AUC, Recall, Precision, F1 Score, Kappa, MCC, and Training Time.

In [28]:
best_model

here we print out the best_model hyperparameters'. It was determined to be Logistic Regression

## select 2nd-to-last row from the DF

In [29]:
df.iloc[-2:-1]

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
7030,4,1,74.4,306.6,1,0.242661,0,0,1,1,0,0,0


In [30]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.0,0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year,Churn,prediction_label,prediction_score
7030,4,1,74.400002,306.600006,0.242661,0,0,1,1,0,0,0,1,0,0.5686


Here we call and predict the Churn label and assign a prediction score
0 (indicates the predicted class is 'No Churn') with a prediction score of 0.5686

**Interpretation**

The actual Churn value for this instance is 1, indicating that Churn occurred.
However, the model predicted a Churn label of 0 ('No Churn') with a prediction score of 0.5686.

**Possible issues**
The model may not be well-calibrated or might need further tuning. There could be an issue with the evaluation setup or the way the model was trained.

## save model to disk

In [31]:
save_model(best_model, 'LR')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'MonthlyCharges', 'TotalCharges',
                                              'MonthlyCharges_to_TotalCharges_Ratio',
                                              'Bank transfer (automatic)',
                                              'Credit card (automatic)',
                                              'Electronic check', 'Mailed check',
                                              'Month-to-month', 'One year',
                                              'Two year'],
                                     transformer=SimpleImputer(ad...
                  TransformerWrapper(exclude=None, include=None,
                                     transformer=CleanColumnNames(match='[\\]\\[\\,\\{\\}\\"\\:]+'))),
                 ('trained_mod

In [32]:
import pickle

## save and load model

In [33]:
with open('LR_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [34]:
with open('LR_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [49]:
new_data = pd.concat([df.iloc[-1:].copy()] * 10, ignore_index=True)
new_data.drop('Churn', axis=1, inplace=True)

new_data

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year
0,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
1,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
2,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
3,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
4,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
5,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
6,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
7,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
8,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1
9,66,1,105.65,6844.5,0.015436,1,0,1,0,1,0,1


We create new_data by copying the second-to-last row of the DataFrame and dropping the'Churn column

## Predict the target variable for the new_data

In [50]:
loaded_model.predict(new_data)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int8)

In [52]:
loaded_lr = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [53]:
predict_model(loaded_lr, new_data)

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,MonthlyCharges_to_TotalCharges_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Month-to-month,One year,Two year,prediction_label,prediction_score
0,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
1,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
2,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
3,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
4,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
5,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
6,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
7,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
8,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925
9,66,1,105.650002,6844.5,0.015436,1,0,1,0,1,0,1,0,0.925


In [55]:
# Save the DataFrame to a CSV file 
new_data.to_csv('new_churn_data.csv', index=False)


## Using Python Module to make predictions

In [58]:
from IPython.display import Code
Code('predict_churn.py')

In [59]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
0    No Churn
1    No Churn
2    No Churn
3    No Churn
4    No Churn
5    No Churn
6    No Churn
7    No Churn
8    No Churn
9    No Churn
Name: Churn_prediction, dtype: object


The Python module is successfully loading the transformation pipeline and model, and it's making predictions on the new data. The predictions are currently all No Churn.

## Summary

We began by importing the necessary libraries, including pandas for data manipulation and PyCaret for automated machine learning tasks. Then, we loaded the prepped churn data from a CSV file into a pandas DataFrame. Using PyCaret's setup function, we initialized the auto ML environment, specifying the target variable as `Churn`. After setting up the environment, we compared different classification models to select the best-performing one which was found to be Logistic Regression.

Once the best model was identified, it was saved both as a file named `LR` and using the pickle serialization method for Python objects. Subsequently, we loaded the saved model using pickle deserialization and used it to make predictions on a new dataset by copying 10 rows of the DataFrame and dropping the Churn column.

Additionally, we demonstrated how to load the saved model using PyCaret's load_model function and make predictions on the same new dataset. Finally, we created a Python module named `predict_churn.py` using IPython's Code display feature and ran the script using the %run magic command, effectively summarizing the entire process of loading the model and making predictions on new data encapsulated within a reusable Python module.