## Import libraries and modules

In [6]:
import pandas as pd
import pickle
from IPython.display import Code
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model
import numpy as np

In [2]:
df = pd.read_csv("cleaned_churn_data.csv")
df.tail(5)

Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check
7038,24,1,1,84.8,1990.5,0,3.533333,0,0,1,1
7039,72,1,1,103.2,7362.9,0,1.433333,0,1,1,0
7040,11,0,0,29.6,346.45,0,2.690909,0,0,0,0
7041,4,1,0,74.4,306.6,1,18.6,0,0,1,1
7042,66,1,3,105.65,6844.5,0,1.600758,1,0,1,0


## Handle Infinity values in the dataset

In [7]:
columns_with_infinity = df.columns[np.isinf(df).any()]

print("Columns with Infinity values:", columns_with_infinity)

# Replace infinity values
df[columns_with_infinity] = df[columns_with_infinity].replace([np.inf, -np.inf], np.nan)

Columns with Infinity values: Index(['MonthlyCharges_to_tenure_Ratio'], dtype='object')


The column `MonthlyCharges_to_tenure_Ratio`, in the DataFrame (df) contains infinity values. We've identified and printed it, and replaced these infinity values with NaN.

## auto  ML environment

In [8]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,4875
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 11)"
4,Transformed data shape,"(7043, 11)"
5,Transformed train set shape,"(4930, 11)"
6,Transformed test set shape,"(2113, 11)"
7,Numeric features,10
8,Rows with missing values,0.2%
9,Preprocess,True


This output summarizes the setup information for the PyCaret auto ML environment designed for the binary classification task predicting `Churn`. 

The key points include,

The session has an ID of 4875, and the target variable is Churn, categorized as binary. The original dataset has dimensions (7043, 11), and after transformation, it maintains the same shape. The transformed training set comprises 4930 samples, while the transformed test set has 2113 samples. There are 10 numeric features in the dataset, and the percentage of rows with missing values is 0.2%.

The data has undergone preprocessing with simple imputation. Numeric features have been imputed with the mean, and categorical features with the mode. The cross-validation is performed using StratifiedKFold with 10 folds. The setup utilizes all available CPUs (-1 CPU jobs) and does not employ GPU acceleration. Logging of the experiment is turned off, and the experiment is named `clf-default-name` with a unique session identifier (USI) of 4ad6.

The dataset is well-prepared, and the setup is ready for model comparison and selection.

In [12]:
# Access the elements of the automl_setup object
automl_element = automl.get_config("X_train")
automl_element


Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check
4827,16,1,0,81.000000,1312.150024,5.062500,0,0,0,0
4167,50,1,1,75.699997,3876.199951,1.514000,1,0,1,0
249,42,1,1,99.000000,4298.450195,2.357143,0,0,0,0
145,65,1,3,99.050003,6416.700195,1.523846,0,1,1,0
1062,34,1,0,50.200001,1815.300049,1.476471,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...
2927,1,1,0,69.900002,69.900002,69.900002,0,0,0,0
1240,25,1,0,20.150000,536.349976,0.806000,0,0,1,1
3538,71,1,3,100.199997,7209.000000,1.411268,0,1,1,0
4613,54,1,0,79.500000,4370.250000,1.472222,0,0,0,0


## compare classification models

In [13]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7955,0.8346,0.4587,0.6662,0.5427,0.417,0.4293,0.176
lr,Logistic Regression,0.7945,0.8411,0.4869,0.6522,0.5566,0.4267,0.435,8.791
gbc,Gradient Boosting Classifier,0.7937,0.8396,0.4977,0.6455,0.5612,0.4294,0.4361,0.614
ridge,Ridge Classifier,0.7911,0.0,0.3975,0.6827,0.5015,0.3811,0.4038,0.152
ada,Ada Boost Classifier,0.7888,0.8351,0.4954,0.6313,0.5542,0.4187,0.4245,0.229
lightgbm,Light Gradient Boosting Machine,0.785,0.8267,0.5084,0.6176,0.5569,0.4167,0.4207,42.939
nb,Naive Bayes,0.771,0.8142,0.6345,0.5611,0.595,0.4363,0.4383,0.088
knn,K Neighbors Classifier,0.7639,0.7424,0.448,0.5704,0.5011,0.3496,0.3543,0.288
rf,Random Forest Classifier,0.7621,0.7941,0.4649,0.5639,0.5086,0.3538,0.3573,0.595
et,Extra Trees Classifier,0.7505,0.7689,0.4718,0.5356,0.501,0.3357,0.3373,0.721


The output includes information about the PyCaret setup and the performance of various classification models.

The Linear Discriminant Analysis (LDA) model appears to have the highest accuracy among the models listed.

LDA (Linear Discriminant Analysis):
    Accuracy: 0.7955
    AUC: 0.8346
    Recall: 0.4587
    Precision: 0.6662
    F1 Score: 0.5427
    Kappa: 0.4170
    MCC: 0.4293
    Training Time (Sec): 0.1760

LDA demonstrates a good balance between accuracy, precision, recall, and F1 score.


In [15]:
best_model

## Select rows 

In [17]:
rows = df.iloc[500:510]
rows

Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check
500,34,1,1,116.25,3899.05,0,3.419118,0,1,1,0
501,71,1,3,80.7,5676.0,0,1.13662,0,1,1,0
502,70,1,1,65.2,4543.15,0,0.931429,1,0,1,0
503,52,1,0,84.05,4326.8,0,1.616346,1,0,1,0
504,69,1,3,79.45,5502.55,0,1.151449,0,1,1,0
505,20,1,0,94.1,1782.4,1,4.705,0,0,0,0
506,11,1,0,78.0,851.8,0,7.090909,0,0,1,1
507,2,1,0,94.2,167.5,1,47.1,0,0,0,0
508,6,1,0,80.5,502.85,1,13.416667,0,0,0,0
509,1,1,0,19.85,19.85,0,19.85,0,0,1,1


## Use best_model to predict churn for the rows

In [18]:
predicted_rows = predict_model(best_model, rows)
predicted_rows


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Churn,prediction_label,prediction_score
500,34,1,1,116.25,3899.050049,3.419118,0,1,1,0,0,0,0.7025
501,71,1,3,80.699997,5676.0,1.13662,0,1,1,0,0,0,0.9704
502,70,1,1,65.199997,4543.149902,0.931429,1,0,1,0,0,0,0.9493
503,52,1,0,84.050003,4326.799805,1.616346,1,0,1,0,0,0,0.8606
504,69,1,3,79.449997,5502.549805,1.151449,0,1,1,0,0,0,0.9696
505,20,1,0,94.099998,1782.400024,4.705,0,0,0,0,1,1,0.5729
506,11,1,0,78.0,851.799988,7.090909,0,0,1,1,0,0,0.6743
507,2,1,0,94.199997,167.5,47.099998,0,0,0,0,1,1,0.9179
508,6,1,0,80.5,502.850006,13.416667,0,0,0,0,1,1,0.6583
509,1,1,0,19.85,19.85,19.85,0,0,1,1,0,0,0.862


We predict the selected rows using the Linear Discriminant Analysis (LDA) model.

LDA Model Performance Metrics:

    Accuracy: 1.0000
    AUC: 1.0000
    Recall: 1.0000
    Precision: 1.0000
    F1 Score: 1.0000
    Kappa: 1.0000
    MCC: 1.0000

These perfect scores suggest that the model has achieved optimal performance on the selected rows.

Each row shows the model's prediction_label (0 or 1) and prediction_score, indicating the model's confidence in its predictions.

Notably, the accuracy of 1.0000 suggests that the model correctly predicted the target variable for each of the rows.

## save model and serialize

In [19]:
save_model(best_model, 'LDA')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'MonthlyCharges',
                                              'TotalCharges',
                                              'MonthlyCharges_to_tenure_Ratio',
                                              'Bank transfer (automatic)',
                                              'Credit card (automatic)',
                                              'Electronic check',
                                              'Mailed check'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill...
                                                               strategy='m

In [21]:
with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [22]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

## Create new data and save to csv

In [23]:
new_data = rows.copy()
new_data.drop('Churn', axis=1, inplace=True)
new_data.to_csv('new_churn_data.csv', index=False)


In [24]:
loaded_model_prediction = loaded_model.predict(new_data)
loaded_model_prediction

array([0, 0, 0, 0, 0, 1, 0, 1, 1, 0], dtype=int8)

The loaded_model_prediction array contains the binary predictions for each row in new_data. The values of 0 and 1 indicate the predicted class (churn or non-churn).

## Probability of churn for each new prediction

In [26]:
probability_of_churn = loaded_model.predict_proba(new_data)[:, 1]
probability_of_churn

array([0.29749369, 0.02959457, 0.05072708, 0.13935076, 0.03040274,
       0.57289406, 0.32565703, 0.91791234, 0.65828767, 0.13795148])

The probability_of_churn array contains the predicted probabilities of churn for each corresponding row in new_data. These values represent the confidence that the predicted class is 1 (churn). For example, a probability of 0.5729 suggests a 57.29% likelihood of churn for the sixth row in new_data.

## Get percentile

In [35]:
df_no_missing = df.dropna()

percentile_rank = (
    (loaded_model.predict_proba(df_no_missing.drop('Churn', axis=1))[:, 1] <= probability_of_churn).mean() * 100
)
percentile_rank

100.0

The percentile_rank is showing 100.0. This suggests that the predicted probability of churn for the new data point is at the highest end of the distribution of probability predictions from the training dataset. A percentile rank of 100.0 indicates that the predicted probability is equal to or greater than all the probabilities in the training dataset.

In this context, where the target variable is binary (churn or not churn), a high predicted probability of churn (close to 1.0) often suggests a high level of confidence by the model that the new data point belongs to the positive class (churn).

In [38]:
loaded_lda = load_model('LDA')

Transformation Pipeline and Model Successfully Loaded


In [40]:
loaded_lda_prediction = predict_model(loaded_lda, new_data)
loaded_lda_prediction

Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,prediction_label,prediction_score
500,34,1,1,116.25,3899.050049,3.419118,0,1,1,0,0,0.7025
501,71,1,3,80.699997,5676.0,1.13662,0,1,1,0,0,0.9704
502,70,1,1,65.199997,4543.149902,0.931429,1,0,1,0,0,0.9493
503,52,1,0,84.050003,4326.799805,1.616346,1,0,1,0,0,0.8606
504,69,1,3,79.449997,5502.549805,1.151449,0,1,1,0,0,0.9696
505,20,1,0,94.099998,1782.400024,4.705,0,0,0,0,1,0.5729
506,11,1,0,78.0,851.799988,7.090909,0,0,1,1,0,0.6743
507,2,1,0,94.199997,167.5,47.099998,0,0,0,0,1,0.9179
508,6,1,0,80.5,502.850006,13.416667,0,0,0,0,1,0.6583
509,1,1,0,19.85,19.85,19.85,0,0,1,1,0,0.862


## Load external python script to predict churn

In [41]:
Code('predict_churn.py')

In [42]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


Predictions:
0    No Churn
1    No Churn
2    No Churn
3    No Churn
4    No Churn
5       Churn
6    No Churn
7       Churn
8       Churn
9    No Churn
Name: Churn_prediction, dtype: object


The `No Churn` and `Churn` labels in the `Churn_prediction` column indicate whether each corresponding entry is predicted as a churn or non-churn instance. 

## Summary

The DS automation process begins by importing essential libraries and modules, such as pandas, pickle, and PyCaret functions for classification purposes. The cleaned churn data is loaded from a CSV file into a pandas DataFrame, and PyCaret's autoML environment is set up with the target variable specified as Churn. The autoML setup is explored by determining its type and accessing specific elements. A variety of classification models are compared, and the best-performing model is selected through PyCaret's compare_models function. Information about the best-performing model is then retrieved, and 10 rows (500-509) from the DataFrame are extracted.

Using the best model, predictions are made for the selected row, and the model is saved with the name LDA. The serialized model is stored using pickle, and later, it is loaded back for further analysis. A new dataset is created by copying the 10 rows and excluding the Churn column. Predictions are made with the loaded model, including returning the probability of churn for each new prediction. Additional analysis involves calculating the percentile rank of the prediction within the distribution of probability predictions from the training dataset.

Finally, we load the saved LDA model and predictions are made for the new data. This is a demonstration of the end-to-end workflow, from loading and setting up data to model selection, prediction, and additional analyses, providing a robust approach to churn prediction with detailed insights into model performance.