## Import Libraries and modules

In [1]:
import pandas as pd
import pickle
from IPython.display import Code
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model
import numpy as np

In [2]:
df = pd.read_csv("cleaned_churn_data.csv")
df.tail(5)

Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check
7038,24,1,1,84.8,1990.5,0,3.533333,0,0,1,1
7039,72,1,1,103.2,7362.9,0,1.433333,0,1,1,0
7040,11,0,0,29.6,346.45,0,2.690909,0,0,0,0
7041,4,1,0,74.4,306.6,1,18.6,0,0,1,1
7042,66,1,3,105.65,6844.5,0,1.600758,1,0,1,0


## Handle Infinity values in the dataset

In [3]:
columns_with_infinity = df.columns[np.isinf(df).any()]

print("Columns with Infinity values:", columns_with_infinity)

# Replace infinity values
df[columns_with_infinity] = df[columns_with_infinity].replace([np.inf, -np.inf], np.nan)

Columns with Infinity values: Index(['MonthlyCharges_to_tenure_Ratio'], dtype='object')


The column `MonthlyCharges_to_tenure_Ratio`, in the DataFrame (df) contains infinity values. We've identified and printed it, and replaced these infinity values with NaN.

## auto ML environment

In [4]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,6993
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 11)"
4,Transformed data shape,"(7043, 11)"
5,Transformed train set shape,"(4930, 11)"
6,Transformed test set shape,"(2113, 11)"
7,Numeric features,10
8,Rows with missing values,0.2%
9,Preprocess,True


This output summarizes the setup information for the PyCaret auto ML environment designed for the binary classification task predicting `Churn`. 

The key points include,

The session has an ID of 6993, and the target variable is Churn, categorized as binary. The original dataset has dimensions (7043, 11), and after transformation, it maintains the same shape. The transformed training set comprises 4930 samples, while the transformed test set has 2113 samples. There are 10 numeric features in the dataset, and the percentage of rows with missing values is 0.2%.

The data has undergone preprocessing with simple imputation. Numeric features have been imputed with the mean, and categorical features with the mode. The cross-validation is performed using StratifiedKFold with 10 folds. The setup utilizes all available CPUs (-1 CPU jobs) and does not employ GPU acceleration. Logging of the experiment is turned off, and the experiment is named `clf-default-name` with a unique session identifier (USI) of df3d.

The dataset is well-prepared, and the setup is ready for model comparison and selection.

In [5]:
# Access the elements of the automl_setup object
automl_element = automl.get_config("X_train")
automl_element


Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check
1086,11,1,0,89.699997,1047.699951,8.154546,1,0,1,0
6871,52,1,0,94.599998,5025.799805,1.819231,1,0,1,0
669,70,0,3,57.799999,4039.300049,0.825714,1,0,1,0
3864,30,1,0,100.199997,2983.800049,3.340000,0,0,0,0
3364,21,1,0,103.900002,2254.199951,4.947619,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
2889,58,0,1,50.000000,2919.850098,0.862069,0,0,1,1
1046,52,1,1,74.000000,3877.649902,1.423077,0,1,1,0
2969,65,1,3,109.300003,7337.549805,1.681538,0,0,0,0
572,11,1,1,64.900002,697.250000,5.900000,0,1,1,0


## Compare classification models

In [6]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7929,0.8298,0.4357,0.6683,0.5265,0.4017,0.4172,0.205
gbc,Gradient Boosting Classifier,0.7919,0.8363,0.4839,0.645,0.5522,0.4203,0.4281,0.509
ada,Ada Boost Classifier,0.7909,0.8322,0.5053,0.637,0.5616,0.427,0.4331,0.203
lr,Logistic Regression,0.7907,0.8361,0.4503,0.6536,0.5318,0.4033,0.4155,3.695
ridge,Ridge Classifier,0.7876,0.0,0.3753,0.6821,0.4836,0.3639,0.3899,0.151
lightgbm,Light Gradient Boosting Machine,0.7836,0.8258,0.5007,0.6153,0.5513,0.4107,0.4151,33.732
svm,SVM - Linear Kernel,0.773,0.0,0.4004,0.6169,0.4705,0.3402,0.3571,0.137
rf,Random Forest Classifier,0.7677,0.7919,0.4687,0.5783,0.5164,0.3661,0.3704,0.457
nb,Naive Bayes,0.7653,0.8113,0.6215,0.5533,0.585,0.4224,0.424,0.133
knn,K Neighbors Classifier,0.7651,0.7432,0.4464,0.5757,0.5021,0.3517,0.3571,0.067


he output includes information about the PyCaret setup and the performance of various classification models.

The Linear Discriminant Analysis (LDA) model has the highest accuracy among the models listed.

```
LDA (Linear Discriminant Analysis):
        Accuracy: 0.7929
        AUC: 0.8346
        Recall: 0.4357
        Precision: 0.6683
        F1 Score: 0.5265
        Kappa: 0.4017
        MCC: 0.4172
        Training Time (Sec): 0.2050
```

LDA demonstrates a good balance between accuracy, precision, recall, and F1 score.


In [7]:
best_model

## Select rows

In [8]:
rows = df.iloc[7000:7010]
rows

Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check
7000,67,1,1,20.55,1343.4,0,0.306716,0,0,0,0
7001,3,1,0,49.9,130.1,1,16.633333,0,0,1,1
7002,64,1,3,105.4,6794.75,0,1.646875,1,0,1,0
7003,26,0,0,35.75,1022.5,0,1.375,0,0,0,0
7004,38,1,0,95.1,3691.2,0,2.502632,0,1,1,0
7005,23,1,1,19.3,486.2,0,0.83913,0,1,1,0
7006,40,1,0,104.5,4036.85,1,2.6125,0,1,1,0
7007,72,0,3,63.1,4685.55,0,0.876389,1,0,1,0
7008,3,1,0,75.05,256.25,1,25.016667,0,1,1,0
7009,23,1,0,81.0,1917.1,1,3.521739,0,0,0,0


## Use best_model to predict churn for the rows

In [9]:
predicted_rows = predict_model(best_model, rows)
predicted_rows

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.7,1.0,0.25,1.0,0.4,0.2857,0.4082


Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,Churn,prediction_label,prediction_score
7000,67,1,1,20.549999,1343.400024,0.306716,0,0,0,0,0,0,0.9274
7001,3,1,0,49.900002,130.100006,16.633333,0,0,1,1,1,0,0.7149
7002,64,1,3,105.400002,6794.75,1.646875,1,0,1,0,0,0,0.9466
7003,26,0,0,35.75,1022.5,1.375,0,0,0,0,0,0,0.7444
7004,38,1,0,95.099998,3691.199951,2.502632,0,1,1,0,0,0,0.763
7005,23,1,1,19.299999,486.200012,0.83913,0,1,1,0,0,0,0.9452
7006,40,1,0,104.5,4036.850098,2.6125,0,1,1,0,1,0,0.7288
7007,72,0,3,63.099998,4685.549805,0.876389,1,0,1,0,0,0,0.955
7008,3,1,0,75.050003,256.25,25.016666,0,1,1,0,1,1,0.5804
7009,23,1,0,81.0,1917.099976,3.521739,0,0,0,0,1,0,0.5804


The selected rows are predicted using the Linear Discriminant Analysis (LDA) model.

LDA Model Performance Metrics:

    Accuracy: 0.7000
    AUC: 1.0000
    Recall: 0.2500
    Precision: 1.0000
    F1 Score: 0.4000
    Kappa: 0.2857
    MCC: 0.408

These scores suggest that the model has achieved optimal performance on the selected rows.

Each row shows the model's prediction_label (0 or 1) and prediction_score, indicating the model's confidence in its predictions.

## Save and serialize model

In [10]:
save_model(best_model, 'LDA')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'MonthlyCharges',
                                              'TotalCharges',
                                              'MonthlyCharges_to_tenure_Ratio',
                                              'Bank transfer (automatic)',
                                              'Credit card (automatic)',
                                              'Electronic check',
                                              'Mailed check'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill...
                                                               strategy='m

In [11]:
with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [12]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

## Create new data and save to csv

In [13]:
new_data = rows.copy()
new_data.drop('Churn', axis=1, inplace=True)
new_data.to_csv('new_churn_data.csv', index=False)

In [14]:
loaded_model_prediction = loaded_model.predict(new_data)
loaded_model_prediction

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0], dtype=int8)

The loaded_model_prediction array contains the binary predictions for each row in new_data. The values of 0 and 1 indicate the predicted class (churn or non-churn).

## Probability of churn for each new prediction

In [15]:
probability_of_churn = loaded_model.predict_proba(new_data)[:, 1]
probability_of_churn

array([0.07255947, 0.28508461, 0.05336652, 0.25559998, 0.23702122,
       0.05475078, 0.27124794, 0.04503105, 0.5804218 , 0.41962034])

The probability_of_churn array contains the predicted probabilities of churn for each corresponding row in new_data. These values represent the confidence that the predicted class is 1 (churn). For example, a probability of 0.5804 suggests a 58.04% likelihood of churn for the ninth row in new_data.

In [17]:
loaded_lda = load_model('LDA')

Transformation Pipeline and Model Successfully Loaded


In [18]:
loaded_lda_prediction = predict_model(loaded_lda, new_data)
loaded_lda_prediction

Unnamed: 0,tenure,PhoneService,Contract,MonthlyCharges,TotalCharges,MonthlyCharges_to_tenure_Ratio,Bank transfer (automatic),Credit card (automatic),Electronic check,Mailed check,prediction_label,prediction_score
7000,67,1,1,20.549999,1343.400024,0.306716,0,0,0,0,0,0.9274
7001,3,1,0,49.900002,130.100006,16.633333,0,0,1,1,0,0.7149
7002,64,1,3,105.400002,6794.75,1.646875,1,0,1,0,0,0.9466
7003,26,0,0,35.75,1022.5,1.375,0,0,0,0,0,0.7444
7004,38,1,0,95.099998,3691.199951,2.502632,0,1,1,0,0,0.763
7005,23,1,1,19.299999,486.200012,0.83913,0,1,1,0,0,0.9452
7006,40,1,0,104.5,4036.850098,2.6125,0,1,1,0,0,0.7288
7007,72,0,3,63.099998,4685.549805,0.876389,1,0,1,0,0,0.955
7008,3,1,0,75.050003,256.25,25.016666,0,1,1,0,1,0.5804
7009,23,1,0,81.0,1917.099976,3.521739,0,0,0,0,0,0.5804


## Use python script to predict churn

In [19]:
Code('predict_churn.py')

In [20]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


Predictions:
0    No Churn
1    No Churn
2    No Churn
3    No Churn
4    No Churn
5    No Churn
6    No Churn
7    No Churn
8       Churn
9    No Churn
Name: Churn_prediction, dtype: object


## Summary

DS automation process begins with importation of necessary libraries and modules, encompassing pandas, pickle, and PyCaret functionalities tailored for classification tasks. Following this, the preprocessed churn data is imported from a CSV file into a pandas DataFrame. Subsequently, PyCaret's autoML environment is established, with the target variable specified as Churn.

Exploration of the autoML setup ensues by discerning its type and accessing specific elements. Various classification models undergo comparison, culminating in the selection of the best-performing model via PyCaret's compare_models function. Information pertaining to the optimal model is then retrieved, and a subset of the DataFrame (rows 7000-7010) is extracted for further analysis.

Utilizing the chosen model, predictions are generated for the selected rows, and subsequently, the model is saved under the name LDA. Utilizing the pickle serialization method, the model is stored and subsequently reloaded for subsequent analyses. A new dataset is created by duplicating the 10 rows while excluding the Churn column. Predictions are then made using the reloaded model, with a focus on computing the probability of churn for each new prediction.

Finally, the saved LDA model is loaded, and predictions are once again made for the new data. This comprehensive process illustrates a comprehensive workflow, encompassing data loading, setup, model selection, prediction generation, and supplementary analyses. The approach not only facilitates churn prediction but also provides deep insights into model performance.