# Data science automation

This week is all about looking at automation tehcniques for data science and with Python. We can automate a lot of things with Python: collecting data, processing it, cleaning it, and many other parts of the data science pipeline. Here, we will show how to:

- use the pycaret autoML Python package to find an optimized ML model for our diabetes dataset
- create a Python script to ingest new data and make predictions on it

Often, next steps in fully operationalizing an ML pipeline like this are to use a cloud service to scale and serve our ML algorithm. We can use things like AWS lambda, GCP, AWS, or Azure ML depolyment with tools such as docker and kubernetes.

## Load data

In [7]:
# import libraries and load the data
import pandas as pd 
df=pd.read_csv('prepped_churn_data.csv',index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_monthly_ratio,tenure_total_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0,0.033501,0.033501
5575-GNVDE,34,1,1,3,56.95,1889.50,0,0.597015,0.017994
3668-QPYBK,2,1,0,3,53.85,108.15,1,0.037140,0.018493
7795-CFOCW,45,0,1,0,42.30,1840.75,0,1.063830,0.024447
9237-HQITU,2,1,0,2,70.70,151.65,1,0.028289,0.013188
...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,3,84.80,1990.50,0,0.283019,0.012057
2234-XADUH,72,1,1,1,103.20,7362.90,0,0.697674,0.009779
4801-JZAZL,11,0,0,2,29.60,346.45,0,0.371622,0.031751
8361-LTMKD,4,1,0,3,74.40,306.60,1,0.053763,0.013046


# AutoML with pycaret

In [5]:
# to use PyCaret for classification tasks
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [8]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1186
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 9)"
4,Transformed data shape,"(7032, 9)"
5,Transformed train set shape,"(4922, 9)"
6,Transformed test set shape,"(2110, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


The setup() function returns a pycaret.classification environment object (stored in automl) that you can use for further modeling tasks.


In [9]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.792,0.8361,0.5245,0.6342,0.5727,0.437,0.4415,1.104
gbc,Gradient Boosting Classifier,0.7918,0.8422,0.487,0.644,0.5539,0.4216,0.429,1.114
ada,Ada Boost Classifier,0.7905,0.8341,0.5031,0.6368,0.5606,0.4257,0.4318,0.375
lda,Linear Discriminant Analysis,0.7873,0.8248,0.5039,0.6283,0.5578,0.4201,0.4255,0.042
ridge,Ridge Classifier,0.7871,0.8249,0.4496,0.6473,0.5289,0.3972,0.4092,0.039
lightgbm,Light Gradient Boosting Machine,0.7818,0.8282,0.5199,0.606,0.5586,0.415,0.4178,0.653
rf,Random Forest Classifier,0.7737,0.813,0.4756,0.5949,0.527,0.381,0.386,0.906
et,Extra Trees Classifier,0.7674,0.7939,0.497,0.573,0.5312,0.3778,0.3802,0.591
knn,K Neighbors Classifier,0.7566,0.7524,0.4336,0.554,0.4847,0.3291,0.3341,0.666
svm,SVM - Linear Kernel,0.7391,0.7179,0.4484,0.5912,0.4678,0.3152,0.3464,0.067


best_model = compare_models() will be a summary table that displays the performance metrics of various classification models trained on your dataset

In [10]:
best_model

best_model after running compare_models(), best_model will contain the trained instance of the model that performed best according to the evaluation metrics used in the comparison (usually accuracy or AUC).

It displays the best-performing model based on the default evaluation metric

In [11]:
df.iloc[-2:-1].shape

(1, 9)

The command df.iloc[1:3].shape is used to get the shape of a specific slice of the DataFrame df.

This indicates that two rows (rows with index 1 and 2) and five columns were selected.

In [12]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,tenure_monthly_ratio,tenure_total_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
8361-LTMKD,4,1,0,3,74.400002,306.600006,0.053763,0.013046,1,1,0.6094


predict_model(best_model, df.iloc[1:3]) in PyCaret is used to make predictions using the trained best_model on a subset of  DataFrame, specifically the rows indexed 1 and 2.

## Saving and loading our model

In [13]:
save_model(best_model, 'lr')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'tenure_monthly_ratio',
                                              'tenure_total_ratio'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=na...
                                                               fill_value=None,
            

saving the trained model best_model to a file named lr. This is typically done to persist the model so i can load it later without needing to retrain it.

In [14]:
import pickle

with open('lr_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

 import pickle is used for serializing and deserializing Python objects, allowing to save and load them.
 with open('lr_model.pk', 'wb') as f: line opens a file named lr_model.pk in binary write mode ('wb'). If the file does not exist, it will be created.
 pickle.dump(best_model, f) serializes the best_model object and writes it to the open file f. The model can later be loaded back into memory.

In [15]:
with open('lr_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

This line opens the file named lr_model.pk in binary read mode ('rb'). If the file exists, it will be opened for reading; if it does not exist, this will raise a FileNotFoundError. and reads the serialized object from the file f and deserializes it back into a Python object, which in this case is your saved model. The loaded model is assigned to the variable loaded_model.

In [16]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1], dtype=int8)

This creates a new DataFrame new_data by copying the rows with index 1 and 2 from df. Using .copy() ensures that any modifications to new_data won’t affect the original DataFrame and removes the Churn column from new_data. It’s necessary to drop the target column because only want the feature columns for making predictions. The axis=1 argument specifies that dropping a column, and inplace=True modifies new_data directly without needing to reassign it.finally predict method of the loaded model to generate predictions based on the features in new_data. The output will typically be an array of predicted values. 

In [20]:
# Load the model
loaded_lr = load_model('lr')

Transformation Pipeline and Model Successfully Loaded


loaded_lr = load_model('lr') is used in PyCaret to load a previously saved model. 

In [21]:
# Make predictions with the loaded model
predict_model(loaded_lr, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,tenure_monthly_ratio,tenure_total_ratio,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,3,74.400002,306.600006,0.053763,0.013046,1,0.6094


predict_model(loaded_lr, new_data) is used in PyCaret to make predictions using the loaded model (loaded_lr) on the new dataset (new_data).

In [63]:
from IPython.display import Code

Code('predict_churn.py')

The command Code('predict_churn.py') is used in an IPython environment to display the contents of a Python script named predict_churn.py

In [64]:
# Run the prediction script
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7931,0.8335,0.5249,0.6337,0.5742,0.4391,0.4425


Index(['tenure', 'PhoneService', 'Contract', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'tenure_monthly_ratio', 'tenure_total_ratio', 'Churn',
       'prediction_label', 'prediction_score'],
      dtype='object')
predictions:
customerID
7590-VHVEG       Churn
5575-GNVDE    No Churn
3668-QPYBK    No Churn
7795-CFOCW    No Churn
9237-HQITU       Churn
                ...   
6840-RESVB    No Churn
2234-XADUH    No Churn
4801-JZAZL    No Churn
8361-LTMKD       Churn
3186-AJIEK    No Churn
Name: Churn_prediction, Length: 7032, dtype: object


Summary

A Conda environment was set up with Python 3.10.14 and PyCaret installed to work with churn data. The data was loaded into a Pandas DataFrame and prepared for analysis, with 'Churn' as the target variable. The data was split into training and testing sets, and missing values were handled. After testing several algorithms, Logistic Regression was found to be the best, achieving an accuracy of 79.20% and an AUC score of 0.8361. The model was saved using both PyCaret and Python's pickle module. It was then used to predict whether a specific customer was likely to churn, showcasing how effective PyCaret is for automating chur prediction.