# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
# !pip install  --upgrade pycaret

In [2]:
# !pip list

In [70]:
import pandas as pd

df = pd.read_csv('./prepped_churn_data.csv')
df

Unnamed: 0.1,Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,churn,Monthlycharges_TotalCharges_ratio,tenure_ratio
0,0,7590-VHVEG,1,0,Month-to-month,Electronic check,29.85,29.85,No,0,1.000000,0.033501
1,1,5575-GNVDE,34,1,One year,Mailed check,56.95,1889.50,No,0,0.030140,0.597015
2,2,3668-QPYBK,2,1,Month-to-month,Mailed check,53.85,108.15,Yes,1,0.497920,0.037140
3,3,7795-CFOCW,45,0,One year,Bank transfer (automatic),42.30,1840.75,No,0,0.022980,1.063830
4,4,9237-HQITU,2,1,Month-to-month,Electronic check,70.70,151.65,Yes,1,0.466205,0.028289
...,...,...,...,...,...,...,...,...,...,...,...,...
7027,7038,6840-RESVB,24,1,One year,Mailed check,84.80,1990.50,No,0,0.042602,0.283019
7028,7039,2234-XADUH,72,1,One year,Credit card (automatic),103.20,7362.90,No,0,0.014016,0.697674
7029,7040,4801-JZAZL,11,0,Month-to-month,Electronic check,29.60,346.45,No,0,0.085438,0.371622
7030,7041,8361-LTMKD,4,1,Month-to-month,Mailed check,74.40,306.60,Yes,1,0.242661,0.053763


# AutoML with pycaret

In [71]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [72]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,3942
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7032, 12)"
5,Transformed data shape,"(7032, 17)"
6,Transformed train set shape,"(4922, 17)"
7,Transformed test set shape,"(2110, 17)"
8,Numeric features,8
9,Categorical features,3


In [73]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.868
nb,Naive Bayes,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.049
ridge,Ridge Classifier,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.055
qda,Quadratic Discriminant Analysis,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.049
ada,Ada Boost Classifier,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.049
et,Extra Trees Classifier,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.115
svm,SVM - Linear Kernel,0.7365,0.7486,0.7365,0.7409,0.7163,0.28,0.305,0.058
rf,Random Forest Classifier,0.7357,1.0,0.7357,0.6461,0.6251,0.0078,0.0394,0.144
dt,Decision Tree Classifier,0.7343,0.5,0.7343,0.5391,0.6217,0.0,0.0,0.048
gbc,Gradient Boosting Classifier,0.7343,1.0,0.7343,0.5391,0.6217,0.0,0.0,0.171


In [74]:
best_model

In [75]:
df.iloc[-2:-1].shape

(1, 12)

In [76]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0.1,Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,churn,Monthlycharges_TotalCharges_ratio,tenure_ratio,Churn,prediction_label,prediction_score
7030,7041,8361-LTMKD,4,1,Month-to-month,Mailed check,74.400002,306.600006,1,0.242661,0.053763,Yes,Yes,0.9928


# Saving and loading our model

In [84]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Unnamed: 0', 'tenure',
                                              'PhoneService', 'MonthlyCharges',
                                              'TotalCharges', 'churn',
                                              'Monthlycharges_TotalCharges_ratio',
                                              'tenure_ratio'],
                                     transformer=...
                  TransformerWrapper(exclude=None, include=None,
                                     transformer=CleanColumnNames(match='[\\]\\[\\,\\{\\}\\"\\:]+'))),
                 ('trained_model',
                  LogisticRegression(C=1.0, class_weight=

In [85]:
import pickle

with open('GBC_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [86]:
with open('GBC_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [87]:
loaded_lda = load_model('GBC')

Transformation Pipeline and Model Successfully Loaded


In [88]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_lda.predict(new_data)

0    Yes
Name: Churn, dtype: object

In [89]:
predict_model(loaded_lda, new_data)

Unnamed: 0.1,Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,churn,Monthlycharges_TotalCharges_ratio,tenure_ratio,prediction_label,prediction_score
7030,7041,8361-LTMKD,4,1,Month-to-month,Mailed check,74.400002,306.600006,1,0.242661,0.053763,Yes,0.9928


In [90]:
from IPython.display import Code

Code('predict_churn.py')

# Summary

Write a short summary of the process and results here.


1. Data Preparation - 
first Loaded prepped_churn_data.csv and 
set up PyCaret.  

2. Model Training & Selection 
- Compared ML models and selected the best one based on performance.  
- Saved the model using PyCaret and pickle.  

3. Prediction Pipeline
- Created a function to predict churn from new_churn_data.csv.  
- Optionally returns churn probabilities and percentiles.  

4. Testing & Results - Tested on new_churn_data.csv, comparing predictions with true values [1, 0, 0, 1, 0].  
