# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [65]:
import pandas as pd

# Load churn data
churn_data = pd.read_csv('churn_data.csv',index_col='customerID')

In [66]:
from pycaret.classification import *

# Initialize the setup
automl = setup(data=churn_data, target='Churn')


Unnamed: 0,Description,Value
0,Session id,6911
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7043, 7)"
5,Transformed data shape,"(7043, 12)"
6,Transformed train set shape,"(4930, 12)"
7,Transformed test set shape,"(2113, 12)"
8,Numeric features,3
9,Categorical features,3


In [67]:
# Compare models using a specific metric (default is accuracy, but you can change it)
best_model = compare_models(sort="AUC")  # You can choose any other metric such as 'Accuracy', 'Recall', etc.


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.7884,0.8327,0.7884,0.7774,0.7793,0.4164,0.4226,0.061
gbc,Gradient Boosting Classifier,0.7907,0.8326,0.7907,0.7787,0.7795,0.4139,0.423,0.097
lr,Logistic Regression,0.7897,0.8323,0.7897,0.7787,0.7809,0.4208,0.4265,0.07
ridge,Ridge Classifier,0.7858,0.8237,0.7858,0.7718,0.7725,0.3928,0.4036,0.031
lda,Linear Discriminant Analysis,0.7901,0.8237,0.7901,0.7809,0.7832,0.4301,0.434,0.027
lightgbm,Light Gradient Boosting Machine,0.7826,0.8228,0.7826,0.772,0.7748,0.4065,0.4108,0.124
nb,Naive Bayes,0.6884,0.8109,0.6884,0.7918,0.7065,0.371,0.4156,0.023
rf,Random Forest Classifier,0.768,0.7954,0.768,0.7583,0.7615,0.3745,0.3772,0.104
et,Extra Trees Classifier,0.754,0.7712,0.754,0.7459,0.7491,0.3455,0.3469,0.088
svm,SVM - Linear Kernel,0.7623,0.7432,0.7623,0.757,0.7434,0.3247,0.3478,0.033


In [68]:
best_model

In [69]:
predict_model(best_model, churn_data.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Ada Boost Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.400002,306.600006,Yes,Yes,0.5007


In [70]:
# Save the model to disk
save_model(best_model, 'best_churn_model')


Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'MonthlyCharges',
                                              'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,...
                                     include=['Contract', 'PaymentMethod'],
                                     transformer=OneHotEncoder(cols=['Contract',
                                                       

In [71]:
from IPython.display import Code

Code('predict_churn.py')

In [72]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
           Churn_prediction
customerID                 
9305-CKSKC            Churn
1452-KNGVK            Churn
6723-OKKJM            Churn
7832-POPKP            Churn
6348-TACGU            Churn


# Summary

Write a short summary of the process and results here.

Firstly, I've created a virtual environment and interpreter with python 3.9.11  which supports pycaret package. Following this, I setup the automl using pycaret and used it to identify the best-performing model for predicting customer churn, using AUC as the performance metric. After comparing several models, I found that a Ada Boost Classifier performed best with an accuracy score of 0.8. The best model was saved to disk and a Python script was created to predict churn probabilities for new data. The script was run using new_churn_data.csv and the predicted churn probabilities were printed successfully with results as churn for all of them.