# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
df = pd.read_csv('prepared_churn_data_updated.csv',index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,3,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,2,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,2,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,1,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,3,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,2,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,0,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,3,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,2,74.40,306.60,1,76.650000


In [2]:
!python --version

Python 3.10.16


using pycaret to find an ML algorithm that performs best on the data

In [3]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [4]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,8281
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 8)"
4,Transformed data shape,"(7032, 8)"
5,Transformed train set shape,"(4922, 8)"
6,Transformed test set shape,"(2110, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


In [5]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7936,0.8336,0.5115,0.6417,0.5678,0.4347,0.4404,0.301
ridge,Ridge Classifier,0.7914,0.821,0.4442,0.6616,0.5302,0.4029,0.4168,0.008
lda,Linear Discriminant Analysis,0.7893,0.821,0.4962,0.6345,0.556,0.4206,0.4267,0.011
gbc,Gradient Boosting Classifier,0.7875,0.8317,0.4801,0.6322,0.5451,0.41,0.417,0.116
ada,Ada Boost Classifier,0.7863,0.8334,0.4908,0.626,0.5493,0.4121,0.4178,0.034
lightgbm,Light Gradient Boosting Machine,0.7848,0.8211,0.4924,0.6222,0.5486,0.4101,0.4156,0.095
rf,Random Forest Classifier,0.77,0.8029,0.4671,0.5849,0.5188,0.3703,0.3747,0.073
et,Extra Trees Classifier,0.7641,0.7861,0.4832,0.5677,0.5207,0.3659,0.3688,0.071
knn,K Neighbors Classifier,0.7639,0.7439,0.4412,0.5724,0.4977,0.347,0.3523,0.22
qda,Quadratic Discriminant Analysis,0.7464,0.8161,0.7201,0.5165,0.6015,0.4229,0.4356,0.01


In [6]:
# best_model object used Accuracy by default to sort. 

In [7]:
best_model

In [8]:
# LR is the best model. Using best_model to predict the data.

In [9]:
# predicting on new data

In [10]:
df.iloc[-2:-1].shape

(1, 8)

In [11]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,2,74.400002,306.600006,76.650002,1,1,0.5717


In [12]:
#Selected best model, gave it to new data as a data frame then used predict_model function from pycaret. 
# It created two columns "prediction_label", "preduction"score". 

Saving and loading model

In [13]:
save_model(best_model, 'LR')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('c...
                                                         

In [14]:
#saved our trained model. Pycaret has a function, which saved this function as a pickle file. 
#Pickle file store data in binary format on hard drive.

In [15]:
import pickle
with open('LR.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [16]:
# open function opens file LR.pk, we can write the file. 'with' statment helps to close the file after exiting the file.

In [17]:
with open('LR.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [18]:
#to open pickle file

In [19]:
new_datafile = df.iloc[-2:-1].copy()
new_datafile.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_datafile)

array([1], dtype=int8)

In [20]:
#trained our model in the same notebook. Saved pycaret model.

In [21]:
loadedmodel_new = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [22]:
predict_model(loadedmodel_new, new_datafile)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,2,74.400002,306.600006,76.650002,1,0.5717


In [23]:
#used predict_model to load our new model. 

creating a Python module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe

In [24]:
from IPython.display import Code
Code('predict_new_churn.py')

In [25]:
%run predict_new_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No Churn
6723-OKKJM       Churn
7832-POPKP       Churn
6348-TACGU       Churn
Name: Contract, dtype: object


# Summary

I used pycaret on week two's churn data to find the best ML models. I created a new environment in Conda and lowered the Python version to 3.10.16 because pycaret will not support a newer version of Python 3.12.
Modified week two's churn data columns accordingly to match the new_churn_data.csv
To sort the best_model object, used "Accuracy" priority by default. Logistic Regression(LR) is the best model where 'accuracy' is the metric. I used this model to predict the data, saved this trained model as a pickle file. I used predict_model to load the new model.
I created a Python file to load the new data and print out the predictions for the new data. Compared Python module and function with the new data, new_churn_data.csv
The best model(LR) has an accuracy of 0.7899 and an AUC of 0.8327 
