# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('updated_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,3.396185,0
5575-GNVDE,34,1,1,1,56.95,7.544068,0
3668-QPYBK,2,1,0,1,53.85,4.683519,1
7795-CFOCW,45,0,1,2,42.30,7.517928,0
9237-HQITU,2,1,0,0,70.70,5.021575,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,7.596141,0
2234-XADUH,72,1,1,3,103.20,8.904209,0
4801-JZAZL,11,0,0,0,29.60,5.847739,0
8361-LTMKD,4,1,0,1,74.40,5.725544,1


In [10]:
conda install -c conda-forge pycaret -y

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/romanlee/opt/anaconda3

  added / updated specs:
    - pycaret


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB  conda-forge
    alembic-1.6.5              |     pyhd8ed1ab_0         114 KB  conda-forge
    catalogue-1.0.0            |   py38h50d1736_3          13 KB  conda-forge
    catboost-0.26              |   py38h50d1736_0        11.7 MB  conda-forge
    chart-studio-1.1.0         |     pyh9f0ad1d_0          51 KB  conda-forge
    colorlover-0.3.0  

alembic-1.6.5        | 114 KB    | ##################################### | 100% 
gorilla-0.4.0        | 13 KB     | ##################################### | 100% 
python-cufflinks-0.1 | 59 KB     | ##################################### | 100% 
libxgboost-1.1.1     | 1.9 MB    | ##################################### | 100% 
configparser-5.0.2   | 21 KB     | ##################################### | 100% 
wasabi-0.8.2         | 23 KB     | ##################################### | 100% 
imagehash-4.2.1      | 292 KB    | ##################################### | 100% 
simplejson-3.17.3    | 100 KB    | ##################################### | 100% 
srsly-1.0.2          | 196 KB    | ##################################### | 100% 
tabulate-0.8.9       | 26 KB     | ##################################### | 100% 
prometheus_flask_exp | 18 KB     | ##################################### | 100% 
missingno-0.4.2      | 12 KB     | ##################################### | 100% 
pycaret-2.2.3        | 166 K

In [11]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [12]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,5692
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 7)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [39]:
automl[6]

In [16]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7974,0.8462,0.5068,0.6619,0.5735,0.4438,0.451,0.034
ada,Ada Boost Classifier,0.7944,0.8412,0.5348,0.6425,0.5833,0.4484,0.452,0.12
gbc,Gradient Boosting Classifier,0.7907,0.8393,0.5,0.6443,0.5624,0.4278,0.4342,0.213
ridge,Ridge Classifier,0.7899,0.0,0.4396,0.6663,0.5291,0.4013,0.416,0.011
lda,Linear Discriminant Analysis,0.7899,0.8385,0.4978,0.6422,0.5603,0.4252,0.4315,0.012
catboost,CatBoost Classifier,0.7891,0.8374,0.4992,0.6395,0.5599,0.4241,0.4302,1.163
lightgbm,Light Gradient Boosting Machine,0.784,0.8311,0.5182,0.6197,0.5632,0.4215,0.4252,0.052
rf,Random Forest Classifier,0.7704,0.7984,0.4872,0.5893,0.5328,0.3826,0.386,0.297
knn,K Neighbors Classifier,0.7649,0.7755,0.5007,0.5746,0.5337,0.3779,0.3803,0.024
et,Extra Trees Classifier,0.758,0.7757,0.488,0.5574,0.5196,0.3591,0.361,0.209


In [25]:
best_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=5692, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [26]:
best = compare_models(sort = 'Kappa')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.7944,0.8412,0.5348,0.6425,0.5833,0.4484,0.452,0.088
lr,Logistic Regression,0.7974,0.8462,0.5068,0.6619,0.5735,0.4438,0.451,0.529
gbc,Gradient Boosting Classifier,0.7907,0.8393,0.5,0.6443,0.5624,0.4278,0.4342,0.168
lda,Linear Discriminant Analysis,0.7899,0.8385,0.4978,0.6422,0.5603,0.4252,0.4315,0.013
catboost,CatBoost Classifier,0.7891,0.8374,0.4992,0.6395,0.5599,0.4241,0.4302,1.09
lightgbm,Light Gradient Boosting Machine,0.784,0.8311,0.5182,0.6197,0.5632,0.4215,0.4252,0.052
ridge,Ridge Classifier,0.7899,0.0,0.4396,0.6663,0.5291,0.4013,0.416,0.01
rf,Random Forest Classifier,0.7704,0.7984,0.4872,0.5893,0.5328,0.3826,0.386,0.22
nb,Naive Bayes,0.6924,0.8157,0.8407,0.4612,0.5954,0.3799,0.4248,0.014
knn,K Neighbors Classifier,0.7649,0.7755,0.5007,0.5746,0.5337,0.3779,0.3803,0.022


In [None]:
I selected Kappa as a metric I thought could be the best to use for the model, knowing it produced 
the second lowest percentage. Interesting that after I did that, the automl pulls the ada as
the best possible model for my dataset. I will try again by sorting to the metric AUC instead this time. 

In [28]:
best = compare_models(sort = 'AUC')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7974,0.8462,0.5068,0.6619,0.5735,0.4438,0.451,0.036
ada,Ada Boost Classifier,0.7944,0.8412,0.5348,0.6425,0.5833,0.4484,0.452,0.083
gbc,Gradient Boosting Classifier,0.7907,0.8393,0.5,0.6443,0.5624,0.4278,0.4342,0.162
lda,Linear Discriminant Analysis,0.7899,0.8385,0.4978,0.6422,0.5603,0.4252,0.4315,0.012
catboost,CatBoost Classifier,0.7891,0.8374,0.4992,0.6395,0.5599,0.4241,0.4302,1.16
lightgbm,Light Gradient Boosting Machine,0.784,0.8311,0.5182,0.6197,0.5632,0.4215,0.4252,0.048
nb,Naive Bayes,0.6924,0.8157,0.8407,0.4612,0.5954,0.3799,0.4248,0.013
rf,Random Forest Classifier,0.7704,0.7984,0.4872,0.5893,0.5328,0.3826,0.386,0.279
et,Extra Trees Classifier,0.758,0.7757,0.488,0.5574,0.5196,0.3591,0.361,0.205
knn,K Neighbors Classifier,0.7649,0.7755,0.5007,0.5746,0.5337,0.3779,0.3803,0.026


In [None]:
Using AUC as the best metric for my test, the logistic regression again returns as the best model to use 
for my data set. Now if I want to return the top three models based on my data, I can run this code below, based
on the default metric of accuracy I can run this code below. 

In [29]:
top3 = compare_models(n_select = 3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7974,0.8462,0.5068,0.6619,0.5735,0.4438,0.451,0.034
ada,Ada Boost Classifier,0.7944,0.8412,0.5348,0.6425,0.5833,0.4484,0.452,0.094
gbc,Gradient Boosting Classifier,0.7907,0.8393,0.5,0.6443,0.5624,0.4278,0.4342,0.159
ridge,Ridge Classifier,0.7899,0.0,0.4396,0.6663,0.5291,0.4013,0.416,0.01
lda,Linear Discriminant Analysis,0.7899,0.8385,0.4978,0.6422,0.5603,0.4252,0.4315,0.012
catboost,CatBoost Classifier,0.7891,0.8374,0.4992,0.6395,0.5599,0.4241,0.4302,1.115
lightgbm,Light Gradient Boosting Machine,0.784,0.8311,0.5182,0.6197,0.5632,0.4215,0.4252,0.047
rf,Random Forest Classifier,0.7704,0.7984,0.4872,0.5893,0.5328,0.3826,0.386,0.219
knn,K Neighbors Classifier,0.7649,0.7755,0.5007,0.5746,0.5337,0.3779,0.3803,0.03
et,Extra Trees Classifier,0.758,0.7757,0.488,0.5574,0.5196,0.3591,0.361,0.191


In [30]:
df.iloc[-1].shape

(7,)

In [31]:
df.iloc[-2:-1].shape

(1, 7)

In [None]:
You can see that they differ because df.iloc[-1].shape only returns the total number of columns since
we only specified an indexing of a 1D array. 

In [32]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.4,5.725544,1,0,0.5042


In [None]:
We can see this line of code creates a score column with the probability class of 1. It also
creates a 'label' column with the predicted label, where it rounds up the score if the score
is greater than or equal to 0.5. 

In [33]:
save_model(best_model, 'LR')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca', 'passthrough'),
                 ['trained_model',
                  LogisticRegression(C=1.0, class_weight=None, dual=False,
                 

In [None]:
We save our trained model based on our best model comparison code ran earlier so we can use 
it in a python file later. 

In [56]:
import pickle

with open('LR_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [57]:
with open('LR_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [58]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

ValueError: X has 6 features per sample; expecting 11

In [59]:
loaded_lr = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [60]:
predict_model(loaded_lr, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4,1,0,1,74.4,5.725544,0,0.5042


In [None]:
I saved my pycaret model and test it with loading it and making predictions to make sure it works which it does. 

In [68]:
from IPython.display import Code

Code('predict_churn.py')

In [69]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
7590-VHVEG       Churn
5575-GNVDE    No churn
3668-QPYBK    No churn
7795-CFOCW    No churn
9237-HQITU       Churn
                ...   
6840-RESVB    No churn
2234-XADUH    No churn
4801-JZAZL    No churn
8361-LTMKD    No churn
3186-AJIEK    No churn
Name: Churn_prediction, Length: 7032, dtype: object


In [None]:
I created a separate python module to take in new data and make a prediction. 
I then import the code and test the code to make sure it reads and pulls up necesssary data. 
I can see that we have binary data returning for churn and no churn so the model is working ok but it is 
not perfect. I need to lookout for false positives and false negatives to ensure we understand our new data correctly. 

# Summary

Write a short summary of the process and results here.