# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [4]:
import pandas as pd

df = pd.read_csv('data/prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totcharges_to_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0,29.850000
5575-GNVDE,34,1,2,3,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,3,53.85,108.15,1,54.075000
7795-CFOCW,45,0,2,0,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,2,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,2,3,84.80,1990.50,0,82.937500
2234-XADUH,72,1,2,1,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,2,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,3,74.40,306.60,1,76.650000


In [5]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [6]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,8380
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [15]:
automl[6]

Unnamed: 0_level_0,tenure,MonthlyCharges,TotalCharges,PhoneService_0,Contract_0,Contract_2,Contract_3,PaymentMethod_0,PaymentMethod_1,PaymentMethod_2,PaymentMethod_3
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7590-VHVEG,1.0,29.850000,29.850000,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
5575-GNVDE,34.0,56.950001,1889.500000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3668-QPYBK,2.0,53.849998,108.150002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
7795-CFOCW,45.0,42.299999,1840.750000,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
9237-HQITU,2.0,70.699997,151.649994,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24.0,84.800003,1990.500000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2234-XADUH,72.0,103.199997,7362.899902,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4801-JZAZL,11.0,29.600000,346.450012,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
8361-LTMKD,4.0,74.400002,306.600006,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [16]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7956,0.8391,0.49,0.6501,0.5573,0.4282,0.4362,0.139
catboost,CatBoost Classifier,0.7913,0.8351,0.4961,0.6349,0.5555,0.4222,0.4285,1.318
ada,Ada Boost Classifier,0.7911,0.8373,0.5046,0.6314,0.5589,0.4248,0.4305,0.071
lda,Linear Discriminant Analysis,0.7901,0.8246,0.5099,0.6248,0.5605,0.4248,0.4291,0.01
lr,Logistic Regression,0.7887,0.8325,0.4891,0.6281,0.5489,0.414,0.4201,0.592
lightgbm,Light Gradient Boosting Machine,0.7861,0.8282,0.5046,0.6159,0.5539,0.4152,0.4192,0.234
ridge,Ridge Classifier,0.7851,0.0,0.4322,0.6361,0.5134,0.3823,0.3946,0.008
rf,Random Forest Classifier,0.7796,0.8045,0.4992,0.5994,0.5438,0.4002,0.4037,0.158
xgboost,Extreme Gradient Boosting,0.7767,0.8182,0.4899,0.5935,0.536,0.3909,0.3945,0.379
knn,K Neighbors Classifier,0.7678,0.7493,0.4314,0.5811,0.4942,0.348,0.355,0.026


In [17]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=8380, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [18]:
df.iloc[-2:-1].shape

(1, 8)

In [19]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totcharges_to_tenure_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,3,74.4,306.6,1,76.65,1,0.6029


In [20]:
save_model(best_model, 'LDA_Churn')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None,
                                             min_i

In [21]:
import pickle

with open('LDA_Churn_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [25]:
with open('LDA_Churn_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [None]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

In [31]:
loaded_lda = load_model('LDA_Churn')

Transformation Pipeline and Model Successfully Loaded


In [32]:
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,totcharges_to_tenure_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,3,74.4,306.6,76.65,1,0.6029


In [38]:
from IPython.display import Code

Code('predict_churn.py')

In [39]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No churn
6723-OKKJM    No churn
7832-POPKP    No churn
6348-TACGU    No churn
Name: Churn_prediction, dtype: object


# Summary

Write a short summary of the process and results here.