# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [105]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

df = pd.read_csv('/Users/nidhi/Documents/Intro to DS/classes/Week1/prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,TotalCharges_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,2,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,0,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,3,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,0,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,1,74.40,306.60,1,76.650000


In [107]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [109]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1343
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 8)"
4,Transformed data shape,"(7032, 8)"
5,Transformed train set shape,"(4922, 8)"
6,Transformed test set shape,"(2110, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


In [111]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7936,0.8312,0.5061,0.6436,0.5656,0.4329,0.4389,0.013
gbc,Gradient Boosting Classifier,0.7911,0.8346,0.4771,0.6488,0.5481,0.4165,0.4259,0.083
ada,Ada Boost Classifier,0.7897,0.8294,0.4931,0.6371,0.5542,0.4199,0.4268,0.026
lda,Linear Discriminant Analysis,0.7869,0.8167,0.4801,0.633,0.5447,0.4091,0.4166,0.005
ridge,Ridge Classifier,0.7865,0.8166,0.4251,0.6527,0.5134,0.3846,0.3999,0.005
lightgbm,Light Gradient Boosting Machine,0.7851,0.8204,0.5069,0.6197,0.5559,0.4162,0.421,0.211
rf,Random Forest Classifier,0.781,0.8012,0.4855,0.6128,0.5407,0.3996,0.4049,0.061
et,Extra Trees Classifier,0.7668,0.782,0.4901,0.5731,0.5275,0.3741,0.3766,0.043
knn,K Neighbors Classifier,0.7591,0.7392,0.4312,0.5636,0.4877,0.3339,0.3396,0.01
nb,Naive Bayes,0.7401,0.8092,0.6988,0.5086,0.5883,0.4054,0.4167,0.004


In [113]:
best_model 

In [115]:
df.iloc[-2:-1].shape

(1, 8)

In [117]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TotalCharges_tenure_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,1,74.400002,306.600006,76.650002,1,1,0.5698


In [119]:
save_model(best_model, 'LR')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'TotalCharges_tenure_ratio'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean...
                                                               fill_value=N

In [121]:
import pickle

with open('LR.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [123]:
with open('LR.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [125]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1], dtype=int8)

In [127]:
loaded_lda = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [129]:
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TotalCharges_tenure_ratio,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.400002,306.600006,76.650002,1,0.5698


In [131]:
from IPython.display import Code

Code('predict_churn.py')

In [133]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
0       Churn
1    No churn
2    No churn
3    No churn
4    No churn
Name: churn_prediction, dtype: object


# Summary

In this assignment, I loaded my prepared churn data. Next I used pycart for autoML. Then, I run the autoML to find the best model. So, best_model object now holds the highest-scoring model i.e LR which is based on accuracy by default. 

Next I selected the last row of my dataset and made the prediction with best_model. We can see this creates a new column, 'prediction_score', with the probability of class 1. It also creates a 'prediction_label' column with the predicted label 1. And as per the data, our model predicted correctly.

Now, I saved my trained model and store it as pickle file. I composed a Python file, named predict_churn.py where I used our model to take in new data and make a prediction. Here, I replaced prediction_label to churn_prediction and our 0's and 1's with No churn and churn values.

I tested out my new churn dataset running the file with the Jupyter "magic" command %run. We got predictions as  1, 0, 0, 0, 0 and the true values are 1, 0, 0, 1, 0. We have 1 false negative in the new data. Thus, our model is working fine but not perfect. 