# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [2]:
df = pd.read_csv('prepped_churn_data2.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,monthly_to_total_ratio,contract_to_tunure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,No,0,Electronic check,29.85,29.85,0,1.000000,0.000000
5575-GNVDE,34,Yes,1,Mailed check,56.95,1889.50,0,0.030140,0.029412
3668-QPYBK,2,Yes,0,Mailed check,53.85,108.15,1,0.497920,0.000000
7795-CFOCW,45,No,1,Bank transfer (automatic),42.30,1840.75,0,0.022980,0.022222
9237-HQITU,2,Yes,0,Electronic check,70.70,151.65,1,0.466205,0.000000
...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,Yes,1,Mailed check,84.80,1990.50,0,0.042602,0.041667
2234-XADUH,72,Yes,1,Credit card (automatic),103.20,7362.90,0,0.014016,0.013889
4801-JZAZL,11,No,0,Electronic check,29.60,346.45,0,0.085438,0.000000
8361-LTMKD,4,Yes,0,Mailed check,74.40,306.60,1,0.242661,0.000000


In [3]:
automl = setup(data=df, target='Churn')
best_model = compare_models(sort='AUC')

Unnamed: 0,Description,Value
0,Session id,1507
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 9)"
4,Transformed data shape,"(7032, 12)"
5,Transformed train set shape,"(4922, 12)"
6,Transformed test set shape,"(2110, 12)"
7,Numeric features,6
8,Categorical features,2
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7966,0.8368,0.4992,0.6531,0.565,0.4357,0.4428,0.175
gbc,Gradient Boosting Classifier,0.7911,0.8351,0.4817,0.6435,0.5502,0.418,0.4259,0.044
ridge,Ridge Classifier,0.7944,0.8334,0.4495,0.6677,0.5365,0.4111,0.4248,0.008
lda,Linear Discriminant Analysis,0.7936,0.8334,0.5,0.6437,0.5621,0.4301,0.4363,0.007
ada,Ada Boost Classifier,0.7871,0.8285,0.4877,0.6291,0.5487,0.4124,0.4186,0.017
lightgbm,Light Gradient Boosting Machine,0.779,0.8161,0.4916,0.6035,0.5412,0.3977,0.4017,0.474
qda,Quadratic Discriminant Analysis,0.7353,0.8151,0.7554,0.5022,0.6018,0.4157,0.4371,0.007
nb,Naive Bayes,0.7592,0.8029,0.6422,0.5394,0.586,0.4181,0.4216,0.007
rf,Random Forest Classifier,0.7674,0.7919,0.474,0.5774,0.5196,0.3683,0.372,0.042
et,Extra Trees Classifier,0.7495,0.7669,0.4739,0.5319,0.501,0.3346,0.3357,0.029


In [4]:
best_model

In [5]:
new_df = df.iloc[-2:-1].copy()
new_df.drop('Churn', axis=1, inplace=True)

In [6]:
predictions = predict_model(best_model, data=new_df)
predictions

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,monthly_to_total_ratio,contract_to_tunure_ratio,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,Yes,0,Mailed check,74.400002,306.600006,0.242661,0.0,0,0.5628


In [9]:
save_model(best_model, 'GBM')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'Contract',
                                              'MonthlyCharges', 'TotalCharges',
                                              'monthly_to_total_ratio',
                                              'contract_to_tunure_ratio'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('...
                                                          

In [10]:
test_model = load_model('GBM')

test_model.predict(new_df)

Transformation Pipeline and Model Successfully Loaded


array([0], dtype=int8)

In [17]:
run predict_churn

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7982,0.8405,0.5051,0.6565,0.5709,0.4419,0.4484


predictions:
           Churn_prediction
customerID                 
7590-VHVEG            Churn
5575-GNVDE            Churn
3668-QPYBK            Churn
7795-CFOCW            Churn
9237-HQITU            Churn
...                     ...
6840-RESVB            Churn
2234-XADUH            Churn
4801-JZAZL            Churn
8361-LTMKD            Churn
3186-AJIEK            Churn

[7032 rows x 1 columns]


# Summary

Write a short summary of the process and results here.

We used the pycaret library to determine the best model to use to predict churn for our phone company and then ran the prediction model to determine which customers are likely to leave. I had to change the language in the predict_churn.py file as I didn’t have a copy of the prepared churn data file so I used my own. I found this process to be incredibly useful but had to run it at least 100 times get it to work with no errors.