# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

## Load Data

In [1]:
import pandas as pd

df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
# Remove enrichment feature since we will be predicting on the orginal features and pycaret will enrich on its own
df.drop('AutoPay', axis=1, inplace=True)
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0
5575-GNVDE,34,1,1,1,56.95,1889.50,0
3668-QPYBK,2,1,0,1,53.85,108.15,1
7795-CFOCW,45,0,1,2,42.30,1840.75,0
9237-HQITU,2,1,0,0,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0
2234-XADUH,72,1,1,3,103.20,7362.90,0
4801-JZAZL,11,0,0,0,29.60,346.45,0
8361-LTMKD,4,1,0,1,74.40,306.60,1


## Pycaret

In [2]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [3]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,2869
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7043, 7)"
5,Missing Values,0
6,Numeric Features,3
7,Categorical Features,3
8,Ordinal Features,0
9,High Cardinality Features,0


In [4]:
automl

({'parameter': 'Hyperparameters',
  'auc': 'AUC',
  'confusion_matrix': 'Confusion Matrix',
  'threshold': 'Threshold',
  'pr': 'Precision Recall',
  'error': 'Prediction Error',
  'class_report': 'Class Report',
  'rfe': 'Feature Selection',
  'learning': 'Learning Curve',
  'manifold': 'Manifold Learning',
  'calibration': 'Calibration Curve',
  'vc': 'Validation Curve',
  'dimension': 'Dimensions',
  'feature': 'Feature Importance',
  'feature_all': 'Feature Importance (All)',
  'boundary': 'Decision Boundary',
  'lift': 'Lift Chart',
  'gain': 'Gain Chart',
  'tree': 'Decision Tree',
  'ks': 'KS Statistic Plot'},
 False,
             tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
 customerID                                                                  
 7590-VHVEG       1             0         0              0           29.85   
 5575-GNVDE      34             1         1              1           56.95   
 3668-QPYBK       2             1         0            

In [5]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7966,0.8349,0.4683,0.6508,0.5429,0.4169,0.4272,0.045
catboost,CatBoost Classifier,0.7937,0.8327,0.4793,0.6361,0.546,0.4161,0.4234,1.327
lr,Logistic Regression,0.7935,0.8301,0.495,0.6319,0.5538,0.4222,0.4283,0.167
ada,Ada Boost Classifier,0.7925,0.8323,0.48,0.6327,0.5451,0.414,0.4212,0.026
lda,Linear Discriminant Analysis,0.7915,0.8216,0.5028,0.6221,0.5551,0.4211,0.4258,0.003
ridge,Ridge Classifier,0.7905,0.0,0.4261,0.6471,0.5123,0.3863,0.4008,0.003
lightgbm,Light Gradient Boosting Machine,0.7903,0.8255,0.4855,0.6231,0.5451,0.4116,0.4175,0.151
xgboost,Extreme Gradient Boosting,0.7793,0.8132,0.4745,0.5932,0.5263,0.385,0.3896,68.335
rf,Random Forest Classifier,0.7757,0.8013,0.4855,0.5803,0.5279,0.3825,0.3855,0.092
et,Extra Trees Classifier,0.7684,0.7761,0.498,0.5613,0.5272,0.3746,0.3762,0.084


In [6]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=2869, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [7]:
df.iloc[-2:-1].shape

(1, 7)

In [8]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,1,0.5319


In [9]:
save_model(best_model, 'LR')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None,
                                             min_i

In [10]:
import pickle

with open('LR_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [11]:
with open('LR_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)
loaded_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=2869, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [12]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)

In [13]:
loaded_lr = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [14]:
predict_model(loaded_lr, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,0.5319


## Making a module out of our predictor

In [15]:
from IPython.display import Code

Code('predict_churn.py')

In [16]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No churn
6723-OKKJM    No churn
7832-POPKP    No churn
6348-TACGU    No churn
Name: Churn_prediction, dtype: object


  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  return linalg.solve(A, Xy, sym_pos=True,
  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  return linalg.solve(A, Xy, sym_pos=True,
  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  return linalg.solve(A, Xy, sym_pos=True,
  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


  return linalg.solve(A, Xy, sym_pos=True,
  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  return linalg.solve(A, Xy, sym_pos=True,
  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  return linalg.solve(A, Xy, sym_pos=True,
  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


  from pandas import MultiIndex, Int64Index
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  _warn_prf(average, modifier, msg_start, len(result))
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


# Summary

The work we have done on churn up to this point I have implemented in three separate AutoML packages.  Those implementations have all been classifiers; one using auto-sklearn, another using h2o, and this weeks use of pycaret.  I found the h2o package with hyperparameter tuning my favorite implementation of the three. This is entirely due to the fact that it handled more data preprocessing steps than the others.  All three were sufficient in accelerating our time to a satisfactory model, and when thinking of rapid iteration and implementation this seems like the way to go when first encountering business problems.  I do find the abstraction away from the algorithms tested does make me worry that I will not sufficiently understand the "best" model implementation and have more trouble explain the results that come out of it.  There are explainability packages that are currently in the works that could assist in this regard, see https://github.com/marcotcr/lime.  I belive that digging deeper into the mechanics of the major algorithms will still be necessary for the intuitive understanding that I am looking for when I implement ML models into a production environment. The modularization of our solution and involvement of github are both very big steps for me in seeing how I can use this in my day to day work. I can see how a good classifier being exposed as an API could be a very helpful tool that I can now make thanks to the past few weeks of work.