- Wiley Winters
- Assignment Week 5
- September 25, 2022

# DS Automation Assignment

Using our prepared churn data from week 2:
- use TPOT to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
    - REMEMBER: TPOT only finds the optimized processing pipeline and model. It doesn't create the model. 
        - You can use `tpot.export('my_model_name.py')` (assuming you called your TPOT object tpot) and it will save a Python template with an example of the optimized pipeline. 
        - Use the template code saved from the `export()` function in your program.
- create a Python script/file/module using code from the exported template above that
    - create a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

Import required packages and functions

In [1]:
import pandas as pd
import numpy as np
import timeit
from tpot import TPOTClassifier, TPOTRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.ensemble import ExtraTreesClassifier
from xgboost import XGBClassifier

# Suppress warnings for the tpot model selection
import warnings
warnings.filterwarnings('ignore')

Since the datafile location can change, I like to put it in its own cell.

In [2]:
file = '../data/prepped_churn_data.csv'

Read in datafile and take a quick look at it.

In [3]:
df = pd.read_csv(file, index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0
5575-GNVDE,34,1,1,3,56.95,1889.50,0
3668-QPYBK,2,1,0,3,53.85,108.15,1
7795-CFOCW,45,0,1,0,42.30,1840.75,0
9237-HQITU,2,1,0,2,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,3,84.80,1990.50,0
2234-XADUH,72,1,1,1,103.20,7362.90,0
4801-JZAZL,11,0,0,2,29.60,346.45,0
8361-LTMKD,4,1,0,3,74.40,306.60,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tenure          7043 non-null   int64  
 1   PhoneService    7043 non-null   int64  
 2   Contract        7043 non-null   int64  
 3   PaymentMethod   7043 non-null   int64  
 4   MonthlyCharges  7043 non-null   float64
 5   TotalCharges    7043 non-null   float64
 6   Churn           7043 non-null   int64  
dtypes: float64(2), int64(5)
memory usage: 440.2+ KB


Break out our train and test dataframes

In [5]:
features = df.drop('Churn', axis=1)
targets = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(features, targets,
                                                    stratify=targets,
                                                    random_state=42)

Running *TPOTClassifier* with setting provided by the Lecture.

In [6]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2,
                      n_jobs=-1, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7996936011008858

Generation 2 - Current best internal CV score: 0.8000725681603165

Generation 3 - Current best internal CV score: 0.8000725681603165

Generation 4 - Current best internal CV score: 0.8000725681603165

Generation 5 - Current best internal CV score: 0.8000725681603165

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.1, max_depth=2, min_child_weight=2, n_estimators=100, n_jobs=1, subsample=0.45, verbosity=0)
0.7921635434412265
CPU times: user 22.9 s, sys: 8.42 s, total: 31.3 s
Wall time: 2min 19s


The *TPOTClassifier* selected the `XGBClassifier()` model as the best choice for this dataset and its default parameters.  I will export it for later use.

In [7]:
tpot.export('../scripts/xgbc_model.py')

Set up a XGBClassifier() model for evaluation

In [8]:
xgbc = XGBClassifier(learning_rate=0.1, max_depth=2, min_child_weight=2,
                     n_estimators=100, n_jobs=-1, subsample=0.45, verbosity=0)
xgbc.fit(X_train, y_train)
print(xgbc.score(X_train, y_train))
print(xgbc.score(X_test, y_test))

0.810109806891329
0.7910278250993753


The train and test scores are within range of the internal CV score. Look at its performance for TPR and a classification matrix.

In [9]:
predictions = xgbc.predict(X_test)
tn, tp, fn, tp = confusion_matrix(y_test, predictions).flatten()
print('TPR: '+str(tn /(tn +fn)))

TPR: 0.8255977496483825


In [10]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.83      0.91      0.86      1294
           1       0.65      0.47      0.54       467

    accuracy                           0.79      1761
   macro avg       0.74      0.69      0.70      1761
weighted avg       0.78      0.79      0.78      1761



Scores are not bad.  For a classification model maximizing the AUC should improve the model's abiity to predict 0 classes as 0 and 1 classes as 1. <a href="https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5">Understanding AUC - ROC Curve</a>. This is using the parameters provided in the lecture, but with *scoring* set to **roc_auc**.

In [11]:
%%time
auc_tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2,
                          n_jobs=-1, scoring='roc_auc', random_state=42)
auc_tpot.fit(X_train, y_train)
print(auc_tpot.score(X_test, y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8402497539950419

Generation 2 - Current best internal CV score: 0.8402497539950419

Generation 3 - Current best internal CV score: 0.8402497539950419

Generation 4 - Current best internal CV score: 0.8402497539950419

Generation 5 - Current best internal CV score: 0.8402497539950419

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=True, criterion=entropy, max_features=0.8, min_samples_leaf=19, min_samples_split=5, n_estimators=100)
0.8438899350982462
CPU times: user 20.9 s, sys: 10.5 s, total: 31.4 s
Wall time: 3min 6s


Not much change in the CV score from the first run of TPOT.  It selected the *ExtraTreesClassifier()* as the model to use when scoring is set to **roc_auc**.

In [12]:
auc_tpot.export('../scripts/etc_tpot_model.py')

In [13]:
etc = ExtraTreesClassifier(bootstrap=True, criterion='entropy',
                           max_features=0.8, min_samples_leaf=19, 
                           min_samples_split=5, n_estimators=100)
etc.fit(X_train, y_train)
print(etc.score(X_train, y_train))
print(etc.score(X_test, y_test))

0.8048087845513063
0.7995457126632595


In [14]:
predictions = etc.predict(X_test)
tn, tp, fn, tp = confusion_matrix(y_test, predictions).flatten()
print('TPR: '+str(tn /(tn +fn)))

TPR: 0.8278745644599304


Slight improvement on TPR.

In [15]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.83      0.92      0.87      1294
           1       0.67      0.47      0.55       467

    accuracy                           0.80      1761
   macro avg       0.75      0.69      0.71      1761
weighted avg       0.79      0.80      0.79      1761



Some improvement on the weighted averages.

-----------------------------------------------------------------------------------------
# Summary
I started this exercise using the **pycaret** libraries and packages.  It took some time, but I was able to get it to work . . . somewhat.  After watching the lecture, I switched to **tpot**.  It appears to be more stable, but does not encapsulate the pipeline build to the same degree as **pycaret**.  Both techniques produced similar results on the new_churn_data set.  The results for both are [1 0 0 0 0] which do not match the stated answer of [1 0 0 1 0].  Not sure where the problem is and I will conduct a more thorough analysis in the future.

The quality of the training, test, and new_data is really important.  If I train a model using six features and the new data has seven, the prediction method will not work.  The method I used in the python script is really tailored to this exercise and on the job, I would take more time to build in error checking routines to make sure that the new data matches the training data.

Even with autoML processes, it can still take time to evaluate a set of models to find the best fit for the dataset.  With all of the work I've put on the **tpot** notebook, I still feel **pycaret** has more potential for easily selecting a modle and building a pipeline around it.  Both packages support GPU; therefore, I may enable that feature and see how much performance improvement my kind of old GPU can provide.

I noticed that `sklearn.prediction()` returns a *numpy.ndarray* which I wasn't sure how to handle.  I tried something like

```
for i in range(len(new_data)):
   print('Churn = '+(new_data[i], predictions[i])
```
but recieved a key error, so I just printed the array.