# Code For Practitioners

This notebook helps the practioners to identify the classification technique to use and whether the classification objective should be to maximize accuracy or the F1 score.

## Input

Before executing the code below, do the following:
<ol>
<li>prepare the file <i>data.csv</i> in the same folder as this notebook. The file should contain one row for each appointment and one column for each attribute. Make sure that the data set also contains a column named  <b>NS</b>, which indicates whether an appointments resulted in a show (NS=0) or in a no-show (NS=1).
<li>download the file delta_costs.txt in the same folder as this notebook
</ol>

If any of the following "import" operations fail, be sure to install the missing package.

In [None]:
import pandas as pd
import sklearn as sk
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier

In [None]:
df = pd.read_csv('data.csv')

In [None]:
delta_costs = pd.read_table('delta_costs.txt')

## Find best classifier

X is the set of independent variables

In [None]:
X = df.drop('NS',axis=1)

Y is the dependent variable NS

In [None]:
Y = df.NS

In [None]:
q = 1.0 - Y.mean()

In [None]:
# Classifiers

clfs = [sk.ensemble.RandomForestClassifier(n_jobs=-1,random_state=0), sk.naive_bayes.GaussianNB(),
        sk.linear_model.LogisticRegression(n_jobs=-1),sk.tree.DecisionTreeClassifier(random_state=0),
        sk.ensemble.AdaBoostClassifier(random_state=0),
        QuadraticDiscriminantAnalysis(),MLPClassifier(random_state=0)]#,sk.svm.SVC()]
clfsNames = {str(cl)[:15] : cl for cl in clfs}
nfolds = 10
maxAUC = -1
bestCL = ''
for cl in clfs:
    print ('Testing ' + str(cl)[:10])
    kf = KFold(n_splits=nfolds,random_state=2,shuffle=True)
    auc = cross_val_score(cl,X,y=Y,cv=kf,scoring='roc_auc').mean()
    if auc > maxAUC:
        bestCL = cl
        maxAUC = auc
print ('The best classification technique is: ' + str(bestCL) +'. The cross-validated AUC is ' +str(round(maxAUC,4)))

## Determine which threshold to use

In [None]:
# Find the entry in delta_costs that is closest to q and maxAUC
closestQ = round(delta_costs.iloc[((delta_costs.q - q).abs() / q).idxmin()].q,5)
closestAUC = round(delta_costs.iloc[((delta_costs.auc - maxAUC).abs() / maxAUC).idxmin()].auc,5)

# Retrieve the cost reductions
max_delta_cost = delta_costs.loc[(delta_costs.q == closestQ) & ( delta_costs.auc == closestAUC)]['max_cost_decrease'].values[0]
accuracy_delta_cost = delta_costs.loc[(delta_costs.q == closestQ) & ( delta_costs.auc == closestAUC),'acc_cost_decrease'].values[0]
f1_delta_cost = delta_costs.loc[(delta_costs.q == closestQ) & ( delta_costs.auc == closestAUC),'f1_cost_decrease'].values[0]

print('By using predictive analytics with AUC = ' + str(closestAUC) + ' and q = ' +str(closestQ) +
      ', the maximum cost reduction is ' + str(max_delta_cost) + '%')
print('By selecting the theshold that maximizes accuracy, the cost reduction is ' + str(accuracy_delta_cost) + '%')
print('By selecting the theshold that maximizes the F1 score, the cost reduction is ' + str(f1_delta_cost) + '%')