In this notebook, we explore the decision tree model and tune the hyperparameters. We work with three different training sets, each with the a different subset of features. 
- train_0 contains all the original features (ingredients).
- train_1 contains features which were found by scraping a website of all food-related words.
- train_2 contains features which were found by picking ingredients which occured at least 50 times in the training data. 

In [None]:
## Import packages 
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score
from sklearn.tree import plot_tree

In [None]:
## Import data
#train_0 = pd.read_csv("../Data/original_data.csv") # This line runs very slowly on my computer and using train_0 always leads to python kernel crashing
train_1 = pd.read_csv("../Data/key_words_data.csv")
train_2 = pd.read_csv("../Data/train_trimmed.csv")

train_list = {1:train_1, 2:train_2} #Do one at a time in case the kernel crashes

## Tunig the depth of the tree

In [None]:
## Make stratified k fold splits of the data
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=429)

cv_accuracy_score = {i:[] for i in train_list} #To store the cross-validation accuracy score
cv_recall_score = {i:[] for i in train_list}
cv_f1_score = {i:[] for i in train_list}
max_depth = {0:100,1:50,2:200} #To store the maximum depths for hyperparameter tuning

## Use cross-validation to tune the depth of the decision tree
for i,train in train_list.items(): 
    X_train = train.drop(columns=["id","cuisine"])
    y_train = train["cuisine"]
        
    for d in range(1,max_depth[i]+1):
        clf = DecisionTreeClassifier(max_depth=d)
        accuracy_list = [] #To store the accuracy for each of the 5 cross-validation pair. 
        f1_list = []
        recall_list = []
        for train_idx, test_idx in skf.split(X_train,y_train):
            X_train_train, y_train_train = X_train.iloc[train_idx], y_train.iloc[train_idx]
            X_holdout, y_holdout = X_train.iloc[test_idx], y_train.iloc[test_idx]
            clf.fit(X_train_train,y_train_train)
            prediction = clf.predict(X_holdout)
            accuracy_list.append(accuracy_score(y_holdout,prediction))
            f1_list.append(f1_score(y_holdout,prediction,average="weighted"))
            recall_list.append(recall_score(y_holdout,prediction,average="weighted"))
            
        cv_accuracy_score[i].append(np.mean(accuracy_list))
        cv_f1_score[i].append(np.mean(f1_list))
        cv_recall_score[i].append(np.mean(recall_list))

In [None]:
## Store the KPI's for further analysis
kpi_df = {}
for i in train_list:
    df = pd.DataFrame(
        np.transpose([range(1,max_depth[i]+1), cv_accuracy_score[i], cv_f1_score[i]
                       , cv_recall_score[i]]),
                     columns=["depth","cv accuracy","cv f1 score", "cv recall score"])
    kpi_df[i] = df

## Export KPI data as a csv
for i in train_list: 
    filename = "decision_tree_hyperparameter_tuning_train_" + str(i) + ".csv"
    kpi_df[i].to_csv(filename, index=False)

In [None]:
## Plot accuracy vs depth
for i in train_list:
    plt.figure()
    plt.title("Cross-validation accuracy vs depth of descision tree for train data "+str(i))
    plt.xlabel("Depth of decision tree")
    plt.ylabel("Accuracy")
    plt.plot(range(1,max_depth[i]+1),cv_accuracy_score[i])
    plt.show()

## Results after tuning the max_depth 
- train_1 - best max_depth is 35, accuracy = 59%
- train_2 - max_depth ~114 gives about 58% accuracy and then the accuracy slowly increases uptil about max_depth 137 to 59% 
- train_0 - stops around max_depth 14 and then kernel crashes 

## Conclusion 
In the context of decision trees, the final models have max_depth 35 and 114 for the training sets train_1 and train_2 respectively. 