In this notebook, we explore the decision tree model and tune the hyperparameters. We work with two different training sets, each with the a different subset of features. 
- train_1 contains features which were found by scraping a website of all food-related words.
- train_2 contains features which were found by picking ingredients which occured at least 50 times in the training data. 

In [8]:
## Import packages 
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from itertools import product
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score
from sklearn.tree import plot_tree

In [15]:
## Import data
#train_0 = pd.read_csv("../Data/original_data.csv") # This line runs very slowly on my computer and using train_0 always leads to python kernel crashing
train_1 = pd.read_csv("../Data/key_words_data.csv")
train_2 = pd.read_csv("../Data/train_trimmed.csv")

train_list = { 2:train_2} #Do one at a time in case the kernel crashes

## Tuning the depth and n_estimators of a random forest

In [20]:
# Range of all hyperparameters
max_depth_list = range(200,201,30)
n_estimators_list = range(60,80,5)
hyperparameter_list = product(max_depth_list,n_estimators_list)

# List of all KPI metrics
cv_accuracy_score = {i:{} for i in train_list}
cv_recall_score = {i:{} for i in train_list}
cv_f1_score = {i:{} for i in train_list}

## Make stratified k fold splits of the data
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=429)

## Use cross-validation to tune the hyperparameters
for i,train in train_list.items(): 
    X_train = train.drop(columns=["id","cuisine"])
    y_train = train["cuisine"]
    
    for hyperparameter in hyperparameter_list: 
        max_depth, n_estimators = hyperparameter
        clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
        
        ## To store KPI metrics for each of the 5 cross-validation sets
        accuracy_list = []
        f1_list = []
        recall_list = []
        
        for train_idx, test_idx in skf.split(X_train,y_train):
            X_train_train, y_train_train = X_train.iloc[train_idx], y_train.iloc[train_idx]
            X_holdout, y_holdout = X_train.iloc[test_idx], y_train.iloc[test_idx]
            clf.fit(X_train_train,y_train_train)
            prediction = clf.predict(X_holdout)
            accuracy_list.append(accuracy_score(y_holdout,prediction))
            f1_list.append(f1_score(y_holdout,prediction,average="weighted"))
            recall_list.append(recall_score(y_holdout,prediction,average="weighted"))
            
        cv_accuracy_score[i][hyperparameter] = np.mean(accuracy_list)
        cv_f1_score[i][hyperparameter] = np.mean(f1_list)
        cv_recall_score[i][hyperparameter] = np.mean(recall_list)

        print("Train set",i,". Done with parameters ", hyperparameter, 
              ". Accuracy = ", cv_accuracy_score[i][hyperparameter]) #Output to keep track of the process

Train set 2 . Done with parameters  (200, 60) . Accuracy =  0.7018655003542525
Train set 2 . Done with parameters  (200, 65) . Accuracy =  0.7033740962135042
Train set 2 . Done with parameters  (200, 70) . Accuracy =  0.7014128978900837
Train set 2 . Done with parameters  (200, 75) . Accuracy =  0.7028209091022595


## Results
Due to computational intensity, we tuned the hyperparameters by fixing one and varying the other. We did this repeatedly for a few times to arrive at the near-optimal values of: 
- train_1 - max_depth=50, n_estimators=75 gives an accuracy of 71%
- train_2 - max_depth=200, n_estimators=65 gives 70.3% accuracy