# On Linear Discrimiant Analysis, Quadractic Discriminant Analysis, and Naive Bayes Models

This document checks three models (abbreviated LDA, QDA, NB) on four datasets (called keywords, trimmed, removed, original) using k-fold cross-validation (5 splits).

Layout:
- Summary/which model is the best.
- Background and functions used in the check.
- Results for Linear Discriminant Analysis (and k-fold cross-validation).
- Results for Quadratic Discriminant Analysis (and k-fold cross-validation).
- Results for Naive Bayes (and k-fold cross-validation).
- Bagging results on the best of the three models.

Indicators used to evaluate results:
- The main indicator was the model's overall accuracy.
- Additionally, each cuisine type has its own accuracy, precision, true positive rate, and false positive rate; each of which was compared across the different cuisine types using minimum, maximum, and average.

Dataset Names (ordered from least features to most features):
- Keywords (this data set replaced the ingredients by a list of key words)
- Trimmed (this data set trimmed the original list of ingredients down to only those that appear at least 50 times)
- Removed (this data set had many brand names, adjectives and descriptors removed)
- Original (this is the uncleaned dataset)

# Best Model (Summary of results below)

Of the three models (LDA, QDA and NB), Linear Discriminant Analysis was the best. 

Best model in this document: Bagged Linear Discriminant Analysis.

Datasets (in order from best to worst performing) along with their bagged LDA accuracy results (averaged across the splits and rounded to two decimal places): 

1. Removed - 73.63% accuracy
2. Original - 72.82% accuracy
3. Trimmed - 71.36% accuracy
4. Keywords - 67.01% accuracy 
   
Note: non-bagged LDA accuracies were 73.49%, 73.26%, 70.47% and 66.5%, respectfully. Additionally, non-bagged had a significantly shorter run time.

# Background Functions and loading

-- Packages and Data --

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

In [2]:
original = pd.read_csv("original_data.csv")
trimmed = pd.read_csv("train_trimmed.csv")
keywords = pd.read_csv("key_words_data.csv")
removed = pd.read_csv("remove_adj_data.csv")

In [3]:
from sklearn.model_selection import StratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

kfold = StratifiedKFold(n_splits = 5,
                           shuffle = True,
                           random_state = 429)

In [4]:
LDA = LinearDiscriminantAnalysis()
QDA = QuadraticDiscriminantAnalysis()
NB = GaussianNB() # Naive Bayes

-- Functions --

In [5]:
## This is the confusion matrix for the dataframe
## This will be used in all functions in this document

## First input is the correct answers
    ## Specifically, our data is a subset of rows of a dataframe, one of whose columns is 'cuisine'.
    ## For example, if we are using the entire original data set (from github) to fit the model and run predictions,
        ## and name the dataframe "original", then we would use original['cuisine']
## Second input is the prediction

def our_matrix(data, prediction):
    return(confusion_matrix(data, prediction))

In [6]:
## percentage of times the correct answer was predicted

## First input: the correct answers
## Second input: the prediction

def total_accuracy(data, prediction):
    cm = our_matrix(data, prediction)
    denom = 0
    for i in range(20):
        denom = denom + cm[i][i]
    return(denom/len(data))

In [7]:
## For each cuisine type, we can find its true positive, false positive, false negative, and false positive values using these functions

## First input: correct answers
## Second input: cuisine type of interest
    ## Warning: name must be entered with quotes, e.g. 'italian'
## Third input: the model's predictions

## True Positive for "cuisine_type" based on the "prediction"
def tp_for_cuisine_type(data, cuisine_type, prediction):
    cm = our_matrix(data, prediction)
    i = data.value_counts().index.get_loc(cuisine_type)
    return(cm[i][i])

## False Positive for "cuisine_type" based on the "prediction"
def fp_for_cuisine_type(data, cuisine_type, prediction):
    cm = our_matrix(data, prediction)
    n = data.value_counts().index.get_loc(cuisine_type)
    summation = 0
    for i in range(20):
        if i != n:
            summation = summation + cm[n][i]
    return(summation)

## False Negative for "cuisine_type" based on the "prediction"
def fn_for_cuisine_type(data, cuisine_type, prediction):
    cm = our_matrix(data, prediction)
    n = data.value_counts().index.get_loc(cuisine_type)
    summation = 0
    for i in range(20):
        if i != n:
            summation = summation + cm[i][n]
    return(summation)

## True Negative for "cuisine_type" based on the "prediction"
def tn_for_cuisine_type(data, cuisine_type, prediction):
    cm = our_matrix(data, prediction)
    n = data.value_counts().index.get_loc(cuisine_type)
    summation = 0
    for i in range(20):
        for j in range(20):
            if (i != n) and (j != n):
                summation = summation + cm[i][j]
    return(summation)

Reminder for myself: What do the functions mean? Note: "cuisine_type" is a fixed value.
- Accuracy - probability of correct prediction (this is in the form "cuisine_type" or "not cuisine_type")
- Precision - probability that recipe with predicted "cuisine_type" is "cuisine_type"
- True positive rate - probability that recipe with "cuisine_type" is predicted to be "cuisine_type"
- True negative rate - probability that recipe with "not cuisine_type" is predicted to be "not cuisine_type"

In [8]:
## Performance Indicators for each cuisine type

## First input: correct answers
## Second input: cuisine type of interest
    ## Warning: name must be entered with quotes, e.g. 'italian'
## Third input: the model's predictions

def accuracy_for_cuisine_type(data, cuisine_type, prediction):
    denom = tp_for_cuisine_type(data, cuisine_type, prediction) + tn_for_cuisine_type(data, cuisine_type, prediction)
    numer = denom + fp_for_cuisine_type(data, cuisine_type, prediction) + fn_for_cuisine_type(data, cuisine_type, prediction)
    return(denom/numer)

def precision_for_cuisine_type(data, cuisine_type, prediction):
    denom = tp_for_cuisine_type(data, cuisine_type, prediction)
    numer = denom + fp_for_cuisine_type(data, cuisine_type, prediction)
    return(denom/numer)

def true_positive_rate_for_cuisine_type(data, cuisine_type, prediction):
    denom = tp_for_cuisine_type(data, cuisine_type, prediction)
    numer = denom + fn_for_cuisine_type(data, cuisine_type, prediction)
    return(denom/numer)

def true_negative_rate_for_cuisine_type(data, cuisine_type, prediction):
    denom = tn_for_cuisine_type(data, cuisine_type, prediction)
    numer = denom + fp_for_cuisine_type(data, cuisine_type, prediction)
    return(denom/numer)

In [9]:
## Overall Performance Indicators

## First input: dataframe column with the correct information, i.e. the correct answers
## Second input: a "_for_cuisine_type" indicator
## Third input: the predications from the model

def indicator_average(data, function, prediction):
    summation = 0
    for i in range(20):
        summation = summation + function(data, data.value_counts().index[i], prediction)
    return(summation/20)

## Fourth input: boolean
    ## True returns the minimum value, False returns the associated cuisine type
def min_of_indicator(data, function, prediction, boolean):
    values = []
    for i in range(20):
        values.append(function(data, data.value_counts().index[i], prediction))
    minimum = min(values)
    temp_location = values.index(minimum)
    location = data.value_counts().index[temp_location]
    if boolean == True:
        return(minimum)
    if boolean == False:
        return(location)

## Fourth input: boolean
    ## True returns the maximum value, False returns the associated cuisine type
def max_of_indicator(data, function, prediction, boolean):
    values = []
    for i in range(20):
        values.append(function(data, data.value_counts().index[i], prediction))
    maximum = max(values)
    temp_location = values.index(maximum)
    location = data.value_counts().index[temp_location]
    if boolean == True:
        return(maximum)
    if boolean == False:
        return(location)

In [10]:
## produces a dataframe containing some of the overall performance indicators for some of the indicators (accuracy, precision, true positive, true negative)

def reduced_overall_indicator_df(data, prediction):
    c1 = ['Average', 'Minimum Value', 'Maximum Value']
    c2 = [indicator_average(data, accuracy_for_cuisine_type, prediction), min_of_indicator(data, accuracy_for_cuisine_type, prediction, True), max_of_indicator(data, accuracy_for_cuisine_type, prediction, True)]
    c3 = [indicator_average(data, precision_for_cuisine_type, prediction), min_of_indicator(data, precision_for_cuisine_type, prediction, True), max_of_indicator(data, precision_for_cuisine_type, prediction, True)]
    c4 = [indicator_average(data, true_positive_rate_for_cuisine_type, prediction), min_of_indicator(data, true_positive_rate_for_cuisine_type, prediction, True), max_of_indicator(data, true_positive_rate_for_cuisine_type, prediction, True)]
    c5 = [indicator_average(data, true_negative_rate_for_cuisine_type, prediction), min_of_indicator(data, true_negative_rate_for_cuisine_type, prediction, True), max_of_indicator(data, true_negative_rate_for_cuisine_type, prediction, True)]
    data = {'Accuracy':c2, 'Precision':c3, 'True Positive Rate':c4, 'True Negative Rate':c5}
    df = pd.DataFrame(data, index = c1)
    return(df)

In [11]:
## an error function for a prediction

## First input: correct answers
## second input: model's prediction

def error_function(data, prediction):
    print("The total accuracy of the prediction is", total_accuracy(data, prediction))
    print()
    print("Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.")
    print("If we compare those four indicators across all of the cuisines we get the following table:")
    print()
    print(reduced_overall_indicator_df(data, prediction))

In [12]:
def cross_validation_with_model(dataframe, model):
    i = 0
    X = dataframe[dataframe.columns[2:]]
    y = dataframe['cuisine']
    for train_index, test_index in kfold.split(X, y):
        # getting the kfold training data and holdout data
        X_train = X.iloc[train_index]
        X_holdout = X.iloc[test_index]
        y_train = y.iloc[train_index]
        y_holdout = y.iloc[test_index]
        # fiting and predicting
        model.fit(X_train, y_train)
        pred = model.predict(X_holdout)
        # error
        print("----------------------------------------------------------")
        print()
        print("Split", i)
        error_function(y_holdout, pred)
        print()
        i = i + 1

In [13]:
def cross_validation_every_dataset(model):
    print("__________________________________________________________")
    print("Key Words Data Set")
    print("__________________________________________________________")
    cross_validation_with_model(keywords, model)
    print("__________________________________________________________")
    print("Trimmed Data Set")
    print("__________________________________________________________")
    cross_validation_with_model(trimmed, model)
    print("__________________________________________________________")
    print("Removed Adjectives Data Set")
    print("__________________________________________________________")
    cross_validation_with_model(removed, model)
    print("__________________________________________________________")
    print("Original Data Set")
    print("__________________________________________________________")
    cross_validation_with_model(original, model)

# Linear Discriminant Analysis Results

In [14]:
cross_validation_every_dataset(LDA)

__________________________________________________________
Key Words Data Set
__________________________________________________________
----------------------------------------------------------

Split 0
The total accuracy of the prediction is 0.6634820867379007

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.966348   0.544056            0.613147            0.981618
Minimum Value  0.896417   0.268817            0.308725            0.947690
Maximum Value  0.990949   0.811215            0.886695            0.994377

----------------------------------------------------------

Split 1
The total accuracy of the prediction is 0.6599622878692646

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those fo

# Quadratic Discriminant Analysis Results

In [16]:
cross_validation_every_dataset(QDA)

__________________________________________________________
Key Words Data Set
__________________________________________________________




----------------------------------------------------------

Split 0
The total accuracy of the prediction is 0.135512256442489

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.913551   0.229818            0.300688            0.955980
Minimum Value  0.677561   0.000000            0.000000            0.811292
Maximum Value  0.990446   0.854305            0.941176            0.995836





----------------------------------------------------------

Split 1
The total accuracy of the prediction is 0.13928346951602766

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.913928   0.227003            0.323797            0.956145
Minimum Value  0.601257   0.000000            0.000000            0.812007
Maximum Value  0.989566   0.874172            0.956989            0.995931





----------------------------------------------------------

Split 2
The total accuracy of the prediction is 0.13714644877435575

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.913715   0.226459            0.327467            0.956023
Minimum Value  0.743432   0.000000            0.000000            0.810595
Maximum Value  0.990195   0.879518            0.928571            0.996900





----------------------------------------------------------

Split 3
The total accuracy of the prediction is 0.1431803896920176

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.914318   0.232170            0.318375            0.956309
Minimum Value  0.688749   0.000000            0.000000            0.811007
Maximum Value  0.989566   0.879518            0.968254            0.997040





----------------------------------------------------------

Split 4
The total accuracy of the prediction is 0.13930098063867236

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.913930   0.218240            0.335281            0.956087
Minimum Value  0.773699   0.000000            0.000000            0.813719
Maximum Value  0.989691   0.849398            0.963964            0.995860

__________________________________________________________
Trimmed Data Set
__________________________________________________________




----------------------------------------------------------

Split 0
The total accuracy of the prediction is 0.34582023884349467

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.934582   0.266889            0.310667            0.965567
Minimum Value  0.834821   0.000000            0.000000            0.834322
Maximum Value  0.988184   0.801303            0.936228            0.991243





----------------------------------------------------------

Split 1
The total accuracy of the prediction is 0.35788812067881837

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.935789   0.286001            0.340223            0.966319
Minimum Value  0.830295   0.000000            0.000000            0.829540
Maximum Value  0.988058   0.866883            0.929348            0.994021





----------------------------------------------------------

Split 2
The total accuracy of the prediction is 0.361659333752357

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.936166   0.287733            0.331867            0.966500
Minimum Value  0.829667   0.000000            0.000000            0.829342
Maximum Value  0.988058   0.837662            0.922368            0.992800





----------------------------------------------------------

Split 3
The total accuracy of the prediction is 0.35210559396605906

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:



  return(denom/numer)
  return(denom/numer)
  return(denom/numer)
  return(denom/numer)
  return(denom/numer)
  return(denom/numer)


               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.935211   0.278704                 NaN            0.965938
Minimum Value  0.828536   0.000000                 NaN            0.829636
Maximum Value  0.988309   0.808442                 NaN            0.991454





----------------------------------------------------------

Split 4
The total accuracy of the prediction is 0.34837817450339453

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:



  return(denom/numer)
  return(denom/numer)
  return(denom/numer)


               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.934838   0.272563                 NaN            0.965818
Minimum Value  0.829645   0.000000            0.000000            0.829039
Maximum Value  0.988182   0.837662            0.924147            0.992605

__________________________________________________________
Removed Adjectives Data Set
__________________________________________________________




----------------------------------------------------------

Split 0
The total accuracy of the prediction is 0.4179761156505343

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.941798   0.209598            0.404430            0.968587
Minimum Value  0.694532   0.007519            0.015419            0.904622
Maximum Value  0.989189   0.887755            0.888889            0.989303





----------------------------------------------------------

Split 1
The total accuracy of the prediction is 0.42928975487115023

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.942929   0.224928            0.398716            0.969064
Minimum Value  0.697297   0.009524            0.020115            0.911367
Maximum Value  0.988812   0.870453            0.844444            0.989544





----------------------------------------------------------

Split 2
The total accuracy of the prediction is 0.42438717787554997

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.942439   0.215860            0.366294            0.968902
Minimum Value  0.693275   0.007463            0.018809            0.907620
Maximum Value  0.988812   0.886407            0.761905            0.989421





----------------------------------------------------------

Split 3
The total accuracy of the prediction is 0.4116907605279698

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:



  return(denom/numer)
  return(denom/numer)
  return(denom/numer)


               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.941169   0.200993                 NaN            0.968142
Minimum Value  0.693526   0.000000            0.000000            0.905860
Maximum Value  0.989189   0.869260            0.818182            0.989426





----------------------------------------------------------

Split 4
The total accuracy of the prediction is 0.41639426703545385

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.941639   0.207204            0.354270            0.968376
Minimum Value  0.686573   0.006211            0.008869            0.907882
Maximum Value  0.987931   0.871811            0.740741            0.989535

__________________________________________________________
Original Data Set
__________________________________________________________




----------------------------------------------------------

Split 0
The total accuracy of the prediction is 0.42149591451917034

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.942150   0.216005            0.370211            0.968716
Minimum Value  0.716656   0.000000            0.000000            0.909723
Maximum Value  0.989315   0.868622            0.833333            0.989551





----------------------------------------------------------

Split 1
The total accuracy of the prediction is 0.42891263356379633

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.942891   0.223798            0.367304            0.968977
Minimum Value  0.713891   0.000000            0.000000            0.915641
Maximum Value  0.988686   0.852585            0.733333            0.989050





----------------------------------------------------------

Split 2
The total accuracy of the prediction is 0.4222501571338781

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.942225   0.219112            0.381021            0.968641
Minimum Value  0.711879   0.019048            0.015291            0.910179
Maximum Value  0.989441   0.858328            0.916667            0.989551





----------------------------------------------------------

Split 3
The total accuracy of the prediction is 0.40930232558139534

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.940930   0.193133            0.464573            0.967987
Minimum Value  0.716279   0.000000            0.000000            0.909737
Maximum Value  0.988938   0.852041            1.000000            0.989177





----------------------------------------------------------

Split 4
The total accuracy of the prediction is 0.4182801106361579

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.941828   0.210055            0.396744            0.968408
Minimum Value  0.703294   0.007519            0.006198            0.911485
Maximum Value  0.989062   0.850765            0.800000            0.989300



# Naive Bayes Results

In [17]:
cross_validation_every_dataset(NB)

__________________________________________________________
Key Words Data Set
__________________________________________________________
----------------------------------------------------------

Split 0
The total accuracy of the prediction is 0.13199245757385292

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.913199   0.256558            0.307797            0.955871
Minimum Value  0.664865   0.005607            0.031467            0.812174
Maximum Value  0.968322   0.819048            0.961905            0.996362

----------------------------------------------------------

Split 1
The total accuracy of the prediction is 0.13060967944688875

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those 

# Bagging on Best Model

The best model was Linear Discriminant Analysis (LDA) so, this is the model we will be bagging.

In [19]:
from sklearn.ensemble import BaggingClassifier

In [20]:
bag = BaggingClassifier(base_estimator = LDA,
                           bootstrap = True,
                           n_estimators = 100, # maybe change this number
                           max_samples = 3000)

Bagging results on all datasets

In [21]:
cross_validation_every_dataset(bag)

__________________________________________________________
Key Words Data Set
__________________________________________________________
----------------------------------------------------------

Split 0
The total accuracy of the prediction is 0.6708988057825267

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those four indicators across all of the cuisines we get the following table:

               Accuracy  Precision  True Positive Rate  True Negative Rate
Average        0.967090   0.518440            0.639640            0.982094
Minimum Value  0.892772   0.180124            0.381579            0.955548
Maximum Value  0.991075   0.828283            0.892857            0.993516

----------------------------------------------------------

Split 1
The total accuracy of the prediction is 0.6612193588937775

Each cuisine type has its own accuracy, precision, true positive rate, and false positive rate.
If we compare those fo