### Cuisine Type Prediction
#### Built using Python version 3.6.5
This project is based on the Kaggle competition titled "What's Cooking?" (https://www.kaggle.com/c/whats-cooking/data). 
The dataset contains the full ingredients list of nearly 40,000 recipes. The purpose of the project is to predict which category of cuisine a given recipe belongs to, out of a list of 20 possible types (Korean, Chinese, Moroccan, Japanese, Filipino, Mexican, Southern US, Irish, Thai, Italian, Vietnamese, Jamaican, Indian, British, Russian, Cajun Creole, Greek, French, Brazilian, Spanish). 

#### Methods

This program uses Pandas to extract the information and subsequently builds an array of input features based on most commonly occurring ingredients within each cuisine category, and whether or not a given recipe contains each of those ingredients (a binary indicator). 

Random forests are the chosen algorithm for this prediction task, because of their generally strong performance in classification tasks, particularly for high-dimensional feature datasets. 

#### Results Summary

Highest Test Set Accuracy Obtained: <b>70.4%</b> <br>

Model Type: <b>Random Forest Classifier</b> <br>

Hyperparameters: <b>n_estimators (number of trees) = 400</b>

In [1]:
import pandas as pd
import numpy as np
from collections import Counter 

# Import downloaded data from JSON file

train = pd.read_json('train.json')

ingredients = train['ingredients'].reset_index(drop=True)
cuisine_type = list(set(train['cuisine']))
common_ing_dict = {'cuisine':[], 'topN': []}

# Extract top N most common words from each cuisine type to use as features
N = 200

for cuisine in cuisine_type:
        
    word_list = []    
    a = train[train['cuisine']==cuisine].reset_index()['ingredients']
 
    combined_str = ""

    for i in a:
        word_list = word_list + i

# The following lines of code are for alternate method used to examine list word by word:
#         combined_str = (combined_str + " " + word_list[len(word_list)-1]).lower()    
#     word_list = combined_str.split()
    
    word_count = Counter(word_list)
    word_freq_table = pd.DataFrame({'word':list(word_count.keys()), 'count':list(word_count.values())})
    word_freq_table = word_freq_table.sort_values(by=['count'], ascending=False)
    
    common_ing_dict['cuisine'].append(type)
    common_ing_dict['topN'].append(list(word_freq_table['word'][0:N]))
    

In [2]:
common_ing_df = pd.DataFrame({'cuisine':common_ing_dict['cuisine'],'topN':common_ing_dict['topN']})

# List of unique words from compiling top N most common ingredients from each cuisine
feature_words = list(set(sum(common_ing_df['topN'], [])))

#### Note About "Secondary" Features:

The following section of code was intended to improve the model's test results accuracy by building additional features. This time, the most common pairs of ingredients (i.e. ingredients which appear together in the same recipe with high frequency) were selected for each cuisine type, and the presence or absence of those pairs in a given recipe (binary indicator) became the basis for the "secondary" features.

In [3]:
# Take 10,000 unique pair samples from randomly chosen recipes in cuisine category
# Count occurrences of each pair in all recipes and sort by frequency

top_pair_features = []

for cuisine in cuisine_type:
    
    a = train[train['cuisine']=='greek']['ingredients']
    a = a.reset_index()['ingredients']

    combined_str = "" 
    str_list = []

    for i in a:
        for j in i:
            combined_str = combined_str + j + " "
        str_list.append(combined_str.split())  
        combined_str = ""
    
    length = len(a)-1
    
    # Number of total random pair samples from dataset
    N_samples = 10000
    
    # Number of pair selections per cuisine type
    N_pairs = 10

    import random

    sample_list = []

    sample_pairs = []

    for x in range(N_samples):
        sample_list.append(random.randint(0,length))

    for i in sample_list:
        if len(str_list[i]) < 2:
            continue
        newlist = random.sample(str_list[i],2)
        newlist.sort()
        if (newlist[0]==newlist[1]):
            continue
        if newlist in sample_pairs:
            continue
        else:
            sample_pairs.append(newlist)

    counter = 0
    pair_counts = []

    for i in sample_pairs:
        for j in a:
            if i[0] in j and i[1] in j:
                counter = counter + 1
        pair_counts.append(counter)
        counter = 0

    top_pairs = pd.DataFrame({'pairs':sample_pairs,'count':pair_counts}).sort_values(by=['count'], ascending=False)
    top_pairs = top_pairs.reset_index()['pairs'][0:N_pairs]

    for i in top_pairs:
        if i in top_pair_features:
            continue
        else:
            top_pair_features.append(i)

In [4]:
#Secondary features cutoff

N_sec = 200

recipe_data = {'cuisine':[], 'primary_features':[], 'secondary_features':[]}

for j in range(0,len(train['ingredients'])):
    test = []
    for i in feature_words:
        if i in train['ingredients'][j]:
            test.append(1)
        else:
            test.append(0)
    recipe_data['cuisine'].append(train['cuisine'][j])
    recipe_data['primary_features'].append(test)
    
for j in range(0,len(train['ingredients'])):
    test = []
    for i in top_pair_features[0:min(N_sec,len(top_pair_features))]:
        if i[0] in train['ingredients'][j] and i[1] in train['ingredients'][j]:
            test.append(1)
        else:
            test.append(0)
    recipe_data['secondary_features'].append(test)
    
data = (np.concatenate([np.array(recipe_data['primary_features']), np.array(recipe_data['secondary_features'])], 1))

target = []

labels = pd.Series(recipe_data['cuisine'])

for i in labels:
    target.append(cuisine_type.index(i))
    
target = np.array(target)
recipe = {'data':data, 'target':target, 'target_names':cuisine_type}

recipe['target'] = recipe['target'][0:39774]

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(recipe['data'], recipe['target'], random_state=42)

In [6]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=400, random_state=0)
forest.fit(X_train, y_train)

print("Training score: {:.3f}".format(forest.score(X_train, y_train)))
print("Test score: {:.3f}".format(forest.score(X_test, y_test)))

Training score: 0.996
Test score: 0.704


In [7]:
# Neural network classifier method:

# from sklearn.neural_network import MLPClassifier

# clf = MLPClassifier(solver='sgd', alpha=1, hidden_layer_sizes=[1000], max_iter=500)

# clf.fit(X_train, y_train)

# print("Training score: {:.3f}".format(clf.score(X_train, y_train)))
# print("Test score: {:.3f}".format(clf.score(X_test, y_test)))