### Part 2 Contents:

1) Creating a Classification model for cuisine category prediction. <br>
2) Peforming pre-processing on the text, and choosing relevant model for prediction. <br>
3) Noting future work that can be done for improving the model performance.

### Reading Train Dataset

In [1]:
# Importing necessary libraries

import pandas as pd
import json

In [2]:
# Reading file contents of train.json

train_file = open("Recipe Ingredients/train.json")
train_json = json.load(train_file)
train_file.close()
train_json[0]

{'id': 10259,
 'cuisine': 'greek',
 'ingredients': ['romaine lettuce',
  'black olives',
  'grape tomatoes',
  'garlic',
  'pepper',
  'purple onion',
  'seasoning',
  'garbanzo beans',
  'feta cheese crumbles']}

In [3]:
# Constructing train_df from the train.json contents

train_df = pd.DataFrame(train_json)
train_df.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


### Splitting train_df into 80% Training and 20% Validation sets

In [4]:
# Splitting train_df into 80% Training and 20% Validation sets

from sklearn.model_selection import train_test_split

X = train_df.drop('cuisine', axis=1)
Y = train_df['cuisine']
train_X, val_X, train_Y, val_Y = train_test_split(X, Y, test_size=0.2, random_state=101)

In [5]:
# Getting overview of training and validation sets shapes

print("X.shape: ",X.shape)
print("Y.shape: ",Y.shape)
print("-----------------------------")
print("train_X.shape: ",train_X.shape)
print("val_X.shape: ",val_X.shape)
print("train_Y.shape: ",train_Y.shape)
print("val_Y.shape: ",val_Y.shape)

X.shape:  (39774, 2)
Y.shape:  (39774,)
-----------------------------
train_X.shape:  (31819, 2)
val_X.shape:  (7955, 2)
train_Y.shape:  (31819,)
val_Y.shape:  (7955,)


## Create a classfication model to classify the cuisine
- what are some models that you can think of?
- what are the tradeoffs?

What are some models that you can think of?

We are dealing with text here, i.e. list of ingredients (X variable) and we need to predict the cuisine category. so we will pre-process the ingredients list and preform Count Vectorization on the same. Also, we will label encode the cuisine category (Y variable). <br>
Later, we will use <i>Multinomial Naive Bayes Algorithm</i> for classification of the cuisine. We will train our model, get an accuarcy score using the Validation set. <br>
Lastly, we will use our model and predict the cuisine categories for the test data.

Another alternative can be of using LSTM: Long Short Term Memory Model for prediction.

What are the tradeoffs?

Following will be some of the trade-offs:

1) If there is any new ingredient in the test recipe, our model will not understand that ingredient, it will treat the ingredient as OOV (Out of Vocabulary word). And here as we using Count Vectorizer, the default behaviour is that it ignores that OOV words, so if that certain ingredient is important, it will be disconsidered by the model. This may affect the prediction accuracy.

2) The more vocabulary we try to introduce to the model, the more time-consuming will be the training process for the model, and model may get heavier as well.

#### Multinomial Naive Bayes Algorithm:

This type of supervised Bayesian approach more popular in Natural Language Processing. It calculates the likelihood for a certain sample and outputs the category having greater probability. One assumption of Naive Bayes is that, it treats each feature to have no relation with other feature. Naive Bayes algorithm is used when their are 2 outcomes (0/1). The term multinomial indicates more than 2 outcomes.<br>

In our case, each ingredient list corresponds to a sample, and the 20 cuisines correspond to categories. <br><br>
<img align="left" src="Algorithms/naive_bayes.png" height="400" width="400">

### Assigning Training Set to Feature X and Target Y

In [6]:
# Creating our feature data of ingredients: X and target data of cuisine: Y from training dataset

X = train_X['ingredients']
Y = train_Y

combined_X = pd.concat([X, val_X['ingredients']])

Why are we creating combined_X, what problem it may mitigate?

We are including validation data along with training dataset inorder to increase the vocabulary of the model. This will help in handling OOV ingredients to some extent. Although, we won't be using validation data for model training.

### Analyzing Text Data from Feature X

In [7]:
# Analyzing the recipes having special characters or unwanted texts that need to be cleaned

recipes_to_be_cleaned = []
for recipe in combined_X:
    for ingredient in recipe:
        if "oz." in ingredient or "%" in ingredient:
            recipes_to_be_cleaned.append(recipe)
            break
    if len(recipes_to_be_cleaned)>5:
        break

In [8]:
# Showing few examples from data that need text pre-processing

print(recipes_to_be_cleaned[0], end="\n\n")
print(recipes_to_be_cleaned[5])

['water', 'mint leaves', 'corn starch', 'granulated sugar', '1% low-fat milk', 'sweetened condensed milk', 'large egg whites', 'vanilla extract', 'nonfat evaporated milk', 'brown sugar', 'large eggs', 'cream cheese']

['celery ribs', 'boneless skinless chicken breasts', 'cayenne pepper', 'medium shrimp', 'olive oil', 'diced tomatoes', 'flat leaf parsley', 'green bell pepper', 'brown rice', 'ham', 'onions', '(    oz.) tomato sauce', 'garlic', 'bay leaf']


### Pre-processing Text Data from Feature X

In [9]:
# Method for pre-processing text data from ingredients list
# This pre-processing includes removing digits, parenthesis, oz and % representations
# It also includes joining words that constitute to be 1 whole ingredient for eg: diced tomatoes

import re

def getCleanedRecipe(recipe):
    """
    This method takes a recipe's ingredients list as its input, performs pre-processing and provides final ingredients list.
    """ 
    
    for i in range(len(recipe)):
        recipe[i] = re.sub(r'\d+','',recipe[i])
        recipe[i] = re.sub(r'[%.]','',recipe[i])
        recipe[i] = recipe[i].replace('oz)','')
        recipe[i] = recipe[i].replace('(','')
        recipe[i] = recipe[i].lstrip()
        
        listI = recipe[i].split(" ")
        if len(listI) > 1:
            ingredient = "-".join(listI)
            recipe[i] = ingredient
            
    return recipe

In [10]:
# Demonstration of method: getCleanedRecipe
getCleanedRecipe(['(    oz.) tomato sauce','(10 oz.) frozen chopped spinach','1% low-fat milk','(14.5 oz.) diced tomatoes'])

['tomato-sauce', 'frozen-chopped-spinach', 'low-fat-milk', 'diced-tomatoes']

In [11]:
# Pre-processing all the ingredients list for feature: X

def getCleanedData(X):
    """
    This method takes the feature X, performs pre-processing on the ingredients list of each recipe.
    It later joins all the ingredients to make a complete sentence and stores it in a list: tranformed_X.
    The transformed_X along with the index from feature X (index= recipe id) is returned as output.
    """
    
    transformed_X = []
    for recipe in X:
        recipe = getCleanedRecipe(recipe)
        final_text = " ".join(recipe)
        transformed_X.append(final_text)
        
    transformed_X = pd.Series(transformed_X, index = X.index)
    return transformed_X

In [12]:
# Getting cleaned text strings: transformed_X, and printing one example for understanding for the process taken place

transformed_X = getCleanedData(combined_X)
print("Text: ",combined_X[0])
print("Cleaned Text: ",transformed_X[0])

Text:  ['romaine-lettuce', 'black-olives', 'grape-tomatoes', 'garlic', 'pepper', 'purple-onion', 'seasoning', 'garbanzo-beans', 'feta-cheese-crumbles']
Cleaned Text:  romaine-lettuce black-olives grape-tomatoes garlic pepper purple-onion seasoning garbanzo-beans feta-cheese-crumbles


### Vectorization of Feature Data

In [13]:
#Vectorisation of transformed_X
# Count Vectorizer counts each word from and maintains a sparse matrix of all the words from the data.
# Each record is encoded into zeros and ones depending on whether the word is present in the record or not.
# A voculabulary is also constructed which maintains unique words found in the data, with their corresponding event counts.

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(token_pattern='[a-zA-Z-]+')
cv.fit_transform(transformed_X)

<39774x6788 sparse matrix of type '<class 'numpy.int64'>'
	with 430393 stored elements in Compressed Sparse Row format>

In [14]:
# Getting length of unique words from data: transformed_X
print("Vocabulary Length:",len(cv.vocabulary_))

Vocabulary Length: 6788


In [15]:
# We have created the vocabulary using training + validation data
# Now we are building actual feature data: feature_X of training set, which will be fed to the model for training

feature_X = getCleanedData(X)
vectorized_feature_X = cv.transform(feature_X).toarray()

print("vectorized_feature_X.shape: ", vectorized_feature_X.shape)

vectorized_feature_X.shape:  (31819, 6788)


### Label Encoding of Target Variable Y

In [16]:
# Label Encoding the target variable Y which includes the cuisine names

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le_Y = le.fit_transform(Y)

In [17]:
# Viewing the first 5 label-encoded values

print(Y[:5])
print(le_Y[:5])

15161     french
37977      irish
4727     mexican
11068    mexican
30324     indian
Name: cuisine, dtype: object
[ 5  8 13 13  7]


In [18]:
# Viewing the classes of the Label Encoder: these are our cuisine categories
le.classes_

array(['brazilian', 'british', 'cajun_creole', 'chinese', 'filipino',
       'french', 'greek', 'indian', 'irish', 'italian', 'jamaican',
       'japanese', 'korean', 'mexican', 'moroccan', 'russian',
       'southern_us', 'spanish', 'thai', 'vietnamese'], dtype=object)

### Model Building using Multinomial Naive Bayes Algorithm

In [19]:
# Applying MultinomiaL Navie Bayes Algorithm and Training the model

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

# Training the model
mnb.fit(vectorized_feature_X, le_Y)

MultinomialNB()

### Model Prediction on Validation Data

In [20]:
# Preparing validation dataset feature to be passed to the model for prediction

val_X.drop('id', axis=1, inplace=True)
val_X = val_X['ingredients']
val_X.head()

10621    [vegetable-oil-cooking-spray, prune-puree, bre...
37221    [soy-sauce, ginger, varnish-clams, sugar, rice...
16752    [chicken-broth, lemon, pepper, garlic-cloves, ...
11585    [unsalted-butter, all-purpose-flour, baking-po...
29307    [ricotta-cheese, linguine, large-garlic-cloves...
Name: ingredients, dtype: object

In [21]:
# Pre-processing, Count Vectorization and Prediction on Validation Data

val_cleaned = getCleanedData(val_X)
val_vectorized = cv.transform(val_cleaned).toarray()
val_Y_predicted = mnb.predict(val_vectorized)

### Model Evaluation

In [22]:
# Evaluating model on predictions of Validation set

from sklearn.metrics import accuracy_score

val_Y_predicted = le.inverse_transform(val_Y_predicted)
accuracy_score(val_Y, val_Y_predicted)

0.7253299811439347

### Saving Recipe Classifier Model

In [23]:
# Saving the model to disk

import pickle

filename = 'recipe-classifier-model-using-MultinomialNB.h5'
pickle.dump(mnb, open(filename, 'wb'))
print("Model Saved Successfully")

Model Saved Successfully


### Model Prediction using Test Data

In [24]:
# Reading file contents of test.json

test_file = open("Recipe Ingredients/test.json")
test_json = json.load(test_file)
test_file.close()
test_json[0]

{'id': 18009,
 'ingredients': ['baking powder',
  'eggs',
  'all-purpose flour',
  'raisins',
  'milk',
  'white sugar']}

In [25]:
# Constructing test_df from the test.json contents

test_df = pd.DataFrame(test_json)
test_df.head()

Unnamed: 0,id,ingredients
0,18009,"[baking powder, eggs, all-purpose flour, raisi..."
1,28583,"[sugar, egg yolks, corn starch, cream of tarta..."
2,41580,"[sausage links, fennel bulb, fronds, olive oil..."
3,29752,"[meat cuts, file powder, smoked sausage, okra,..."
4,35687,"[ground black pepper, salt, sausage casings, l..."


In [26]:
# Getting feature test_X: ingredients to be passed for prediction

test_X = test_df['ingredients']
test_X.head()

0    [baking powder, eggs, all-purpose flour, raisi...
1    [sugar, egg yolks, corn starch, cream of tarta...
2    [sausage links, fennel bulb, fronds, olive oil...
3    [meat cuts, file powder, smoked sausage, okra,...
4    [ground black pepper, salt, sausage casings, l...
Name: ingredients, dtype: object

In [27]:
# Pre-processing, Count Vectorization and Prediction on Test Data

test_cleaned = getCleanedData(test_X)
test_vectorized = cv.transform(test_cleaned).toarray()
Y_predicted = mnb.predict(test_vectorized)
Y_predicted = le.inverse_transform(Y_predicted)

### Saving Predicted Output as JSON File

In [28]:
# Saving the cuisine predictions into dataframe

Y_predicted_df = pd.DataFrame(test_df['id'])
Y_predicted_df['cuisine'] = Y_predicted
Y_predicted_df.head()

Unnamed: 0,id,cuisine
0,18009,southern_us
1,28583,southern_us
2,41580,italian
3,29752,cajun_creole
4,35687,italian


In [29]:
# Saving the dataframe of cuisine predictions to disk in JSON Format

Y_predicted_df.to_json("Recipe Ingredients/test_output_using_MultinomialNB.json", orient='records', indent=3)
print("Output File Saved Successfully")

Output File Saved Successfully


### Future Work:

We will use LSTM Model for Model Building and Training phases and then compare the results of the 2 models on basis of their accuracy scores, as well their output predictions.

-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------