# Cuisine Classification

In this project, I am trying to classify different recipes into the cuisines that they belong to on the basis of the ingredients in the recipe. This dataset has been provided by the Yummly API. This dataset has been taken from the 'What's Cooking' Kaggle competition.
The data provided by Kaggle was already split into train and test, with the test set having no labels, so just as other reserachers (Check the end for citations) - Even I have made use of the train set (provided online) as the full set and split that into test and train to measure how good/bad the models are.

Flow of this project - I have a dataset with text input and multiple classes that each of the text pieces need to be classified into. I'm using Natural Language Processing in this Project and then feeding it to different Classifiers.

In [184]:
# Importing the necessary packages
import numpy as np
import pandas as pd
import re # For pattern matching - Regular Expressions
import nltk # For TextProcessing and exploring string distances

from nltk.stem import WordNetLemmatizer # Used for Natural Language Processing

'''WordNetLemmatizer is used for Parts of Speech Tagging in NLP, it converts a word to its base form. 
   I learnt more about WordNetLemmatizer and other lemmatization techniques from the below mentioned sources:
   https://en.wikipedia.org/wiki/Lemmatisation
   https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
   https://www.geeksforgeeks.org/python-lemmatization-with-nltk/'''

from sklearn.feature_extraction.text import TfidfVectorizer

'''TF-IDF Vectorization converts textual data to numeric vectors that can be fed to a model.
   I learnt more about TFIDF vectorization from the below mentioned sources:
   https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Xri8s2gzZPY
   https://monkeylearn.com/blog/beginners-guide-text-vectorization/
   https://monkeylearn.com/blog/what-is-tf-idf/
   https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html'''

from sklearn.model_selection import GridSearchCV # For hyperparameter tuning
from sklearn.model_selection import train_test_split # For splitting the data into train and test

''' the function train_test_split is also available with sklearn.cross_validation, but usage for that has deprecated'''

'''Importing all classification models we will be using'''

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import AdaBoostClassifier

'''Importing all metrics for comparing results'''
from sklearn.metrics import classification_report

In [66]:
# The WordNetLemmatizer needs to install first
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\Pratiksha
[nltk_data]     Sharma\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [83]:
# Loading the dataset into a dataframe
# The dataset was in a json format
df=pd.read_json("whats-cooking/train.json")

In [84]:
# Knowing more about this dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39774 entries, 0 to 39773
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           39774 non-null  int64 
 1   cuisine      39774 non-null  object
 2   ingredients  39774 non-null  object
dtypes: int64(1), object(2)
memory usage: 932.3+ KB


The dataset has no null values, and therefore no preprocessing for cleaning (other than text processing) is required

In [85]:
# Viewing the first couple of rows to understand the format in which the dataset is
df.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


In [93]:
# Checking what all different cuisine are there in the dataset
df["cuisine"].value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

**There are a total of 20 cuisines or 20 classes that a recipe can be classified into.**

 ## Next we'll move on to NLP.

Starting with lemmatization of the words in our list of ingredients; i.e. bringing them into their base form - this is also called 'Parts of Speech' tagging. Check above for links to detailed information.

In [91]:
'''We use strip here to trim any leading or trailing white spaces from our inputs after POS tagging and transformation.
   ' '.join is used for joining all transformed lemmatized words into a single sentence or input string'''

df['ingredients_afterPOStagging']=[' '.join([WordNetLemmatizer().lemmatize(re.sub('[^a-zA-Z]',' ',element)
                                                        ) for element in ing]).strip() for ing in df['ingredients']]

In [92]:
'''Seeing the results of our lemmatization'''

print(df['ingredients'][4])
print(df['ingredients_afterPOStagging'][4])

['black pepper', 'shallots', 'cornflour', 'cayenne pepper', 'onions', 'garlic paste', 'milk', 'butter', 'salt', 'lemon juice', 'water', 'chili powder', 'passata', 'oil', 'ground cumin', 'boneless chicken skinless thigh', 'garam masala', 'double cream', 'natural yogurt', 'bay leaf']
black pepper shallot cornflour cayenne pepper onion garlic paste milk butter salt lemon juice water chili powder passata oil ground cumin boneless chicken skinless thigh garam masala double cream natural yogurt bay leaf


We can clearly see from the example above that the word 'shallots' got transformed to 'shallot', 'onions' to 'onion' etcettera.

In [96]:
# Now we'll seperate our inputs from our class labels
x=df['ingredients_afterPOStagging']
y=df['cuisine']

In [97]:
# Splitting the dataset into train and test in the ratio 7:3
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=24,test_size=0.3)

Now, we need to transform our data in a way that can be fed to the machine. We use Term Frequency - Inverse Document Frequency for vectorization of our text. We create a TFIDF model that learns its vocabulary from our set of ingredient strings (or 'Documents') and rates the uniqueness or significance of a word on the basis of the number of times it appears in a document and the number of documents it is present in. **By documents I mean the ingredient strings.**

In [112]:
'''The parameter stop_words='english' below has an inbuilt list of all pronouns that can confuse our model 
   if they are present, so as to avoid this, TfidfVectorizer keeps in mind to not add words like these to its
   vocabulary'''
'''The parameter max_df specifies to what extent a word's presence, throughout all documents, can be tolerated'''

vectorizer=TfidfVectorizer(stop_words='english',ngram_range = ( 1 , 1 ),analyzer="word", max_df = .55, token_pattern=r'\w+')

In [121]:
'''The function fit_transform combines the task of fitting our input into to the TFIDF model and uses it to transform
   all input strings into a sparse matrix.'''
vectorized_x_train=vectorizer.fit_transform(x_train)

'''Since our inputs were already fitted on our model in the last step we do not need to fit again, so we just 
   transform the representation of our test set into a sparse matrix '''
vectorized_x_test=vectorizer.transform(x_test)

'''This is how our model's vocabulary looks like'''
print(vectorizer.get_feature_names())

['abalone', 'absinthe', 'abura', 'acai', 'accent', 'accompaniment', 'achiote', 'acid', 'acini', 'ackee', 'acorn', 'acting', 'active', 'added', 'adobo', 'adzuki', 'agar', 'agave', 'age', 'aged', 'ahi', 'ai', 'aioli', 'ajinomoto', 'ajwain', 'aka', 'alaskan', 'albacore', 'alcohol', 'ale', 'aleppo', 'alexia', 'alfalfa', 'alfredo', 'allspice', 'almond', 'almondmilk', 'almonds', 'aloe', 'alum', 'amaranth', 'amarena', 'amaretti', 'amaretto', 'amba', 'amber', 'amberjack', 'amchur', 'american', 'aminos', 'ammonium', 'amontillado', 'ampalaya', 'anaheim', 'anasazi', 'ancho', 'anchovies', 'anchovy', 'andouille', 'anejo', 'angel', 'anglaise', 'angled', 'angostura', 'anise', 'anisette', 'anjou', 'annatto', 'ao', 'aonori', 'apple', 'apples', 'applesauce', 'applewood', 'apricot', 'arak', 'arame', 'arbol', 'arborio', 'arepa', 'argo', 'arhar', 'armagnac', 'arrabbiata', 'arrowroot', 'artichok', 'artichoke', 'artichokes', 'artisan', 'arugula', 'asada', 'asadero', 'asafetida', 'asafoetida', 'asiago', 'asia

## **The next step is creation, training and testing of our models:**

**Multinomial Logistic Regression**

In [123]:
classify=GridSearchCV(LogisticRegression(),{'C':[1,10]}) # Creation of model
classify.get_params

<bound method BaseEstimator.get_params of GridSearchCV(cv=None, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None, param_grid={'C': [1, 10]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)>

In [124]:
# Training this model
classify=classify.fit(vectorized_x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [138]:
print('Multinomial Logistic Regression: ',classify.score(vectorized_x_test, y_test))

Multinomial Logistic Regression:  0.7836252409285176


**Random Forest Classifier**

In [130]:
'''We will be creating 300 trees'''

classify2= RandomForestClassifier(n_estimators=300) # Creation of model
classify2=classify2.fit(vectorized_x_train,y_train) # Training this model

In [131]:
# Predictions of this model
predictions2=classify2.predict(vectorized_x_test)

In [135]:
'''To measure the performance of this classifier'''
classification_report(y_test,predictions2 )

'              precision    recall  f1-score   support\n\n   brazilian       0.81      0.43      0.56       156\n     british       0.73      0.24      0.36       223\ncajun_creole       0.82      0.66      0.73       467\n     chinese       0.72      0.89      0.80       819\n    filipino       0.84      0.52      0.64       231\n      french       0.61      0.52      0.56       792\n       greek       0.87      0.50      0.64       376\n      indian       0.83      0.92      0.87       898\n       irish       0.78      0.29      0.42       201\n     italian       0.69      0.92      0.79      2312\n    jamaican       0.97      0.53      0.69       143\n    japanese       0.83      0.64      0.72       421\n      korean       0.88      0.64      0.74       241\n     mexican       0.84      0.93      0.88      1969\n    moroccan       0.91      0.60      0.72       274\n     russian       0.86      0.21      0.34       147\n southern_us       0.63      0.77      0.69      1246\n     sp

In [139]:
print('Random Forest Classifier',classify2.score(vectorized_x_test, y_test))

Random Forest Classifier 0.7467527025894578


**AdaBoost Classifier**

In [140]:
'''We will be creating 300 stumps'''

classify3= AdaBoostClassifier(n_estimators=300) # Creation of model
classify3=classify3.fit(vectorized_x_train,y_train) # Training this model

In [141]:
# Predictions of this model
predictions3=classify3.predict(vectorized_x_test)

In [155]:
print('AdaBoost Classifier with 300 stumps',classify3.score(vectorized_x_test, y_test))

AdaBoost Classifier with 300 stumps 0.4418000502807341


This gives a result which is extremely bad

**SGD Classifier**

In [160]:
classify4=SGDClassifier() # Creation of model
classify4=classify4.fit(vectorized_x_train,y_train) # Training this model

In [170]:
print('SGD Classifier: ',classify4.score(vectorized_x_test, y_test))

SGD Classifier:  0.7734852928852761


**LinearSVC Classifier : Support Vector Machine**

In [172]:
classify5=GridSearchCV(LinearSVC(C=0.8,dual=False),{'C':[1,10]}) # creation of model

In [173]:
classify5=classify5.fit(vectorized_x_train,y_train) # Training

In [175]:
print('LinearSVC Classifier: ',classify5.score(vectorized_x_test, y_test))

LinearSVC Classifier:  0.7860554764099555


We try creating another TFID vectorizer model by aloowing max_df to be 0.7 to see if it improves any of our model

In [180]:
vectorizer2=TfidfVectorizer(stop_words='english',ngram_range = ( 1 , 1 ),analyzer="word", max_df = .70 , token_pattern=r'\w+')
secvectorized_x_train=vectorizer2.fit_transform(x_train).todense()
secvectorized_x_test=vectorizer2.transform(x_test)

In [181]:
'''Trying with Logistic Regression'''

secclassify=GridSearchCV(LogisticRegression(),{'C':[1,10]})
secclassify=secclassify.fit(secvectorized_x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [183]:
print('New Multinomial Logistic Regression Classifier: ',secclassify.score(secvectorized_x_test, y_test))

New Multinomial Logistic Regression Classifier:  0.7856364702924663


This improvement is not significant, but further exploration with different paramenters might yield a better result.

**We can rank our models this way:**
    1. Linear SVC                         : 78.60%
    2. New Multinomial Logistic Regression: 78.56%
    3. Multinomial Logistic Regression    : 78.36%
    3. SGD Classifier                     : 77.34%
    4. Random Forest                      : 74.67%  
    5. AdaBoost Classifier                : 44.18%

**References**

S. Kalajdziski, G. Radevski, I. Ivanoska, K. Trivodaliev and B. R. Stojkoska, "Cuisine classification using recipe's ingredients," 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, 2018, pp. 1074-1079, doi: 10.23919/MIPRO.2018.8400196.

**Further Readings**

S. Jayaraman, T. Choudhury and P. Kumar, "Analysis of classification models based on cuisine prediction using machine learning," 2017 International Conference On Smart Technologies For Smart Nation (SmartTechCon), Bangalore, 2017, pp. 1485-1490, doi: 10.1109/SmartTechCon.2017.8358611.