Food is the basic neccessity of the of all living creatures. Humans are the only species who have created their own cuisines using the naturally available ingredients. Food industry is one of the most widely popular form of business. There are thousands of cuisines across the world and each one unique in its own way. Even though there are some common ingredients between some cuisines there is always that something unique which makes the ultimate difference in the food.

This dataset has the ingredients of the cuisines and the labels of the cuisines. Can we predict the cuisine looking at the ingredients. This is a Natural Languages Processing text classification problem. 

We start this by loading the required libraries.

In [1]:
## Importing the packages/libraries.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, GridSearchCV
from sklearn.metrics import scorer, confusion_matrix, classification_report, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import spacy

nlp = spacy.load('en_core_web_sm')
stopwords = spacy.lang.en.stop_words.STOP_WORDS

## Suppressing the warning
import warnings
warnings.filterwarnings('ignore')


### Loading , Cleaning and Sneak peak into the dataset

In [2]:
## Import the dataset
df = pd.read_csv("../data/cuisine.csv")

## Text Preprocessing and Cleaning
def cleaning(txt):
    tokens = [token.lemma_.lower().strip(" ") for token in nlp(txt) if token.text.isalpha()]
    return ' '.join(tokens)

df['cleanedingredients'] = df['ingredients'].apply(cleaning)

In [11]:
df.iloc[:,[2,1]].head()

Unnamed: 0,cleanedingredients,cuisine
0,romaine lettuce black olive grape tomato garli...,greek
1,plain flour ground pepper salt tomato grind bl...,southern_us
2,water vegetable oil wheat salt,indian
3,black pepper shallot cornflour cayenne pepper ...,indian
4,sugar pistachio nut white almond bark flour va...,italian


In [13]:
df.iloc[:,[2,1]].tail()

Unnamed: 0,cleanedingredients,cuisine
32596,low fat sour cream grate parmesan cheese salt ...,italian
32597,shred cheddar cheese crush cheese cracker ched...,mexican
32598,kraft zesty italian dress purple onion broccol...,italian
32599,boneless chicken skinless thigh mince garlic s...,chinese
32600,green chile jalapeno chilie onion ground black...,mexican


In [15]:
df.iloc[:,[2,1]].groupby('cuisine').count()

Unnamed: 0_level_0,cleanedingredients
cuisine,Unnamed: 1_level_1
cajun_creole,1546
chinese,2673
french,2646
greek,1175
indian,3003
italian,7838
japanese,1423
mexican,6438
southern_us,4320
thai,1539


### Splitting the data, training, cross validation and predicting

In [3]:
## Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(df['cleanedingredients'], df['cuisine'], test_size=0.30, random_state=42)

## Building the model 
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression()),
])

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'clf__solver': ('saga', 'lbfgs'),
}

## Grid Search Cross Validation
grid_search = GridSearchCV(pipeline, 
                           parameters, 
                           cv=5,
                           n_jobs=-1, 
                           verbose=1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:: ", parameters)
grid_search.fit(X_train, y_train)
print("Best score: %0.3f" % round(grid_search.best_score_ * 100,2))
print("Best best estimator :::: ", grid_search.best_estimator_)

print("Testing Accuracy::",round(accuracy_score(grid_search.predict(X_test), y_test) * 100, 2))

Performing grid search...
pipeline: ['vect', 'clf']
parameters::  {'vect__max_df': (0.5, 0.75, 1.0), 'vect__ngram_range': ((1, 1), (1, 2)), 'clf__solver': ('saga', 'lbfgs')}
Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  8.0min finished


Best score: 84.500
Best best estimator ::::  Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip...penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False))])
Testing Accuracy:: 85.25


Logistic Regression algorithm performed the best among host of algorithms tried. This model might not be giving the best possible accuracy but it is performing well. Testing score is better than training score, it is not overfitting.

There is scope for improvement to make it more accurate.