## Recipe Cuisine

You've just joined the data team at an online publishing company. One of your verticals is a food publication. A product manager on your team wants to build a feature for this vertical that enables users to query by cuisine, not just by ingredients. Most of your recipes are unlabeled, and it's infeasible to label them by hand. Luckily, you have a small training set of about 10,000 recipes with labeled cuisines.

Design and execute a method to predict the cuisine of a recipe given only its ingredients. 

* Data Due Diligence: All-Purpose Flour and Flour are likely the same ingredient, but red onions and yellow onions are incredibly different.
* For each major cuisine, what are the driving ingredients that characterize it? What are the features of a cuisine that drive misclassification in your method above?
* How could you design this to robust enough to understand similarities / substitutions between ingredients? 
* Your product manager indicates a likelihood that you will only need to write a guideline for an outsourced team to hand label the remaining corpus. How would you go about writing this guide for a few major cuisines?



In [2]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import pandas as pd
import nltk

from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

### parameters

In [3]:
recipesAll = pd.read_json('recipies.json')

In [4]:
recipesAll.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


# How many types of cusine are there?

In [5]:
recipesAll.cuisine.unique()

array(['greek', 'southern_us', 'filipino', 'indian', 'jamaican',
       'spanish', 'italian', 'mexican', 'chinese', 'british', 'thai',
       'vietnamese', 'cajun_creole', 'brazilian', 'french', 'japanese',
       'irish', 'korean', 'moroccan', 'russian'], dtype=object)

In [6]:
len(recipesAll.cuisine.unique())

20

### How is the data distributed among classes?

In [7]:
recipesAll.cuisine.value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

### Stick in a bargraph of cuisines

## How many types of ingredients are there?

In [45]:
subRecipe = recipesAll.head(1000)
type(subRecipe['ingredients'][1])
corpus = recipesAll['ingredients'].apply(lambda x: ' '.join(x)).to_list()
corpus[0]

'romaine lettuce black olives grape tomatoes garlic pepper purple onion seasoning garbanzo beans feta cheese crumbles'

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [29]:
X = vectorizer.fit_transform(corpus)
X

<1000x1206 sparse matrix of type '<class 'numpy.int64'>'
	with 19156 stored elements in Compressed Sparse Row format>

In [30]:
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

<1x1206 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [43]:
cuisine_labels = recipesAll.cuisine
type(cuisine_labels)

pandas.core.series.Series

In [48]:
#confusion matrix to quantify mistakes were made between cuisines
X_train, X_test, y_train, y_test = train_test_split(corpus, cuisine_labels, test_size=0.25, stratify=cuisine_labels, random_state=123)

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression




In [49]:
vectorizer = CountVectorizer(ngram_range=(1, 2))

# tokenize and build vocab
matrix_train=vectorizer.fit_transform(X_train)
matrix_test=vectorizer.transform(X_test)
# print(matrix_train[:5])
lr_clf = LogisticRegression( max_iter=1000,random_state=123, multi_class='auto', solver='lbfgs')
lr_clf.fit(matrix_train, y_train)

y_pred = lr_clf.predict(matrix_test)



print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))#,target_names=cuisines))
cm_lr_test=confusion_matrix(y_test, y_pred)
print(cm_lr_test)

accuracy 0.7813757039420757
              precision    recall  f1-score   support

   brazilian       0.75      0.68      0.71       117
     british       0.58      0.48      0.53       201
cajun_creole       0.79      0.66      0.72       386
     chinese       0.82      0.86      0.84       668
    filipino       0.74      0.60      0.66       189
      french       0.58      0.65      0.61       662
       greek       0.75      0.70      0.72       294
      indian       0.89      0.90      0.89       751
       irish       0.65      0.49      0.56       167
     italian       0.80      0.87      0.84      1960
    jamaican       0.90      0.73      0.81       131
    japanese       0.82      0.71      0.76       356
      korean       0.87      0.74      0.80       207
     mexican       0.90      0.91      0.90      1610
    moroccan       0.81      0.75      0.78       205
     russian       0.55      0.43      0.48       122
 southern_us       0.72      0.79      0.75      1080

In [32]:
vectorizer.get_feature_names()

['abbamele',
 'acai',
 'achiote',
 'acorn',
 'acting',
 'active',
 'adobo',
 'aged',
 'ahi',
 'aioli',
 'ale',
 'alfredo',
 'all',
 'allspice',
 'almond',
 'almonds',
 'amaretti',
 'amaretto',
 'amchur',
 'aminos',
 'ancho',
 'anchovy',
 'and',
 'andouille',
 'angel',
 'anise',
 'anjou',
 'annatto',
 'apple',
 'apples',
 'applewood',
 'apricot',
 'apricots',
 'arame',
 'arborio',
 'arrowroot',
 'artichoke',
 'artichokes',
 'arugula',
 'asadero',
 'asafetida',
 'asafoetida',
 'asiago',
 'asian',
 'asparagus',
 'avocado',
 'baby',
 'back',
 'bacon',
 'bagels',
 'bags',
 'baguette',
 'baking',
 'balls',
 'balsamic',
 'bamboo',
 'banana',
 'bananas',
 'barbecue',
 'bark',
 'barley',
 'bars',
 'base',
 'basil',
 'basmati',
 'bay',
 'bbq',
 'bean',
 'beans',
 'beansprouts',
 'beaten',
 'beef',
 'beer',
 'beets',
 'bell',
 'belly',
 'berries',
 'bertolli',
 'beverages',
 'bibb',
 'bird',
 'biscuit',
 'biscuits',
 'bittersweet',
 'black',
 'blackberries',
 'blanc',
 'blanched',
 'blend',
 'blo

In [29]:
from collections import Counter

words = subRecipe['ingredients']

Counter(words).keys() # equals to list(set(words))
Counter(words).values() # counts the elements' frequency

TypeError: unhashable type: 'list'

In [8]:
listofIngredients = recipesAll['ingredients'].values.tolist()
len(listofIngredients)

39774

In [11]:
len(listofIngredients)
flat_list = [item for sublist in listofIngredients for item in sublist]


In [12]:
len(flat_list)

428275

In [13]:
values = array(flat_list)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

['romaine lettuce' 'black olives' 'grape tomatoes' ... 'roma tomatoes'
 'celery' 'dried oregano']
[5222  956 3033 ... 5221 1418 2334]


In [42]:
# Python program to count the frequency of 
# elements in a list using a dictionary 
  
def CountFrequency(my_list): 
      
    # Creating an empty dictionary  
    freq = {} 
    for items in my_list: 
        freq[items] = my_list.count(items) 
      
    for key, value in freq.items(): 
        print ("% s : % d"%(key, value)) 

In [43]:
CountFrequency(flat_list)

romaine lettuce :  7
black olives :  8
grape tomatoes :  9
garlic :  211
pepper :  107
purple onion :  52
seasoning :  7
garbanzo beans :  11
feta cheese crumbles :  10
plain flour :  6
ground pepper :  15
salt :  440
tomatoes :  68
ground black pepper :  132
thyme :  14
eggs :  84
green tomatoes :  3
yellow corn meal :  12
milk :  52
vegetable oil :  101
mayonaise :  15
cooking oil :  14
green chilies :  17
grilled chicken breasts :  1
garlic powder :  38
yellow onion :  35
soy sauce :  78
butter :  116
chicken livers :  4
water :  168
wheat :  2
black pepper :  82
shallots :  30
cornflour :  4
cayenne pepper :  40
onions :  198
garlic paste :  8
lemon juice :  44
chili powder :  60
passata :  2
oil :  47
ground cumin :  65
boneless chicken skinless thigh :  9
garam masala :  20
double cream :  2
natural yogurt :  1
bay leaf :  20
sugar :  143
fresh ginger root :  13
ground cinnamon :  41
vanilla extract :  24
ground ginger :  16
powdered sugar :  12
baking powder :  44
olive oil :  2

In [47]:
fdist = nltk.FreqDist(flat_list)
for word, frequency in fdist.most_common(25):
    print(u'{};{}'.format(word, frequency)) 

salt;440
garlic;211
olive oil;210
onions;198
water;168
sugar;143
ground black pepper;132
garlic cloves;132
butter;116
pepper;107
vegetable oil;101
all-purpose flour;101
eggs;84
green onions;84
black pepper;82
soy sauce;78
kosher salt;74
unsalted butter;69
tomatoes;68
large eggs;67
ground cumin;65
carrots;65
extra-virgin olive oil;61
chili powder;60
jalapeno chilies;56


In [51]:
len(fdist)

1817

In [35]:
unique_data = [list(x) for x in set(tuple(x) for x in setIngredients)]
unique_data

[['urad dal',
  'jalapeno chilies',
  'cumin seed',
  'curry leaves',
  'salt',
  'ginger',
  'canola oil'],
 ['sugar',
  'celery',
  'lettuce',
  'mandarin orange segments',
  'almonds',
  'dressing',
  'green onions'],
 ['pitted kalamata olives',
  'grated parmesan cheese',
  'penne pasta',
  'cauliflower',
  'olive oil',
  'crushed red pepper',
  'pepper',
  'garlic',
  'fresh parsley',
  'capers',
  'whole peeled tomatoes',
  'salt'],
 ['garlic',
  'chopped parsley',
  'curry powder',
  'brown lentils',
  'homemade chicken stock',
  'salt',
  'onions',
  'olive oil',
  'hot curry powder'],
 ['olive oil',
  'red potato',
  'vegan parmesan cheese',
  'garlic',
  'fresh chives',
  'dried oregano'],
 ['sugar', 'water', 'shredded coconut', 'glutinous rice flour'],
 ['eggs',
  'pepper',
  'coarse salt',
  'purple onion',
  'feta cheese crumbles',
  'couscous',
  'boiling water',
  'pinenuts',
  'roasted red peppers',
  'red pepper flakes',
  'yellow onion',
  'greek yogurt',
  'ground be

In [22]:
#recipesAll.ingredients.value_counts()

#recipesAll.set_option('display.max_rows', 500)

### Clean data

### Split data into train and validation

In [17]:
train_set, test_set = train_test_split(recipesAll, test_size = 0.2, random_state = 42)
len(train_set)

31819

In [18]:
len(test_set)

7955

In [14]:

doc1 = "Can I eat the Pizza".lower()
doc2 = "You can eat the Pizza".lower()
doc1 = doc1.split()
doc2 = doc2.split()
doc1_array = array(doc1)
doc2_array = array(doc2)
doc3 = doc1+doc2
# doc3 = set(doc3)
data = list(doc3)


values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)


# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)

['can' 'i' 'eat' 'the' 'pizza' 'you' 'can' 'eat' 'the' 'pizza']
[0 2 1 4 3 5 0 1 4 3]
[[1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]]
['can']
