# What are ingredients?

In the previous kernel (https://www.kaggle.com/rejasupotaro/representations-for-ingredients), I experimented which representation is better for this dataset without looking at what ingredients itself are.

Viewing ingredients, I found some interesting things which might help to understand cuisines.

0. Outliers
1. Special characters
2. Upper cases
3. Apostrophes
4. Hyphens
5. Numbers
6. Units
7. Region names
8. Accents
9. Unique ingredients
10. Language
11. Misspellings

In [None]:
import json
import langdetect
import re
import time
import unidecode
import ipywidgets as widgets
import numpy as np
import pandas as pd
from ipywidgets import interact
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, cross_validate, train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer, LabelEncoder, MultiLabelBinarizer

In [None]:
train = pd.read_json('../input/train.json')
test = pd.read_json('../input/test.json')
train.head()

In [None]:
df = pd.concat([train, test], sort=False)
df['ingredients_text'] = df['ingredients'].apply(lambda x: ', '.join(x))
df['num_ingredients'] = df['ingredients'].apply(lambda x: len(x))
raw_ingredients = [ingredient for ingredients in df.ingredients.values for ingredient in ingredients]
df.head()

## 0. Outliers

I found some recipes which consist of only 1 ingredient like below.

- water => Japanese
- butter => Indian
- butter => French

It could get models confused. Unfortunately, such recipes exist in the test dataset though.

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

plt.figure(figsize=(16,4))
sns.countplot(x='num_ingredients', data=df)

Let's see single-ingredient recipes.

In [None]:
df[df['num_ingredients'] <= 1]

Btw, are all ingredients valid? For example, do ingredients consisting of less than 2 characters make sense?

In [None]:
[ingredient for ingredient in raw_ingredients if len(ingredient) <= 2]

## 1. Special characters

See what special characters are contained.

  - "Bertolli**®** Alfredo Sauce"
  - "Progresso**™** Chicken Broth"
  - "green bell pepper**,** slice"
  - "half **&** half"
  - "asafetida **\(**powder**\)**"
  - "Spring**!** Water"

In [None]:
' '.join(sorted([char for char in set(' '.join(raw_ingredients)) if re.findall('[^A-Za-z]', char)]))

## 2. Upper cases

It may be a proper noun.

- Company name
  - "**Oscar Mayer** Deli Fresh Smoked Ham"
- Region name
  - "**Shaoxing** wine"
  - "**California** bay leaves"
  - "**Italian** parsley leaves"

In [None]:
list(set([ingredient for ingredient in raw_ingredients if re.findall('[A-Z]+', ingredient)]))[:5]

## 3. Apostrophes

- "Zatarain’s Jambalaya Mix"
- "Breakstone’s Sour Cream"
- "sheep’s milk cheese"

It'd be useful if there are many apostrophes for possession in the dataset but

In [None]:
list(set([ingredient for ingredient in raw_ingredients if '’' in ingredient]))

## 4. Hyphens ( `-` )

It might be okay to replace "-" with " ".

- "chicken-apple sausage"
- "chocolate-hazelnut spread"
- "bone-in chicken breasts"

In [None]:
list(set([ingredient for ingredient in raw_ingredients if re.findall('-', ingredient)]))[:5]

## 5. Numbers

Numbers show quantity or density.

- "1% low-fat milk"
- "40% less sodium taco seasoning"
- "mexican style 4 cheese blend"

Strictly speaking, quantities can be a factor of identifying the cuisine but only a few ingredients come with quantity in this dataset.

In [None]:
list(set([ingredient for ingredient in raw_ingredients if re.findall('[0-9]', ingredient)]))[:5]

## 6. Units

Units come with numbers.

  - "(15 **oz**.) refried beans"
  - "2 1/2 to 3 **lb**. chicken, cut into serving pieces"
  - "pork chops, 1 **inch** thick"
  
Some units are used only in a specific region. It might be useful for classifiers.

In [None]:
units = ['inch', 'oz', 'lb', 'ounc', '%'] # ounc is a misspelling of ounce?

@interact(unit=units)
def f(unit):
    ingredients_df = pd.DataFrame([ingredient for ingredient in raw_ingredients if unit in ingredient], columns=['ingredient'])
    return ingredients_df.groupby(['ingredient']).size().reset_index(name='count').sort_values(['count'], ascending=False)

## 7. Region names

In [None]:
keywords = [
    # It indicates the cusine directly
    'american', 'greek', 'filipino', 'indian', 'jamaican', 'spanish', 'italian', 'mexican', 'chinese', 'thai',
    'vietnamese', 'cajun', 'creole', 'french', 'japanese', 'irish', 'korean', 'moroccan', 'russian',
    # Region names I found in the dataset
    'tokyo', 'shaoxing', 'california'
]

@interact(keyword=keywords)
def f(keyword):
    ingredients_df = pd.DataFrame([ingredient for ingredient in raw_ingredients if keyword in ingredient], columns=['ingredient'])
    return ingredients_df.groupby(['ingredient']).size().reset_index(name='count').sort_values(['count'], ascending=False)

## 8. Accents

Some accents are used only in a specific region. Can we use this information?

- "pumpkin pur**é**e"
- "crème fra**î**che"
- "Ni**ç**oise olives"

In [None]:
accents = ['â', 'ç', 'è', 'é', 'í', 'î', 'ú']

@interact(accent=accents)
def f(accent):
    ingredients_df = pd.DataFrame([ingredient for ingredient in raw_ingredients if accent in ingredient], columns=['ingredient'])
    return ingredients_df.groupby(['ingredient']).size().reset_index(name='count').sort_values(['count'], ascending=False)

## 9. Unique ingredients

Some ingredients are used only in a specific region.

I picked some ingredients used in `cuisine == 'japanese'`.

### brown rice => genmai => 玄米

<img src="https://kinarino.k-img.com/system/press_eye_catches/000/025/367/aea26213187ab5c5f69ed43c3480e24038c3d06f.jpg?1477883419" width="480">

### bonito => katsuo => 鰹

<img src="http://qoonell.me/wordpress/wp-content/uploads/2015/10/20070412155803000.jpg" width="480">

### salmon roe => ikura => いくら

<img src="https://cdn.macaro-ni.jp/image/summary/33/33186/40f8825cdd5438a6361373fae156a7b5.jpg" width="480">

In [None]:
lemmatizer = WordNetLemmatizer()
def preprocess(ingredients):
    ingredients = ' '.join(ingredients).lower().replace('-', ' ')
    ingredients = re.sub("\d+", "", ingredients)
    return [lemmatizer.lemmatize(ingredient) for ingredient in ingredients.split()]

ingredients_df = df.groupby(['cuisine'])['ingredients'].sum().apply(lambda ingredients: preprocess(ingredients)).reset_index()
unique_ingredients = []
for cuisine in ingredients_df['cuisine'].unique():
    target = set(ingredients_df[ingredients_df['cuisine'] == cuisine]['ingredients'].values[0])
    others = set(ingredients_df[ingredients_df['cuisine'] != cuisine]['ingredients'].sum())
    unique_ingredients.append({
        'cuisine': cuisine,
        'ingredients': target - others
    })
pd.DataFrame(unique_ingredients, columns=['cuisine', 'ingredients'])

## 10. Language

Related to accents, we can guess which language the ingredient is by looking at the sequence of characters.

- tofu => Japanese
- purée => French

Can we extract language information from ingredients? In this case, we need to think about how to detect the ingredient language.

In [None]:
text_languages = []
for text in [
    'ein, zwei, drei, vier',
    'purée',
    'taco',
    'tofu',
    'tangzhong',
    'xuxu',
]:
    text_languages.append({
        'text': text,
        'detected language': langdetect.detect(text)
    })
pd.DataFrame(text_languages, columns=['text', 'detected language'])

## 11. Misspellings

I found that there are some misspellings in the dataset.

- ounc (ounce)
- wasabe (wasabi)
- ...

# Normalize

I saw some ingredients which contain special characters, misspellings, ... I should probabily normalize ingredients.

In [None]:
from IPython.display import clear_output

ingredients = ['romaine lettuce', 'Eggs', 'Beef demi-glace', 'Sugar 10g', 'Pumpkin purée', 'Kahlúa']
labels = [widgets.Label(ingredient) for ingredient in ingredients]

lower_checkbox = widgets.Checkbox(value=False, description='lower', indent=False)
lemmatize_checkbox = widgets.Checkbox(value=False, description='lemmatize', indent=False)
remove_hyphens_checkbox = widgets.Checkbox(value=False, description='remove hyphens', indent=False)
remove_numbers_checkbox = widgets.Checkbox(value=False, description='remove numbers', indent=False)
strip_accents_checkbox = widgets.Checkbox(value=False, description='strip accents', indent=False)

lemmatizer = WordNetLemmatizer()
def lemmatize(sentence):
    return ' '.join([lemmatizer.lemmatize(word) for word in sentence.split()])
assert lemmatize('eggs') == 'egg'

def remove_numbers(sentence):
    words = []
    for word in sentence.split():
        if re.findall('[0-9]', word): continue
        if len(word) > 0: words.append(word)
    return ' '.join(words)

def update_ingredients(widget):
    for i, ingredient in enumerate(ingredients):
        processed = ingredient
        if lower_checkbox.value: processed = processed.lower()
        if lemmatize_checkbox.value: processed = lemmatize(processed)
        if remove_hyphens_checkbox.value: processed = processed.replace('-', ' ')
        if remove_numbers_checkbox.value: processed = remove_numbers(processed)
        if strip_accents_checkbox.value: processed = unidecode.unidecode(processed)
        if processed == ingredient:
            labels[i].value = ingredient
        else:
            labels[i].value = f'{ingredient} => {processed}'

lower_checkbox.observe(update_ingredients)
lemmatize_checkbox.observe(update_ingredients)
remove_hyphens_checkbox.observe(update_ingredients)
remove_numbers_checkbox.observe(update_ingredients)
strip_accents_checkbox.observe(update_ingredients)

display(widgets.VBox([
    widgets.Box([lower_checkbox, lemmatize_checkbox, remove_hyphens_checkbox, remove_numbers_checkbox, strip_accents_checkbox]),
    widgets.VBox(labels)
]))