# Contents

1. [Project birds-eye view](#Project-at-a-glance)
    - [Survey existing projects](#Survey); take notes (on datasets; research questions)
    - [Data](#Data)
         - [Get data](#Get-data)
         - [Explore data](#Explore-data)
         - [Pre-process data](#Preprocess-data)
    - [Label recipes](#Label-recipes-as-W/NW) (based on ingredients? based on name?) as W or NW 
         - Define W/NW depending on:
             - if non-fusion:
                 - country of origin (straightforward for non-fusion recipes)
             - if fusion:
                 - recipes can be conceptualized as principal ingredient + form; ingredient and form can be of independent origins (e.g., matcha (NW) muffins (W); chocolate (W) zongzi (NW))—what to do in this case?
                     - option 1: exclude 
                     - option 2: label based on country of origin of form
    - Analyze words that are most marked for being NW

2. [Developing a classifier](#Develop-classifier)
    - Assemble dataset of labeled W/NW recipes 
         - [Wikipedia cuisine portal](#Get-Wikipedia-data)
         - Recipe collections (NYT, Bon Appétit, Food52, etc.)
         - manual labeling to understand potential problem cases (e.g. fusion recipes; long names)
    - Experiment w/ different models, baselines
        - Filter adjectives and other modifiers from recipe titles so just nouns remain
        - Look for explicit country-of-origin modifiers (e.g. “Turkish”, “Asian”)
    - Some qualitative analysis

3. Analysis
    - Assemble multiple different corpora of recipes/food-writing (preferably at least some distinct from training data)
         - NYT Cooking columns
         - Bon Appétit, Food52, food blogs
         - Cookbooks (would be especially suited for longitudinal analysis)
    - Ask research questions
        - H: Length of adjectives in recipe title longer for NW than W (just introduced as in; no need to “hype up”)
        - Do recipes belonging to multiple countries (e.g., baklava) get described differently when it’s framed as Western (=Greek) vs. Non-Western (=Turkish)?

# Project at a glance

## Survey

- Data
    - scraper from foodnetwork, epicurious, allrecipes (https://github.com/rtlee9/recipe-box)
    - Food.com recipes (https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions?select=PP_recipes.csv)
    - dataset w/ cuisine annotations (https://old.datahub.io/dataset/recipe-dataset)
    - Recipe1M+ (https://luisrita.medium.com/recipe1m-dataset-2ecb62a43804)
- Analysis
    - Priya Krishna and Yewande Komolafe conversation (https://www.bonappetit.com/story/recipe-writing-whitewashed) with ideas for differences
        - modifiers indexing “approachability” (“It's why modifiers like “simple” or “weeknight” that many publications (including BA) have historically added always make me laugh. Indian is my weeknight food. There's no weeknight dal and then dal. It's all just dal. It’s as if our food needs to be made more approachable.”)
        - “translating” into familiar terms (“The recipe title reads "Kadhi," and then in parentheses "Turmeric Yogurt Soup.” Of course, people in the South Asian community were like, "What are you doing calling it a soup?””)
        - having a technical term for everything (vs. intuitive description) (“It blew my mind when I first moved here how there’s a name for everything. Like, in Nigerian cooking there's this process where you take a starch and you pound it, but there’s no technical name for it. I think the way we write recipes now almost demands that we have one word for a given technique. When you throw spices in oil, it's called blooming. But do you bloom onions, too? Do you bloom dried fish?”)
        - italicizing unfamiliar terms (“Last year when my book was coming out, I had to take a stand against italicizing non-English words. It's a way that Western publications literally "other" non-white foods: they make them look different. But why can't dal and jollof rice and macaroni and cheese all exist in the same font style?”)
    

## Data

### Get data

- Recipe1M+ dataset: problems with download :(
- Bon Appetit/NYT Cooking: scraped using modified code from [recipe-box](https://github.com/rtlee9/recipe-box)
    - Modified code to additionally scrape free-form text and keywords
    - `python get_recipes.py --nyt --ba` from within `./recipe-box/src` to get recipes
    - `python get_recipes.py --nyt_cuisine` to get the titles of recipes belonging to each cuisine category (as categorized by NYT); BA does not practice any categorization

In [167]:
import os
import json
import glob
from collections import defaultdict

In [2]:
DOMAIN_CODES = {
    'Bon Appetit': 'ba',
    'NYT Cooking': 'nyt',
}

In [3]:
path_to_jsons = {code: 
                 'recipe-box/data/recipes_raw_{}.json'.format(code)
                for code in DOMAIN_CODES.values()}
path_to_jsons

{'ba': 'recipe-box/data/recipes_raw_ba.json',
 'nyt': 'recipe-box/data/recipes_raw_nyt.json'}

In [4]:
with open(path_to_jsons['ba']) as f:
    ba_data = json.load(f)
    
with open(path_to_jsons['nyt']) as f:
    nyt_data = json.load(f)

Scraped 10.3K and 8.9K recipes from BA, NYT respectively.

In [5]:
len(ba_data.keys()),len(nyt_data.keys())

(10264, 8910)

Let's also collect the titles of recipes categorized by cuisine, from NYT:

In [142]:
# These are the cuisine filters, which I have assigned to 
# 'west' or 'non-west' based on my own intuitions

nyt_cuisine_types = {'west': [
    'American','Australian','Austrian','Belgian','British',
    'Canadian','Eastern%20European','French','German','Greek',
    'Icelandic','Irish','Italian','Jewish','Mediterranean',
    'New%20England','Portuguese','Provencal','Russian','Scandinavian',
    'Southern','Southwestern','Spanish'],
                    'non_west': [
    'African','Asian','Brazilian','Cajun','Caribbean',
    'Central%20American','Chinese','Creole','Cuban','Ethiopian',
    'Filipino','Indian','Indonesian','Japanese','Korean',
    'Latin%20American','Malaysian','Mexican','Middle%20Eastern',
    'Moroccan','Pakistani','South%20American','Thai','Tibetan',
    'Turkish','Vietnamese']}

cuisine2type = {cuisine: cuisine_type 
               for cuisine_type in ['west','non_west']
               for cuisine in nyt_cuisine_types[cuisine_type]}

We'll create a dictionary with (cuisine, list of recipes) key, val pairs.

In [143]:
recipes_per_cuisine = {}

for file in glob.glob('./recipe-box/data/per_cuisine/*.txt'):
    with open(file, 'r') as f:
        recipe_urls = f.read().splitlines()
    recipe_urls = ['http://cooking.nytimes.com/'+r for r in recipe_urls]
    recipe_titles = [" ".join([c for c in r.split('/')[-1].split('-')
                              if c.isalpha()]) 
                     for r in recipe_urls]
    recipes_per_cuisine[file.split('/')[-1][:-4]] = recipe_urls

And another dictionary with (recipe URL, set of cuisines) key, val pairs (as a recipe can belong to multiple cuisines).

In [144]:
recipe2cuisine = defaultdict(set)

for cuisine in recipes_per_cuisine:
    for r in recipes_per_cuisine[cuisine]:
        recipe2cuisine[r].add(cuisine)

In [145]:
recipe2cuisine['http://cooking.nytimes.com//recipes/1980-macadamia-meringue-triangles']

{'French'}

In [15]:
all_cuisine_urls = [recipes_per_cuisine[cuisine]
                    for cuisine in recipes_per_cuisine]
all_cuisine_urls = [item for sublist in all_cuisine_urls 
                   for item in sublist]
len(all_cuisine_urls)

15419

### Explore data

In [64]:
import pandas as pd
import numpy as np
import pickle

Let's create a dataframe to store all the recipes, with the URL as the index.

In [17]:
df = pd.read_json(path_to_jsons['ba'],orient='index').append(
pd.read_json(path_to_jsons['nyt'],orient='index'))
df.shape

(19174, 4)

In [18]:
df.tail()

Unnamed: 0,title,author,description,steps
http://cooking.nytimes.com//recipes/1739-hootenholler-whiskey-quick-bread,Hootenholler Whiskey Quick Bread,Alex Witchel,This recipe is adapted from the “I Hate to Coo...,"First, take the bourbon out of the cupboard an..."
http://cooking.nytimes.com//recipes/1740-braised-chicken-with-artichokes-and-mushrooms,Braised Chicken With Artichokes and Mushrooms,Alex Witchel,"The men who ruled the world in the late 1950s,...",Preheat the oven to 375 degrees. Mix together ...
http://cooking.nytimes.com//recipes/1878-roquefort-and-pear-eggnog,Roquefort-and-Pear Eggnog,Amanda Hesser,This fascinating recipe came to The Times in a...,"One to two days before making the eggnog, comb..."
http://cooking.nytimes.com//recipes/1877-1958-eggnog,1958: Eggnog,Amanda Hesser,This recipe appeared in The Times in an articl...,"In an electric mixer, beat the egg yolks with ..."
http://cooking.nytimes.com//recipes/1015818-white-bark-balls,White Bark Balls,Jennifer Steinhauer,,"In a medium bowl, combine Rice Krispies, peanu..."


We'll add a column indicating whether the recipe comes from BA or NYT.

In [20]:
df['source'] = ['ba' if url.startswith('http://www.bonappetit.com')
               else 'nyt' for url in df.index.values]
df['source'].value_counts()

ba     10264
nyt     8910
Name: source, dtype: int64

It looks like there are 5.5K recipes that do not have a cuisine category label:

In [21]:
len(set(df.loc[df['source']=='nyt'].index.values).difference(
    set(all_cuisine_urls)))

5506

Let's see who the recipes are coming from over at NYT and BA:

In [22]:
df.loc[df['source']=='nyt']['author'].value_counts()

Martha Rose Shulman                                1632
Melissa Clark                                       963
Mark Bittman                                        664
David Tanis                                         578
Sam Sifton                                          248
                                                   ... 
Recipe from Gladys Puglla-Jimenez                     1
Recipe from Philip Greene                             1
Recipe from “VegNews Holiday Cookie Collection”       1
Recipe from Ree Drummond                              1
Recipe from Robbie Richter and Zakary Pelaccio        1
Name: author, Length: 1809, dtype: int64

In [23]:
df.loc[df['source']=='ba']['author'].value_counts()

Claire Saffitz                                    474
The Bon Appétit Test Kitchen                      376
Chris Morocco                                     374
Andy Baraghani                                    298
Alison Roman                                      291
                                                 ... 
Bluewater Cafe New Canaan CT                        1
Clyde Cooper s Barbecue Raleigh North Carolina      1
Victoria Granof                                     1
Judith Fertig and Karen Adler                       1
Adam Sachs                                          1
Name: author, Length: 1563, dtype: int64

It looks like some regularization is called for. We'll do that in the [Regularize authors](#Regularize-authors) section below.

Finally, let's explore the cuisine breakdown of recipes from NYT.

In [24]:
counts_per_cuisine = {item[0]: len(item[1])
 for item in sorted(recipes_per_cuisine.items(), 
      key=lambda x: len(x[1]), reverse=True)}

counts_per_cuisine

{'American': 2554,
 'Italian': 1656,
 'Central%20American': 1610,
 'Latin%20American': 1610,
 'South%20American': 1610,
 'French': 1494,
 'Asian': 631,
 'Mediterranean': 460,
 'Southern': 404,
 'Mexican': 388,
 'Chinese': 294,
 'Indian': 287,
 'Jewish': 217,
 'Spanish': 202,
 'British': 172,
 'Greek': 170,
 'Japanese': 139,
 'Thai': 118,
 'Caribbean': 116,
 'Southwestern': 110,
 'Moroccan': 105,
 'Vietnamese': 99,
 'Scandinavian': 95,
 'African': 92,
 'Korean': 92,
 'German': 78,
 'Irish': 60,
 'Turkish': 60,
 'Cajun': 54,
 'Creole': 52,
 'Russian': 52,
 'Eastern%20European': 47,
 'Portuguese': 42,
 'Middle%20Eastern': 41,
 'Indonesian': 28,
 'Austrian': 25,
 'Filipino': 24,
 'Brazilian': 22,
 'Cuban': 18,
 'Provencal': 18,
 'Canadian': 17,
 'New%20England': 14,
 'Belgian': 12,
 'Australian': 11,
 'Malaysian': 10,
 'Pakistani': 4,
 'Ethiopian': 2,
 'Tibetan': 2,
 'Icelandic': 1}

In [25]:
counts_per_cuisine_type = defaultdict(int)
for cuisine_type in ['west','non_west']:
    for cuisine in nyt_cuisine_types[cuisine_type]:
        counts_per_cuisine_type[cuisine_type] += \
        counts_per_cuisine[cuisine]
        
counts_per_cuisine_type

defaultdict(int, {'west': 7911, 'non_west': 7508})

### Preprocess data

#### Regularize 

In [26]:
def regularize_author(auth_str):
    if auth_str is not None:
        return auth_str.replace('Recipe from ','')
    return

import re
def regularize_title(title_str):
    if title_str is not None:
        return re.sub(r'[^a-zA-Z0-9]+',' ',title_str.lower())
    return

In [27]:
df['author'] = df['author'].apply(regularize_author)

In [28]:
df.loc[df['source']=='nyt']['author'].value_counts()

Martha Rose Shulman              1634
Melissa Clark                     963
Mark Bittman                      670
David Tanis                       580
Sam Sifton                        248
                                 ... 
Bill Smith                          1
Ree Drummond                        1
Niloufer Ichaporia King             1
"The Divvies Bakery Cookbook"       1
Nate Dumas                          1
Name: author, Length: 1755, dtype: int64

In [29]:
df['reg_title'] = df['title'].apply(regularize_title)

#### Deduplicate

Let's check for recipes from the same author with the same title (regardless of whether they're from BA or NYT):

In [30]:
df.loc[df.duplicated(subset=['author','reg_title'],keep='first')]

Unnamed: 0,title,author,description,steps,source,reg_title
http://www.bonappetit.com/recipe/chocolate-pizzettes,Pizzettes,Chris Morocco,The short bake time in this recipe isn’t a typ...,"Preheat oven to 375°. Mix flour, chocolate chi...",ba,pizzettes
http://www.bonappetit.com/recipe/buche-de-noel-recipe,Bûche de Noël,Claire Saffitz,A little oil in this bûche de noel recipe help...,"Preheat oven to 375°. Coat an 18x13"" rimmed ba...",ba,b che de no l
http://www.bonappetit.com/recipe/extra-buttery-mashed-spuds,Extra-Buttery Mashed Potatoes,Dawn Perry,"For this mashed potato recipe, drying the cook...",Place potatoes in a large pot and pour in cold...,ba,extra buttery mashed potatoes
http://www.bonappetit.com/recipe/brads-campsite-jambalaya-2,Brad’s Campsite Jambalaya,Zach DeSart,If you decide to make this best jambalaya reci...,Prepare grill for high heat. Heat oil in a lar...,ba,brad s campsite jambalaya
http://www.bonappetit.com/recipe/turmeric-tonic-2,Turmeric Tonic,Alison Roman,"Find turmeric at specialty, Asian, and health ...","Pass turmeric, ginger, and lemon (with peel) t...",ba,turmeric tonic
...,...,...,...,...,...,...
http://cooking.nytimes.com//recipes/1014579-la-zucca-magicas-orange-and-olive-salad,La Zucca Magica’s Orange and Olive Salad,Mark Bittman,"The combination of sweet, juicy, tart (and col...","In a food processor, combine olives and thyme,...",nyt,la zucca magica s orange and olive salad
http://cooking.nytimes.com//recipes/1016526-cauliflower-gratin-with-goat-cheese-topping,Cauliflower Gratin with Goat Cheese Topping,Martha Rose Shulman,"Of all of the many gratins that I make, this i...",Preheat the oven to 450ºF. Oil a 2-quart grati...,nyt,cauliflower gratin with goat cheese topping
http://cooking.nytimes.com//recipes/1016147-simple-marinara-sauce,Simple Marinara Sauce,Martha Rose Shulman,Recipes hardly come easier. This marinara sauc...,Pulse the chopped tomatoes in a food processor...,nyt,simple marinara sauce
http://cooking.nytimes.com//recipes/1016478-simple-marinara-sauce,Simple Marinara Sauce,Martha Rose Shulman,This is the marinara sauce I make all winter. ...,Pulse the chopped tomatoes in a food processor...,nyt,simple marinara sauce


There are 108 recipes with duplicate versions.

Let's save the regularized, deduplicated dataframe of recipes.

In [31]:
df = df.drop_duplicates(subset=['author','reg_title'],keep='first')
df.shape

(19066, 6)

In [332]:
df.to_csv('./data/recipes_df.tsv',sep='\t',header=True,index=True)

In [65]:
df = pd.read_csv('./data/recipes_df.tsv',sep='\t',index_col=0)
df.shape

(19066, 6)

In [66]:
df.head()

Unnamed: 0,title,author,description,steps,source,reg_title
http://www.bonappetit.com/recipe/rustic-shrimp-toasts,Rustic Shrimp Toasts,Christian David Reynoso,This version of shrimp toast is inspired by my...,Heat broiler (set to low if you have that opti...,ba,rustic shrimp toasts
http://www.bonappetit.com/recipe/tomato-and-egg-drop-noodle-soup,Tomato and Egg Drop Noodle Soup,Hetty McKinnon,"The combination of tomato, egg, and noodles is...",Fill a small Dutch oven or large saucepan with...,ba,tomato and egg drop noodle soup
http://www.bonappetit.com/recipe/macadamia-and-brown-butter-blondies,Macadamia and Brown Butter Blondies,Roxana Jullapat,"At her L.A. bakery Friends and Family, Roxana ...","Preheat oven to 350°. Lightly coat a 9""-diamet...",ba,macadamia and brown butter blondies
http://www.bonappetit.com/recipe/grilled-mushrooms-and-root-vegetables,Grilled Mushrooms and Root Vegetables,Maricela Vega,“My great-grandmothers were Indigenous and mos...,"Purée Sesame Crème, Allium Confit, chopped her...",ba,grilled mushrooms and root vegetables
http://www.bonappetit.com/recipe/spiced-pecans,Spiced Pecans,Maricela Vega,Pecans are a favorite Southern source of prote...,Preheat oven to 350°. Grind fennel seeds in a ...,ba,spiced pecans


#### Use spaCy to tokenize, lemmatize, and parse titles, descriptions, & steps

In [250]:
import spacy
nlp = spacy.load("en_core_web_sm")

import os
import glob

Can we use the final part of each recipe's URL as a GUID? (I.e., does each recipe have a distinct URL tail?) It appears so:

In [252]:
url_tails = [x.split('/')[-1] for x in df.index.values]
len(url_tails),len(set(url_tails))

(19174, 19174)

In [253]:
url_tails[:10]

['rustic-shrimp-toasts',
 'tomato-and-egg-drop-noodle-soup',
 'macadamia-and-brown-butter-blondies',
 'grilled-mushrooms-and-root-vegetables',
 'spiced-pecans',
 'granola-scones',
 'chocolate-buckwheat-cake',
 'pumpkin-hot-sauce',
 'scratchy-throat-soother',
 'blueberry-spelt-muffins']

In [254]:
url_tails[-10:]

['7649-spicy-pork-belly-with-green-olives-and-lemon',
 '1017675-clementine-clafoutis',
 '1016062-red-lentil-soup-with-lemon',
 '1016234-pernil',
 '1292-gingerbread-apple-cocktail',
 '1739-hootenholler-whiskey-quick-bread',
 '1740-braised-chicken-with-artichokes-and-mushrooms',
 '1878-roquefort-and-pear-eggnog',
 '1877-1958-eggnog',
 '1015818-white-bark-balls']

We'll save each spaCy-processed recipe to 
`./data/preprocessed/{GUID}.json`, with each `.json` having the structure
```
{
    'title': {
        'tokens': ['rustic shrimp toasts'], 
        'lemmas': ['rustic','shrimp','toast'],
        'pos': ['ADJ','NOUN','NOUN'],
        'dep_label': ['amod','compound','ROOT'],
        'head': ['toasts','toasts','toasts'],
    },
    'desc': {
        'tokens': ['this','version','of','shrimp','toast','is',...]
        'lemmas': ['this','version','of','shrimp','toast','be',...],
        'pos': ['DET','NOUN','ADP','NOUN','NOUN','VERB',...],
        'dep_label': ['det','nsubjpass','prep','compound','pobj','auxpass',...],
        'head': ['version','inspired','version','toast','of','inspired',...],
    },
    'steps': {
        'tokens': ['heat','broiler','(','set',...]
        'lemmas': ['heat','broiler','(','set',...],
        'pos': ['NOUN','NOUN','PUNCT','VERB',...],
        'dep_label': ['compound','ROOT','punct','acl',...],
        'head': ['broiler','broiler','broiler','broiler',...],
    }
}
```

In [272]:
from collections import defaultdict

In [297]:
progress = 0
for row_ix,row in df.iterrows():
    
    # Create json object to store results
    json_out = defaultdict(lambda: defaultdict(list))
    
    # Run spaCy pipeline for each of the 3 text components
    for obj in ['title', 'description', 'steps']:
        if row[obj] is not None:
            doc = nlp(row[obj].lower()) # convert to lowercase
            for token in doc:
                json_out[obj]['tokens'].append(token.text)
                json_out[obj]['lemmas'].append(token.lemma_)
                json_out[obj]['pos'].append(token.pos_)
                json_out[obj]['dep_label'].append(token.dep_)
                json_out[obj]['head'].append(token.head.text)
            
    # Write results
    fname = os.path.join('data','preprocessed',row_ix.split('/')[-1])       
    with open(fname, 'w') as outfile:
        json.dump(json_out, outfile)
    
    
    progress += 1
    if progress % 500 == 0:
        print('Processed {} out of {} rows.'.format(progress,len(df)))

Processed 500 out of 19174 rows.
Processed 1000 out of 19174 rows.
Processed 1500 out of 19174 rows.
Processed 2000 out of 19174 rows.
Processed 2500 out of 19174 rows.
Processed 3000 out of 19174 rows.
Processed 3500 out of 19174 rows.
Processed 4000 out of 19174 rows.
Processed 4500 out of 19174 rows.
Processed 5000 out of 19174 rows.
Processed 5500 out of 19174 rows.
Processed 6000 out of 19174 rows.
Processed 6500 out of 19174 rows.
Processed 7000 out of 19174 rows.
Processed 7500 out of 19174 rows.
Processed 8000 out of 19174 rows.
Processed 8500 out of 19174 rows.
Processed 9000 out of 19174 rows.
Processed 9500 out of 19174 rows.
Processed 10000 out of 19174 rows.
Processed 10500 out of 19174 rows.
Processed 11000 out of 19174 rows.
Processed 11500 out of 19174 rows.
Processed 12000 out of 19174 rows.
Processed 12500 out of 19174 rows.
Processed 13000 out of 19174 rows.
Processed 13500 out of 19174 rows.
Processed 14000 out of 19174 rows.
Processed 14500 out of 19174 rows.
Proce

## Label recipes as W/NW

### Approach 1: NYT existing cuisine labels

Add a column indicating W/NW label for each recipe, according to the following scheme:
    <br>* if a recipe has one cuisine category, use that category's label
    <br>* if a recipe has multiple cuisine categories, use the most majority label
    <br>* if a recipe has multiple cuisine categories and there is a tie among majority labels, ignore

In [140]:
from collections import Counter

def get_nyt_cuisine_label(r):
    cuisine_cats = list(recipe2cuisine[r])
    #print('cuisine_cats:',cuisine_cats)
    if len(cuisine_cats) == 0:
        return
    elif len(cuisine_cats) == 1:
        return cuisine2type[cuisine_cats[0]]
    else:
        #print([cuisine2type[cuisine_cat]
        #                       for cuisine_cat in cuisine_cats])
        #print(Counter([cuisine2type[cuisine_cat]
        #                       for cuisine_cat in cuisine_cats]))
        c = Counter([cuisine2type[cuisine_cat]
                               for cuisine_cat in cuisine_cats]).\
                    most_common()
        if len(c) == 1:
            return c[0][0]
        else:
            majority_label = c[0][0]
            if majority_label != c[1][0]:
                return majority_label
            else:
                print('Tie')

In [146]:
df['cuisine_type'] = [get_nyt_cuisine_label(x) for x in df.index.values]

In [147]:
df['cuisine_type'].value_counts()

non_west    1847
west        1550
Name: cuisine_type, dtype: int64

In [177]:
df.loc[df['cuisine_type']=='west']#['title'].values

Unnamed: 0,title,author,description,steps,source,reg_title,cuisine_type
http://cooking.nytimes.com//recipes/1021792-spicy-chorizo-pasta,Spicy Chorizo Pasta,Ali Slagle,Macaroni and chorizo is classic Spanish comfor...,Bring a large pot of salted water to a boil. A...,nyt,spicy chorizo pasta,west
http://cooking.nytimes.com//recipes/1021774-sauerkraut-jeon-korean-pancakes,Sauerkraut Jeon (Korean Pancakes),J. Kenji López-Alt,"Jeon are savory Korean vegetable, meat or seaf...","Prepare the dipping sauce: In a small bowl, st...",nyt,sauerkraut jeon korean pancakes,west
http://cooking.nytimes.com//recipes/1021791-cheesy-beer-bread,Cheesy Beer Bread,Erin Jeanne McDowell,This easy bread recipe uses beer and baking po...,Heat the oven to 375 degrees. Lightly grease a...,nyt,cheesy beer bread,west
http://cooking.nytimes.com//recipes/1021775-earlonnes-chicken-and-brown-rice,Earlonne’s Chicken and Brown Rice,Samin Nosrat,"Layered with savory, satisfying flavors, this ...",Season the chicken generously with salt and pe...,nyt,earlonne s chicken and brown rice,west
http://cooking.nytimes.com//recipes/1021767-mulling-spice-cake-with-cream-cheese-frosting,Mulling-Spice Cake With Cream-Cheese Frosting,Tara Bench,The spices in this cake from “Live Life Delici...,Heat oven to 350 degrees. Prepare the cakes: C...,nyt,mulling spice cake with cream cheese frosting,west
...,...,...,...,...,...,...,...
http://cooking.nytimes.com//recipes/1018618-meatball-sausages-soutzoukakia-smyrneika,Meatball Sausages (Soutzoukakia Smyrneika),Diane Kochilas,This is a soulful dish: handmade meatball saus...,For the sauce: In a medium saucepan over mediu...,nyt,meatball sausages soutzoukakia smyrneika,west
http://cooking.nytimes.com//recipes/1015667-sweet-tart-crust,Sweet Tart Crust,Mark Bittman,,"Combine the flour, salt, and sugar in the cont...",nyt,sweet tart crust,west
http://cooking.nytimes.com//recipes/1015542-crisp-skinned-tilefish-over-provencal-cabbage,Crisp-Skinned Tilefish Over Provencal Cabbage,Mark Bittman,,Turn the heat to medium-high under a heavy ski...,nyt,crisp skinned tilefish over provencal cabbage,west
http://cooking.nytimes.com//recipes/1017675-clementine-clafoutis,Clementine Clafoutis,Mark Bittman,Clafoutis is a classic French dessert most oft...,Heat oven to 350 degrees. Prepare a gratin dis...,nyt,clementine clafoutis,west


--> This doesn't seem to yield very much labeled data, and we still don't have a way of labeling the BA recipes. Moreover, the labels seem lacking in accuracy (e.g., Korean jeon is American). Results are noisy (see [Initial analysis](#Initial-analysis)).

In [649]:
recipe2cuisine['http://cooking.nytimes.com//recipes/1021774-sauerkraut-jeon-korean-pancakes']

{'American'}

### Approach 2: Using explicit cues in recipe titles

Let's look for adjectives like "Turkish", "Argentine". First we'll get a big list of all such adjectives for different regions/countries from Wikipedia.

In [2]:
from urllib import request
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
import re

In [3]:
url = "https://en.wikipedia.org/wiki/Demonym#Suffixation"
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
}
soup = BeautifulSoup(request.urlopen(
        request.Request(url, headers=HEADERS)).read(), "html.parser")

In [54]:
demonyms = set(["African", "Antarctican", "Asian", "Australian", "European", "North American", 
                "South American", "Central American", "American", "Oceanian"])

divs = soup.find_all('div',attrs={'class':'div-col'})
for ix,div in enumerate(divs):
    #print(ix)
    uls = div.find_all('ul')
    for ul in uls:
        lis = ul.find_all('li')
        for li in lis:
            dems = re.split('also |more commonly|less commonly|/|or |Or |demonym|, ', li.text.split(' → ')[-1])
            #print('first pass:',dems)
            dems = [re.sub(r'[^ \-a-zA-Z0-9+]','',x) for x in dems]
            #print('sub punctuation:',dems)
            dems = set([x for x in dems if len(x) > 0 and x[0].isupper()])
            #print('final:', dems)
            
            demonyms |= dems

In [70]:
# Remove final -s and lowercase to get adjectival form

demonyms = [d[:-1].lower() for d in demonyms]
len(demonyms)

844

In [83]:
demonyms[:5]

['silhillian', 'calcuttan', 'dubaiite', 'palaungges', 'transnistrian']

Look for these adjectives in recipe titles:

In [75]:
def has_demonym(s):
    return len(set(s.lower().split()).intersection(demonyms)) > 0

def get_demonyms(s):
    return set(s.lower().split()).intersection(demonyms)

df['has_demonym'] = df['title'].apply(has_demonym)
df['demonyms'] = df['title'].apply(get_demonyms)

In [76]:
df['has_demonym'].value_counts()

False    18697
True       369
Name: has_demonym, dtype: int64

In [85]:
df.loc[df['has_demonym']].head(10)

Unnamed: 0,title,author,description,steps,source,reg_title,has_demonym,demonyms
http://www.bonappetit.com/recipe/salvadoran-quesadilla,Salvadoran Quesadilla,,Not to be confused with the Mexican dish by th...,Place a rack in middle of oven; preheat to 375...,ba,salvadoran quesadilla,True,{salvadoran}
http://www.bonappetit.com/recipe/italian-chopped-salad,Italian Chopped Salad,Chris Morocco,"This winter salad with peak season chicories, ...",Heat butter and 2 Tbsp. oil in a large nonstic...,ba,italian chopped salad,True,{italian}
http://www.bonappetit.com/recipe/jamaican-beef-patties,Jamaican Beef Patties,Shani Jones,These delectable flaky pastries from can be fo...,"Pulse flour, salt, and turmeric in a food proc...",ba,jamaican beef patties,True,{jamaican}
http://www.bonappetit.com/recipe/thai-tea-ice-cream,Swirled No-Churn Thai Tea Ice Cream,Sarah Jampel,The hardest part of this recipe is tracking do...,Bring 1½ cups cream to a bare simmer in a smal...,ba,swirled no churn thai tea ice cream,True,{thai}
http://www.bonappetit.com/recipe/chrissy-teigen-thai-soy-garlic-fried-ribs-recipe,Chrissy Teigen’s Thai Soy-Garlic Fried Ribs,Chrissy Teigen,Making ribs at home can be sooooo intimidating...,"Place ribs in a large bowl. Add soy sauce, gar...",ba,chrissy teigen s thai soy garlic fried ribs,True,{thai}
http://www.bonappetit.com/recipe/italian-sundaes-with-nutella,Italian Sundaes with Nutella,Ignacio Mattos,Fior di latte (“milk flower”) is a fresh cow’s...,Place a scoop of gelato into each chilled bowl...,ba,italian sundaes with nutella,True,{italian}
http://www.bonappetit.com/recipe/arugula-with-italian-plums-and-parmesan,Arugula with Italian Plums and Parmesan,Ignacio Mattos,"For the best play between sweet, hot, and salt...","Toss plums, cocktail onions, lemon juice, and ...",ba,arugula with italian plums and parmesan,True,{italian}
http://www.bonappetit.com/recipe/german-apple-cake,Buttery German Apple Cake,,This gorgeous cake was unanimously crowned the...,Preheat oven to 350°. Grease bottom and sides ...,ba,buttery german apple cake,True,{german}
http://www.bonappetit.com/recipe/smashed-cucumber-salad-with-italian-dressing,Smashed Cucumber Salad with Italian Dressing,Chris Morocco,Smashing the cucumbers lets the crunchy veg so...,Gently smash cucumbers with a rolling pin or t...,ba,smashed cucumber salad with italian dressing,True,{italian}
http://www.bonappetit.com/recipe/weeknight-pad-thai,Weeknight Pad Thai,Claire Saffitz,We're all about taking shortcuts when it comes...,"First, some prep: Cut 1 bunch scallions crossw...",ba,weeknight pad thai,True,{thai}


In [82]:
from collections import Counter

counted_dems = Counter([item for sublist in df.loc[df['has_demonym']]['demonyms'].values for item in sublist])
sorted(counted_dems.items(), key=lambda x:x[1], reverse=True)

[('italian', 54),
 ('thai', 54),
 ('moroccan', 46),
 ('mexican', 33),
 ('indian', 20),
 ('japanese', 16),
 ('sicilian', 14),
 ('persian', 10),
 ('tunisian', 10),
 ('jamaican', 8),
 ('roman', 8),
 ('german', 7),
 ('belgian', 6),
 ('russian', 6),
 ('cuban', 6),
 ('hungarian', 6),
 ('american', 5),
 ('breton', 5),
 ('peruvian', 4),
 ('salvadoran', 3),
 ('malaysian', 3),
 ('colombian', 3),
 ('brazilian', 3),
 ('indonesian', 3),
 ('hamburger', 2),
 ('haitian', 2),
 ('bangladeshi', 2),
 ('andalusian', 2),
 ('georgian', 2),
 ('javanese', 1),
 ('malagasy', 1),
 ('libyan', 1),
 ('syrian', 1),
 ('singaporean', 1),
 ('america', 1),
 ('ghanaian', 1),
 ('genovese', 1),
 ('armenian', 1),
 ('chilean', 1),
 ('alaskan', 1),
 ('egyptian', 1),
 ('iranian', 1),
 ('baja', 1),
 ('victorian', 1),
 ('valencian', 1),
 ('algerian', 1),
 ('bosnian', 1),
 ('majorcan', 1),
 ('macedonian', 1),
 ('bohemian', 1),
 ('ukrainian', 1),
 ('bulgarian', 1),
 ('liberian', 1),
 ('ligurian', 1),
 ('galician', 1)]

Now I'll code these for W/NW:

In [86]:
# with open('coded_demonyms.txt','w') as f:
#     for key in counted_dems:
#         f.write(key+'\t-1\n')

In [89]:
coded_demonyms = pd.read_csv('coded_demonyms.txt',sep='\t',header=None)
coded_demonyms.columns = ['demonym','label']
coded_demonyms['label'].value_counts()

nw    30
w     24
-1     1
Name: label, dtype: int64

In [96]:
coded_demonyms = coded_demonyms.loc[coded_demonyms['label']!='-1']

In [98]:
dem_to_label = dict(zip(coded_demonyms['demonym'],coded_demonyms['label']))

In [99]:
dem_to_label

{'salvadoran': 'nw',
 'italian': 'w',
 'jamaican': 'nw',
 'thai': 'nw',
 'german': 'w',
 'persian': 'nw',
 'belgian': 'w',
 'japanese': 'nw',
 'russian': 'w',
 'mexican': 'nw',
 'moroccan': 'nw',
 'indian': 'nw',
 'american': 'w',
 'cuban': 'nw',
 'roman': 'w',
 'malaysian': 'nw',
 'colombian': 'nw',
 'peruvian': 'nw',
 'sicilian': 'w',
 'breton': 'w',
 'tunisian': 'nw',
 'hungarian': 'w',
 'javanese': 'nw',
 'brazilian': 'nw',
 'malagasy': 'nw',
 'libyan': 'nw',
 'syrian': 'nw',
 'singaporean': 'nw',
 'haitian': 'nw',
 'indonesian': 'nw',
 'america': 'w',
 'bangladeshi': 'nw',
 'ghanaian': 'nw',
 'genovese': 'w',
 'armenian': 'nw',
 'chilean': 'nw',
 'alaskan': 'w',
 'andalusian': 'w',
 'egyptian': 'nw',
 'iranian': 'nw',
 'baja': 'nw',
 'victorian': 'w',
 'valencian': 'w',
 'algerian': 'nw',
 'bosnian': 'w',
 'majorcan': 'w',
 'macedonian': 'w',
 'bohemian': 'w',
 'ukrainian': 'w',
 'bulgarian': 'w',
 'liberian': 'nw',
 'georgian': 'w',
 'ligurian': 'w',
 'galician': 'w'}

Finally, I'll add a separate column to the df indicating whether it's W/NW based on these explicit indicators.

In [104]:
def get_demonym_label(dem_set):
    labels = [dem_to_label[dem] for dem in dem_set if dem != 'hamburger']
    if len(labels) > 0:
        if len(labels) == 1:
            return labels[0]
        else:
            c = Counter(labels).most_common()
            if len(c) == 1:
                return c[0][0]
            else:
                majority_label = c[0][0]
                if majority_label != c[1][0]:
                    return majority_label
                else:
                    print('Tie')
            return 

df['demonym_label'] = df['demonyms'].apply(get_demonym_label)

In [112]:
df.loc[df['demonym_label']=='nw']['source'].value_counts()/df['source'].value_counts()

ba     0.009335
nyt    0.016200
Name: source, dtype: float64

In [113]:
df.loc[df['demonym_label']=='w']['source'].value_counts()/df['source'].value_counts()

ba     0.005404
nyt    0.008212
Name: source, dtype: float64

### Approach 3: Classifier trained on Wikipedia data

# Develop classifier 

### Get Wikipedia data

Root pages: 
    - W: https://en.wikipedia.org/wiki/European_cuisine
    - NW: https://en.wikipedia.org/wiki/Asian_cuisine
    
For each root page:
    - A. Get all hyperlinks with anchor text matching the pattern "<X> cuisine"
    - B. For each cuisine hyperlink:
        - Grab all items occurring in a list -> individual dishes/ingredients
        - Add to dictionary tracking ingredients/dishes of a cuisine

Part A.

In [37]:
from urllib import request
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
}

root_wiki_pages = {
    'western': 'https://en.wikipedia.org/wiki/European_cuisine',
    'non_western': 'https://en.wikipedia.org/wiki/Asian_cuisine'
}

def get_cuisine_hyperlinks(root_page):
    """
    Retrieves all hyperlinks with anchor text matching the pattern 
    "<X> cuisine" from a wikipedia URL.
    @param root_page: str wiki URL
    """
    
    soup = BeautifulSoup(request.urlopen(
            request.Request(root_page, headers=HEADERS)).read(), 
                         "html.parser")
    main_div = soup.find('div',attrs={'class':'mw-parser-output'})
    uls = main_div.find_all('ul',recursive=False)
    lis = []
    for ul in uls:
        lis.extend(ul.find_all('li'))
    as_ = []
    for li in lis:
        as_.extend(li.find_all('a'))
    hyperlinks_and_anchors = [(a['href'],a['title']) for a in as_ 
                             if 'href' in a.attrs and 
                             'title' in a.attrs]
    cuisine_hyperlinks = [a[0] for a in hyperlinks_and_anchors
                         if a[1].endswith(' cuisine')]

    return ["https://en.wikipedia.org/"+x for x in cuisine_hyperlinks]

cuisine_hyperlinks = {}
for cuisine_type in ['western','non_western']:
    cuisine_hyperlinks[cuisine_type] = get_cuisine_hyperlinks(
        root_wiki_pages[cuisine_type])   

Note: Russian cuisine is categorized as both W and NW; and Jewish cuisine is categorized as NW.

In [153]:
cuisine_hyperlinks['non_western']

['https://en.wikipedia.org//wiki/Russian_cuisine',
 'https://en.wikipedia.org//wiki/Chinese_cuisine',
 'https://en.wikipedia.org//wiki/Taiwanese_cuisine',
 'https://en.wikipedia.org//wiki/Taiwanese_cuisine',
 'https://en.wikipedia.org//wiki/Japanese_cuisine',
 'https://en.wikipedia.org//wiki/North_Korean_cuisine',
 'https://en.wikipedia.org//wiki/South_Korean_cuisine',
 'https://en.wikipedia.org//wiki/Singaporean_cuisine',
 'https://en.wikipedia.org//wiki/Mongolian_cuisine',
 'https://en.wikipedia.org//wiki/Indian_cuisine',
 'https://en.wikipedia.org//wiki/Pakistani_cuisine',
 'https://en.wikipedia.org//wiki/Bangladeshi_cuisine',
 'https://en.wikipedia.org//wiki/Nepali_cuisine',
 'https://en.wikipedia.org//wiki/Thai_cuisine',
 'https://en.wikipedia.org//wiki/Malaysian_cuisine',
 'https://en.wikipedia.org//wiki/Singaporean_cuisine',
 'https://en.wikipedia.org//wiki/Filipino_cuisine',
 'https://en.wikipedia.org//wiki/Vietnamese_cuisine',
 'https://en.wikipedia.org//wiki/Indonesian_cuisin

In [163]:
set(cuisine_hyperlinks['western']).intersection(
set(cuisine_hyperlinks['non_western']))

{'https://en.wikipedia.org//wiki/Russian_cuisine'}

Part B.

In [141]:
def find_footer(el):
    h2s = el.find_all('h2')
    if len(h2s) > 0:
        ix = 0
        text = h2s[ix].text 
        while not (text.startswith('See also') or \
                  text.startswith('Bibliography') or \
                  text.startswith('External links') or \
                  text.startswith('Notes'))\
            and ix < len(h2s)-1:
            ix += 1
            text = h2s[ix].text

        return h2s[ix]
    else:
        return None

def get_cuisine_elements(cuisine_page):
    """
    Retrieves all dishes/ingredients of a cuisine.
    @param cuisine_page: str wiki URL for the cuisine of interest
    """
    
    soup = BeautifulSoup(request.urlopen(
            request.Request(cuisine_page, headers=HEADERS)).read(), 
                         "html.parser")
    main_div = soup.find('div',attrs={'class':'mw-parser-output'})
    
    # Get dishes in tables
    main_tabs = main_div.find_all('table',attrs={'class':'wikitable'})
    print('Found {} tables.'.format(len(main_tabs)))
    dishes = []
    for main_tab in main_tabs:
        trs = main_tab.find_all('tr')
        for tr in trs[1:]: # skip the table header
            td = tr.find('td')
            if td is not None:
                dishes.append(td.text.strip())
    print('Found {} dishes within tables.'.format(len(dishes)))
    
    # Get dishes in lists before 'See also/Bibligraphy/Notes' sections
    footer = find_footer(main_div)
    if footer is not None:
        last_ul_before_footer = footer.find_previous_sibling('ul')
        if last_ul_before_footer is not None:
            uls = main_div.find_all('ul',recursive=False)
            print("Found {} lists.".format(len(uls)))
            filtered_uls = []
            ix_ul = 0
            ul = uls[ix_ul]
            while ul is not last_ul_before_footer:
                filtered_uls.append(ul)
                ix_ul += 1
                ul = uls[ix_ul]
            filtered_uls.append(uls[ix_ul])

            lis = []
            for ul in filtered_uls:
                lis.extend([li for li in ul.find_all('li')
                           if ('class' not in li.attrs) or 
                           ('caption' not in li['class'][0])])

            # get only italicized portion if present
            for li in lis:
                try:
                    dish = li.find('i').text.strip()
                except AttributeError:
                    dish = li.text.strip()
                dishes.append(dish)  
        else:
            print("Found no lists before footer sections.")
    else:
        print("No footer found!")
    
    cleaned_dishes = [d.split(' - ')[0].split(': ')[0].split(' – ')[0]\
            .split(' is ')[0].split(', ')[0].split('\n')
            for d in dishes]
    
    return [item for sublist in cleaned_dishes for item in sublist
           if len(item) > 0]

def get_dishes(cuisine_url):
    """
    Wrapper function that gets dishes from 'X Cuisine' and 
    'List of X dishes' Wiki pages.
    @param cuisine_url: str Wikipedia url for the 'X Cuisine' page
    """
    url_cuisine = cuisine_url.split('/')[-1].split('_')[0] 
    print('Getting dishes for {} cuisine'.format(url_cuisine))
    out1 = set(get_cuisine_elements(cuisine_url))
    url2 = "https://en.wikipedia.org/wiki/List_of_{}_dishes".format(
                url_cuisine)
    try:
        out2 = set(get_cuisine_elements(url2))
    except HTTPError:
        out2 = set()
    
    print('Found {} dishes in List page.'.format(len(out2)))
    return list(out1 | out2)

In [155]:
#cuisine_elements = defaultdict(dict)
# for cuisine_type in ['non_western']:#['western','non_western']:
#     for cuisine_hyperlink in cuisine_hyperlinks[cuisine_type]:
#         cuisine_elements[cuisine_type][cuisine_hyperlink] = \
#             get_dishes(cuisine_hyperlink)

In [148]:
all_W_els = [item for sublist in cuisine_elements['western'].values() 
             for item in sublist]
all_W_els_set = set(all_W_els)

all_NW_els = [item for sublist in cuisine_elements['non_western'].values() 
             for item in sublist]
all_NW_els_set = set(all_NW_els)

len(all_W_els),len(all_W_els_set),len(all_NW_els),len(all_NW_els_set)

(5651, 5241, 3179, 3052)

In [158]:
# with open('./data/cuisine_labeled/wiki_western.txt','w') as f:
#     for el in all_W_els_set:
#         f.write(el+'\n')
        
# with open('./data/cuisine_labeled/wiki_nonwestern.txt','w') as f:
#     for el in all_NW_els_set:
#         f.write(el+'\n')

### Qualitative exploration

In [129]:
with open('./data/cuisine_labeled/wiki_western.txt','r') as f:
    W_els = f.read().splitlines()
with open('./data/cuisine_labeled/wiki_nonwestern.txt','r') as f:
    NW_els = f.read().splitlines()
len(W_els),len(NW_els)

(5241, 3052)

Some observations:
- Some imbalance in data available
- Some dishes have rather long names; some also have descriptions attached that were not cleaned through rule-based heuristics (e.g., italics, colons, punctuation).
- There is some overlap (N=113) in W/NW dishes due to some common food terms; Russian cuisine seems to be in both.

In [161]:
W_NW_overlap = set(W_els).intersection(set(NW_els))
len(W_NW_overlap)

113

In [160]:
set(W_els).intersection(set(NW_els))

{'Baguette',
 'Baranka',
 'Beef Stroganoff',
 'Beef Stroganoff served with rice',
 'Beer',
 'Bitterballen',
 'Blini',
 'Borodinsky bread',
 'Borscht',
 'Bublik',
 'Buckwheat kasha',
 'Caviar',
 'Caviar butterbrot',
 'Chicken Kiev',
 'Chicken cutlets',
 'Chilled soups based on kvass',
 'Chocolate',
 'Courgette caviar',
 'Crêpes',
 'Dressed Herring',
 'Dressed herring (Seld pod shuboi)',
 'Eastern European style sauerkraut',
 'Escabeche',
 'Firni',
 'Fish soups such as ukha.',
 'Fruits',
 'Golubtsy',
 'Grain- and vegetable-based soups.',
 'Guriev porridge',
 'Hutspot',
 'Ice cream',
 'Julienne',
 'Kalach',
 'Karavai',
 'Karelsky pirog',
 'Kasha',
 'Kasha served with jam',
 'Kasha with milk',
 'Kholodets',
 'Khren',
 'Khrenovina',
 'Kissel',
 'Kissel served with bananas and grapes',
 'Knish',
 'Kompot',
 'Kulich',
 'Kulyebyaka',
 'Kurnik',
 'Kurnik (pirog)',
 'Kutia',
 'Kvass',
 'Light soups and stews based on water and vegetables',
 'Lor',
 'Makarony po-flotski',
 'Meat',
 'Medovukha',
 

### Train classifier

#### Create train, dev, test sets

In [194]:
import numpy as np
from sklearn.model_selection import train_test_split

X, Y = np.array(W_els+NW_els), np.array(
    ['w']*len(W_els)+['nw']*len(NW_els))
X_train, X_dev_test, Y_train, Y_dev_test = train_test_split(
    X, Y, test_size=0.33, random_state=42)
X_dev, X_test, Y_dev, Y_test = train_test_split(
    X_dev_test, Y_dev_test, test_size=0.6, random_state=42)

assert len(X_train) == len(Y_train)
assert len(X_dev) == len(Y_dev)
assert len(X_test) == len(Y_test)

print("Created tr/de/te splits with {}/{}/{} examples.".format(
    len(X_train),len(X_dev),len(X_test)))

print("Writing to files...")
parent_dir = './data/classifier/wiki'
os.mkdir(parent_dir)
pd.DataFrame({'recipe_name': X_train, 'label': Y_train}).to_csv(
    parent_dir+'/train.tsv',sep='\t',header=True,index=False)
pd.DataFrame({'recipe_name': X_dev, 'label': Y_dev}).to_csv(
    parent_dir+'/dev.tsv',sep='\t',header=True,index=False)
pd.DataFrame({'recipe_name': X_test, 'label': Y_test}).to_csv(
    parent_dir+'/test.tsv',sep='\t',header=True,index=False)

Created tr/de/te splits with 5556/1094/1643 examples.
Writing to files...


#### Create gold eval sets

* manually labeled sample of BA/NYT recipes (`gold_manual`)
* `gold_manual`, but filtering to only nouns (`gold_manual_nouns`)
* sample of BA/NYT recipes labeled via explicit demonym mentions (`gold_dem`)

In [148]:
# Create manually labeled OOD test set on BA/NYT recipes
# by randomly sample from recipes w/o a label
gold_manual = df.loc[pd.isnull(df['cuisine_type'])].sample(n=150)
gold_manual['source'].value_counts()

ba     107
nyt     43
Name: source, dtype: int64

In [186]:
# with open('./data/cuisine_labeled/gold_sample.tsv','w') as f:
#     f.write('title\tlabel\n')
#     for title in gold_manual['title'].values:
#         f.write(title+'\t'+str(-1)+'\n')

In [163]:
gold_manual = pd.read_csv('./data/cuisine_labeled/gold_sample.tsv',
                            sep='\t',header=0)
gold_manual.shape

(150, 3)

In [164]:
gold_manual['label'].value_counts()

w     119
nw     31
Name: label, dtype: int64

In [173]:
# Create `gold_manual_nouns`

def get_title_nouns(recipe_url):
    """Looks up the spaCy processed json given a recipe's URL and 
    returns the nouns in the recipe's title"""
    
    recipe_fname = os.path.join('data','preprocessed',recipe_url.split('/')[-1])
    with open(recipe_fname,'r') as f:
        json_obj = json.load(f)
    
    title_nouns = [l for ix_l,l in enumerate(json_obj['title']['lemmas'])
                    if json_obj['title']['pos'][ix_l] == 'NOUN']
    
    return ' '.join(title_nouns)

In [177]:
gold_manual['gold_manual_nouns'] = gold_manual['URL'].apply(lambda x: get_title_nouns(x))

In [178]:
gold_manual.head()

Unnamed: 0,title,label,URL,gold_manual_nouns
0,"Large White Bean, Tuna and Spinach Salad",w,http://cooking.nytimes.com//recipes/1014616-la...,bean tuna spinach salad
1,Smashed and Loaded Crispy Potatoes,w,http://www.bonappetit.com/recipe/smashed-and-l...,potato
2,Margarita,w,http://cooking.nytimes.com//recipes/1016358-ma...,margarita
3,Nita's Crazy Cake,w,http://www.bonappetit.com/recipe/nita-s-crazy-...,cake
4,Warm Apple-Cornmeal Upside-Down Cake,w,http://www.bonappetit.com/recipe/warm-apple-co...,apple cake


In [152]:
# Use explicit demonym indicators as another gold sample 
gold_dem = df.loc[~pd.isnull(df['demonym_label'])][['title','demonym_label']]
gold_dem.reset_index(drop=True,inplace=True)
gold_dem.shape

(367, 2)

In [153]:
gold_dem

Unnamed: 0,title,demonym_label
0,Salvadoran Quesadilla,nw
1,Italian Chopped Salad,w
2,Jamaican Beef Patties,nw
3,Swirled No-Churn Thai Tea Ice Cream,nw
4,Chrissy Teigen’s Thai Soy-Garlic Fried Ribs,nw
...,...,...
362,White Russian,w
363,Italian Mushroom and Celery Salad,w
364,Italian Bread Salad (Panzanella),w
365,South Indian Eggplant Curry,nw


In [179]:
X_gold_manual, Y_gold_manual = np.array(gold_manual['title'].apply(
                                    lambda x: x.lower())), np.array(gold_manual['label'])

X_gold_manual_noun, Y_gold_manual_noun = np.array(gold_manual['gold_manual_nouns'].apply(
                                    lambda x: x.lower())), np.array(gold_manual['label'])

X_gold_dem, Y_gold_dem = np.array(gold_dem['title'].apply(
                                    lambda x: x.lower())), np.array(gold_dem['demonym_label'])

In [186]:
gold_data_dict = {
    'manual': {'X': X_gold_manual, 'Y': Y_gold_manual},
    'manual_nouns': {'X': X_gold_manual_noun, 'Y': Y_gold_manual_noun},
    'dem': {'X': X_gold_dem, 'Y': Y_gold_dem},
}

In [196]:
parent_dir = './data/classifier/GOLD'
os.mkdir(parent_dir)
for key in gold_data_dict:
    pd.DataFrame({'recipe_name': gold_data_dict[key]['X'],
                  'label': gold_data_dict[key]['Y']}).to_csv(parent_dir+'/'+key+'.tsv',
                                                    sep='\t',header=True,index=False)

#### Establish baselines

Baselines:
- guess majority class (W)
- Naive Bayes
- linear SVM
- logistic regression

In [126]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier, LogisticRegression

In [131]:
from sklearn.pipeline import Pipeline

NB_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

SVM_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

LR_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(random_state=42, 
                               class_weight='balanced')),
])

NB_clf.fit(X_train, Y_train)
SVM_clf.fit(X_train, Y_train)
LR_clf.fit(X_train, Y_train)



Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                                    fit_intercept=True, intercept_scaling

In [191]:
print('************* Test set performance *************')

MC_predicted = np.array(['w']*len(X_test))
print("\tMajority class acc: {0:10}".format(
    round(np.mean(MC_predicted == Y_test),3)))

NB_predicted = NB_clf.predict(X_test)
print("\tNaive Bayes acc: {0:12}".format(
    round(np.mean(NB_predicted == Y_test),3)))

SVM_predicted = SVM_clf.predict(X_test)
print("\tSVM acc: {0:21}".format(
    round(np.mean(SVM_predicted == Y_test),3)))

LR_predicted = LR_clf.predict(X_test)
print("\tLogistic Regression acc: {0:2}".format(
    round(LR_clf.score(X_test, Y_test),3)))

print('\n************ Gold sample (manual) performance ************')

MC_predicted = np.array(['w']*len(gold_data_dict['manual']['X']))
print("\tMajority class acc: {0:10}".format(
    round(np.mean(MC_predicted == gold_data_dict['manual']['Y']),3)))

NB_predicted = NB_clf.predict(gold_data_dict['manual']['X'])
print("\tNaive Bayes acc: {0:12}".format(
    round(np.mean(NB_predicted == gold_data_dict['manual']['Y']),3)))

SVM_predicted = SVM_clf.predict(gold_data_dict['manual']['X'])
print("\tSVM acc: {0:21}".format(
    round(np.mean(SVM_predicted == gold_data_dict['manual']['Y']),3)))

LR_predicted = LR_clf.predict(gold_data_dict['manual']['X'])
print("\tLogistic Regression acc: {0:2}".format(
    round(LR_clf.score(X_gold, gold_data_dict['manual']['Y']),3)))

print('\n************ Gold sample (manual, nouns only) performance ************')

MC_predicted = np.array(['w']*len(gold_data_dict['manual_nouns']['X']))
print("\tMajority class acc: {0:10}".format(
    round(np.mean(MC_predicted == gold_data_dict['manual_nouns']['Y']),3)))

NB_predicted = NB_clf.predict(gold_data_dict['manual_nouns']['X'])
print("\tNaive Bayes acc: {0:12}".format(
    round(np.mean(NB_predicted == gold_data_dict['manual_nouns']['Y']),3)))

SVM_predicted = SVM_clf.predict(gold_data_dict['manual_nouns']['X'])
print("\tSVM acc: {0:21}".format(
    round(np.mean(SVM_predicted == gold_data_dict['manual_nouns']['Y']),3)))

LR_predicted = LR_clf.predict(gold_data_dict['manual_nouns']['X'])
print("\tLogistic Regression acc: {0:2}".format(
    round(LR_clf.score(X_gold, gold_data_dict['manual_nouns']['Y']),3)))

print('\n************ Gold sample (demonym labelled) performance ************')

MC_predicted = np.array(['w']*len(gold_data_dict['dem']['X']))
print("\tMajority class acc: {0:10}".format(
    round(np.mean(MC_predicted == gold_data_dict['dem']['Y']),3)))

NB_predicted = NB_clf.predict(gold_data_dict['dem']['X'])
print("\tNaive Bayes acc: {0:12}".format(
    round(np.mean(NB_predicted == gold_data_dict['dem']['Y']),3)))

SVM_predicted = SVM_clf.predict(gold_data_dict['dem']['X'])
print("\tSVM acc: {0:21}".format(
    round(np.mean(SVM_predicted == gold_data_dict['dem']['Y']),3)))

LR_predicted = LR_clf.predict(gold_data_dict['dem']['X'])
print("\tLogistic Regression acc: {0:2}".format(
    round(LR_clf.score(X_gold_dem, gold_data_dict['dem']['Y']),3)))

************* Test set performance *************
	Majority class acc:      0.644
	Naive Bayes acc:         0.82
	SVM acc:                 0.716
	Logistic Regression acc: 0.813

************ Gold sample (manual) performance ************
	Majority class acc:      0.793
	Naive Bayes acc:         0.82
	SVM acc:                 0.793
	Logistic Regression acc: 0.773

************ Gold sample (manual, nouns only) performance ************
	Majority class acc:      0.793
	Naive Bayes acc:         0.82
	SVM acc:                   0.8
	Logistic Regression acc: 0.773

************ Gold sample (demonym labelled) performance ************
	Majority class acc:      0.349
	Naive Bayes acc:        0.559
	SVM acc:                 0.447
	Logistic Regression acc: 0.629


#### BERT

# Initial analysis

## Q: Which words occur most often in W vs. NW recipes?

In [592]:
import math

def compute_lor(c1, c2, smoothing=True):
    """
    Computes log odds ratios between words in Counted corpus 1 vs. 2.
    @param c1: Counter object of words in corpus 1.
    @param c2: Counter object of words in corpus 2.
    @smoothing: if True, applies +0.5 smoothing to difference between 
                words occurring in one corpus but not the other;
                if False, only looks at words occuring in both corpora
    """
    
    c1_vocab, c2_vocab = set(c1.keys()), set(c2.keys())
    
    if smoothing:
        joint_vocab = c1_vocab.union(c2_vocab)
        c1 = {w: c1[w]+0.5 for w in joint_vocab}
        c2 = {w: c2[w]+0.5 for w in joint_vocab}
    else:
        joint_vocab = c1_vocab.intersection(c2_vocab)
    
    # convert counts to frequencies
    N1, N2 = sum(c1.values()), sum(c2.values())
    f1, f2 = {w: c1[w]/N1 for w in joint_vocab}, {w: c2[w]/N2 
                                              for w in joint_vocab}
    
    # frequencies > odds
    o1, o2 = {w: f1[w]/(1-f1[w]) for w in f1}, {w: f2[w]/(1-f2[w]) 
                                                for w in f2}
    
    # odds > log odds ratios
    lor = {w: math.log(o1[w]/o2[w]) for w in joint_vocab}
    
#     print(c1,c2)
#     print(N1,N2)
#     print(f1,f2)
#     print(o1,o2)
    
    return lor

In [600]:
recipe_words = {'west': defaultdict(list),
                'non_west': defaultdict(list)}

for cuisine_type in ['west','non_west']:
    cuisine_df = df.loc[df['cuisine_type']==cuisine_type]
    #print(cuisine_df.shape)
    for ix_row,row in cuisine_df.iterrows():
        fname = ix_row.split('/')[-1]
        with open('./data/preprocessed/'+fname,'r') as file:
            json_obj = json.load(file)
            
            for field in ['title','description','steps']:
                if field in json_obj:
                    recipe_words[cuisine_type][field].extend(
                        json_obj[field]['lemmas'])

In [643]:
for key in recipe_words:
    for key2 in recipe_words[key]:
        print(key,key2,len(recipe_words[key][key2]))

west title 7686
west description 145368
west steps 398272
non_west title 9572
non_west description 206389
non_west steps 475658


In [605]:
lor = compute_lor(Counter(recipe_words['non_west']['title']), 
                  Counter(recipe_words['west']['title']))

In [606]:
len(lor)

2397

In [610]:
sorted(lor.items(), key=lambda x: x[1], reverse=True)

[('stir', 4.503861836578962),
 ('tofu', 4.180563009554212),
 ('thai', 4.101295073425935),
 ('burger', 3.4728147786587362),
 ('indian', 3.420078034895612),
 ('blueberry', 3.420078034895612),
 ('mango', 3.364415182113205),
 ('vietnamese', 3.364415182113205),
 ('soba', 3.242868347797634),
 ('turkish', 3.1760839976120367),
 ('smoothie', 3.1760839976120367),
 ('soy', 3.1045320665866414),
 ('dal', 3.1045320665866414),
 ('bowl', 3.1045320665866414),
 ('choy', 3.027478067049331),
 ('taco', 2.9746974948300253),
 ('curry', 2.9460309734873875),
 ('bok', 2.9440035083495584),
 ('peanut', 2.8547798006349074),
 ('mexican', 2.8529387890219655),
 ('dumpling', 2.752762397980366),
 ('pho', 2.752762397980366),
 ('asian', 2.752762397980366),
 ('hash', 2.641443839021169),
 ('blini', 2.51618778085223),
 ('quesadilla', 2.51618778085223),
 ('dress', 2.51618778085223),
 ('moroccan', 2.3740899217460094),
 ('japanese', 2.3740899217460094),
 ('five', 2.3729940306290582),
 ('tamale', 2.3729940306290582),
 ('thigh',

In [611]:
sorted(lor.items(), key=lambda x: x[1], reverse=False)

[('matzo', -3.6908227608807125),
 ('provençal', -3.4898138276341766),
 ('gazpacho', -3.329245749244246),
 ('quiche', -3.329245749244246),
 ('sicilian', -3.2381612759350022),
 ('kugel', -3.1379651349732582),
 ('au', -3.1379651349732582),
 ('sardine', -3.026626830154166),
 ('fava', -3.026626830154166),
 ('mediterranean', -3.026626830154166),
 ('paella', -3.026626830154166),
 ('irish', -3.026626830154166),
 ('biscotti', -2.901351030184325),
 ('galette', -2.901351030184325),
 ('lewis', -2.7581375422179906),
 ('pomme', -2.7581375422179906),
 ('penne', -2.7581375422179906),
 ('way', -2.7581375422179906),
 ('edna', -2.7581375422179906),
 ('romesco', -2.7581375422179906),
 ('clafoutis', -2.7581375422179906),
 ('cassoulet', -2.7581375422179906),
 ('rillette', -2.7581375422179906),
 ('gremolata', -2.7581375422179906),
 ('risotto', -2.6636960331391943),
 ('liver', -2.5909708259164783),
 ('ratatouille', -2.5909708259164783),
 ('greek', -2.392954093540887),
 ('rabbit', -2.3901875115004376),
 ('grav

In [612]:
lor_desc = compute_lor(Counter(recipe_words['non_west']['description']), 
                  Counter(recipe_words['west']['description']))
len(lor_desc)

11982

In [613]:
sorted(lor_desc.items(), key=lambda x: x[1], reverse=True)

[('mango', 4.334354845285789),
 ('filipino', 3.8661244400967356),
 ('dal', 3.8661244400967356),
 ('tamarind', 3.668736750632716),
 ('chaat', 3.511532330683479),
 ('ghee', 3.3249275390310467),
 ('dosa', 3.2722790965830564),
 ('soba', 3.2722790965830564),
 ('pho', 3.2722790965830564),
 ('nigerian', 3.2722790965830564),
 ('choy', 3.216704536487851),
 ('soy', 3.1702099988714756),
 ('bok', 3.157859327546698),
 ('haitian', 3.095334261669317),
 ('thai', 3.04260167041034),
 ('dashi', 3.0286381782967724),
 ('charcoal', 2.9571745054629277),
 ('indian', 2.9373543593457874),
 ('mochi', 2.8802087554972733),
 ('hash', 2.8802087554972733),
 ('raman', 2.8802087554972733),
 ('kombu', 2.8802087554972733),
 ('understand', 2.796822437750869),
 ('philippine', 2.796822437750869),
 ('ramadan', 2.796822437750869),
 ('din', 2.796822437750869),
 ('cashews', 2.796822437750869),
 ('jamaica', 2.796822437750869),
 ('mirin', 2.796822437750869),
 ('thailand', 2.796822437750869),
 ('roti', 2.796822437750869),
 ('haban

In [614]:
sorted(lor_desc.items(), key=lambda x: x[1], reverse=False)

[('provençal', -4.8049158263371625),
 ('kugel', -4.189023143960974),
 ('ratatouille', -3.772809885453966),
 ('trahana', -3.7061119034915024),
 ('biscotti', -3.7061119034915024),
 ('edna', -3.6346463320892246),
 ('sicily', -3.6346463320892246),
 ('spaetzle', -3.6346463320892246),
 ('tapas', -3.6346463320892246),
 ('watson', -3.474290467304752),
 ('babka', -3.474290467304752),
 ('trifle', -3.383312081809864),
 ('pimentón', -3.2832220160073757),
 ('provence', -3.2476181619407662),
 ('quiche', -3.24043310212017),
 ('quealy', -3.1719897736953016),
 ('burrata', -3.1719897736953016),
 ('orecchiette', -3.1719897736953016),
 ('apulia', -3.1719897736953016),
 ('pâte', -3.1719897736953016),
 ('madeleine', -3.1719897736953016),
 ('antipasto', -3.1719897736953016),
 ('combo', -3.1719897736953016),
 ('clafoutis', -3.1719897736953016),
 ('paella', -3.0904061467294004),
 ('rillette', -3.0469144269765835),
 ('bud', -3.0468200235831002),
 ('homey', -3.0468200235831002),
 ('caponata', -3.0468200235831002

In [644]:
lor_steps = compute_lor(Counter(recipe_words['non_west']['steps']), 
                  Counter(recipe_words['west']['steps']))
len(lor_steps)

5756

In [645]:
sorted(lor_steps.items(), key=lambda x: x[1], reverse=True)

[('dal', 4.41882602858806),
 ('choy', 4.398415067035915),
 ('bok', 4.398415067035915),
 ('kombu', 4.086356796999109),
 ('mango', 4.028365359633544),
 ('tofu', 3.8569836722305393),
 ('blueberry', 3.830993386434223),
 ('marshmallow', 3.7554816542807417),
 ('horizontal', 3.6737994436386074),
 ('sake', 3.6303122418914864),
 ('ghee', 3.4524938452272487),
 ('husk', 3.320211247080631),
 ('mirin', 3.320211247080631),
 ('habanero', 3.3201447748350645),
 ('gyoza', 3.3201447748350645),
 ('tamale', 3.2576223280769185),
 ('tamarind', 3.190986976140511),
 ('grass', 3.190928863805802),
 ('fenugreek', 3.190928863805802),
 ('onigiri', 3.190928863805802),
 ('cupcake', 3.190928863805802),
 ('dosa', 3.1194678100555793),
 ('scotch', 3.1194678100555793),
 ('obe', 3.1194678100555793),
 ('ata', 3.1194678100555793),
 ('unfurl', 3.1194678100555793),
 ('cashew', 3.0425046791557406),
 ('thai', 3.0425046791557406),
 ('wall', 3.0425046791557406),
 ('compress', 3.0425046791557406),
 ('wafer', 3.0425046791557406),
 (

In [646]:
sorted(lor_desc.items(), key=lambda x: x[1], reverse=False)
# homey, definitive, underrated, dynamic, insanely, modest, 

[('provençal', -4.8049158263371625),
 ('kugel', -4.189023143960974),
 ('ratatouille', -3.772809885453966),
 ('trahana', -3.7061119034915024),
 ('biscotti', -3.7061119034915024),
 ('edna', -3.6346463320892246),
 ('sicily', -3.6346463320892246),
 ('spaetzle', -3.6346463320892246),
 ('tapas', -3.6346463320892246),
 ('watson', -3.474290467304752),
 ('babka', -3.474290467304752),
 ('trifle', -3.383312081809864),
 ('pimentón', -3.2832220160073757),
 ('provence', -3.2476181619407662),
 ('quiche', -3.24043310212017),
 ('quealy', -3.1719897736953016),
 ('burrata', -3.1719897736953016),
 ('orecchiette', -3.1719897736953016),
 ('apulia', -3.1719897736953016),
 ('pâte', -3.1719897736953016),
 ('madeleine', -3.1719897736953016),
 ('antipasto', -3.1719897736953016),
 ('combo', -3.1719897736953016),
 ('clafoutis', -3.1719897736953016),
 ('paella', -3.0904061467294004),
 ('rillette', -3.0469144269765835),
 ('bud', -3.0468200235831002),
 ('homey', -3.0468200235831002),
 ('caponata', -3.0468200235831002

In [616]:
lor['easy']

-0.7805638808004122

In [618]:
lor['weeknight']

1.4171110455639093

In [620]:
lor['simple']

0.04392395293215739

In [625]:
lor['spicy']

0.6021711164252307

In [627]:
lor['goddess']

1.7536761542539143

In [647]:
lor['humble']

KeyError: 'humble'

In [617]:
lor_desc['easy']

-0.07758974599004735

In [619]:
lor_desc['weeknight']

0.15290602051926946

In [621]:
lor_desc['simple']

0.06363119326268343

In [622]:
lor_desc['substitute']

0.1234448266003417

In [623]:
lor_desc['substitution']

0.28032406583492797

In [624]:
lor_desc['spicy']

0.8911380367621159

In [628]:
lor_desc['goddess']

1.6072007012164082

In [631]:
lor_desc['travel']

-0.12742024933952922

In [634]:
lor_desc['strange']

-0.3387254718849736

In [635]:
lor_desc['unfamiliar']

0.24906871194428504

In [636]:
lor_desc['comfort']

-0.11808106424902055

In [648]:
lor_desc['humble']

-1.638093438219648

In [637]:
lor_desc['familiar']

0.17211230550998355

In [638]:
lor_desc['tasty']

-0.05914364326770409

In [639]:
lor_desc['delicious']

-0.11155703037071332

In [642]:
lor_desc['fatty']

-0.14767600466069639

## Q: Do modifiers like "simple", "easy", "weeknight", "quick" tend to be applied to NW dishes?

In [352]:
from collections import Counter

In [387]:
def collect_pos(pos_label, field='title', lemmas=True):
    matches = []
    fnames = glob.glob('./data/preprocessed/*')
    
    for ix_fname, fname in enumerate(fnames):
        with open(fname, 'r') as file:
            json_obj = json.load(file)
            pos_match_ixs = [pos_ix for pos_ix,pos in 
                             enumerate(json_obj[field]['pos'])
                            if pos==pos_label]
            if lemmas:
                matches.extend(
                    [json_obj[field]['lemmas'][pos_match_ix]
                    for pos_match_ix in pos_match_ixs])
            else:
                matches.extend(
                    [json_obj[field]['tokens'][pos_match_ix]
                     for pos_match_ix in pos_match_ixs])
        
        if ix_fname % 500 == 0:
            print('Searched through {} files out of {} total.'.format(
                ix_fname, len(fnames)))
    
    return matches

def collect_modifiees(adj_set, field='title', lemmas=True):
    """
    Retrieves nouns modified by a set of adjectives within recipes
    @param adj_set: set of adjectives
    @param field: str part of a recipe to look for modified nouns 
                  possible values: {'title','description','steps'}
    @param lemmas: boolean for whether to return lemmas or original tokens
    """
    matches = []
    fnames = glob.glob('./data/preprocessed/*')
    
    for ix_fname, fname in enumerate(fnames):
        with open(fname, 'r') as file:
            json_obj = json.load(file)
            
            if field in json_obj:
            
                # find ixs of lemmas in adj_set with 'amod' dep. label
                match_ixs = [ix for ix,dep in 
                                 enumerate(json_obj[field]['dep_label'])
                                if dep=='amod' and \
                             json_obj[field]['lemmas'][ix] in adj_set]

                # return heads of those ixs
                matches.extend(
                        [json_obj[field]['head'][match_ix]
                        for match_ix in match_ixs])
        
        if ix_fname % 500 == 0:
            print('Searched through {} files out of {} total.'.format(
                ix_fname, len(fnames)))
    
    return matches

In [373]:
fnames = glob.glob('./data/preprocessed/*')
with open(fnames[0], 'r') as file:
    json_obj = json.load(file)

In [370]:
test = "This recipe is easy."
doc = nlp(test)
for tok in doc:
    print(tok.text,tok.pos_,tok.dep_,tok.head)

This DET det recipe
recipe NOUN nsubj is
is VERB ROOT is
easy ADJ acomp is
. PUNCT punct is


In [388]:
# title_easy_heads = collect_modifiees(set(
#     ['simple','easy','quick','weeknight']), field='title')

description_easy_heads = collect_modifiees(set(
    ['simple','easy','quick','weeknight']), field='description')

Searched through 0 files out of 19174 total.
Searched through 500 files out of 19174 total.
Searched through 1000 files out of 19174 total.
Searched through 1500 files out of 19174 total.
Searched through 2000 files out of 19174 total.
Searched through 2500 files out of 19174 total.
Searched through 3000 files out of 19174 total.
Searched through 3500 files out of 19174 total.
Searched through 4000 files out of 19174 total.
Searched through 4500 files out of 19174 total.
Searched through 5000 files out of 19174 total.
Searched through 5500 files out of 19174 total.
Searched through 6000 files out of 19174 total.
Searched through 6500 files out of 19174 total.
Searched through 7000 files out of 19174 total.
Searched through 7500 files out of 19174 total.
Searched through 8000 files out of 19174 total.
Searched through 8500 files out of 19174 total.
Searched through 9000 files out of 19174 total.
Searched through 9500 files out of 19174 total.
Searched through 10000 files out of 19174 to

In [389]:
len(title_easy_heads),len(description_easy_heads)

(163, 1662)

In [369]:
sorted(Counter(title_easy_heads).items(), 
       key=lambda x: x[1], reverse=True)

[('sauce', 15),
 ('bread', 9),
 ('soup', 9),
 ('syrup', 6),
 ('chicken', 5),
 ('pickles', 5),
 ('cake', 4),
 ('pudding', 3),
 ('tomatillo', 3),
 ('dressing', 3),
 ('kimchi', 3),
 ('stock', 2),
 ('gratin', 2),
 ('turkey', 2),
 ('steak', 2),
 ('cream', 2),
 ('tart', 2),
 ('lamb', 2),
 ('aioli', 2),
 ('mousse', 2),
 ('ribs', 2),
 ('fish', 1),
 ('grilled', 1),
 ('confit', 1),
 ('fudge', 1),
 ('bouillabaisse', 1),
 ('chops', 1),
 ('ginger', 1),
 ('potatoes', 1),
 ('fries', 1),
 ('asparagus', 1),
 ('quesadilla', 1),
 ('slaw', 1),
 ('kebabs', 1),
 ('pasta', 1),
 ('salad', 1),
 ('clams', 1),
 ('scallops', 1),
 ('flounder', 1),
 ('patties', 1),
 ('yeasted', 1),
 ('mushrooms', 1),
 ('grits', 1),
 ('dough', 1),
 ('beans', 1),
 ('salmon', 1),
 ('shrimp', 1),
 ('peaches', 1),
 ('sardines', 1),
 ('kulfi', 1),
 ('minestrone', 1),
 ('baklava', 1),
 ('yogurt', 1),
 ('rhubarb', 1),
 ('bean', 1),
 ('jam', 1),
 ('amba', 1),
 ('matzo', 1),
 ('broth', 1),
 ('punch', 1),
 ('caramel', 1),
 ('adobo', 1),
 ('gy

In [391]:
sorted(Counter(description_easy_heads).items(), 
       key=lambda x: x[1], reverse=True)

[('dish', 102),
 ('recipe', 77),
 ('sauce', 65),
 ('salad', 57),
 ('way', 48),
 ('meal', 42),
 ('syrup', 41),
 ('dinner', 38),
 ('cooking', 32),
 ('soup', 31),
 ('version', 23),
 ('cake', 23),
 ('work', 21),
 ('—', 20),
 ('meals', 19),
 ('method', 19),
 ('recipes', 19),
 ('pasta', 17),
 ('bread', 15),
 ('dessert', 15),
 ('broth', 15),
 ('dressing', 12),
 ('dishes', 12),
 ('one', 12),
 ('’s', 12),
 ('glaze', 12),
 ('chicken', 11),
 ('vinaigrette', 10),
 ('combination', 10),
 ('ingredients', 10),
 ('fry', 9),
 ('marinade', 9),
 ('stew', 9),
 ('dough', 9),
 ('breads', 9),
 ('technique', 9),
 ('mixture', 8),
 ('pastas', 8),
 ('drink', 8),
 ('stocks', 8),
 ('breakfast', 7),
 ('pleasure', 7),
 ('supper', 7),
 ('appetizer', 7),
 ('dip', 7),
 ('preparation', 7),
 ('tart', 7),
 ('time', 7),
 ('salads', 6),
 ('crust', 6),
 ('form', 6),
 ('stock', 6),
 ('sauté', 6),
 ('process', 6),
 ('steps', 6),
 ('menu', 6),
 ('salsa', 5),
 ('topping', 5),
 ('tacos', 5),
 ('cocktail', 5),
 ('ways', 5),
 ('side

## Most common adjectives occurring in recipe titles

In [350]:
title_adjs = collect_pos('ADJ', field='title')

0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
8000
8500
9000
9500
10000
10500
11000
11500
12000
12500
13000
13500
14000
14500
15000
15500
16000
16500
17000
17500
18000
18500
19000


In [351]:
len(title_adjs)

13788

In [353]:
counted_title_adjs = Counter(title_adjs)

In [354]:
sorted(counted_title_adjs.items(), key=lambda x: x[1], reverse=True)

[('roasted', 828),
 ('red', 501),
 ('green', 472),
 ('sweet', 404),
 ('spicy', 393),
 ('garlic', 307),
 ('black', 267),
 ('fresh', 258),
 ('white', 251),
 ('brown', 198),
 ('eggplant', 177),
 ('cranberry', 169),
 ('parmesan', 161),
 ('rosemary', 151),
 ('hot', 149),
 ('crispy', 146),
 ('broccoli', 141),
 ('orange', 133),
 ('whole', 133),
 ('creamy', 126),
 ('sour', 124),
 ('smoked', 124),
 ('baked', 121),
 ('pecan', 120),
 ('wild', 110),
 ('avocado', 110),
 ('sesame', 108),
 ('warm', 103),
 ('classic', 97),
 ('chili', 94),
 ('almond', 94),
 ('blue', 91),
 ('good', 91),
 ('mixed', 90),
 ('couscous', 84),
 ('seared', 83),
 ('olive', 82),
 ('mango', 77),
 ('chile', 77),
 ('short', 73),
 ('french', 70),
 ('swiss', 69),
 ('chipotle', 68),
 ('thai', 67),
 ('quick', 67),
 ('italian', 65),
 ('pomegranate', 65),
 ('horseradish', 63),
 ('smoky', 60),
 ('simple', 60),
 ('crisp', 59),
 ('free', 59),
 ('new', 59),
 ('crème', 57),
 ('strawberry', 57),
 ('asian', 56),
 ('moroccan', 55),
 ('dark', 52)