<center>
    <b>
        <font size="+3">
            Food Pairing and Data Science
        </font>
    </b>
    <br>
    <br>
    Vincent Choo
</center>

# Introduction

Food pairing is...

# Acquiring Data

First, we'll need a database of flavor compounds in each kind of food ingredient.

Several databases exist, such as [FoodDB](http://foodb.ca/), [FlavorNet](http://www.flavornet.org/), and [FlavorDB](https://www.ncbi.nlm.nih.gov/pubmed/29059383), but not all associate foods with the compounds they contain. The one at FlavorDB does, so we scrape our data from the FlavorDB [website](https://cosylab.iiitd.edu.in/flavordb/) to generate [Pandas DataFrames](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html).

The data at FlavorDB is scattered across JSON files on the site, where each JSON file corresponds to a particular food and the flavor molecules in it. Fortunately, the files each have a numerical id, so we can grab the JSON files by iterating over URLs of the right form.

In [1]:
# import the relevant Python packages
import urllib.request
import json

import numpy as np
import pandas as pd
import math

In [2]:
# JSON files are at addresses of this form
def flavordb_entity_url(x):
    return "https://cosylab.iiitd.edu.in/flavordb/entities_json?id="+str(x)

# translates the JSON file at the specified web address into a dictionary
def get_flavordb_entity(x):
    with urllib.request.urlopen(flavordb_entity_url(x)) as url:
        return json.loads(url.read().decode())
    return None

Then, we convert the JSON files into ``DataFrames``.

In [3]:
# the names of the columns in the raw JSON objects
def flavordb_entity_cols():
    return [
        'entity_id', 'entity_alias_readable', 'entity_alias_synonyms',
        'natural_source_name', 'category_readable', 'molecules'
    ]


# define the names of the columns in the dataframes we want to generate
def flavordb_df_cols():
    return [
        'entity id', 'alias', 'synonyms',
        'scientific name', 'category', 'molecules'
    ]


def molecules_df_cols():
    return ['pubchem id', 'common name', 'flavor profile']    

In [4]:
def clean_flavordb_dataframes(flavor_df, molecules_df):
    strtype = type('')
    settype = type(set())
    
    for k in ['alias', 'scientific name', 'category']:
        flavor_df[k] = [
            elem.lower() if isinstance(elem, strtype) else ''
            for elem in flavor_df[k]
        ]
    
    flavor_df['synonyms'] = [
        elem if isinstance(elem, settype) else (
            set(elem.lower().split(', ') if isinstance(elem, strtype) else [''])
        )
        for elem in flavor_df['synonyms']
    ]
    
    molecules_df['flavor profile'] = [
        set([x.lower() for x in elem])
        for elem in molecules_df['flavor profile']
    ]
    
    return flavor_df, molecules_df

In [5]:
# generate dataframes from some of the JSON objects
def get_flavordb_dataframes(start, end):
    # make intermediate values to make dataframes from
    flavordb_data = []
    molecules_dict = {}
    missing = [] # numbers of the missing JSON files during iteration
    
    flavordb_cols = flavordb_entity_cols()
    
    for i in range(start, end):
        # get the ith food entity, as a JSON dict
        try:
            fdbe = get_flavordb_entity(i + 1)

            # get only the relevant fields (columns) of the dict
            flavordb_series = [fdbe[k] for k in flavordb_cols[:-1]]
            flavordb_series.append(
                set([m['pubchem_id'] for m in fdbe['molecules']])
            )
            flavordb_data.append(flavordb_series)

            # update the molecules database with the data in 'molecules' field
            for m in fdbe['molecules']:
                if m['pubchem_id'] not in molecules_dict:
                    molecules_dict[m['pubchem_id']] = [
                        m['common_name'],
                        set(m['flavor_profile'].split('@'))
                    ]
        except urllib.error.HTTPError as e:
            if e.code == 404: # if the JSON file is missing
                missing.append(i)
            else:
                raise RuntimeError(
                    'Error while fetching JSON object from ' + flavordb_entity_url(x)
                ) from e
            

    # generate the dataframes
    flavordb_df = pd.DataFrame(
        flavordb_data,
        columns=flavordb_df_cols()
    )
    molecules_df = pd.DataFrame(
        [
            [k, v[0], v[1]]
             for k, v in molecules_dict.items()
        ],
        columns=molecules_df_cols()
    )
    
    # clean up the dataframe columns
    flavordb_df, molecules_df = clean_flavordb_dataframes(flavordb_df, molecules_df)
    
    return [flavordb_df, molecules_df, missing]

It takes a while to download all of these JSON files, so make sure to save your download progress!

In [6]:
# updates & saves the download progress of your dataframes
def update_flavordb_dataframes(df0, df1, ranges):
    df0_old = df0
    df1_old = df1
    missing_old = []

    # time how long it took to download the files
    import time
    start = time.time()

    # save the download progress in increments of 50 JSON files
    try:
        # as of today, it looks like FlavorDB has about 1000 distinct entities
        for a, b in ranges:
            df0_new, df1_new, missing_new = get_flavordb_dataframes(a, b)
            
            df0_old = df0_old.append(df0_new, ignore_index=True)
            df1_old = df1_old.append(df1_new, ignore_index=True)
            missing_old.extend(missing_new)
        
        return df0_old, df1_old, missing_old
    except:
        raise # always throw the error so you know what happened
    finally:
        # even if you throw an error, you'll have saved them as csv files
        df0_old.to_csv('flavordb.csv')
        df1_old.to_csv('molecules.csv')

        end = time.time()
        mins = (end - start) / 60.0
        print('Downloading took: '+ str(mins) + ' minutes')

In [None]:
# take new dataframes
df0 = pd.DataFrame(columns=flavordb_df_cols())
df1 = pd.DataFrame(columns=molecules_df_cols())
# fill them with JSON files up to id = 1000
ranges = [(50 * i, 50 * (i + 1)) for i in range(20)]
# update & save the dataframes as csv files
update_flavordb_dataframes(df0, df1, ranges)

While downloading the JSON files, you'll notice that some of them are missing due to ``HTTPError``s. The first time you download, you might, say, notice that 43 entries are missing.

In [7]:
# get the missing entries
def missing_entity_ids(flavor_df):
    out = []
    entity_id_set = set(flavor_df['entity id'])
    for i in range(1, 1 + max(entity_id_set)):
        if i not in entity_id_set:
            out.append(i)
    return out


# loads the dataframes from csv files
def load_db():
    settype = type(set())
    
    df0 = pd.read_csv('flavordb.csv')[flavordb_df_cols()]
    df0['synonyms'] = [eval(x) if isinstance(x, settype) else x for x in df0['synonyms']]
    df0['molecules'] = [eval(x) for x in df0['molecules']]
    
    df1 = pd.read_csv('molecules.csv')[molecules_df_cols()]
    df1['flavor profile'] = [eval(x) for x in df1['flavor profile']]
    
    df0, df1 = clean_flavordb_dataframes(df0, df1)
    
    return df0, df1, missing_entity_ids(df0)

In [8]:
# missing_ids = the missing ids that are less than the max one found
flavor_df, molecules_df, missing_ids = load_db()
flavor_df.to_csv('flavordb.csv')
molecules_df.to_csv('molecules.csv')
flavor_df

Unnamed: 0,entity id,alias,synonyms,scientific name,category,molecules
0,1,bakery products,{bakery products},poacceae,bakery,"{27457, 7976, 31252, 26808, 22201, 26331}"
1,2,bread,{bread},poacceae,bakery,"{1031, 1032, 644104, 527, 8723, 31260, 15394, ..."
2,3,rye bread,{rye bread},rye,bakery,"{644104, 7824, 643731, 8468, 1049, 5372954, 80..."
3,4,wheaten bread,"{soda scones, soda farls}",wheat,bakery,"{6915, 5365891, 12170, 8082, 31251, 7958, 1049..."
4,5,white bread,{white bread},wheat,bakery,"{7361, 994, 7362, 10883, 11173, 5365891, 11559..."
5,6,wholewheat bread,{wholewheat bread},wheat,bakery,"{107905, 8194, 10883, 13187, 5283329, 5283335,..."
6,7,wort,{wort},barley,beverage,"{13187, 9862, 135, 18827, 7824, 61712, 19602, ..."
7,8,arrack,{arak},grape,beverage alcoholic,"{1031, 240, 31249, 6584, 7997}"
8,9,beer,{beer},poacceae,beverage alcoholic,"{229888, 62465, 8194, 8193, 1031, 644104, 5283..."
9,10,bantu beer,"{pombe, millet beer, malwa, kaffir beer, opaqu...",eragrostideae,beverage alcoholic,"{6560, 8038, 7654, 7147, 1068, 14286, 527, 240..."


In [9]:
molecules_df

Unnamed: 0,pubchem id,common name,flavor profile
0,22201,"2,3-Dimethylpyrazine","{peanut, peanut butter, butter, cocoa, leather..."
1,31252,"2,5-Dimethylpyrazine","{medicine, roasted nuts, roast beef, woody, co..."
2,26331,2-Ethylpyrazine,"{peanut, peanut butter, butter, bitter, woody,..."
3,27457,2-Ethyl-3-Methylpyrazine,"{peanut, earthy, roast, hazelnut, corn, potato..."
4,7976,2-Methylpyrazine,"{peanut, chocolate, green, cocoa, popcorn, roa..."
5,26808,"2,3,5-Trimethylpyrazine","{peanut, earthy, roast, hazelnut, cocoa, potat..."
6,323,coumarin,"{sweet, new mown hay, bitter, green, tonka}"
7,7150,Methyl Benzoate,"{sweet, prune, floral, herb, lettuce, cananga,..."
8,11509,3-Hexanone,"{ether, sweet, grape, waxy, fruity, rum}"
9,637566,Geraniol,"{rose, sweet, floral, geranium, citrus, waxy, ..."


In [10]:
print('Missing IDs: ' + str(missing_ids))

Missing IDs: [406, 407, 420, 479, 483, 599, 605, 666, 681, 689, 692, 760, 761, 779, 797, 798, 801, 802, 804, 808, 809, 811, 812, 813, 816, 819, 838, 844, 866, 877, 888, 892, 903, 910, 922, 940, 946, 957, 966, 973, 974, 975, 976]


The missing IDs might be due to a bad internet connection, as opposed to the content actually missing, so redownload them just to be sure.

In [11]:
ranges = [(i-1, i) for i in missing_ids]
flavor_df, molecules_df, missing_ids = update_flavordb_dataframes(flavor_df, molecules_df, ranges)
print('# of missing IDs: ' + str(len(missing_ids)))
print('Missing IDs: ' + str(missing_ids))

Downloading took: 0.8541397333145142 minutes
# of missing IDs: 43
Missing IDs: [405, 406, 419, 478, 482, 598, 604, 665, 680, 688, 691, 759, 760, 778, 796, 797, 800, 801, 803, 807, 808, 810, 811, 812, 815, 818, 837, 843, 865, 876, 887, 891, 902, 909, 921, 939, 945, 956, 965, 972, 973, 974, 975]


Done! Now we have a large database of foods. But how do we know if the database is complete enough? Let's test how many foods FlavorDB knows.

In [13]:
foods = ['caramel', 'urchin', 'liver', 'haggis',
         'blood', 'cheese', 'pawpaw', 'rose',
         'durian', 'squirrel', 'kombu', 'whale',
         'white fish', 'whitefish']

# check if any food matches (or is a substring of) an alias in the database
{f : any([f in alias for alias in flavor_df['alias']])
 for f in foods}

{'caramel': False,
 'urchin': False,
 'liver': False,
 'haggis': False,
 'blood': False,
 'cheese': True,
 'pawpaw': True,
 'rose': True,
 'durian': True,
 'squirrel': True,
 'kombu': True,
 'whale': True,
 'white fish': False,
 'whitefish': True}

Hmmm. This database is not exactly complete. While the database certainly includes some uncommon foods like whale, durian, pawpaw, and rose, it is also missing others such as sea urchin, liver, and blood. In addition, common terms, like "white fish", which refers to several species of fish, are left out entirely ("whitefish" refers to a single species of fish).

Of course, we wouldn't expect this database to have the food compounds of caramel, because even today, the process of caramelization is extremely complex and not well-understood, so complete information on caramel shouldn't be there.

# 