# Introduction

In this chapter the two dataframes we processed in the previous chapters, the `ingredients_df` will be joined onto the `food_df` of the `density_db`.

# Setup

In [1]:
#| default_exp density.food_match

In [2]:
#| export
from pyprojroot import here
root = here()
import sys
sys.path.append(str(root))

In [3]:
#| export
import pandas as pd
import numpy as np
import seaborn as sns

from pyprojroot import here
root = here()

from food_database.utils.join_utils import *

import nltk
from nltk.corpus import wordnet
import spacy
from spacy.matcher import Matcher
from spacy.util import filter_spans

from thefuzz import fuzz

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import torch
from sklearn.metrics.pairwise import cosine_similarity

import re
import json
import pickle
from itertools import groupby

from tqdm import tqdm
tqdm.pandas()

from food_database.utils.utils import *

In [4]:
pd.options.mode.chained_assignment = None  # default='warn'

In [5]:
ingredients_df = pd.read_feather('../data/local/recipe/partial/ingredients/0.feather')
expanded_ingredients_df = pd.read_feather('../data/local/recipe/partial/expanded_ingredients/0.feather', dtype_backend='pyarrow')
food_df = pd.read_feather('../data/local/density/full/food/0.feather')

In [6]:
expanded_ingredients_df

Unnamed: 0_level_0,Unnamed: 1_level_0,name.name.nouns.4,name.name.nouns.3,name.name.nouns.2,name.name.nouns.1,name.name.nouns.0,name.name.others.0,name.name.others.1,name.name.others.2,name.name.others.3,name.description.nouns.5,...,name.description.nouns.3,name.description.nouns.2,name.description.nouns.1,name.description.nouns.0,name.description.others.0,name.description.others.1,name.description.others.2,name.description.others.3,name.description.others.4,name.description.others.5
recipe,ingredient,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1746116,0,,,,,butter,,,,,,...,,lake,,land,,,,,,
1746116,1,,,,,sugar,,,,,,...,,,,,,,,,,
1746116,2,,,,,egg,,,,,,...,,lake,,land,,,,,,
1746116,3,,,,,vanilla,,,,,,...,,,,,,,,,,
1746116,4,,,,,flour,,,,,,...,,,,,all-purpose,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
931097,9,,,,,onion,red,,,,,...,,,,,,,,,,
931097,10,,,,pepper,bell,red,,,,,...,,,,,,,,,,
931097,11,,,,rice,jasmine,,,,,,...,,,,,,,,,,
931097,12,,,,,chicken,,,,,,...,,broth,,reduced-sodium,,,,,,


In [7]:
#| export 
col_types = ['name.name', 'name.description']
split_types = ['nouns', 'others']

In [8]:
col_types, split_types

(['name.name', 'name.description'], ['nouns', 'others'])

In [9]:
# separating comma separated description into its own element
exploded_food_df = food_df.explode('description_list')['description_list'].to_frame('description')
exploded_food_df.head()

Unnamed: 0_level_0,description
fdc_id,Unnamed: 1_level_1
167525,tostada shell
167525,corn
167526,bread
167526,salvadoran sweet cheese quesadilla salvadorena
167527,bread


# Fuzzy Search

## Embedding Distance

This was tested however it wasn't found to be useful in our case. The hope was that it could be used to find synonym/phrase matches eg. courgette: zucchini, ribeye: cut of beef. However this wasn't the case - these embeddings were moreso trained towards understanding meaning in full sentences rather than phrases. 

## Levenshtein Distance

We want to compute the levenshtein distance between two strings in order to catch misspellings.

It would be great if this could be done in a vectorised fashion like embedding distances, however there isn't a built-in function that can do this here. Instead this will have to be calculated manually with each search, which should be okay as the calculation is quite fast.

#### Original Fuzzy Search

In [10]:
#| export
def fuzzy_search(food_df_description, search_word, threshold=90):

    if not search_word:
        return False
    
    # check if full word match of the *words in string*
    if contains_whole_word(food_df_description, search_word):
        return True

    # check levenstain distance of the *string*
    fuzzy_score = fuzz.ratio(food_df_description, search_word)
    if fuzzy_score >= threshold:
        return True
    
    return False

In [11]:
match_df = exploded_food_df[exploded_food_df['description'].apply(fuzzy_search, args=('rasberry',))]
print(match_df.shape)
food_df.loc[match_df.index.unique()]

(7, 1)


Unnamed: 0_level_0,data_type,description,description_list,description_length,description_list_length,default_word_count,exclusion_word_count,volume_exists,portion_exists
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
167755,sr_legacy_food,"Raspberries, raw","[raspberry, raw]",16,2,1,0,True,False
167756,sr_legacy_food,"Raspberries, canned, red, heavy syrup pack, so...","[raspberry, canned, red, heavy syrup pack, sol...",62,5,0,1,True,False
167757,sr_legacy_food,"Raspberries, frozen, red, sweetened","[raspberry, frozen, red, sweetened]",35,4,0,1,True,True
168209,sr_legacy_food,"Raspberries, frozen, red, unsweetened","[raspberry, frozen, red, unsweetened]",37,4,0,1,True,False
172755,sr_legacy_food,"Danish pastry, fruit, enriched (includes apple...","[danish pastry, fruit, enriched includes apple...",95,8,0,0,False,True
2344775,survey_fndds_food,"Raspberries, raw","[raspberry, raw]",16,2,1,0,True,False
2344776,survey_fndds_food,"Raspberries, frozen","[raspberry, frozen]",19,2,0,1,True,False


# Full Search

Now that this fuzzy matching works, we need to implement the top level function which selects which values to search for.

In order to make this function reusable, we want any dataframe specific changes (ie. search term orderings) to not be applied here. Rather, these should be applied in the preprocessing, which are accepted as inputs to this function. This function gives room to this by simply taking a dataframe, with search terms given as the columns in an orderd way ie. the expanded_ingredients_df created in the [previous chapter](./01-recipes-db-explore.ipynb). 

This function then simply becomes an eliminative fuzzy search done on a dataframe's column.

### Ingredient Transform

Functionality to allow for manual transformations of search terms, to aid the search algorithm with its common mistakes. 

In [12]:
#| export
with open(f'{root}/data/globals/density/ingredient_transforms.json', 'r') as f:
    default_transforms = json.load(f)

In [33]:
#| export
def transform_ingredient(ingredient):
    ingredient = ingredient[ingredient.notnull()]
    for defualt_instance in default_transforms.values():
        matches = [key_word in ingredient.values for key_word in defualt_instance['key']]
        if all(matches):
            if len(matches) == len(ingredient.index[ingredient.index.str.startswith('name.name.nouns')]): # only match if there are no additional ingredient name nouns
                transformed = pd.Series(defualt_instance['value'], dtype='string', name=ingredient.name)
                transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
                transformed = transformed[transformed.notnull()]
                return transformed
    return ingredient

In [34]:
ingredient = expanded_ingredients_df.loc[931097,	10]
ingredient[ingredient.notnull()]

name.name.nouns.1     pepper
name.name.nouns.0       bell
name.name.others.0       red
Name: (931097, 10), dtype: string[pyarrow]

In [15]:
ingredient2 = ingredient.copy(deep=True)
ingredient2['name.name.nouns.2'] = 'lettuce'

In [16]:
transform_ingredient(ingredient)

name.name.nouns.0      sweet
name.name.nouns.1     pepper
name.name.others.0       red
Name: (931097, 10), dtype: object

In [17]:
assert all([value in transform_ingredient(ingredient).values for value in ['sweet', 'pepper', 'red']])  
assert transform_ingredient(ingredient2).values == ingredient2.values

## Main Functions

The main matching functions, which are separated into two:

1. Initial Search: Eliminative search of the dataframe with the search terms provided.
2. Search Selection: Selecting the best fit with various search criteria.

In [18]:
ingredient = expanded_ingredients_df.loc[931097, 10]
print(ingredient[ingredient.notnull()])

name.name.nouns.1     pepper
name.name.nouns.0       bell
name.name.others.0       red
Name: (931097, 10), dtype: string[pyarrow]


In [19]:
ingredient = expanded_ingredients_df.loc[2005640,	4	]
ingredient[ingredient.notnull()]

name.name.nouns.0    flour
Name: (2005640, 4), dtype: string[pyarrow]

### Initial Search

A simple additive search to compile the data that could be a possible match.

In [20]:
#| export
def match_food_df_on_ingredient(ingredient, exploded_food_df, debug=False):

    ingredient = ingredient[ingredient.notnull()]

    matched_food_df = exploded_food_df.copy(deep=True)
    matched_idxs = []

    debug_idxs = {col: {} for col in ingredient.index}

    for search_col, search_word in ingredient.items():
            
            current_match_idxs = list((matched_food_df.index[matched_food_df['description'].apply(fuzzy_search, args=(search_word,))]).unique())
            debug_idxs[search_col] = {'value': search_word, 'size': len(current_match_idxs), 'idxs': {'fuzzy': current_match_idxs, 'current': current_match_idxs}}

            if current_match_idxs: matched_idxs.extend(current_match_idxs)  
            if matched_idxs: matched_food_df = matched_food_df.loc[matched_idxs]
            debug_idxs[search_col]['idxs']['selected'] = matched_idxs

    matched_idxs = list(set(matched_idxs))
    
    if debug:
        return matched_idxs, debug_idxs
    else:
        return matched_idxs

In [21]:
food_df[food_df['description'].str.lower().str.contains('flour')].head(35)

Unnamed: 0_level_0,data_type,description,description_list,description_length,description_list_length,default_word_count,exclusion_word_count,volume_exists,portion_exists
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
167535,sr_legacy_food,"Tortillas, ready-to-bake or -fry, flour, shelf...","[tortilla, ready-to-bake -fry, flour, shelf st...",53,4,0,0,False,True
168885,sr_legacy_food,"Rye flour, dark","[rye flour, dark]",15,2,0,0,True,False
168886,sr_legacy_food,"Rye flour, medium","[rye flour, medium]",17,2,0,0,True,False
168887,sr_legacy_food,"Rye flour, light","[rye flour, light]",16,2,0,0,True,False
168888,sr_legacy_food,"Triticale flour, whole-grain","[triticale flour, whole-grain]",28,2,1,0,True,False
168893,sr_legacy_food,"Wheat flour, whole-grain","[wheat flour, whole-grain includes food usda's...",24,2,1,0,True,False
168894,sr_legacy_food,"Wheat flour, white, all-purpose, enriched, ble...","[wheat flour, white, all-purpose, enriched, bl...",51,5,0,0,True,False
168895,sr_legacy_food,"Wheat flour, white, all-purpose, self-rising, ...","[wheat flour, white, all-purpose, self-rising,...",54,5,0,0,True,False
168896,sr_legacy_food,"Wheat flour, white, bread, enriched","[wheat flour, white, bread, enriched]",35,4,0,0,True,False
168898,sr_legacy_food,"Rice flour, brown","[rice flour, brown]",17,2,0,0,True,False


In [22]:
match_idxs, debug_idxs = match_food_df_on_ingredient(ingredient, exploded_food_df, True)
match_df = food_df.loc[match_idxs]
match_df

Unnamed: 0_level_0,data_type,description,description_list,description_length,description_list_length,default_word_count,exclusion_word_count,volume_exists,portion_exists
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2343304,survey_fndds_food,"Tortilla, flour","[tortilla, flour]",15,2,0,0,False,True
2343307,survey_fndds_food,"Taco shell, flour","[taco shell, flour]",17,2,0,0,False,True
169741,sr_legacy_food,"Oat flour, partially debranned","[oat flour, partially debranned]",30,2,0,0,True,False
175037,sr_legacy_food,"Tortillas, ready-to-bake or -fry, flour, refri...","[tortilla, ready-to-bake -fry, flour, refriger...",53,4,0,0,False,True
172435,sr_legacy_food,"Peanut flour, low fat","[peanut flour, low fat]",21,2,0,1,True,False
169748,sr_legacy_food,"Corn flour, whole-grain, white","[corn flour, whole-grain, white]",30,3,1,0,True,False
169749,sr_legacy_food,"Corn flour, yellow, masa, enriched","[corn flour, yellow, masa, enriched]",34,4,0,0,True,False
169754,sr_legacy_food,"Wheat flour, white, all-purpose, enriched, cal...","[wheat flour, white, all-purpose, enriched, ca...",60,5,0,0,True,False
2343066,survey_fndds_food,"Bread, NS as to major flour","[bread, n major flour]",27,2,1,0,False,True
172444,sr_legacy_food,"Soy flour, low-fat","[soy flour, low-fat]",18,2,0,1,True,False


### Search Selection

It looks like for a few ingredients (flour, red onion, cheddar) we are getting food_df matches which don't contain density values. 

In [23]:
#| export
with open(f'{root}/data/globals/default_words.json', 'r') as f:
    default_words = json.load(f)['density']

with open(f'{root}/data/globals/exclusion_words.json', 'r') as f:
    exclusion_words = json.load(f)['density']

def calculate_match_stats(food_descriptions, ingredient_values):

    ingredient_noun_values, ingredient_other_values = ingredient_values[:2]
    ingredient_description_values = [*ingredient_values[2], *ingredient_values[3]]

    ingredient_match_position = 99
    description_match_position = 99
    description_match_word_count = 99
    ingredient_nouns_match_count = 0
    ingredient_nouns_whole_match_count = 0
    ingredient_others_match_count = 0
    ingredient_others_whole_match_count = 0
    ingredient_description_match_count = 0
    default_word_count = 0
    exclusion_word_count = 0
    remaining_word_count = 99

    for description_idx, food_description in enumerate(food_descriptions):

        noun_matches = [(ingredient_idx, ingredient_noun_value) for ingredient_idx, ingredient_noun_value in enumerate(ingredient_noun_values)
                if fuzzy_search(food_description, ingredient_noun_value)]
        if noun_matches:
            if description_match_position == 99: 
                description_match_position = description_idx - default_word_count # not including default words in description eg. spice, cinammon.
                description_match_word_count = len(food_description.split(' '))
            if ingredient_match_position == 99: ingredient_match_position = noun_matches[0][0]
            ingredient_noun_values = [value for i, value in enumerate(ingredient_noun_values) if i not in list(zip(*noun_matches))[0]]
        ingredient_nouns_match_count += len(noun_matches)
        ingredient_nouns_whole_match_count += len([noun_match for noun_match in noun_matches if contains_whole_word(food_description, noun_match[1])])

        description_matches = [(ingredient_idx, ingredient_description_value) for ingredient_idx, ingredient_description_value in enumerate(ingredient_description_values)
                if fuzzy_search(food_description, ingredient_description_value)]
        ingredient_description_match_count += len(description_matches)

        other_matches = [(ingredient_idx, ingredient_other_value) for ingredient_idx, ingredient_other_value in enumerate(ingredient_other_values)
                if fuzzy_search(food_description, ingredient_other_value)]
        ingredient_others_match_count += len(other_matches)
        ingredient_others_whole_match_count += len([other_match for other_match in other_matches if contains_whole_word(food_description, other_match[1])])
        if other_matches: 
            ingredient_other_values = [value for i, value in enumerate(ingredient_other_values) if i not in list(zip(*other_matches))[0]] # removing so it can't match twice
        default_word_count += len([default_word for default_word in default_words if contains_whole_word(food_description, default_word)])
        exclusion_word_count += len([exclusion_word for exclusion_word in exclusion_words if contains_whole_word(food_description, exclusion_word)])


    remaining_word_count = np.sum([len(food_description.split(" ")) for food_description in food_descriptions]) + len(ingredient_other_values) - default_word_count - ingredient_nouns_whole_match_count - ingredient_others_match_count 

    return (
        ingredient_match_position,
        description_match_position,
        description_match_word_count,
        ingredient_nouns_match_count,
        ingredient_nouns_whole_match_count,
        ingredient_others_match_count,
        ingredient_others_whole_match_count,
        ingredient_description_match_count,
        default_word_count,
        exclusion_word_count,
        remaining_word_count
    )

In [24]:
# finding word noun order (adjectives come before the main noun)
ingredient_values = []
for name_type in ['name', 'description']:
    for word_type in ['nouns', 'others']:
        name_cols = [col for col in ingredient.index[ingredient.notnull()] if col.startswith(f'name.{name_type}.{word_type}')]
        name_cols.reverse()
        ingredient_values.append(ingredient[name_cols].values)

print(ingredient_values)
print(match_df.iloc[0]['description_list'])

match_stats = calculate_match_stats(match_df.iloc[0]['description_list'], ingredient_values)
match_stats

[<ArrowExtensionArray>
['flour']
Length: 1, dtype: string[pyarrow], <ArrowExtensionArray>
[]
Length: 0, dtype: string[pyarrow], <ArrowExtensionArray>
[]
Length: 0, dtype: string[pyarrow], <ArrowExtensionArray>
[]
Length: 0, dtype: string[pyarrow]]
['tortilla' 'flour']


(0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1)

In [25]:
#| export
def select_from_matches(match_df, ingredient, sort_order, return_df=False):
    
    ingredient_names = []
    for name_type in ['name', 'description']:
        for word_type in ['nouns', 'others']:
            name_cols = [col for col in ingredient.index[ingredient.notnull()] if col.startswith(f'name.{name_type}.{word_type}')]
            name_cols.reverse()
            ingredient_names.append(ingredient[name_cols].values)

    match_df['ingredient_match_position'], \
    match_df['description_match_position'],\
    match_df['description_match_word_count'],\
    match_df['ingredient_nouns_match_count'],\
    match_df['ingredient_nouns_whole_match_count'],\
    match_df['ingredient_others_match_count'],\
    match_df['ingredient_others_whole_match_count'],\
    match_df['ingredient_description_match_count'],\
    match_df['default_word_count'],\
    match_df['exclusion_word_count'],\
    match_df['description_remaining_word_count'] = zip(*match_df['description_list'].apply(calculate_match_stats, args=(ingredient_names, )))
    
    # ordering dataset by priority
    sort_order = list(zip(*sort_order.items()))
    match_df = match_df.sort_values(
        by=list(sort_order[0]), ascending=list(sort_order[1])
    )

    if return_df:
        result = match_df.reindex(columns=['description_list', *list(sort_order[0])])
    else:
        result = match_df.iloc[0].name if not match_df.empty else pd.NA

    return result

In [26]:
#| export 
sort_order = {
    'ingredient_nouns_whole_match_count': False,
    'ingredient_nouns_match_count': False,
    'exclusion_word_count': True,
    'description_match_position': True,
    'ingredient_match_position': True,
    'description_match_word_count': True,
    'ingredient_others_whole_match_count': False,
    'ingredient_others_match_count': False,
    'ingredient_description_match_count': False,
    'default_word_count': False,
    'description_remaining_word_count': True,
    'description_list_length': True,
    'data_type': True
}

It looks like for a few ingredients (flour, red onion, cheddar) we are getting food_df matches which don't contain density values. 

In [27]:
select_from_matches(match_df, ingredient, {f'volume_exists': False, **sort_order}, True)

Unnamed: 0_level_0,description_list,volume_exists,ingredient_nouns_whole_match_count,ingredient_nouns_match_count,exclusion_word_count,description_match_position,ingredient_match_position,description_match_word_count,ingredient_others_whole_match_count,ingredient_others_match_count,ingredient_description_match_count,default_word_count,description_remaining_word_count,description_list_length,data_type
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
168888,"[triticale flour, whole-grain]",True,1,1,0,0,0,2,0,0,0,1,1,2,sr_legacy_food
170687,"[buckwheat flour, whole-groat]",True,1,1,0,0,0,2,0,0,0,1,1,2,sr_legacy_food
168943,"[sorghum flour, whole-grain]",True,1,1,0,0,0,2,0,0,0,1,1,2,sr_legacy_food
169748,"[corn flour, whole-grain, white]",True,1,1,0,0,0,2,0,0,0,1,2,3,sr_legacy_food
170290,"[corn flour, whole-grain, yellow]",True,1,1,0,0,0,2,0,0,0,1,2,3,sr_legacy_food
174273,"[soy flour, full-fat, raw]",True,1,1,0,0,0,2,0,0,0,1,2,3,sr_legacy_food
173262,"[sorghum flour, refined, unenriched]",True,1,1,0,0,0,2,0,0,0,1,2,3,sr_legacy_food
168913,"[wheat flour, bread, unenriched]",True,1,1,0,0,0,2,0,0,0,1,2,3,sr_legacy_food
169714,"[rice flour, white, unenriched]",True,1,1,0,0,0,2,0,0,0,1,2,3,sr_legacy_food
169761,"[wheat flour, white, all-purpose, unenriched]",True,1,1,0,0,0,2,0,0,0,1,3,4,sr_legacy_food


## Full Function

In [28]:
#| export
def match_ingredient(ingredient, food_df, exploded_food_df):

    unit_type = ingredient['unit_type']
    if unit_type == 'weight': return pd.NA
    ingredient = ingredient.drop('unit_type')
    local_sort_order = {f'{unit_type}_exists': False, **sort_order}

    ingredient = transform_ingredient(ingredient)

    searched_df = food_df.loc[match_food_df_on_ingredient(ingredient, exploded_food_df)]
    if searched_df.empty: return pd.NA

    selected_food_idx = select_from_matches(searched_df, ingredient, local_sort_order)
    
    return selected_food_idx

#TODO: Don't like this code to get the unit_type. Having to join on the info to the object, and remove it in the match_ingredient function. It is more efficient than doing an indexing lookup in the function, but it still seems like bad code mutating the object like that.

Overall though this is definitely worth doing, this sets us up with much better and more reliable results.

In [36]:
ingredient = expanded_ingredients_df.join(ingredients_df['unit_type']).loc[ingredient.name]
ingredient[ingredient.notnull()]

name.name.nouns.1     pepper
name.name.nouns.0       bell
name.name.others.0       red
unit_type             volume
Name: (931097, 10), dtype: object

In [37]:
match_ingredient(ingredient, food_df, exploded_food_df)

2345321

# Full Dataframe Join

Testing on a sample of the dataframe.

In [38]:
expanded_ingredients_df.shape

(2450, 21)

In [39]:
food_ids = expanded_ingredients_df.join(ingredients_df['unit_type']).progress_apply(match_ingredient, args=(food_df, exploded_food_df), axis=1)

  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])
  transformed = pd.concat([transformed, ingred

In [40]:
food_ids = food_ids.rename('food_id')

In [41]:
pd.set_option('display.max_rows', None)

In [42]:
results_df = ingredients_df.join(food_ids).join(food_df, on='food_id')[['name.name', 'name.description', 'comment', 'description']]

In [43]:
results_df.loc[1723278,	5]

name.name              yellow pepper
name.description       yellow pepper
comment                         <NA>
description         Pepper, raw, NFS
Name: (1723278, 5), dtype: object

In [44]:
results_df.head(100)

Unnamed: 0_level_0,Unnamed: 1_level_0,name.name,name.description,comment,description
recipe,ingredient,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1746116,0,butter,land lake butter,,"Butter, NFS"
1746116,1,sugar,sugar,,"Sugar, NFS"
1746116,2,egg,land lake egg,(yolks only),"Egg, whole, raw, fresh"
1746116,3,vanilla,vanilla,,"Vanilla extract, imitation, no alcohol"
1746116,4,flour,all-purpose flour,,"Wheat flour, white, all-purpose, unenriched"
1746116,5,caramel,caramel,,"Candy, caramel"
1746116,6,cream,land lake heavy whipping cream,,"Cream, fluid, heavy whipping"
1746116,7,pecan,pecan half,,"Pecans, NFS"
1746116,8,semi-sweet chocolate chip,real semi-sweet chocolate chip,,"Cookie, chocolate chip"
1746116,9,shortening,shortening,,"Shortening, NS as to vegetable or animal"


In [45]:
pd.reset_option('display.max_rows')

# Postprocessing

## NA Values

We want to reduce NA values (non-joins) as much as possible, as this has potential of making a whole recipe redundant.

In [46]:
expanded_ingredients_df.shape, food_ids.shape

((2450, 21), (2450,))

In [47]:
na_expanded_ingredients_df = expanded_ingredients_df[food_ids.isna()]
na_expanded_ingredients_df.shape, expanded_ingredients_df.shape

((216, 21), (2450, 21))

### Thesaurus Synonyms

Some ingredients aren't matched simply because they have multiple names eg. aubergine eggplant. We want to minimise this, which can be done by searching through synonyms.

In [48]:
#| export
def create_na_synonyms_df(na_expanded_ingredients_df):

    na_synonyms_df = na_expanded_ingredients_df.copy(deep=True)

    for col in na_synonyms_df.columns:
        na_synonyms_df[col] = na_synonyms_df[col].apply(find_alt_words)

    na_synonyms_df = na_synonyms_df.map(lambda x: [] if not isinstance(x, list) else x)

    for col in na_synonyms_df.columns:
        expanded = pd.DataFrame(na_synonyms_df[col].tolist(), index=na_synonyms_df.index)
        expanded.columns = [col + '.' + str(c) for c in expanded.columns]
        na_synonyms_df = na_synonyms_df.join(expanded)
        na_synonyms_df.drop(columns=[col], inplace=True)

    return na_synonyms_df

In [49]:
na_synonyms_df = create_na_synonyms_df(na_expanded_ingredients_df)
na_synonyms_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,name.name.nouns.2.0,name.name.nouns.2.1,name.name.nouns.2.2,name.name.nouns.2.3,name.name.nouns.2.4,name.name.nouns.2.5,name.name.nouns.2.6,name.name.nouns.2.7,name.name.nouns.2.8,name.name.nouns.2.9,...,name.description.others.1.0,name.description.others.1.1,name.description.others.1.2,name.description.others.1.3,name.description.others.1.4,name.description.others.1.5,name.description.others.1.6,name.description.others.1.7,name.description.others.1.8,name.description.others.1.9
recipe,ingredient,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1828339,3,,,,,,,,,,,...,,,,,,,,,,
1828339,6,,,,,,,,,,,...,,,,,,,,,,
1828339,7,,,,,,,,,,,...,,,,,,,,,,
1703,3,,,,,,,,,,,...,,,,,,,,,,
1262123,0,,,,,,,,,,,...,,,,,,,,,,
1262123,4,,,,,,,,,,,...,,,,,,,,,,
767911,0,,,,,,,,,,,...,,,,,,,,,,
1106838,1,,,,,,,,,,,...,,,,,,,,,,
1106838,2,,,,,,,,,,,...,,,,,,,,,,
1042455,5,,,,,,,,,,,...,,,,,,,,,,


In [50]:
na_synonym_food_ids = na_synonyms_df.join(ingredients_df['unit_type']).progress_apply(match_ingredient, axis=1, args=(food_df, exploded_food_df))
na_synonym_food_ids.isna().sum(), na_synonym_food_ids.shape[0]

100%|██████████| 216/216 [00:02<00:00, 99.60it/s] 


(199, 216)

This finds over 50% of the previously unmatched ingredients.

In [51]:
food_ids.fillna(na_synonym_food_ids, inplace=True)
food_ids.isna().sum()

199

### Evaluating

Let's see what the rest of the not-found ingredients are.

In [52]:
expanded_ingredients_df.join(ingredients_df).join(food_ids.rename('food_id')).query('food_id.isnull()')[['name.name', 'name.description']]

Unnamed: 0_level_0,Unnamed: 1_level_0,name.name,name.description
recipe,ingredient,Unnamed: 2_level_1,Unnamed: 3_level_1
1828339,3,long grain brown rice,long grain brown rice
1828339,6,dried apricot,dried apricot
1828339,7,sultana,sultana
1703,3,mincemeat,mincemeat
1262123,0,linguine,linguine
...,...,...,...
599284,1,potato,frozen shredded hash brown potato
1357213,3,cinnamin,cinnamin
2006319,2,raspberry,raspberry
2006319,3,blackberry,blackberry


There are very obscure words, we're okay with not having matched these.

### Handling NA Values

We can either remove the values, or we can simply have a default value for the density which we can use (1.0).

In [53]:
ingredients_df = ingredients_df.join(food_ids.rename('food_id'))
ingredients_df = ingredients_df.drop(ingredients_df.index[ingredients_df['food_id'].isna()].get_level_values(0), axis=0, level=0)
ingredients_df = ingredients_df.drop('food_id', axis=1)
ingredients_df.shape

(1406, 10)

# Saving

In [54]:
food_ids.to_frame('food_id').to_feather('../data/local/density/partial/food_ids/0.feather')

In [55]:
from nbdev import nbdev_export 
nbdev_export()

# Misc Investigating

In [56]:
food_df[food_df['description'].str.lower().str.contains('tomato')]

Unnamed: 0_level_0,data_type,description,description_list,description_length,description_list_length,default_word_count,exclusion_word_count,volume_exists,portion_exists
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
167704,sr_legacy_food,"Salad dressing, bacon and tomato","[salad dressing, bacon tomato]",32,2,0,0,True,False
167708,sr_legacy_food,"Tomato and vegetable juice, low sodium","[tomato vegetable juice, low sodium]",38,2,0,1,True,False
168125,sr_legacy_food,"Babyfood, dinner, macaroni, beef and tomato sa...","[babyfood, dinner, macaroni, beef tomato sauce...",58,5,0,0,True,False
168567,sr_legacy_food,"Tomatoes, sun-dried","[tomato, sun-dried]",19,2,0,0,True,True
168962,sr_legacy_food,"Turnover, meat- and cheese-filled, tomato-base...","[turnover, meat- cheese-filled, tomato-based s...",74,5,0,2,False,True
169074,sr_legacy_food,"Tomato sauce, canned, no salt added","[tomato sauce, canned]",35,2,0,1,True,False
169384,sr_legacy_food,"Tomatoes, sun-dried, packed in oil, drained","[tomato, sun-dried, packed oil, drained]",43,4,0,0,True,True
169775,sr_legacy_food,"Turnover, cheese-filled, tomato-based sauce, f...","[turnover, cheese-filled, tomato-based sauce, ...",63,5,0,1,False,True
170050,sr_legacy_food,"Tomatoes, red, ripe, cooked","[tomato, red, ripe, cooked]",27,4,0,1,True,True
170051,sr_legacy_food,"Tomatoes, red, ripe, canned, packed in tomato ...","[tomato, red, ripe, canned, packed tomato juice]",51,5,0,1,True,True


In [57]:
expanded_ingredients_df_full = pd.read_feather('../data/local/recipe/full/expanded_ingredients/7_filtered.feather', dtype_backend='pyarrow')
ingredients_df_full = pd.read_feather('../data/local/recipe/full/ingredients/6_filtered.feather')

In [58]:
ingredient = expanded_ingredients_df_full.loc[1374679,	3	]
ingredient = transform_ingredient(ingredient)
ingredient[ingredient.notnull()]

  transformed = pd.concat([transformed, ingredient[~ingredient.index.str.startswith('name.name.nouns')]])


name.name.nouns.0    spearmint
Name: (1374679, 3), dtype: string

In [59]:
transform_ingredient(ingredient)

name.name.nouns.0    spearmint
Name: (1374679, 3), dtype: string

In [60]:
match_idxs, debug_idxs = match_food_df_on_ingredient(ingredient, exploded_food_df, True)
match_df = food_df.loc[match_idxs]
match_df

Unnamed: 0_level_0,data_type,description,description_list,description_length,description_list_length,default_word_count,exclusion_word_count,volume_exists,portion_exists
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
173475,sr_legacy_food,"Spearmint, fresh","[spearmint, fresh]",16,2,1,0,True,False
172239,sr_legacy_food,"Spearmint, dried","[spearmint, dried]",16,2,0,0,True,False


In [61]:
select_from_matches(match_df, ingredient, sort_order, True)

Unnamed: 0_level_0,description_list,ingredient_nouns_whole_match_count,ingredient_nouns_match_count,exclusion_word_count,description_match_position,ingredient_match_position,description_match_word_count,ingredient_others_whole_match_count,ingredient_others_match_count,ingredient_description_match_count,default_word_count,description_remaining_word_count,description_list_length,data_type
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
173475,"[spearmint, fresh]",1,1,0,0,0,1,0,0,0,1,0,2,sr_legacy_food
172239,"[spearmint, dried]",1,1,0,0,0,1,0,0,0,0,1,2,sr_legacy_food


# To-Do's

For the problems which require specific decision trees to be made on selection, I should add them to a to-do list and figure out how to implement them all, instead of a never ending list of edit's to create an unreadable function.

- Powders/Spices: check quantity/unit measure:: if small then look for 'ground', 'spice', keywords.