# Introduction

Our ingredients are measured in different units, some weight, some volume, and others in various serving quantities. To get to end goal of concentrations, it would be easiest to standardise these all to one particular unit. In our case we want to convert everything to weight in grams, and in order to convert non-weight measurements we need to have the information of the ingredient densities.

This section will be concerned with finding an appropriate density database, and processing it before the joining the ingredients database onto it.

# Setup

In [1]:
from pyprojroot import here
root = here()
import sys
sys.path.append(str(root))

In [2]:

import pandas as pd
import numpy as np

import nltk
import spacy
from spacy.matcher import Matcher
from spacy.util import filter_spans

import json
from itertools import groupby
import re
import string

from tqdm import tqdm
tqdm.pandas()

from ast import literal_eval

from recipe_dataset.utils.utils import *
from recipe_dataset.utils.parallel import *

import pickle

from nltk.corpus import wordnet

In [3]:
pd.options.mode.chained_assignment = None  # default='warn'

In [4]:
pd.set_option('max_colwidth', None)

# Datasets Research


There are online volume-weight conversion calculators, which must make use of density table for foods. Since there's no API endpoint that will allow me to scrape all of the data I need to find a database.

### Primary Examples

The largest online calculator https://www.aqua-calc.com/calculate/food-volume-to-weight claims to just use data from the [USDA Food Database](https://fdc.nal.usda.gov/download-datasets.html). However this database doesn't contain explicit information about density - it would involve another calculation to do this (there's a [paper](https://www.researchgate.net/publication/241098096_Using_database_values_to_determine_food_density) written specifically on this). I have suspicions that it's using some other form of information - one way to prove it is to compare the reference food name formats with eachother.

Another website https://khymos.org/2014/01/23/volume-to-weight-calculator-for-the-kitchen/ has taken data from both http://www.faqs.org/faqs/cooking/faq/ and the USDA Food Database. I'm skeptical about the latter too because the reference names don't seem to match up and it seems to have a much smaller database. Nevertheless, it sources an easy table of the data which I can use to try it out. I think it will not contain all the ingredients required, but we could try to see.

### Khymos Database

We have a simple density table here.

In [5]:
density_df = pd.read_csv('../data/datasets/density/food_densities_khymos/food_densities_khymos.csv', usecols=['food', 'g/ml'])
density_df

Unnamed: 0,food,g/ml
0,"allspice, ground",0.42
1,"almonds, ground",0.36
2,"almonds, sliced",0.39
3,"almonds, whole",0.66
4,anchovies,1.02
...,...,...
270,wheat germ,0.53
271,wild rice,0.61
272,"wine, red",0.99
273,"wine, white",0.99


This is a small database which definitely will not have all of our required ingredients.

### USDA FoodData Central Database

In [6]:
food_df = pd.read_csv('../data/datasets/density/FoodData_Central_csv_2023-04-20/food.csv', index_col='fdc_id')
food_portion_df = pd.read_csv('../data/datasets/density/FoodData_Central_csv_2023-04-20/food_portion.csv', index_col=['fdc_id', 'id'])

food_df.shape, food_portion_df.shape

((1913739, 4), (47837, 9))

In [7]:
food_df.head()

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1105904,branded_food,WESSON Vegetable Oil 1 GAL,,2020-11-13
1105905,branded_food,SWANSON BROTH BEEF,,2020-11-13
1105906,branded_food,CAMPBELL'S SLOW KETTLE SOUP CLAM CHOWDER,,2020-11-13
1105907,branded_food,CAMPBELL'S SLOW KETTLE SOUP CHEESE BROCCOLI,,2020-11-13
1105898,experimental_food,Discrepancy between the Atwater factor predicted and empirically measured energy values of almonds in human diets,,2020-10-30


In [8]:
food_df['description'].fillna('', inplace=True)

In [9]:
food_df[food_df['description'].str.contains('Pepper, raw')]

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2345319,survey_fndds_food,"Pepper, raw, NFS",6420.0,2022-10-28


In [10]:
food_portion_df.loc[food_df[food_df['description'].str.contains('Pepper, raw')].iloc[0].name]

Unnamed: 0_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
286043,1.0,,9999,1 piece,61667,10.0,,,
286044,2.0,,9999,1 ring,61820,10.0,,,
286045,3.0,,9999,1 miniature,61467,30.0,,,
286046,4.0,,9999,1 regular,62862,120.0,,,
286047,5.0,,9999,1 cup,10205,150.0,,,
286048,6.0,,9999,Quantity not specified,90000,20.0,,,


### Decision

The USDA will be used as it has undoubtably more foods to find matches for. The vast selection of foods will have its own challenges, so it might be best to try the cleaner Khymos database first. However because we will need to work with the USDA database for the missing foods anyway, we will begin with working with this.

# USDA Dataset

This on the other hand is a massive database. It contians far more ingredients than we'd need, which is its own challenge. We will need to try to filter the database as much as possible, setting it up to be able to be searched not only efficiently, but most accurately too.

As well as searching, there is another challenge in that it doesn't provide explicit density information. What it has instead is information of various portions for the food, with a measured gram weight for each. It's simple enough to get the density here, just taking a volumetric poriton measure and converting it to density using its gram weight. What we can also do with this information is actually get the gram weight for other measures too, for example one *whole* pepper, or a coffee serving etc. 

This means that we will need to find a way of not only selecting the right food from the `food_df`, but also the right portion from its corresponding `portion_df` entries.

## Other Database Fields 

In [11]:
food_category_wwei_df = pd.read_csv('../data/datasets/density/FoodData_Central_csv_2023-04-20/wweia_food_category.csv')
food_category_wwei_df[food_category_wwei_df['wweia_food_category_description'].str.contains("beef")]

Unnamed: 0,wweia_food_category,wweia_food_category_description
15,2004,Ground beef


In [12]:
food_portion_df.loc[2341335]

Unnamed: 0_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
270013,1.0,,9999,"1 cup, cooked, diced",10049,135.0,,,
270014,2.0,,9999,1 piece,61667,60.0,,,
270015,3.0,,9999,1 slice,61935,60.0,,,
270016,4.0,,9999,"1 oz, cooked",40040,28.35,,,
270017,16.0,,9999,Quantity not specified,90000,85.0,,,


In [13]:
food_df[food_df['food_category_id'] == 2202].head(20)

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2341335,survey_fndds_food,"Chicken, NS as to part and cooking method, NS as to skin eaten",2202.0,2022-10-28
2341336,survey_fndds_food,"Chicken, NS as to part and cooking method, skin eaten",2202.0,2022-10-28
2341337,survey_fndds_food,"Chicken, NS as to part and cooking method, skin not eaten",2202.0,2022-10-28
2341338,survey_fndds_food,"Chicken, NS as to part, baked, broiled, or roasted, NS as to skin eaten",2202.0,2022-10-28
2341339,survey_fndds_food,"Chicken, NS as to part, baked, broiled, or roasted, skin eaten",2202.0,2022-10-28
2341340,survey_fndds_food,"Chicken, NS as to part, baked, broiled, or roasted, skin not eaten",2202.0,2022-10-28
2341341,survey_fndds_food,"Chicken, NS as to part, rotisserie, NS as to skin eaten",2202.0,2022-10-28
2341342,survey_fndds_food,"Chicken, NS as to part, rotisserie, skin eaten",2202.0,2022-10-28
2341343,survey_fndds_food,"Chicken, NS as to part, rotisserie, skin not eaten",2202.0,2022-10-28
2341344,survey_fndds_food,"Chicken, NS as to part, stewed, NS as to skin eaten",2202.0,2022-10-28


### Food Category

Just an idea that we could use the food category to help with the reference.

# Processing Food DF

In [14]:
food_df_full = food_df.copy(deep=True)

## Dtypes

In [15]:
food_df = food_df.convert_dtypes()
food_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1913739 entries, 1105904 to 2554914
Data columns (total 4 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   data_type         string
 1   description       string
 2   food_category_id  Int64 
 3   publication_date  string
dtypes: Int64(1), string(3)
memory usage: 74.8 MB


## LLematization

Getting just the base of the appropriate word.

## Filtering Ingredients

We are only interested in raw, whole foods, for which we will try filter down the section to get there.

### Food `data_type`

Many of the unnecessary non-whole ingredients can be filtered out through their data_type category.

In [16]:
food_df.value_counts('data_type')

data_type
branded_food                1845297
sub_sample_food               44537
sr_legacy_food                 7793
market_acquistion              6402
survey_fndds_food              5624
sample_food                    2904
agricultural_acquisition        810
foundation_food                 307
experimental_food                65
Name: count, dtype: int64

In [17]:
food_df = food_df[~food_df['data_type'].str.strip().isin(['branded_food','sub_sample_food', 'market_acquistion', 'agricultural_acquisition'])]
food_df.shape

(16693, 4)

In [18]:
# ordering catgegories according to importance (to aid with searching)
food_df['data_type'] = pd.Categorical(food_df['data_type'], categories=['foundation_food', 'survey_fndds_food', 'sr_legacy_food'], ordered=True)

## Syncing Food DF & Portion DF Entries

Only the foods with entries in the portion DF will be of use.

In [19]:
food_df = food_df[food_df.index.isin(food_portion_df.reset_index(1).index.unique())]

In [20]:
food_portion_df = food_portion_df[food_portion_df.index.get_level_values(0).isin(food_df.index.unique())]

## NA's

In [21]:
food_df = food_df[~food_df['description'].isnull()]

## Parsing Food Descriptions

Ingredient names come in the form of a 'description', which is a comma separated string contianing the ingredient name as well as other descriptions about the specifict type. These orderings depend on the study that was done ie. the `data_type`. These specificies can can often be ignored, as we want to be dealing with the fundamental ingredient (ie. raw) in the dataset. However there are factors that should be considered (eg. eggs whole/yolk/white, specific part of meat, dried ingredients, type of flour).

To help work with the data, we parse each one of these into a list of descriptions.

In [22]:
food_df['description']

fdc_id
167512     Pillsbury Golden Layer Buttermilk Biscuits, Artificial Flavor, refrigerated dough
167513                              Pillsbury, Cinnamon Rolls with Icing, refrigerated dough
167514                      Kraft Foods, Shake N Bake Original Recipe, Coating for Pork, dry
167515                                        George Weston Bakeries, Thomas English Muffins
167516                                            Waffles, buttermilk, frozen, ready-to-heat
                                                 ...                                        
2346351                                            Sports drink, low calorie (Powerade Zero)
2346352                                                            Sports drink, low calorie
2346353                                              Fluid replacement, electrolyte solution
2346354                                               Fluid replacement, 5% glucose in water
2346355                                                        

In [23]:
food_df['description_list'] = food_df['description'].apply(lambda description: description.split(', '))

## Filtering

In [24]:
food_df[food_df['description'].str.lower().str.contains('flour') & food_df['description'].str.lower().str.contains('wheat')]

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167938,sr_legacy_food,"Pan Dulce, La Ricura, Salpora de Arroz con Azucar, cookie-like, contains wheat flour and rice flour",,2019-04-01,"[Pan Dulce, La Ricura, Salpora de Arroz con Azucar, cookie-like, contains wheat flour and rice flour]"
168869,sr_legacy_food,"Cornmeal, yellow, self-rising, bolted, with wheat flour added, enriched",,2019-04-01,"[Cornmeal, yellow, self-rising, bolted, with wheat flour added, enriched]"
168893,sr_legacy_food,"Wheat flour, whole-grain (Includes foods for USDA's Food Distribution Program)",,2019-04-01,"[Wheat flour, whole-grain (Includes foods for USDA's Food Distribution Program)]"
168894,sr_legacy_food,"Wheat flour, white, all-purpose, enriched, bleached",,2019-04-01,"[Wheat flour, white, all-purpose, enriched, bleached]"
168895,sr_legacy_food,"Wheat flour, white, all-purpose, self-rising, enriched",,2019-04-01,"[Wheat flour, white, all-purpose, self-rising, enriched]"
168896,sr_legacy_food,"Wheat flour, white, bread, enriched",,2019-04-01,"[Wheat flour, white, bread, enriched]"
168913,sr_legacy_food,"Wheat flours, bread, unenriched",,2019-04-01,"[Wheat flours, bread, unenriched]"
168924,sr_legacy_food,"Cornmeal, white, self-rising, bolted, with wheat flour added, enriched",,2019-04-01,"[Cornmeal, white, self-rising, bolted, with wheat flour added, enriched]"
168936,sr_legacy_food,"Wheat flour, white, all-purpose, enriched, unbleached",,2019-04-01,"[Wheat flour, white, all-purpose, enriched, unbleached]"
169723,sr_legacy_food,"Wheat flour, white, cake, enriched",,2019-04-01,"[Wheat flour, white, cake, enriched]"


In [25]:
food_df.query(f'description.str.lower().str.contains("mango")')

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167785,sr_legacy_food,"Mango nectar, canned",,2019-04-01,"[Mango nectar, canned]"
167795,sr_legacy_food,"Fruit juice smoothie, NAKED JUICE, MIGHTY MANGO",,2019-04-01,"[Fruit juice smoothie, NAKED JUICE, MIGHTY MANGO]"
169090,sr_legacy_food,"Mangosteen, canned, syrup pack",,2019-04-01,"[Mangosteen, canned, syrup pack]"
169910,sr_legacy_food,"Mangos, raw",,2019-04-01,"[Mangos, raw]"
171341,sr_legacy_food,"Babyfood, fruit dessert, mango with tapioca",,2019-04-01,"[Babyfood, fruit dessert, mango with tapioca]"
171933,sr_legacy_food,"Beverages, V8 V-FUSION Juices, Peach Mango",,2019-04-01,"[Beverages, V8 V-FUSION Juices, Peach Mango]"
172254,sr_legacy_food,"Babyfood, GERBER, 3rd Foods, apple, mango and kiwi",,2019-04-01,"[Babyfood, GERBER, 3rd Foods, apple, mango and kiwi]"
173173,sr_legacy_food,"Beverages, FUZE, orange mango, fortified with vitamins A, C, E, B6",,2019-04-01,"[Beverages, FUZE, orange mango, fortified with vitamins A, C, E, B6]"
173186,sr_legacy_food,"Beverages, V8 SPLASH Smoothies, Peach Mango",,2019-04-01,"[Beverages, V8 SPLASH Smoothies, Peach Mango]"
174165,sr_legacy_food,"Beverages, V8 SPLASH Juice Drinks, Mango Peach",,2019-04-01,"[Beverages, V8 SPLASH Juice Drinks, Mango Peach]"


### Filtering Brands/Meals

We want to avoid brand names / full meals showing up in the food_df. 

There's a number of ways of doing this, the following section will investigate these:

In [26]:
food_df[food_df['description'].str.contains('flour') & food_df['description'].str.contains('wheat')]

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167938,sr_legacy_food,"Pan Dulce, La Ricura, Salpora de Arroz con Azucar, cookie-like, contains wheat flour and rice flour",,2019-04-01,"[Pan Dulce, La Ricura, Salpora de Arroz con Azucar, cookie-like, contains wheat flour and rice flour]"
168869,sr_legacy_food,"Cornmeal, yellow, self-rising, bolted, with wheat flour added, enriched",,2019-04-01,"[Cornmeal, yellow, self-rising, bolted, with wheat flour added, enriched]"
168924,sr_legacy_food,"Cornmeal, white, self-rising, bolted, with wheat flour added, enriched",,2019-04-01,"[Cornmeal, white, self-rising, bolted, with wheat flour added, enriched]"
170687,sr_legacy_food,"Buckwheat flour, whole-groat",,2019-04-01,"[Buckwheat flour, whole-groat]"


In [27]:
food_df

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167512,sr_legacy_food,"Pillsbury Golden Layer Buttermilk Biscuits, Artificial Flavor, refrigerated dough",,2019-04-01,"[Pillsbury Golden Layer Buttermilk Biscuits, Artificial Flavor, refrigerated dough]"
167513,sr_legacy_food,"Pillsbury, Cinnamon Rolls with Icing, refrigerated dough",,2019-04-01,"[Pillsbury, Cinnamon Rolls with Icing, refrigerated dough]"
167514,sr_legacy_food,"Kraft Foods, Shake N Bake Original Recipe, Coating for Pork, dry",,2019-04-01,"[Kraft Foods, Shake N Bake Original Recipe, Coating for Pork, dry]"
167515,sr_legacy_food,"George Weston Bakeries, Thomas English Muffins",,2019-04-01,"[George Weston Bakeries, Thomas English Muffins]"
167516,sr_legacy_food,"Waffles, buttermilk, frozen, ready-to-heat",,2019-04-01,"[Waffles, buttermilk, frozen, ready-to-heat]"
...,...,...,...,...,...
2346351,survey_fndds_food,"Sports drink, low calorie (Powerade Zero)",7104,2022-10-28,"[Sports drink, low calorie (Powerade Zero)]"
2346352,survey_fndds_food,"Sports drink, low calorie",7104,2022-10-28,"[Sports drink, low calorie]"
2346353,survey_fndds_food,"Fluid replacement, electrolyte solution",7206,2022-10-28,"[Fluid replacement, electrolyte solution]"
2346354,survey_fndds_food,"Fluid replacement, 5% glucose in water",7206,2022-10-28,"[Fluid replacement, 5% glucose in water]"


In [28]:
food_df['description'][food_df['description'].str.contains('with')]

fdc_id
167513                   Pillsbury, Cinnamon Rolls with Icing, refrigerated dough
167523            Pie crust, deep dish, frozen, unbaked, made with enriched flour
167546                                  Candies, honey-combed, with peanut butter
167551                              Snacks, popcorn, caramel-coated, with peanuts
167552                           Snacks, popcorn, caramel-coated, without peanuts
                                            ...                                  
2346180                                               Oatmeal beverage with water
2346181                                                Oatmeal beverage with milk
2346184                                     Cornmeal beverage with chocolate milk
2346188    Fruit flavored drink, with high vitamin C, powdered, not reconstituted
2346257                                                        Liqueur with cream
Name: description, Length: 3125, dtype: string

Often times these appear in lengthy phrases.

In [29]:
def filter_long_phrases(food_description):
    # note: exluding bracketed text as this is sometimes important
    if any([len(re.sub(r'\([^)]*\)', '', phrase)) > 30 for phrase in food_description.split(',')]):
        return True
    return False
    
food_df[food_df['description'].apply(filter_long_phrases)]

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167512,sr_legacy_food,"Pillsbury Golden Layer Buttermilk Biscuits, Artificial Flavor, refrigerated dough",,2019-04-01,"[Pillsbury Golden Layer Buttermilk Biscuits, Artificial Flavor, refrigerated dough]"
167531,sr_legacy_food,"Crackers, cream, La Moderna Rikis Cream Crackers",,2019-04-01,"[Crackers, cream, La Moderna Rikis Cream Crackers]"
167548,sr_legacy_food,"Snacks, granola bars, soft, uncoated, peanut butter and chocolate chip",,2019-04-01,"[Snacks, granola bars, soft, uncoated, peanut butter and chocolate chip]"
167581,sr_legacy_food,"Candies, NESTLE, GOOBERS Chocolate Covered Peanuts",,2019-04-01,"[Candies, NESTLE, GOOBERS Chocolate Covered Peanuts]"
167583,sr_legacy_food,"Candies, TWIZZLERS Strawberry Twists Candy",,2019-04-01,"[Candies, TWIZZLERS Strawberry Twists Candy]"
...,...,...,...,...,...
2346140,survey_fndds_food,"Fruit punch, made with fruit juice and soda",7204,2022-10-28,"[Fruit punch, made with fruit juice and soda]"
2346155,survey_fndds_food,"Vegetable and fruit juice drink, with high vitamin C",7204,2022-10-28,"[Vegetable and fruit juice drink, with high vitamin C]"
2346168,survey_fndds_food,"Vegetable and fruit juice drink, with high vitamin C, diet",7106,2022-10-28,"[Vegetable and fruit juice drink, with high vitamin C, diet]"
2346169,survey_fndds_food,"Vegetable and fruit juice drink, with high vitamin C, light",7204,2022-10-28,"[Vegetable and fruit juice drink, with high vitamin C, light]"


In [30]:
food_df = food_df[~food_df['description'].apply(filter_long_phrases)]
food_df.shape

(11930, 5)

Foods containing 'with' are the majority of othe time bloated and too specific.

In [31]:
food_df[food_df['description'].str.contains('with')]

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167513,sr_legacy_food,"Pillsbury, Cinnamon Rolls with Icing, refrigerated dough",,2019-04-01,"[Pillsbury, Cinnamon Rolls with Icing, refrigerated dough]"
167523,sr_legacy_food,"Pie crust, deep dish, frozen, unbaked, made with enriched flour",,2019-04-01,"[Pie crust, deep dish, frozen, unbaked, made with enriched flour]"
167546,sr_legacy_food,"Candies, honey-combed, with peanut butter",,2019-04-01,"[Candies, honey-combed, with peanut butter]"
167551,sr_legacy_food,"Snacks, popcorn, caramel-coated, with peanuts",,2019-04-01,"[Snacks, popcorn, caramel-coated, with peanuts]"
167552,sr_legacy_food,"Snacks, popcorn, caramel-coated, without peanuts",,2019-04-01,"[Snacks, popcorn, caramel-coated, without peanuts]"
...,...,...,...,...,...
2346179,survey_fndds_food,"Horchata beverage, made with milk",7220,2022-10-28,"[Horchata beverage, made with milk]"
2346180,survey_fndds_food,Oatmeal beverage with water,7220,2022-10-28,[Oatmeal beverage with water]
2346181,survey_fndds_food,Oatmeal beverage with milk,7220,2022-10-28,[Oatmeal beverage with milk]
2346188,survey_fndds_food,"Fruit flavored drink, with high vitamin C, powdered, not reconstituted",9999,2022-10-28,"[Fruit flavored drink, with high vitamin C, powdered, not reconstituted]"


In [32]:
food_df = food_df[~food_df['description'].str.contains('with')]
food_df.shape

(9601, 5)

In [33]:
food_df.loc[171474]

data_type                                                      sr_legacy_food
description           Chicken, broilers or fryers, breast, meat and skin, raw
food_category_id                                                         <NA>
publication_date                                                   2019-04-01
description_list    [Chicken, broilers or fryers, breast, meat and skin, raw]
Name: 171474, dtype: object

Brands are often written as fully capitized words.

In [34]:
food_df[food_df['description'].apply(lambda x: bool(re.search(r'\b[A-Z0-9]{2,}\b', x)))]

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167562,sr_legacy_food,"Candies, ALMOND JOY Candy Bar",,2019-04-01,"[Candies, ALMOND JOY Candy Bar]"
167563,sr_legacy_food,"Candies, TWIZZLERS CHERRY BITES",,2019-04-01,"[Candies, TWIZZLERS CHERRY BITES]"
167564,sr_legacy_food,"Candies, NESTLE, BIT-O'-HONEY Candy Chews",,2019-04-01,"[Candies, NESTLE, BIT-O'-HONEY Candy Chews]"
167565,sr_legacy_food,"Candies, NESTLE, BUTTERFINGER Bar",,2019-04-01,"[Candies, NESTLE, BUTTERFINGER Bar]"
167582,sr_legacy_food,"Candies, NESTLE, BABY RUTH Bar",,2019-04-01,"[Candies, NESTLE, BABY RUTH Bar]"
...,...,...,...,...,...
2346339,survey_fndds_food,"Energy drink, sugar-free (NOS)",7104,2022-10-28,"[Energy drink, sugar-free (NOS)]"
2346344,survey_fndds_food,Energy drink (XS),7206,2022-10-28,[Energy drink (XS)]
2346345,survey_fndds_food,Energy drink (XS Gold Plus),7206,2022-10-28,[Energy drink (XS Gold Plus)]
2346349,survey_fndds_food,"Sports drink, NFS",7206,2022-10-28,"[Sports drink, NFS]"


However we want to make sure this isn't including any useful upper-cased words. One such frequent example is NFS (Not Further Specified). Let's see which are the most common:

In [35]:
def get_re_match(search_string, regex):
    match = re.search(regex, search_string)
    if match: return search_string[match.start():match.end()]

In [36]:
_ = food_df['description'].apply(get_re_match, args=(r'\b[A-Z0-9]{2,}\b',))
_ = _[_.notnull()]
_.value_counts().head(10)

description
NFS       245
NS        177
QUAKER     54
USDA       51
100        39
10         32
SILK       27
MARS       26
MALT       24
V8         20
Name: count, dtype: int64

In [37]:
food_df[food_df['description'].str.contains("USDA")].head()

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167776,sr_legacy_food,"Pears, raw, bartlett (Includes foods for USDA's Food Distribution Program)",,2019-04-01,"[Pears, raw, bartlett (Includes foods for USDA's Food Distribution Program)]"
167778,sr_legacy_food,"Pears, raw, bosc (Includes foods for USDA's Food Distribution Program)",,2019-04-01,"[Pears, raw, bosc (Includes foods for USDA's Food Distribution Program)]"
167779,sr_legacy_food,"Pears, raw, green anjou (Includes foods for USDA's Food Distribution Program)",,2019-04-01,"[Pears, raw, green anjou (Includes foods for USDA's Food Distribution Program)]"
167816,sr_legacy_food,"Pork, fresh, leg (ham), rump half, separable lean only, raw (Includes foods for USDA's Food Distribution Program)",,2019-04-01,"[Pork, fresh, leg (ham), rump half, separable lean only, raw (Includes foods for USDA's Food Distribution Program)]"
168165,sr_legacy_food,"Raisins, dark, seedless (Includes foods for USDA's Food Distribution Program)",,2019-04-01,"[Raisins, dark, seedless (Includes foods for USDA's Food Distribution Program)]"


In [38]:
food_df['description'] = food_df['description'].apply(lambda x: x.replace(" (Includes foods for USDA's Food Distribution Program)", ""))

In [39]:
food_df = food_df[~(food_df['description'].apply(lambda x: bool(re.search(r'\b[A-Z0-9]{2,}\b', x))) & ~(food_df['description'].str.contains('NFS')) & ~(food_df['description'].str.contains('NS')))]

What about series of capitalised words

In [40]:
food_df[food_df['description'].apply(lambda x: bool(re.search(r'([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)', x)))]

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
167514,sr_legacy_food,"Kraft Foods, Shake N Bake Original Recipe, Coating for Pork, dry",,2019-04-01,"[Kraft Foods, Shake N Bake Original Recipe, Coating for Pork, dry]"
167515,sr_legacy_food,"George Weston Bakeries, Thomas English Muffins",,2019-04-01,"[George Weston Bakeries, Thomas English Muffins]"
167520,sr_legacy_food,"Pie Crust, Cookie-type, Graham Cracker, Ready Crust",,2019-04-01,"[Pie Crust, Cookie-type, Graham Cracker, Ready Crust]"
167521,sr_legacy_food,"Pie Crust, Cookie-type, Chocolate, Ready Crust",,2019-04-01,"[Pie Crust, Cookie-type, Chocolate, Ready Crust]"
167522,sr_legacy_food,"Pie, Dutch Apple, Commercially Prepared",,2019-04-01,"[Pie, Dutch Apple, Commercially Prepared]"
...,...,...,...,...,...
2346334,survey_fndds_food,Energy Drink,7206,2022-10-28,[Energy Drink]
2346338,survey_fndds_food,"Energy drink, sugar free (No Fear)",7104,2022-10-28,"[Energy drink, sugar free (No Fear)]"
2346340,survey_fndds_food,Energy drink (Ocean Spray Cran-Energy Juice Drink),7206,2022-10-28,[Energy drink (Ocean Spray Cran-Energy Juice Drink)]
2346341,survey_fndds_food,"Energy drink, sugar-free (Red Bull)",7104,2022-10-28,"[Energy drink, sugar-free (Red Bull)]"


Although not as convincing, it does seem like these are all branded foods that aren't useful to us.

In [41]:
food_df = food_df[~food_df['description'].apply(lambda x: bool(re.search(r'([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)', x)))]

In [42]:
food_df.shape

(8191, 5)

### Filtering "no"s

We don't want values specifically specifying that they don't contain a search term coming up during the search

In [43]:
def filter_nos(description_list):
    description_list = [description for description in description_list
     if not bool(re.search(r'^no\b', description))]
    return description_list

In [44]:
assert filter_nos(['sloppy joe', 'no bun']) == ['sloppy joe']

In [45]:
food_df['description_list'] = food_df['description_list'].apply(filter_nos)

In [46]:
food_df.loc[2341797]

data_type            survey_fndds_food
description         Sloppy joe, no bun
food_category_id                  3002
publication_date            2022-10-28
description_list          [Sloppy joe]
Name: 2341797, dtype: object

## Cleaning

In [47]:
food_df['description_list'] = food_df['description_list'].apply(lambda description_list: [clean_ingredient_string(x) for x in description_list])

In [48]:
food_df.loc[2341797]

data_type            survey_fndds_food
description         Sloppy joe, no bun
food_category_id                  3002
publication_date            2022-10-28
description_list          [sloppy joe]
Name: 2341797, dtype: object

## Column Aggregations

Here we are adding columns that are useful for the joining process.

In [49]:
food_df['description_length'] = food_df['description'].apply(len)
food_df['description_list_length'] = food_df['description_list'].apply(len)
food_df = food_df[food_df['description_list_length'] != 1] # remove ingredients which do not follow the comma separated format (found that these are not useful).
food_df.head()

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list,description_length,description_list_length
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
167516,sr_legacy_food,"Waffles, buttermilk, frozen, ready-to-heat",,2019-04-01,"[waffle, buttermilk, frozen, ready-to-heat]",42,4
167517,sr_legacy_food,"Waffle, buttermilk, frozen, ready-to-heat, toasted",,2019-04-01,"[waffle, buttermilk, frozen, ready-to-heat, toasted]",50,5
167518,sr_legacy_food,"Waffle, buttermilk, frozen, ready-to-heat, microwaved",,2019-04-01,"[waffle, buttermilk, frozen, ready-to-heat, microwaved]",53,5
167519,sr_legacy_food,"Waffle, plain, frozen, ready-to-heat, microwave",,2019-04-01,"[waffle, plain, frozen, ready-to-heat, microwave]",47,5
167524,sr_legacy_food,"Waffles, chocolate chip, frozen, ready-to-heat",,2019-04-01,"[waffle, chocolate chip, frozen, ready-to-heat]",46,4


In [50]:
with open('../data/globals/default_words.json', 'r') as f: 
    default_words = json.load(f)['density']

with open('../data/globals/exclusion_words.json', 'r') as f: 
    exclusion_words = json.load(f)['density']

food_df['default_word_count'] = food_df['description_list'].apply(count_list_matches, args=(default_words,))
food_df['exclusion_word_count'] = food_df['description_list'].apply(count_list_matches, args=(exclusion_words,))

## Syncing Food DF & Portion DF Entries

Only the foods with entries in the portion DF will be of use.

In [51]:
food_df = food_df[food_df.index.isin(food_portion_df.reset_index(1).index.unique())]

In [52]:
food_portion_df = food_portion_df[food_portion_df.index.get_level_values(0).isin(food_df.index.unique())]

# Processing Food Portions DF

## Dtypes

In [53]:
food_portion_df = food_portion_df.convert_dtypes()
food_portion_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 22521 entries, (167516, 81553) to (2346355, 290505)
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   seq_num              22521 non-null  Int64  
 1   amount               9758 non-null   Float64
 2   measure_unit_id      22521 non-null  Int64  
 3   portion_description  12764 non-null  string 
 4   modifier             22452 non-null  string 
 5   gram_weight          22521 non-null  Float64
 6   data_points          2151 non-null   Int64  
 7   footnote             0 non-null      Int64  
 8   min_year_acquired    159 non-null    Int64  
dtypes: Float64(2), Int64(5), string(2)
memory usage: 3.8 MB


## Cleaning Poriton Description

In [54]:
def clean_portion_description(description):
    if not isinstance(description, str): return pd.NA
    description = description.lower()
    description = description.replace('quantity not specified', '')
    description = re.sub(r'guideline amount \w+', '', description)
    description = re.sub(r'^1 ', '', description)
    description = clean_ingredient_string(description)
    return description

food_portion_df['portion_cleaned'] = food_portion_df['portion_description'].apply(clean_portion_description)

In [55]:
food_portion_df

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
167516,81553,1,1.0,9999,,"waffle, square",39.0,10,,,
167516,81554,2,1.0,9999,,"waffle, round",38.0,40,,,
167517,81555,1,1.0,9999,,oz,28.0,,,,
167517,81556,2,1.0,9999,,"waffle round (4"" dia)",33.0,,,,
167518,81557,1,1.0,9999,,waffle,35.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...
2346354,290501,4,,9999,Quantity not specified,90000,120.0,,,,
2346355,290502,3,,9999,Quantity not specified,90000,248.0,,,,
2346355,290503,4,,9999,1 fl oz (no ice),30001,31.0,,,,fl oz ice
2346355,290504,5,,9999,1 fl oz (with ice),30009,23.0,,,,fl oz ice


In [56]:
_ = food_portion_df[['portion_cleaned', 'portion_description']][food_portion_df['portion_cleaned'] != food_portion_df['portion_description']]
_[_['portion_description'].notnull()].head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,portion_cleaned,portion_description
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1
2340760,267796,cup,1 cup
2340760,267797,,Quantity not specified
2340760,267798,fl oz,1 fl oz
2340761,267799,cup,1 cup
2340761,267800,fl oz,1 fl oz
2340761,267801,individual school container,1 individual school container
2340761,267802,fl oz beverage,Guideline amount per fl oz of beverage
2340761,267803,cup hot cereal,Guideline amount per cup of hot cereal
2340761,267804,,Quantity not specified
2340762,267805,cup,1 cup


## Selecting Portion Measures

This piece of data appears in various columns depending on the study mostly, but sometimes seemingly randomly. Here we want to gather all of this and select the appropriate one.

In [57]:
measure_unit_df = pd.read_csv('../data/datasets/density/FoodData_Central_csv_2023-04-20/measure_unit.csv', index_col='id')
measure_unit_df = measure_unit_df.convert_dtypes()
measure_unit_df['name'][measure_unit_df['name'] == 'undetermined'] = np.nan
measure_unit_df

Unnamed: 0_level_0,name
id,Unnamed: 1_level_1
1000,cup
1001,tablespoon
1002,teaspoon
1003,liter
1004,milliliter
...,...
1117,bunch
1118,Tablespoons
1119,Banana
1120,Onion


In [58]:
food_portion_df.drop(columns=['measure_unit', 'description', 'combined', 'unit_tags', 'unit_remainders'], inplace=True, errors='ignore')

In [59]:
food_portion_df = food_portion_df.join(measure_unit_df.rename({'name':'measure_unit'}, axis=1), on='measure_unit_id')
food_portion_df

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned,measure_unit
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
167516,81553,1,1.0,9999,,"waffle, square",39.0,10,,,,
167516,81554,2,1.0,9999,,"waffle, round",38.0,40,,,,
167517,81555,1,1.0,9999,,oz,28.0,,,,,
167517,81556,2,1.0,9999,,"waffle round (4"" dia)",33.0,,,,,
167518,81557,1,1.0,9999,,waffle,35.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
2346354,290501,4,,9999,Quantity not specified,90000,120.0,,,,,
2346355,290502,3,,9999,Quantity not specified,90000,248.0,,,,,
2346355,290503,4,,9999,1 fl oz (no ice),30001,31.0,,,,fl oz ice,
2346355,290504,5,,9999,1 fl oz (with ice),30009,23.0,,,,fl oz ice,


In [60]:
def combine_food_portion_description(portion):

    description = ''
    for col in ['measure_unit', 'portion_cleaned', 'modifier']:
        column_description = portion[col]
        if pd.isnull(column_description): continue
        if (col == 'modifier' and column_description.isnumeric()): continue
        description = description + " " + clean_portion_description(column_description)

    description = description.strip()

    return description

combined_description = food_portion_df.apply(combine_food_portion_description, axis=1).astype(str)
combined_description

fdc_id   id    
167516   81553           waffle square
         81554            waffle round
167517   81555                      oz
         81556     waffle round 4 "dia
167518   81557                  waffle
                          ...         
2346354  290501                       
2346355  290502                       
         290503              fl oz ice
         290504              fl oz ice
         290505              fl oz nfs
Length: 22521, dtype: object

In [61]:
food_portion_df['combined_description'] = combined_description

In [62]:
food_portion_df

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned,measure_unit,combined_description
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
167516,81553,1,1.0,9999,,"waffle, square",39.0,10,,,,,waffle square
167516,81554,2,1.0,9999,,"waffle, round",38.0,40,,,,,waffle round
167517,81555,1,1.0,9999,,oz,28.0,,,,,,oz
167517,81556,2,1.0,9999,,"waffle round (4"" dia)",33.0,,,,,,"waffle round 4 ""dia"
167518,81557,1,1.0,9999,,waffle,35.0,,,,,,waffle
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2346354,290501,4,,9999,Quantity not specified,90000,120.0,,,,,,
2346355,290502,3,,9999,Quantity not specified,90000,248.0,,,,,,
2346355,290503,4,,9999,1 fl oz (no ice),30001,31.0,,,,fl oz ice,,fl oz ice
2346355,290504,5,,9999,1 fl oz (with ice),30009,23.0,,,,fl oz ice,,fl oz ice


## NA's

Here there will be food portions without any description. We want to remove these, but only for those foods which have portions with a description.

In [63]:
null_idxs = food_portion_df.index[food_portion_df['combined_description'] == ''].unique()
len(null_idxs)

2769

In [64]:
non_null_food_idxs = food_portion_df.reset_index(1).index[food_portion_df['combined_description'] != ''].unique()
len(non_null_food_idxs)

7455

In [65]:
null_idxs_with_alternatives = [idx for idx in null_idxs if idx[0] in non_null_food_idxs]
null_idxs_without_alternatives = [idx for idx in null_idxs if idx[0] not in non_null_food_idxs]

In [66]:
food_portion_df = food_portion_df[~food_portion_df.index.isin(null_idxs_with_alternatives)]

What about the ones without alternatives?

In [67]:
food_portion_df.loc[null_idxs_without_alternatives].join(food_df, on='fdc_id')

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned,...,combined_description,data_type,description,food_category_id,publication_date,description_list,description_length,description_list_length,default_word_count,exclusion_word_count
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
172079,89840,1,1.0,9999,,each,25.0,8,,,,...,,sr_legacy_food,"Fast foods, miniature cinnamon rolls",,2019-04-01,"[fast food, miniature cinnamon roll]",36,2,0,0


Just one obscure ingredient which can just be removed.

In [68]:
food_portion_df = food_portion_df[~food_portion_df.index.isin(null_idxs_without_alternatives)]

## Syncing Food DF & Portion DF Entries

Only the foods with entries in the portion DF will be of use.

In [69]:
food_df = food_df[food_df.index.isin(food_portion_df.reset_index(1).index.unique())]

In [70]:
food_portion_df = food_portion_df[food_portion_df.index.get_level_values(0).isin(food_df.index.unique())]

## Tagging Units

Just as we did in the previous chapter for the ingredients_df, we are tagging the units to allow them to be matched with eachother by homogenising them.

In [71]:
with open('../data/globals/unit_conversions.json') as f:
    unit_list = json.load(f)

In [72]:
unit_tags = parallel_apply(food_portion_df['combined_description'], tag_units, meta=pd.Series(dtype='object'))
unit_tags = unit_tags.apply(literal_eval)
food_portion_df['unit_tags'], food_portion_df['unit_remainders'], food_portion_df['unit_type'] = zip(*unit_tags)

[11/02/2024 11:25:37] [INFO] [recipe_dataset.utils.parallel] [parallel_apply():47] [PID:3853 TID:139957401994048] Commencing parallel apply
[11/02/2024 11:25:37] [INFO] [recipe_dataset.utils.parallel] [parallel_apply():48] [PID:3853 TID:139957401994048] DF shape: (19752,) | DF size: 278.72 KB


In [73]:
food_portion_df[['combined_description', 'unit_tags', 'unit_remainders']]

Unnamed: 0_level_0,Unnamed: 1_level_0,combined_description,unit_tags,unit_remainders
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
167516,81553,waffle square,[],"[waffle, square]"
167516,81554,waffle round,[],"[waffle, round]"
167517,81555,oz,[ounce],[]
167517,81556,"waffle round 4 ""dia",[4.0],"[waffle, round, dia]"
167518,81557,waffle,[],[waffle]
...,...,...,...,...
2346354,290499,fl oz,[fluid_ounce],[]
2346354,290500,bottle 4 oz,"[4.0, ounce]",[bottle]
2346355,290503,fl oz ice,[fluid_ounce],[ice]
2346355,290504,fl oz ice,[fluid_ounce],[ice]


In [74]:
food_portion_df[food_portion_df['unit_tags'].isnull()]

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned,measure_unit,combined_description,unit_tags,unit_remainders,unit_type
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1


### Portion Sizes

One additional piece of information would be useful here. When dealing with portions, sometimes their actual size is specified, or the amounts of them is specified. We want to be able to get this amount. It can be done with an additional tagging step of tagging the amounts?

In [75]:
food_portion_df.loc[332282.0]

Unnamed: 0_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned,measure_unit,combined_description,unit_tags,unit_remainders,unit_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
121416,1,1.0,1049,,1/2 cup,135.0,36,,2009,,serving,serving 1/2 cup,"[portion, 0.5, cup]",[],volume


In [76]:
portion = food_portion_df.loc[(332282, 121416)]
portion

seq_num                                   1
amount                                  1.0
measure_unit_id                        1049
portion_description                    <NA>
modifier                            1/2 cup
gram_weight                           135.0
data_points                              36
footnote                               <NA>
min_year_acquired                      2009
portion_cleaned                        <NA>
measure_unit                        serving
combined_description        serving 1/2 cup
unit_tags               [portion, 0.5, cup]
unit_remainders                          []
unit_type                            volume
Name: (332282, 121416), dtype: object

In [77]:
food_portion_df[food_portion_df['unit_tags'].apply(lambda x: any([i[0].isnumeric() for i in x]))].sample(20, random_state=777)

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned,measure_unit,combined_description,unit_tags,unit_remainders,unit_type
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2341553,271244,1,,9999,"1 hen (1-1/4 lb, raw) (yield after cooking, bone and skin removed)",60854,250.0,,,,hen 1-1/4 lb raw yield cooking bone skin removed,,hen 1-1/4 lb raw yield cooking bone skin removed,"[1.0, 0.25, pound]","[hen, raw, yield, cooking, bone, skin, removed]",weight
171847,89462,1,2.0,9999,,cookie 1 serving,36.0,,,,,,cookie 1 serving,"[1.0, portion]",[cookie],portion
173638,93003,1,1.0,9999,,unit (yield from 1 lb ready-to-cook chicken),113.0,,,,,,unit yield 1 lb ready-to-cook chicken,"[1.0, pound]","[unit, yield, ready, to, cook, chicken]",weight
171966,89643,3,1.0,9999,,"can (6.5 oz), drained",125.0,,,,,,can 65 oz drained,"[can, 65.0, ounce]",[drained],volume
2344182,282215,6,,9999,"1 personal size pizza (5-7"" diameter)",64363,240.0,,,,"personal size pizza 5-7 ""diameter",,"personal size pizza 5-7 ""diameter","[5.0, 7.0]","[personal, size, pizza, diameter]",portion
170028,85928,4,1.0,9999,,"small (1-3/4"" to 2-1/4"" dia.)",92.0,,,,,,"small 1-3/4 ""2-1/4"" dia","[whole, 1.0, 3.0, 2.0, 0.25]",[dia],portion
169972,85786,2,1.0,9999,,package (10 oz),284.0,,,,,,package 10 oz,"[package, 10.0, ounce]",[],weight
170849,87395,6,1.0,9999,,package (6 oz),170.0,,,,,,package 6 oz,"[package, 6.0, ounce]",[],weight
171379,88595,6,1.0,9999,,jar Beech-Nut Stage 2 (4 oz),113.0,,,,,,jar beech-nut stage 2 4 oz,"[whole, 6.0, ounce]","[jar, beech, stage]",weight
171284,88363,2,1.0,9999,,container (8 oz),227.0,,,,,,container 8 oz,"[8.0, ounce]",[container],weight


In [78]:
def get_portion_measure(portion_unit_tags):
    modifier = (pd.NA, pd.NA)
    for i, unit_tag in enumerate(portion_unit_tags):
        if i == len(portion_unit_tags)-1: continue
        if unit_tag[0].isnumeric() and portion_unit_tags[i+1] in [*unit_list['weight'].keys(), *unit_list['volume'].keys(), *unit_list['portion'].keys()]:
            modifier = (float(unit_tag), portion_unit_tags[i+1])
    return modifier

assert get_portion_measure(portion['unit_tags']) == (0.5, 'cup')

In [79]:
food_portion_df['portion_amount'], food_portion_df['portion_unit'] = zip(*food_portion_df['unit_tags'].apply(get_portion_measure))

In [80]:
food_portion_df[food_portion_df['unit_tags'].apply(lambda x: any([i[0].isnumeric() for i in x]))].sample(20, random_state=777)

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned,measure_unit,combined_description,unit_tags,unit_remainders,unit_type,portion_amount,portion_unit
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2341553,271244,1,,9999,"1 hen (1-1/4 lb, raw) (yield after cooking, bone and skin removed)",60854,250.0,,,,hen 1-1/4 lb raw yield cooking bone skin removed,,hen 1-1/4 lb raw yield cooking bone skin removed,"[1.0, 0.25, pound]","[hen, raw, yield, cooking, bone, skin, removed]",weight,0.25,pound
171847,89462,1,2.0,9999,,cookie 1 serving,36.0,,,,,,cookie 1 serving,"[1.0, portion]",[cookie],portion,1.0,portion
173638,93003,1,1.0,9999,,unit (yield from 1 lb ready-to-cook chicken),113.0,,,,,,unit yield 1 lb ready-to-cook chicken,"[1.0, pound]","[unit, yield, ready, to, cook, chicken]",weight,1.0,pound
171966,89643,3,1.0,9999,,"can (6.5 oz), drained",125.0,,,,,,can 65 oz drained,"[can, 65.0, ounce]",[drained],volume,65.0,ounce
2344182,282215,6,,9999,"1 personal size pizza (5-7"" diameter)",64363,240.0,,,,"personal size pizza 5-7 ""diameter",,"personal size pizza 5-7 ""diameter","[5.0, 7.0]","[personal, size, pizza, diameter]",portion,,
170028,85928,4,1.0,9999,,"small (1-3/4"" to 2-1/4"" dia.)",92.0,,,,,,"small 1-3/4 ""2-1/4"" dia","[whole, 1.0, 3.0, 2.0, 0.25]",[dia],portion,,
169972,85786,2,1.0,9999,,package (10 oz),284.0,,,,,,package 10 oz,"[package, 10.0, ounce]",[],weight,10.0,ounce
170849,87395,6,1.0,9999,,package (6 oz),170.0,,,,,,package 6 oz,"[package, 6.0, ounce]",[],weight,6.0,ounce
171379,88595,6,1.0,9999,,jar Beech-Nut Stage 2 (4 oz),113.0,,,,,,jar beech-nut stage 2 4 oz,"[whole, 6.0, ounce]","[jar, beech, stage]",weight,6.0,ounce
171284,88363,2,1.0,9999,,container (8 oz),227.0,,,,,,container 8 oz,"[8.0, ounce]",[container],weight,8.0,ounce


In [81]:
food_portion_df['amount'].fillna(1.0, inplace=True)

## Removing Weight Values

The portions containing a weight value are completely useless. weight -> gram_weight. Let's remove these.

In [82]:
def filter_weight(portion):
    if not len(portion['unit_remainders']) > 0:
        if len(portion['unit_tags']) > 0:
            if all([unit in unit_list['weight'].keys() for unit in portion['unit_tags']]):
                return False
    return True

filtered = food_portion_df[food_portion_df.apply(filter_weight, axis=1)]
food_portion_df = pd.concat([filtered, food_portion_df[~food_portion_df.reset_index(1).index.isin(filtered.reset_index(1).index.unique())]]) # add back those where all portions for the fdc_id were removed

## Highlighting Density Entries

Keeping in mind that the purpose of this dataset being to calculate densities, we hould know that they only useful portions here are ones containing volume entries. However we don't quite want to remove those which don't contain volume, as they might be useful (to determine portion weights). Why don't we instead make a Boolean field in the food_df `density_exists`.

In [83]:
food_df.loc[171474]

data_type                                                           sr_legacy_food
description                Chicken, broilers or fryers, breast, meat and skin, raw
food_category_id                                                              <NA>
publication_date                                                        2019-04-01
description_list                  [chicken, broiler fryer, breast, meat skin, raw]
description_length                                                              55
description_list_length                                                          5
default_word_count                                                               1
exclusion_word_count                                                             0
Name: 171474, dtype: object

In [84]:
def check_portion_type(food):
    volume, portion = False, False
    portions = food_portion_df.loc[food.name]
    if any([any([portion_unit in unit_list['volume'].keys() for portion_unit in portion]) for portion in portions['unit_tags']]):
        volume = True
    if any([any([portion_unit in unit_list['portion'].keys() for portion_unit in portion]) for portion in portions['unit_tags']]):
        portion = True

    return volume, portion

check_portion_type(food_df.iloc[0])

(False, False)

In [85]:
food_df['volume_exists'], food_df['portion_exists'] = zip(*food_df.apply(check_portion_type, axis=1))

## Filtering Non-Density Foods

In [86]:
food_df.shape

(7455, 11)

In [87]:
food_df = food_df[~(~food_df['volume_exists'] & ~food_df['portion_exists'])]
food_df.shape

(5302, 11)

## Column Selection

In [88]:
food_df

Unnamed: 0_level_0,data_type,description,food_category_id,publication_date,description_list,description_length,description_list_length,default_word_count,exclusion_word_count,volume_exists,portion_exists
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
167525,sr_legacy_food,"Tostada shells, corn",,2019-04-01,"[tostada shell, corn]",20,2,0,0,False,True
167526,sr_legacy_food,"Bread, salvadoran sweet cheese (quesadilla salvadorena)",,2019-04-01,"[bread, salvadoran sweet cheese quesadilla salvadorena]",55,2,0,0,False,True
167527,sr_legacy_food,"Bread, pound cake type, pan de torta salvadoran",,2019-04-01,"[bread, pound cake type, pan de torta salvadoran]",47,3,0,0,False,True
167528,sr_legacy_food,"Pastry, Pastelitos de Guava (guava pastries)",,2019-04-01,"[pastry, pastelitos de guava guava pastry]",44,2,0,0,False,True
167532,sr_legacy_food,"Bread, white wheat",,2019-04-01,"[bread, white wheat]",18,2,0,0,False,True
...,...,...,...,...,...,...,...,...,...,...,...
2346349,survey_fndds_food,"Sports drink, NFS",7206,2022-10-28,"[sport drink, nfs]",17,2,1,0,True,False
2346352,survey_fndds_food,"Sports drink, low calorie",7104,2022-10-28,"[sport drink, low calorie]",25,2,0,1,True,False
2346353,survey_fndds_food,"Fluid replacement, electrolyte solution",7206,2022-10-28,"[fluid replacement, electrolyte solution]",39,2,0,0,True,False
2346354,survey_fndds_food,"Fluid replacement, 5% glucose in water",7206,2022-10-28,"[fluid replacement, 5% glucose water]",38,2,0,0,True,False


In [89]:
food_df.drop(['food_category_id', 'publication_date'], axis=1, inplace=True)

There's a lot of unnecessary data here, let's filter out only that which we need.

In [90]:
food_portion_df

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,measure_unit_id,portion_description,modifier,gram_weight,data_points,footnote,min_year_acquired,portion_cleaned,measure_unit,combined_description,unit_tags,unit_remainders,unit_type,portion_amount,portion_unit
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
167516,81553,1,1.0,9999,,"waffle, square",39.0,10,,,,,waffle square,[],"[waffle, square]",portion,,
167516,81554,2,1.0,9999,,"waffle, round",38.0,40,,,,,waffle round,[],"[waffle, round]",portion,,
167517,81556,2,1.0,9999,,"waffle round (4"" dia)",33.0,,,,,,"waffle round 4 ""dia",[4.0],"[waffle, round, dia]",portion,,
167518,81557,1,1.0,9999,,waffle,35.0,,,,,,waffle,[],[waffle],portion,,
167519,81558,1,1.0,9999,,"waffle, round (4""dia)",32.0,,,,,,"waffle round 4 ""dia",[4.0],"[waffle, round, dia]",portion,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
175303,95995,2,1.0,9999,,lb,453.6,,,,,,lb,[pound],[],weight,,
323604,119060,1,1.0,1038,,,28.4,1,,2017,,oz,oz,[ounce],[],weight,,
323697,119063,1,1.0,1038,,,28.4,1,,2017,,oz,oz,[ounce],[],weight,,
329596,119694,1,1.0,1038,,,28.4,1,,2017,,oz,oz,[ounce],[],weight,,


In [91]:
food_portion_df.drop(columns=['measure_unit_id','portion_description','modifier','data_points','footnote','min_year_acquired','measure_unit', 'portion_cleaned'],inplace=True)
food_portion_df.rename(columns={'combined_description': 'description'}, inplace=True)

In [92]:
food_portion_df

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,gram_weight,description,unit_tags,unit_remainders,unit_type,portion_amount,portion_unit
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
167516,81553,1,1.0,39.0,waffle square,[],"[waffle, square]",portion,,
167516,81554,2,1.0,38.0,waffle round,[],"[waffle, round]",portion,,
167517,81556,2,1.0,33.0,"waffle round 4 ""dia",[4.0],"[waffle, round, dia]",portion,,
167518,81557,1,1.0,35.0,waffle,[],[waffle],portion,,
167519,81558,1,1.0,32.0,"waffle round 4 ""dia",[4.0],"[waffle, round, dia]",portion,,
...,...,...,...,...,...,...,...,...,...,...
175303,95995,2,1.0,453.6,lb,[pound],[],weight,,
323604,119060,1,1.0,28.4,oz,[ounce],[],weight,,
323697,119063,1,1.0,28.4,oz,[ounce],[],weight,,
329596,119694,1,1.0,28.4,oz,[ounce],[],weight,,


## Syncing Food DF & Portion DF Entries

Only the foods with entries in the portion DF will be of use.

In [93]:
food_df = food_df[food_df.index.isin(food_portion_df.reset_index(1).index.unique())]

In [94]:
food_portion_df = food_portion_df[food_portion_df.index.get_level_values(0).isin(food_df.index.unique())]

# Saving

In [95]:
food_df

Unnamed: 0_level_0,data_type,description,description_list,description_length,description_list_length,default_word_count,exclusion_word_count,volume_exists,portion_exists
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
167525,sr_legacy_food,"Tostada shells, corn","[tostada shell, corn]",20,2,0,0,False,True
167526,sr_legacy_food,"Bread, salvadoran sweet cheese (quesadilla salvadorena)","[bread, salvadoran sweet cheese quesadilla salvadorena]",55,2,0,0,False,True
167527,sr_legacy_food,"Bread, pound cake type, pan de torta salvadoran","[bread, pound cake type, pan de torta salvadoran]",47,3,0,0,False,True
167528,sr_legacy_food,"Pastry, Pastelitos de Guava (guava pastries)","[pastry, pastelitos de guava guava pastry]",44,2,0,0,False,True
167532,sr_legacy_food,"Bread, white wheat","[bread, white wheat]",18,2,0,0,False,True
...,...,...,...,...,...,...,...,...,...
2346349,survey_fndds_food,"Sports drink, NFS","[sport drink, nfs]",17,2,1,0,True,False
2346352,survey_fndds_food,"Sports drink, low calorie","[sport drink, low calorie]",25,2,0,1,True,False
2346353,survey_fndds_food,"Fluid replacement, electrolyte solution","[fluid replacement, electrolyte solution]",39,2,0,0,True,False
2346354,survey_fndds_food,"Fluid replacement, 5% glucose in water","[fluid replacement, 5% glucose water]",38,2,0,0,True,False


In [96]:
food_portion_df

Unnamed: 0_level_0,Unnamed: 1_level_0,seq_num,amount,gram_weight,description,unit_tags,unit_remainders,unit_type,portion_amount,portion_unit
fdc_id,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
167525,81568,1,1.0,12.3,piece,[piece],[],portion,,
167525,81569,2,3.0,37.0,piece mean serving weight aggregated brand,"[piece, portion]","[mean, weight, aggregated, brand]",portion,,
167526,81570,1,1.0,55.0,serving approximate serving size,"[portion, portion]","[approximate, size]",portion,,
167526,81571,2,1.0,399.0,cake square average weight whole item,[whole],"[cake, square, average, weight, item]",portion,,
167527,81572,1,1.0,55.0,serving,[portion],[],portion,,
...,...,...,...,...,...,...,...,...,...,...
2346354,290499,2,1.0,30.0,fl oz,[fluid_ounce],[],volume,,
2346354,290500,3,1.0,120.0,bottle 4 oz,"[4.0, ounce]",[bottle],weight,4.0,ounce
2346355,290503,4,1.0,31.0,fl oz ice,[fluid_ounce],[ice],volume,,
2346355,290504,5,1.0,23.0,fl oz ice,[fluid_ounce],[ice],volume,,


In [97]:
food_df.to_feather('../data/local/density/full/food/0.feather')
food_portion_df.to_feather('../data/local/density/full/food_portion/0.feather')