In [1]:
%load_ext autoreload
%autoreload 2

In [14]:
import pandas as pd
import numpy as np

# Load USA food data

### Nutritient info in databases
- https://www.nzdl.org/cgi-bin/library.cgi?e=d-00000-00---off-0mhl--00-0----0-10-0---0---0direct-10---4-------0-1l--11-en-50---20-about---00-0-1-00-0-0-11-1-0utfZz-8-00&a=d&c=mhl&cl=CL1.1&d=HASHc5831578d1d2af498d537a.5.2.4

### Foundation Food Field Description
https://fdc.nal.usda.gov/docs/Foundation_Foods_Documentation_Apr2020.pdf

Foundation Foods does not provide data on all nutrients. This is because of the uniqueness of the
data: 
- Some nutrients are not found in certain foods (e.g., cholesterol in plant foods, protein in oils).
- Some nutrients in a food have not yet been analyzed. Data analyses are continually conducted
and as data on nutrients are obtained, values will be added to food profiles.

#### Proximate fields
“Proximate component” refers to the following macronutrients: water (moisture), protein, total lipid
(fat), total carbohydrate, and ash. Except for a few food items, nutrient profiles contain values for the
proximate components and at least one other nutrient
- Carbohydrate content, referred to as “carbohydrate by difference” in the tables, is expressed as the
difference between 100 and the sum of the percentages of water, protein, total lipid (fat), ash, and
alcohol (when present). - **PERCENTAGE**
-“Sugars, total NLEA” refers to the sum of the values for individual monosaccharides (galactose, glucose,
and fructose) and disaccharides (sucrose, lactose, and maltose), which are those sugars analyzed for
nutrition labelling. Because the analyses of total dietary fiber, total sugars, and starch content are
conducted separately and reflect the analytical variability inherent in the measurement process, the
sum of these carbohydrate fractions may not equal the carbohydrate-by-difference value or may even
exceed it
- Food energy is expressed in kcal and kJ. One kcal equals 4.184 kJ. The data represent physiologically
available energy, which is the value remaining after digestive and urinary losses are deducted from gross
energy (Merrill and Watt, 1973). Most energy values are calculated using the default factors of 4, 9, and
4 for protein, fat, and carbohydrates, respectively. Calorie factors for protein, fat, and carbohydrates are
included in the Food Descriptions table for many food items. For foods containing alcohol, a factor of 6.93 is used to calculate kcal/g of alcohol (Merrill and Watt, 1973)
- Vitamins reported in the database include ascorbic acid, thiamin, riboflavin, niacin, pantothenic acid,
vitamin B 6, vitamin B 12, folate, choline, vitamin A, vitamin D, vitamin E, and vitamin K. Many of the values
were obtained in small sample sizes, often of composited samples.
- "Protein". The values for protein are calculated from the amount of total nitrogen in the food using the nitrogen-
to-protein conversion factors recommended by Jones (1941) for most food items. The factor applied
to each food item is provided in the NFactor field in the Food Description table. Values in Foundation Foods are now listed as “calculated.” This differs from the approach taken in SR
Legacy, which denotes protein as “analytical.”
- Lipid component. Fatty acid values are expressed in g per 100 g of food. Logically, the sum of the fatty acids may not add
up to the value for total lipid. Total lipid values used on food labels represent the amount of triglyceride
that would produce the amount of lipid fatty acids determined using gas chromatography, as required
by the NLEA.

In [None]:
from nutritransform import load_food_data

In [4]:
food_data_cooked = load_food_data() # only cooked data
food_data_cooked.head()

# food_data = load_food_data(filter_uncooked=False) # all data, not filtered

Unnamed: 0,name,parent_name,"Cryptoxanthin, beta",Lycopene,"Tocopherol, delta","Tocotrienol, gamma","Tocotrienol, delta","Vitamin C, total ascorbic acid",Thiamin,Riboflavin,...,PUFA 20:3 n-6,"Fluoride, F",PUFA 18:2 i,SFA 13:0,Phytosterols,PUFA 2:4 n-6,MUFA 18:1-11 t (18:1t n-7),bow,title_simple,title_simple_reversed
0,"Hummus, commercial","Hummus, commercial",3.0,0.0,1.3,0.0,0.0,0.0,0.15,0.115,...,,,,,,,,"[Hummus, commercial]",hummus commercial,commercial hummus
2,"Beans, snap, green, canned, regular pack, drai...","Beans, snap, green, canned, regular pack, drai...",,,,,,,,,...,,,,,,,,"[Beans, snap, green, canned, regular pack, dra...",beans snap green canned regular pack drained s...,drained solids regular pack canned green snap ...
4,"Nuts, almonds, dry roasted, with salt added","Nuts, almonds, dry roasted, with salt added",9.0,0.0,0.0,0.0,0.0,0.0,0.079,1.57,...,,,,,,,,"[Nuts, almonds, dry roasted, with salt added]",nuts almonds dry roasted with salt added,dry roasted almonds nuts with salt added
8,"Egg, white, dried","Egg, white, dried",,,,,,,,,...,,,,,,,,"[Egg, white, dried]",egg white dried,dried white egg
9,"Onion rings, breaded, par fried, frozen, prepa...","Onion rings, breaded, par fried, frozen, prepa...",,,1.96,0.0,0.0,1.6,0.196,0.116,...,,,,,,,,"[Onion rings, breaded, par fried, frozen, prep...",onion rings breaded par fried frozen prepared ...,heated in oven prepared frozen par fried bread...


# Get embeddings for concatenated (or reversed) title

In [16]:
from nutritransform import generate_embedding_dict, retrieve_nutrition, nutrition_df_apply_thresholds, compute_metric_df

In [29]:
title_col = 'title_simple_reversed'
food_to_label = ['chicken with broccoli', 'doener kebap', 'beef steak', 'burek']

In [18]:
embedded_db_dict = generate_embedding_dict(food_data_cooked[title_col].values)

In [30]:
# build dataframe with n closest entries. choose higher value e.g. for grid search, exact value if you are sure about the config
df_res_gt = retrieve_nutrition(food_to_label, food_data_cooked, embedded_db_dict, title_col=title_col, n=5)

In [31]:
df_res_gt.head()

Unnamed: 0,name,parent_name,"Cryptoxanthin, beta",Lycopene,"Tocopherol, delta","Tocotrienol, gamma","Tocotrienol, delta","Vitamin C, total ascorbic acid",Thiamin,Riboflavin,...,PUFA 18:2 i,SFA 13:0,Phytosterols,PUFA 2:4 n-6,MUFA 18:1-11 t (18:1t n-7),bow,title_simple,title_simple_reversed,similarity,match_title
0,Chicken with gravy,Poultry mixed dishes,0.0,0.0,,,,0.0,0.056,0.143,...,,,,,,[Chicken with gravy],chicken with gravy,chicken with gravy,0.753817,chicken with broccoli
1,"Babyfood, dinner, broccoli and chicken, junior","Babyfood, dinner, broccoli and chicken, junior",0.0,0.0,0.0,0.0,0.0,17.7,0.019,0.098,...,,,,,,"[Babyfood, dinner, broccoli and chicken, junior]",babyfood dinner broccoli and chicken junior,junior broccoli and chicken dinner babyfood,0.740719,chicken with broccoli
2,Beef and broccoli,Stir-fry and soy-based sauce mixtures,0.0,0.0,,,,40.0,0.06,0.127,...,,,,,,[Beef and broccoli],beef and broccoli,beef and broccoli,0.738814,chicken with broccoli
3,Biryani with chicken,Rice mixed dishes,5.0,310.0,,,,6.2,0.073,0.079,...,,,,,,[Biryani with chicken],biryani with chicken,biryani with chicken,0.734347,chicken with broccoli
4,Chicken with mole sauce,Poultry mixed dishes,17.0,267.0,,,,0.1,0.061,0.144,...,,,,,,[Chicken with mole sauce],chicken with mole sauce,chicken with mole sauce,0.71917,chicken with broccoli


In [32]:
# groupby + grab actual values depending on params
df_nutres_thresh = nutrition_df_apply_thresholds(df_res_gt, n=3, sim_thresh=0.0)

In [33]:
# compute actual values
df_nutres_metrics = compute_metric_df(df_nutres_thresh,  np.mean)

In [38]:
# check out how it actually leverages semantic kontext-> the food is encoding "germanness" ---
df_res_gt[df_res_gt.match_title == 'doener kebap']

Unnamed: 0,name,parent_name,"Cryptoxanthin, beta",Lycopene,"Tocopherol, delta","Tocotrienol, gamma","Tocotrienol, delta","Vitamin C, total ascorbic acid",Thiamin,Riboflavin,...,PUFA 18:2 i,SFA 13:0,Phytosterols,PUFA 2:4 n-6,MUFA 18:1-11 t (18:1t n-7),bow,title_simple,title_simple_reversed,similarity,match_title
0,"Knackwurst, knockwurst, pork, beef","Knackwurst, knockwurst, pork, beef",0.0,0.0,,,,0.0,0.342,0.14,...,,,0.0,,,"[Knackwurst, knockwurst, pork, beef]",knackwurst knockwurst pork beef,beef pork knockwurst knackwurst,0.46176,doener kebap
1,"Cookie, Lebkuchen",Cookies and brownies,0.0,0.0,,,,0.0,0.355,0.264,...,,,,,,"[Cookie, Lebkuchen]",cookie lebkuchen,lebkuchen cookie,0.442114,doener kebap
2,"Bagel, pumpernickel",Bagels and English muffins,0.0,0.0,,,,0.0,0.403,0.209,...,,,,,,"[Bagel, pumpernickel]",bagel pumpernickel,pumpernickel bagel,0.435495,doener kebap
3,"Bread, pumpernickel",Yeast breads,0.0,0.0,,,,0.0,0.327,0.305,...,,,,,,"[Bread, pumpernickel]",bread pumpernickel,pumpernickel bread,0.424304,doener kebap
4,Knockwurst,Sausages,0.0,0.0,,,,0.0,0.342,0.14,...,,,,,,[Knockwurst],knockwurst,knockwurst,0.415284,doener kebap


In [40]:
# other are fairly accurate
df_res_gt[df_res_gt.match_title == 'beef steak']

Unnamed: 0,name,parent_name,"Cryptoxanthin, beta",Lycopene,"Tocopherol, delta","Tocotrienol, gamma","Tocotrienol, delta","Vitamin C, total ascorbic acid",Thiamin,Riboflavin,...,PUFA 18:2 i,SFA 13:0,Phytosterols,PUFA 2:4 n-6,MUFA 18:1-11 t (18:1t n-7),bow,title_simple,title_simple_reversed,similarity,match_title
0,"Beef, steak, chuck","Beef, excludes ground",0.0,0.0,,,,0.0,0.088,0.241,...,,,,,,"[Beef, steak, chuck]",beef steak chuck,chuck steak beef,0.880018,beef steak
1,"Beef, steak, flank","Beef, excludes ground",0.0,0.0,,,,0.0,0.082,0.345,...,,,,,,"[Beef, steak, flank]",beef steak flank,flank steak beef,0.851766,beef steak
2,"Beef, sandwich steak","Beef, excludes ground",0.0,0.0,,,,0.0,0.056,0.151,...,,,,,,"[Beef, sandwich steak]",beef sandwich steak,sandwich steak beef,0.832156,beef steak
3,"Beef, steak, tenderloin","Beef, excludes ground",0.0,0.0,,,,0.0,0.07,0.438,...,,,,,,"[Beef, steak, tenderloin]",beef steak tenderloin,tenderloin steak beef,0.831113,beef steak
4,"Beef, roast","Beef, excludes ground",0.0,0.0,,,,0.0,0.073,0.258,...,,,,,,"[Beef, roast]",beef roast,roast beef,0.818014,beef steak


# load all food data from reddit posts

70% of the title has to specifically describe what the food is. Backstory to the food / where the ingredients come from do not fall within the 70%. Use a comment on the post for backstory. No ALL CAPS, emojis or NON-OC Add a Tag at the Front of Your Title: * [I ate] - food you ate in a restaurant * [Homemade] - food you made at home * [Pro/Chef] - food you made if you work in the food industry


Rule changes over the years:
 - "Please describe the food in the title" https://www.reddit.com/r/food/comments/3vvk91/mod_post_rule_clarification_we_require/
 - "Going forward, all link posts must have "only one" of the following tags" - with tag dscription
     - https://www.reddit.com/r/food/comments/56nlo5/mod_post_psa_update_please_see_the_following_rule/
 - announced in february 2020 (tags, 100% of titles need to be food, etc.)
        - https://www.reddit.com/r/food/comments/exg935/a_massive_rule_overhaul/
        - when did they take effect?
 - new title rules (70% of title = food only. Take effect December 2021)
        - https://www.reddit.com/r/food/comments/rhok28/announcement_new_titles_rules_are_going_to_be/
 - post flairs were gone but came back in June 2020: https://www.reddit.com/r/food/comments/gwsbg9/post_flairs_are_back/

In [None]:
from nutritransform import filter_relevant_data, load_all_food_submissions

In [None]:
df_food_subs = filter_relevant_data(load_all_food_submissions())