###### To-Do

Add palmitic acid, stearic-acid, EPA, DHA to cleaned food df and constraints

###### attributions

[1) Set default .head() to 3 rows -- Ted Petrou, Dunder Data](https://medium.com/dunder-data/pandas-trick-1-change-the-default-number-of-rows-returned-from-the-head-method-bc7c21ce0d53)

###### imports

In [1]:
import pandas as pd # for dataframe analysis
import numpy as np # for arrays functionality, i.e. .where()...
import re
from functools import partialmethod # for changing default of pandas .head()
import difflib # compare naming of nutrients between constraints and foods DFs
from notebooks_module.data_munging import format_cols, float_match, \
find_duplicate_columns, find_exact_matches, getUnit, getNumber

###### set defaults

In [2]:
pd.DataFrame.head = partialmethod(pd.DataFrame.head, n=3)

### Whfoods

###### data cleaning

In [3]:
whfoods = pd.read_csv('../01_Data/whfoods.csv')
whfoods.index = range(whfoods.shape[0])
whfoods.head()

Unnamed: 0,"Asparagus, Cooked",Unnamed: 1,Unnamed: 2,"Avocado, cubed, raw",Unnamed: 4,Unnamed: 5,"Beet Greens, boiled",Unnamed: 7,Unnamed: 8,"Beets, sliced, cooked",...,Unnamed: 341,"Sage, dried",Unnamed: 343,Unnamed: 344,"Thyme, fresh",Unnamed: 346,Unnamed: 347,"Turmeric, ground",Unnamed: 349,Unnamed: 350
0,BASIC MACRONUTRIENTS AND CALORIES,,,BASIC MACRONUTRIENTS AND CALORIES,,,BASIC MACRONUTRIENTS AND CALORIES,,,BASIC MACRONUTRIENTS AND CALORIES,...,,BASIC MACRONUTRIENTS AND CALORIES,,,BASIC MACRONUTRIENTS AND CALORIES,,,BASIC MACRONUTRIENTS AND CALORIES,,
1,nutrient,amount,DRI/DV,nutrient,amount,DRI/DV,nutrient,amount,DRI/DV,nutrient,...,DRI/DV,nutrient,amount,DRI/DV,nutrient,amount,DRI/DV,nutrient,amount,DRI/DV
2,,,(%),,,(%),,,(%),,...,(%),,,(%),,,(%),,,(%)


In [4]:
# investigating dimensions of dataset

nfoods = len(whfoods.columns)/3 # there are 117 foods in the dataset.
nfoods
whfoods.shape
nnutrients = whfoods.shape[0]

174 "nutrients," though many of these will be not real nutrients.

**Initial assessment of dataset**\
\
*SHAPE:* There are three columns per food group, 174 rows, 351 columns.  \
\
*INFORMATION:* There is no serving size in grams of the food, though from inspecting the excel spreadsheet it appears possible to reconstruct this value from the amounts of the macronutrients, micronutrients, water, and ash.\
\
*FORMATTING:* The three columns indicate first the description of the component of the food, the amount in grams, milligrams, or micrograms, and the DRI/DV if applicable.

**Action Items**
1) Generate two dataframes -- one for amount, one for DRI/DV
2) Standardize missing value indicators
3) Eliminate rows with one unique value.
4) Separate out units for the amount df into a reference dictionary of {nutrient1: 'mg', nutrient2: 'g', nutrient3: 'mcg', ...}
5) Build the LP problem with PuLP

In [5]:
# initial pass on removing extraneous rows, columns.
whfoods = whfoods.dropna(how='all',axis=1).dropna(how='all', axis=0)

There are under 173 nutrient categories since some of the rows correspond to supercategories such as 'Minerals','INDIVIDUAL FATTY ACIDS', 'Monounsaturated Fats', 'INDIVIDUAL AMINO ACIDS', 'OTHER COMPONENTS', etc. as well as corresponding extraneous rows such as: 'nutrient', nan.

The dataset could be filtered by removing rows with all 0.00 g, mg etc or all -- mg such as with the sweeteners, caffeine, alcohol.

I'll create two dictionaries, one of the foods and the raw nutrient values and another of the foods and DRI/DV.  These can be used to generate data frames.  The raw nutrient values and the DRI/DV info can be used to generate the 100% DRI/DV values for each nutrient.  Rounding errors can be minimized by using the food with the highest DRI/DV for a given nutrient to generate the recommendation.

###### column name formatting

In [6]:
# gather all the foods into a list
foods = [food for inx,food in enumerate(whfoods.columns) if inx%3 ==0]
len(foods) # verify the number of foods

117

In [7]:
new_cols = []

for i, food in enumerate(foods):
    # Replace spaces, then commas, then dunders, and finally make lowercase
    food = re.sub(r'-','_',re.sub(r'_+', '_', 
                  re.sub(r',+', '_', 
                         re.sub(r' +', '_', food))).lower())

    new_cols = new_cols + [food, f'nv_{food}', f'drv_{food}']

whfoods.columns = new_cols
foods = [food for inx,food in enumerate(whfoods.columns) if inx%3 ==0]

whfoods.columns

Index(['asparagus_cooked', 'nv_asparagus_cooked', 'drv_asparagus_cooked',
       'avocado_cubed_raw', 'nv_avocado_cubed_raw', 'drv_avocado_cubed_raw',
       'beet_greens_boiled', 'nv_beet_greens_boiled', 'drv_beet_greens_boiled',
       'beets_sliced_cooked',
       ...
       'drv_rosemary_fresh', 'sage_dried', 'nv_sage_dried', 'drv_sage_dried',
       'thyme_fresh', 'nv_thyme_fresh', 'drv_thyme_fresh', 'turmeric_ground',
       'nv_turmeric_ground', 'drv_turmeric_ground'],
      dtype='object', length=351)

###### separate nutrient val, drv info

In [8]:
# collect a nested dictionary of... {food: {nutrient:nutrient_val}}
nutrient_vals = \
{food:
    {whfoods.loc[i,food]: 
     whfoods.iloc[i,int(np.where(whfoods.columns.values==food)[0][0])+1] 
     for i in range (2, nnutrients)
    } for food in foods
}

# collect a nested dictionary of... {food: {nutrient:nutrient_drv}}
nutrient_drv = \
{food:
    {whfoods.loc[i,food]: 
     whfoods.iloc[i,int(np.where(whfoods.columns.values==food)[0][0])+2] 
     for i in range (2, nnutrients)
    } for food in foods
}

# convert to DatFrame
nv_df = pd.DataFrame(nutrient_vals)
drv_df = pd.DataFrame(nutrient_drv)

# Transposing so that the foods are like "observations" in long format
nv_df = nv_df.T
drv_df = drv_df.T

###### Cleaning

Remove all columns that have uniform values

In [9]:
# credit: ChatGPT
def remove_columns_with_same_value(df):
    unique_counts = df.nunique()
    columns_to_remove = unique_counts[(unique_counts == 1)|(unique_counts == 0)].index
    df = df.drop(columns=columns_to_remove)
    return df

In [10]:
nv_df = remove_columns_with_same_value(nv_df)

In [11]:
nv_df.head()

Unnamed: 0,Protein,Carbohydrates,Fat - total,Dietary Fiber,Calories,Starch,Total Sugars,Monosaccharides,Fructose,Glucose,...,Sugar Alcohols (Total),Glycerol,Inositol,Mannitol,Sorbitol,Xylitol,Artificial Sweeteners (Total),Aspartame,Saccharin,Caffeine
asparagus_cooked,4.32 g,7.40 g,0.40 g,3.60 g,39.6,-- g,2.34 g,2.18 g,1.42 g,0.76 g,...,-- g,-- g,-- g,-- g,-- g,-- g,-- mg,-- mg,-- mg,0.00 mg
avocado_cubed_raw,3.00 g,12.80 g,21.99 g,10.05 g,240.0,-- g,0.99 g,0.89 g,0.18 g,0.56 g,...,-- g,-- g,-- g,-- g,-- g,-- g,-- mg,-- mg,-- mg,0.00 mg
beet_greens_boiled,3.70 g,7.86 g,0.29 g,4.18 g,38.88,-- g,0.86 g,-- g,-- g,-- g,...,-- g,-- g,-- g,-- g,-- g,-- g,-- mg,-- mg,-- mg,0.00 mg


In [12]:
drv_df = remove_columns_with_same_value(drv_df)

In [13]:
drv_df.head()

Unnamed: 0,Protein,Carbohydrates,Fat - total,Dietary Fiber,Calories,Vitamin B1,Vitamin B2,Vitamin B3,Vitamin B6,Vitamin B12,...,Iron,Magnesium,Manganese,Molybdenum,Phosphorus,Potassium,Selenium,Sodium,Zinc,Omega-3 Fatty Acids
asparagus_cooked,9,3,1,13,2,24,19,12,8,0,...,9,6,12,--,14,9,20,2,10,2
avocado_cubed_raw,6,6,28,36,13,8,15,16,23,0,...,5,10,9,--,11,15,1,1,9,8
beet_greens_boiled,7,3,0,15,2,14,32,5,11,0,...,15,23,32,--,8,28,2,23,7,0


Reformatting columns for nv_df & drv_df

In [14]:
nv_df = format_cols(nv_df)
drv_df = format_cols(drv_df)

###### Formatting

Replacing '--' with '0.00', assuming that this is correct.  Another interpretation is that -- represents no record, so this may need to be instead replaced with NaN.

In [15]:
for col in nv_df.columns:
    nv_df[col] = nv_df[col].str.replace('--','0.00')
nv_df = remove_columns_with_same_value(nv_df)

###### Strip units, create {nutrient: unit} dictionary

The whfoods data contain the following units:\
mg (ATE), mcg (RE), mcg (RAE), IU, g, mg, mcg

In [16]:
# Generate a dictionary of units for the different info types (serving_size, 
# calories, etc.)
units_dict = {}
units_pattern = re.compile(r"[mcgiu\)\(]+",re.I) # generate pattern object
for r in range(1,len(nv_df.columns)):  # for each of the nutrient labels
    s = nv_df.iat[0,r] # get nutrient value
    m = units_pattern.search(s) # check pattern object against nutrient value, 
                        # generating match object
    try:
        units_dict[nv_df.columns[r]] = m.group()
    except:
        i=0
        while not m:
            # loop through until we have a unit or decide there are no units
            i = i+1 # go to next value
            s = nv_df.iat[i,r] # store value
            m = units_pattern.search(s) # search value against the pattern
            try: # store match if there is one
                units[nv_df.columns[r]] = m.group() 
            except: 
                if i==nv_df.shape[0]-1: # if no units found by end, None units
                    m = "None"
                    units_dict[nv_df.columns[r]] = m

In [17]:
# convert drv from str to int
for nutrient in drv_df.columns.values[[0,1]]:
    try: drv_df[[nutrient]] = drv_df[[nutrient]].astype("int")
    except: pass

In [18]:
float_pattern = r'^\d+\.\d+'
for col in nv_df.columns:
    nv_df[col] = nv_df[col].apply(lambda x: float_match(float_pattern, x))

In [19]:
nv_df = remove_columns_with_same_value(nv_df)
drv_df = remove_columns_with_same_value(drv_df)

##### Removing duplicate nutrient columns

The columns that need investigating are: 
* 'Folate', 'Folate (DFE)', and 'Folate (food)' - same for almost all foods.  Keeping 'Folate (DFE)'
* 'Vitamin B3', 'Vitamin B3 (Niacin Equivalents)'
*  'Vitamin A International Units (IU)',
       'Vitamin A mcg Retinol Activity Equivalents (RAE)',
       'Vitamin A mcg Retinol Equivalents (RE)',
       'Retinol mcg Retinol Equivalents (RE)',
       'Carotenoid mcg Retinol Equivalents (RE)', 'Alpha-Carotene'
* 'Vitamin E mg Alpha-Tocopherol Equivalents (ATE)',
       'Vitamin E International Units (IU)', 'Vitamin E mg',

In [20]:
dup_cols_nv = find_duplicate_columns(nv_df)
dup_cols_drv = find_duplicate_columns(drv_df)

In [21]:
dup_cols_nv

['folate',
 'folate_food',
 'vitamin_e_mg_alpha_tocopherol_equivalents_ate',
 'vitamin_e_mg',
 'acetic_acid',
 'sugar_alcohols_total',
 'xylitol']

I'll drop ['vitamin_e_mg',
 'acetic_acid',
 'sugar_alcohols_total',
 'xylitol'] from nv_df since the first of these is a duplicate and the others have no informative values.

In [22]:
nv_df = nv_df.drop(['folate_food', 'vitamin_e_mg', 'acetic_acid', 
                    'sugar_alcohols_total', 'xylitol'],axis=1)

In [23]:
nv_df.shape # there are now 125 nutrients in nv_df

(117, 124)

###### Vitamin D

In [24]:
nv_df[~(nv_df.filter(regex='vitamin_d', axis=1) == 0).all(axis=1)].filter(
    regex='vitamin_d',axis=1)['vitamin_d_international_units_iu'].values/\
nv_df[~(nv_df.filter(regex='vitamin_d', axis=1) == 0).all(axis=1)].filter(
    regex='vitamin_d',axis=1)['vitamin_d_mcg']

mushrooms_crimini_raw                       30.857143
mushrooms_shiitake_cooked                   39.803922
chicken_pasture_raised_breast_roasted       51.545455
lamb_grass_fed_lean_loin_roasted            20.636364
turkey_pasture_raised_light_meat_roasted    33.352941
cheese_grass_fed_cheddar_whole_milk         40.000000
cow's_milk_grass_fed                        39.132075
eggs_pasture_raised_large_hard_boiled       39.545455
yogurt_grass_fed_whole_milk                 19.600000
cod_pacific_fillet_baked                    40.029412
salmon_wild_coho_broiled                    39.924278
sardines_atlantic_canned                    40.250575
scallops_steamed                                  inf
shrimp_large_steamed                        41.272727
tuna_yellowfin_fillet_baked                 40.964758
Name: vitamin_d_mcg, dtype: float64

###### Folate

In [25]:
folate_cols = [col for col in nv_df.columns if 'folate' in col]

# drop ['Folate', 'Folate (food)'] columns and keep the descriptive 
# 'Folate (DFE)' but rename it to the more concise 'folate'

nv_df = nv_df.drop(columns = ['folate'])
nv_df.rename({'folate_dfe':'folate'},axis=1,inplace=True)

###### Vitamin E

In [26]:
ve_cols = [col for col in nv_df.columns if 'vitamin_e' in col]
# I'll get rid of vitamin_e_international_units_iu and keep only the more modern alpha tocopherol equivalent and simplify the name to vitamin_e.
nv_df = nv_df.drop(columns=['vitamin_e_international_units_iu']).rename(
    {'vitamin_e_mg_alpha_tocopherol_equivalents_ate':'vitamin_e'},axis=1)

###### Vitamin A

In [27]:
va_cols =  nv_df.filter(regex='vitamin_a|carotenoid|retinol').columns
va_df = nv_df[nv_df[va_cols].\
                 apply(lambda x: x.nunique() != 1,axis = 1).values][
            va_cols]

To compare, I'll extract the floating point numbers from the data.

In [28]:
sum(va_df[va_cols[2]] == va_df[va_cols[4]])
# 104 foods are the same for 'Vitamin A mcg Retinol Equivalents (RE)' and 'Carotenoid mcg Retinol Equivalents (RE)'
(va_df[va_df[va_cols[2]] != va_df[va_cols[4]]][[va_cols[2],va_cols[4]]]).head()

Unnamed: 0,vitamin_a_mcg_retinol_equivalents_re,carotenoid_mcg_retinol_equivalents_re
chicken_pasture_raised_breast_roasted,0.0,6.8
lamb_grass_fed_lean_loin_roasted,0.0,34.23
turkey_pasture_raised_light_meat_roasted,3.4,0.0


No two columns are the same, so for now I'll keep all columns that pertain to Vitamin A

Background:

Retinol Equivalents or RE is used by the World Health Organization and Food and Agriculture Organization while Retinol Activity Equivalents or RAE is used by the FDA.  They are based on different conversion rates of carotenoids into vitamin A while having the same conversion for the animal based retinol, with the RE being more optimistic for conversion rates while also corresponding to a lower recommended minimum intake for mcg RE than the FDA's recommendation for mcg RAE.  \
\
Provitamin A, or Carotenoids, are found only in plants while preformed vitamin A or retinol is found only in animals.  Carotenoids are not toxic at high doses while retinol is.\
\
Constraining the vitamin A contributed by animal products while counting plant sources of carotenoid precursors (Beta- Carotene and Cryptoxanthin) to vitamin A towards minimum requirements looks like a promising potential solution.  

In [29]:
carotenoid_cols = nv_df.filter(regex='carotene|cryptoxanthin').columns.values
va_df[carotenoid_cols] = nv_df[carotenoid_cols]

To avoid deleting useful information, I'll simply create a new column 'vitamin_a' which will be a duplicate of 'vitamin_a_mcg_retinol_activity_equivalents_rae' since this is the new FDA standard.

In [30]:
nv_df['vitamin_a'] = nv_df['vitamin_a_mcg_retinol_activity_equivalents_rae']

###### Vitamin B3

In [31]:
vb3_cols = nv_df.filter(regex='vitamin_b3').columns.values

In [32]:
unique_rows = nv_df[nv_df[vb3_cols].\
                 apply(lambda x: x.nunique() != 1,axis = 1).values][vb3_cols]

Vitamin_b3 and vitamin_b3_niacin_equivalents are largely different.  60 mg tryptophan contributes 1 mg of niacin according to [Niacin - Harvard Health](https://www.hsph.harvard.edu/nutritionsource/niacin-vitamin-b3/#:~:text=RDA%3A%20Niacin%20is%20measured%20in,mg%20NE%20for%20lactating%20women.)    

In [33]:
nv_df.rename(columns = {'vitamin_b3_niacin_equivalents':'niacin'}, 
             inplace = True)

###### Export nv_df and drv_df to csv

In [34]:
nv_df.to_csv('../01_Data/whfoods_nv.csv')
drv_df.to_csv('../01_Data/whfoods_drv.csv')

Note: can relate nv_df to drv_df for deriving drv in mass units

### Constraints

###### Merge constraints

In [35]:
constraints = pd.read_csv('../01_Data/constraints.csv')
aa_constraints = pd.read_csv('../01_Data/AminoAcids.csv')

In [36]:
constraints.columns, aa_constraints.columns

(Index(['nutrient', 'Min', 'Max', 'notes:'], dtype='object'),
 Index(['AminoAcid', 'Min'], dtype='object'))

In [37]:
# format, make compatible, and merge amino acid constraints with the rest.
constraints.rename(columns = lambda x: x.lower(), inplace = True)
aa_constraints.rename(columns = lambda x: x.lower(), inplace = True)
constraints.set_index('nutrient',inplace=True)
constraints.drop(columns = ['notes:'],inplace = True)
constraints.dropna(how = 'all', inplace = True)
aa_constraints.rename(columns = {'aminoacid':'nutrient'}, inplace = True)
aa_constraints.set_index('nutrient', inplace = True)
aa_constraints.index.name = None
all_constraints = pd.concat([constraints,aa_constraints],axis = 0)

In [38]:
# Credit: Chat-gpt
all_constraints.index.values[[bool(val) for val in 1- np.array(
    find_exact_matches(all_constraints.index.values, nv_df.columns.values))]]

array(['total_fat', 'riboflavin', 'thiamin', 'vitamin_a_rae', 'vitamin_d',
       'irom', 'phosphorous', 'zink', 'carbohydrate', 'fiber', 'fat',
       'saturated_fatty_acids', 'cystine + methionine',
       'phenylalanine + tyrosine'], dtype=object)

###### Make constraints and nutrients compatible

In [39]:
nv_df['phenylalanine_tyrosine'] = nv_df['phenylalanine'] + nv_df['tyrosine']
nv_df['cysteine_methionine'] = nv_df['cysteine'] + nv_df['methionine']
constraints.loc['soluble_fiber'] = {'min':'6 g', 'max': np.nan}
nv_df.rename({'dietary_fiber':'fiber'}, axis=1, inplace=True)
all_constraints.rename({'zink':'zinc', 'total_fat':'fat_total',}, 
                       axis=0, inplace=True)

In [40]:
all_constraints.rename(index={
    'total_fat': 'fat_total',
    'riboflavin':'vitamin_b2',
    'thiamin':'vitamin_b1',
    'vitamin_d':'vitamin_d_mcg',
    'vitamin_e':'vitamin_e_mg_alpha_tocopherol_equivalents_ate',
    'irom':'iron',
    'phosphorous':'phosphorus',
    'carbohydrate':'carbohydrates',
    'fat':'fat_total',
    'saturated_fatty_acids':'saturated_fat',
    'cystine + methionine':'cysteine_methionine',
    'phenylalanine + tyrosine':'phenylalanine_tyrosine',
    'zink':'zinc'
}, inplace=True)
all_constraints.drop(index=['vitamin_a_rae', 
        'vitamin_e_mg_alpha_tocopherol_equivalents_ate'], inplace = True)
constraints = all_constraints.copy()

In [41]:
constraints_units_dict = {}

units_dict = \
{idx:getUnit((constraints.loc[idx,'min'], constraints.loc[idx,'max'])) 
 for idx in constraints.index}
constraints['min'] = [getNumber(val) for val in constraints['min'].values]
constraints['max'] = [getNumber(val) for val in constraints['max'].values]

Unit Conversions: getting constraints to have same unit as foods.

In [42]:
constraints = constraints.drop_duplicates()

# some constraints need to be converted from mcg to mg (copper) 
# or from mg to g (amino acids) while others (potassium) need to be converted 
# from g to mg 
amino_acids = ['phenylalanine_tyrosine', 'leucine', 'valine', 'threonine', 
    'histidine', 'lysine','cysteine_methionine', 'methionine']
convert_up = amino_acids
convert_up.append('copper')
convert_down = ['potassium']

constraints.loc[convert_up, ['min','max']] /= 1000
constraints.loc[convert_down, ['min','max']] *= 1000

units_updates = dict(zip(amino_acids,['g/kg/d']*len(amino_acids)))
units_updates.update({'copper':'mg','potassium':'mg'})
units_dict.update(units_updates)

nv_df_cnstr = nv_df[constraints.index.values]

nv_df_cnstr.to_csv('../02_Data_formatted/nv_df.csv')
constraints.to_csv('../02_Data_formatted/constraints.csv')
pd.DataFrame.from_dict(units_dict, orient='index', columns=['Units']).to_csv(
    '../02_Data_formatted/nutrient_units.csv')

In [43]:
constraints

Unnamed: 0,min,max
calories,2000.0,2000.0
protein,56.0,160.0
fat_total,22.2,78.0
saturated_fat,0.0,12.0
cholesterol,0.0,
sodium,1500.0,2300.0
choline,550.0,3500.0
folate,400.0,1000.0
niacin,16.0,35.0
pantothenic_acid,5.0,


In [None]:
f_i = infeasibilitySearch(foods,constraints)
i_c = f_i[1]
feasible = f_i[0]
f_c = constraints.iloc[feasible]
model = formulateMP(foods,f_c)
plp.listSolvers(onlyAvailable=True)
model.solve(solver=PULP_CBC_CMD(msg=False))