# Calculating the Satiety Index

One of the most important concepts in nutrition is the satiety index- the ratio between how filling a food is vs. how many calories it contains. People who eat satiating foods tend to consume fewer calories overall, helping them lose weight. Understanding satiety is also important for gaining weight, since eating un-filling foods allows for the consumption of excess calories 

Unfortunately, the satiety index is difficult to measure. [Some work](https://www.researchgate.net/publication/15701207_A_Satiety_Index_of_common_foods) has been done to measure the satiety index of common foods, but no comprehensive research exists. The closest I have found is [this article](https://optimisingnutrition.com/calculating-satiety/) which uses a public dataset to investigate the satiation of macronutrients.

If we are interested in how satiating various foods are, we can take a similar approach to the aforementioned article, except look at individual foods instead of macronutrients. We will use the same dataset from kaggle, which has records of 10k people's eating habits plus daily calorie goals. We will create a sparse (day record x food consumed) matrix, and use linear regression to estimate how strongly each food contributed to achieving the corresponding calorie goals. 

In [351]:
import json
from datetime import datetime
import pandas as pd
import re
from scipy.sparse import vstack
from scipy.sparse import csr_matrix, csc_matrix
from scipy.sparse.linalg import lsqr
from itertools import compress
from operator import itemgetter

import os
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
import nltk
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
from rake_nltk import Rake



# Part 1: Reading the data and transforming it into df format

The data's raw form is a tab-seperated text file with json-like entries for nutrition logs. We are interested only in the names of foods consumed and their calorie content, so we can transform the json logs into series mapping foods to calories. 

In [543]:
df = pd.read_csv('../data/mfp-diaries.tsv', sep = '\t', header = None)
df.columns = ['PERSON_ID','DATE','NUTRITION','GOALS']

def get_float_value(x):
    return int(re.sub( ',','',x)) #it is important to store vars as ints instead of strs, to conserve memory later 

def flatten(xss):
    return [x for xs in xss for x in xs]

goal_calories = [json.loads(x['GOALS'])['total'][0]['value'] for k,x in df.iterrows()]

start = datetime.now()
calorie_records = [pd.Series(
                                {y['name'] : get_float_value(y['nutritions'][0]['value'])
                                 for y in flatten([z['dishes'] for z in json.loads(x['NUTRITION'])])
                                }
                            )
                for k, x in df.iterrows()]
print(datetime.now() - start)

0:03:25.172914


# Part 2: Creating a sparse matrix from the records 

Our goal is to have a dataframe with a row for each daily journal and a column for each food consumed, but that is infeasible. There are ~600k daily entries, and ~1M unique foods listed. While the vast majority of cell contents would be simply zero, the resulting dense dataframe would still be far too large to hold in traditional memory. 

Fortunately, we can use a sparse representation to have a dataframe-esque object without explicitly writing every zero.

In [298]:
start = datetime.now()

all_foods = set().union(*[list(c.index) for c in calorie_records])
tot_foods = len(all_foods)
mapping = {food : i for food, i in zip(all_foods,range(tot_foods))}
sparse_records = [csr_matrix((calorie_record.values,
                               ([0] * len(calorie_record)
                                ,[mapping[z] for z in calorie_record.index]
                               )
                              )
                             ,shape = (1, tot_foods)
                            ) for calorie_record in calorie_records]
sparse_matrix = vstack(sparse_records)
print(datetime.now() - start)

0:02:00.643453


# Part 3: Calculating inferred satiety indexes by food

We overcame the technical hurdle of holding the dataset in memory, but there is also a mathematical hurdle, which is that we have more columns than rows. We don't want to suffer from the curse of dimensionality so we need to reduce our features somehow. 

One approach is to lump all of the rare foods together. This will add some noise to the model, but it will also allow the model to focus on the more common foods, and hopefully come up with solid estimates

In [388]:
start = datetime.now()
food_counts = sparse_matrix.sum(axis = 0)
food_count_significant = [f >= 1000 for f in food_counts] #we will lump all the 'insignificant' foods together
significant_food_matrix = sparse_matrix[:, food_count_significant[0].tolist()[0]]
insignificant_food_matrix = sparse_matrix[:, (~food_count_significant[0]).tolist()[0]]

insignificant_food_calories = insignificant_food_matrix.sum(axis = 1)

adjusted_significant_food_matrix = hstack([significant_food_matrix, insignificant_food_calories])
res = lsqr(adjusted_significant_food_matrix
           , goal_calories
           , x0 = [1] * adjusted_significant_food_matrix.shape[1])

significant_foods = list(compress(list(mapping.keys()),food_count_significant[0].tolist()[0]))
significant_food_ratios = pd.Series(res[0], index = significant_foods + ['Other'])
print(datetime.now() - start)

0:07:38.607748


In [528]:
significant_food_ratios.sort_values(ascending = False)


Fresh - Green Onion, Chopped, 1/4 cup                                                               22.564728
Kirkland Signature (Costco) - Extra Strength Glucosamine Hci 1500 Mg With Msm 1500 Mg, 2 Tablets    12.591241
Tesco - Organic Spinach, 50 g                                                                        9.197528
Generic - Tea With 40ml Whole Milk, 1 Mug                                                            8.924278
Generic - Green Beens Boiled, 3 cup (125 grams)                                                      8.692831
                                                                                                      ...    
Kirkland - Vitamin C 500 Mg Chewable, 2 tablet                                                      -5.455760
Nescafe Taster's Choice - Single Serve Packet - Hazelnut Instant Coffee, 2 packet                   -5.476055
Generic - Rice, Jasmin, Boiled, 150 g                                                               -6.706598
Classic - 

In [546]:
significant_food_matrix.sum()/sparse_matrix.sum()

0.604063870539085

Even after reducing our features significantly (from 1M to ~157k), the results still look fairly unreliable. We don't have the option of excluding more foods because we are already cutting out 40% of all calories consumed, and cutting out more than that would add excessive noise. So what other options do we have?

# Part 4: Finding satiety index by category

One step we could take is categorizing foods by type. "Potato" and "Sweet Potato" could be mixed together, for example . 

To do this we will embed the food descriptions with Sentence-Bert, then categorize them using K-means clustering. Using LDA is another option but it is not appropriate because it assigns multiple topics to each input document, which is not what we want. We will also rake-nltk to summarize the categories for more understandability. In an ideal world we would do this analysis for every food, but it would take a long time so we will use the same procedure as before, limiting to only the common foods

In [754]:
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

stemmer = SnowballStemmer("english")

additional_stopwords = {'oz','ozs','cup','cups','small','medium','large','gram','grams','pound','serving','tbsp'
                       ,'container','order','serving(s)','tbls','mini','inch','servings','standard','white','black'
                       ,'regular','homemade'}
full_stopwords = gensim.parsing.preprocessing.STOPWORDS.union(additional_stopwords)

'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in full_stopwords and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

def get_summary_of_cluster(cluster):
    text = ". ".join([mapping_inverted[x] for x in cluster_map[cluster]])
    rake_nltk_var.extract_keywords_from_text(text)
    keyword_extracted = rake_nltk_var.get_ranked_phrases()
    return keyword_extracted[0]

[nltk_data] Downloading package omw-1.4 to C:\Users\Zach
[nltk_data]     Rosenof\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Zach
[nltk_data]     Rosenof\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Zach
[nltk_data]     Rosenof\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Zach
[nltk_data]     Rosenof\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
start = datetime.now()

processed_docs = []

for doc in significant_foods:
    processed_docs.append(preprocess(doc))

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode([z if len(z) > 1 else (z*2 if len(z) == 1 else ['blank','blank']) for z in processed_docs]) 

km = KMeans(n_clusters = 300)
km.fit(embeddings)

cluster_map = {}
for food, label in zip(significant_foods, km.labels_):
    cluster_map[label] = cluster_map.get(label,[]) + [mapping[food]]
cluster_list = list(cluster_map.keys())

clumped_matrix = np.concatenate([sparse_matrix[:,cluster_map[cluster]].sum(axis = 1) for cluster in cluster_list]
                                , axis = 1)
clumped_matrix = np.concatenate([clumped_matrix, insignificant_food_calories]
                                , axis = 1)

res = np.linalg.lstsq(clumped_matrix, goal_calories)    

mapping_inverted = {v: k for k, v in mapping.items()}
res_final = pd.Series({get_summary_of_cluster(cluster) : res[0][cluster_list.index(cluster)] 
                       for cluster in cluster_list})

print(datetime.now() - start)

In [None]:
pd.set_option('display.max_rows', 300)
res_final.sort_values(ascending = False)



There is some interesting information here. Pure protein products at the top make sense superficially, and could indicate something real. However, the coefficients generally look close to one, suggesting that MyFitnessPal users are largely hitting their calorie targets regardless of what kinds of food they choose to eat. 

This is not necessarily an indictment on the concept of a satiety index. People who track their own calories are likely capable of adjusting their food intake against what it would be naturally to meet their calorie targets. They might even be mis-reporting their intake, consciously or unconsciously, to keep it in line with their goals