# Nutrient clusters

## Set-up

In [1]:
try:
    # If we are not on google colab we need to upgrade pandas...
    from google.colab import widgets
    !pip install pandas --upgrade
except ModuleNotFoundError:
    # If we are not on google colab we pass
    pass


from distutils.version import LooseVersion
import pandas as pd
# As we are using features from pandas 0.23, we need to 
# check that the correct version is used.
assert LooseVersion(pd.__version__) > LooseVersion('0.23'), """
If you are on Google Colab and this fails, make sure you "restart runtime"
after running the cell that install the newest version of Pandas
If you are not on Google Colab and this fails, please update your
Pandas version: pip install --upgrade pandas
"""

In [2]:
# The block of code is here to make the notebook Google Colab compatible
try:
    # If we are not on google colab (we assume that you don't have 
    # the google.colab module on your computer...), 
    # the line below will raise an error that will be catch :)
    from google.colab import widgets
    # We clean the content of the directory, so that we can use
    #  git clone directly in it
    !rm -rf *
    !rm -rf .*
    !git clone https://github.com/striantafyllouEPFL/healthy-candies.git .
    # We get the data and initialize everything
    !python ./init_project.py
except ModuleNotFoundError:
    pass

In [3]:
import pandas as pd
import numpy as np

# Project specific module/functions
from healthy_candies.load import load_data

# Machine learning
from sklearn.preprocessing import RobustScaler
from sklearn.cluster import KMeans

## Intuition behind the clustering analysis

For this part of our project, we will be concerned with the nutrition facts of the products. Nutrient information has also been made available by regular contributors, therefore it is subject to inconsistencies and missing data. We will need to address these inconsistencies through a number of pre-processing steps before fitting our clustering model on them.

As shown below, we observe a large number of missing values for most of the nutrients.

In [4]:
data = load_data(limit_have_nutri_score=False)
cols = data.columns

# Identify nutrition facts columns
nf_cols = cols[cols.str.contains(r'100g')] \
    .drop(['nutrition-score-fr_100g', 'nutrition-score-uk_100g'])

print('Number of features: {}'.format(len(nf_cols)))

nf_cols_percent_available = data[nf_cols].count()/len(data)
print('Percentage of available data for each feature')
display(nf_cols_percent_available.sort_values(ascending=False).head(20))

Number of features: 103
Percentage of available data for each feature


energy_100g                 0.852649
proteins_100g               0.850595
fat_100g                    0.844954
carbohydrates_100g          0.844586
sugars_100g                 0.829701
salt_100g                   0.822351
sodium_100g                 0.822301
saturated-fat_100g          0.807846
fiber_100g                  0.392432
cholesterol_100g            0.207059
trans-fat_100g              0.206083
calcium_100g                0.205408
vitamin-c_100g              0.202903
iron_100g                   0.202747
vitamin-a_100g              0.198045
potassium_100g              0.036393
polyunsaturated-fat_100g    0.034057
monounsaturated-fat_100g    0.034001
vitamin-pp_100g             0.017607
vitamin-b1_100g             0.016951
dtype: float64

We will necessarily have to discard nutrients with $4\%$ or less available data,  which would not be able to provide much information to our model.

We will attempt two iterations of the analysis with the rest of the nutrients: one with only the $8$ nutrients for which we have sufficient data (over $80\%$), and one also including the other $7$ nutrients for which we only have $20\%$ to $40\%$ available data, yet we think they might add to the expressive power of our model.

It is here that we should consider what a missing value really means in the context of our dataset. It should be noted that the eight nutrients with sufficient data are intuitively that appear more often in nutrition labelling than the other seven. But we can also guess that a contributor might have skipped filling in the nutrients or the OCR failed to detect them from the scan that they provided.

However, it could also mean that they had a zero value, which also raises some questions about the non-missing zero values as well. A need for a design decision that emerges from this is whether we should consider zero as a missing value or not. Zero can be a valid nutrient value for a lot of products (e.g. water has zero calories). On the other hand, we can easily imagine a scenario where a user fills in zero values for missing nutrient information during the product submission. We will continue our provisional analysis by considering zeros as valid values. However this is something we will re-evaluate during our analysis.

In [5]:
# As an example of the process in the context of this milestone,
# we are going forward with the full 15 nutrients.
nf_cols_pruned = nf_cols[nf_cols_percent_available > 0.197]

Another challenge that immediately follows is what to do with the missing data. We can safely drop products where information is missing for all of the nutrients, but what about the rest?

A strategy followed in such cases is imputation, replacement of the missing data by substituted values. We will consider both simple imputation by mean or median, and more advanced strategies including replacing by values from a fitted regression model or multiple imputation.

An important problem with imputation is that it can create a lot of noise and limit the validity of the results. Especially in our case, where there is a large variance in nutrients between products, we might wish to avoid this additional noise. In this case we would consider working with a subset with as few missing data as possible.

In [6]:
for t in range(8, 16, 1):
    print('{} products have fully available data for at least'
          ' {} of the 15 nutrients.'
          .format(len(data[data[nf_cols_pruned].count(axis=1) >= t]), t))

551182 products have fully available data for at least 8 of the 15 nutrients.
278907 products have fully available data for at least 9 of the 15 nutrients.
150550 products have fully available data for at least 10 of the 15 nutrients.
142799 products have fully available data for at least 11 of the 15 nutrients.
136958 products have fully available data for at least 12 of the 15 nutrients.
134372 products have fully available data for at least 13 of the 15 nutrients.
131133 products have fully available data for at least 14 of the 15 nutrients.
126480 products have fully available data for at least 15 of the 15 nutrients.


We can see that the difference between the number of products that have nutrient information for at least 10 nutrients and the respective number for 15 nutrients is very small, thus, for the 15-nutrient analysis, we will use only products with fully available data for all 15 nutrients - note that we will either way examine the case with 8 nutrients in our 8-nutrient analysis, as mentioned above.

In [7]:
t = 15
data_nf = data[data[nf_cols_pruned].count(axis=1) >= t]

A final consideration before moving on to the machine learning model has to do with the available nutrient values themselves. It should be noted that our information is per 100g/ml and not per serving, which means that the nutrients are not directly comparable. For example, one would normally eat a much smaller quantity of salt (in grams) compared to yoghurt. Interestingly enough, the nutrient profiling system used for the calculation of the _NutriScore_ does not seem to account for this issue and calculates the nutritional rating directly on the 100g/ml values instead. We will inevitably follow the course and hope to exclude most of the noise through regular outlier handling.

The rest of the work will have to do with building and training our clustering model. We will first need to scale our data.

In [8]:
# Provisionally scale data based on interquartile range to
# ensure robustness
scaler = RobustScaler()
scaled_data = scaler.fit_transform(data_nf[nf_cols_pruned])

# Remove data under/over 3 stds from the center
scaled_data_3std = scaled_data[np.where((scaled_data.min(axis=1) >= -3)
                                        & (scaled_data.max(axis=1) <= 3))]

Then, we will need to train our clustering model. _scikit-learn_ provides implementations for different clustering approaches, which we will have to evaluate in the context of data. As an example, we simply train a trivial k-means model and display information about the derived centroids.

In [9]:
kmeans = KMeans(n_clusters=10, random_state=0) \
    .fit(scaled_data_3std)

pd.DataFrame(
    scaler
    .inverse_transform(kmeans.cluster_centers_), columns=nf_cols_pruned
)

Unnamed: 0,energy_100g,fat_100g,saturated-fat_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,vitamin-a_100g,vitamin-c_100g,calcium_100g,iron_100g
0,1225.620606,5.09451,0.804289,0.00489,0.001116,53.391885,4.911961,3.176929,8.465931,0.78859,0.310469,6e-06,0.000121,0.043401,0.002285
1,1500.230987,5.713919,2.016506,0.050052,0.001306,74.756481,51.734276,1.337414,2.697101,0.420565,0.165577,5e-06,0.000145,0.017433,0.000779
2,501.044836,4.11182,1.077207,0.006419,0.004572,17.466109,6.171329,1.720585,3.410711,0.72827,0.286721,7.3e-05,0.003916,0.037723,0.000814
3,2039.045307,26.27739,14.191414,0.010991,0.008638,59.991365,42.138392,2.849181,5.613262,0.517748,0.203838,1.6e-05,0.000177,0.068473,0.001994
4,1691.45944,14.790664,2.789275,0.002,0.000676,60.258066,15.516814,8.059393,10.27306,0.684319,0.269417,1.3e-05,0.000295,0.062502,0.002819
5,2436.761135,48.211011,6.974454,0.001517,0.000252,25.543515,7.949795,7.456849,19.647889,0.714954,0.281479,7e-06,0.00021,0.107943,0.003514
6,1508.444128,16.462981,3.199896,0.019617,0.002866,47.007341,8.362428,2.196319,6.09116,2.631371,1.035971,1.7e-05,0.000273,0.04449,0.001632
7,821.184326,10.468794,3.173026,0.02562,0.039442,11.298159,2.545277,0.560003,14.148657,1.509633,0.594345,2.9e-05,0.000597,0.046339,0.000952
8,344.041958,2.221978,0.838072,0.005597,0.003764,12.640672,7.122898,0.748684,2.842364,0.378765,0.14912,3.2e-05,0.000332,0.055322,0.000267
9,1253.966642,14.714266,7.02652,0.020944,0.034298,38.003755,23.694447,0.918899,4.474438,0.563073,0.221682,0.000103,0.000304,0.087705,0.000883


How do we work here? A part of the process involves training, tuning and evaluating a number of different clustering methods (more suitable for our problem than k-means). Both hard and soft clustering approaches might be of interest here.

Another part is about diving into the clusters and exploring the products that comprise them. What kind of categories are they into, what is their nutritional rating, how do they relate to our analysis of packaging colours?

In this notebook, we define the challenges we expect to face during the final phase of our analysis. We will proceed by addressing these challenges all the way to our eventual insights.