# üçé Exploring the Open Food Facts API

**DS205 W01 NB01 ‚Äì Advanced Data Manipulation (Winter Term 2025/2026)**

<div style="font-family: system-ui; padding: 20px 30px 20px 20px; background-color: #FFFFFF; border-left: 8px solid #ED9255; border-radius: 8px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);max-width:600px;color:#212121;">

**Student Notebook**
- üìÖ Date: 19 January
- üë§ Name: Jon
- üéØ Purpose: Learn to collect data from APIs and transform it for analysis

ü•Ö **Learning Goals**

<ul style="margin: 0.2em 0 0.4em 0; padding-left: 1.25em; font-size:1em; list-style-type:none;font-size:0.85em;color:#666666">

  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">i)</span> Understand what an API is and how to request data from one,
  </li>
  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">ii)</span> Learn the structure of JSON data and how Python represents it,
  </li>
  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">iii)</span> Practise inspecting unfamiliar data systematically,
  </li>
  <li style="padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">iv)</span> Transform JSON data into a pandas DataFrame for analysis.
  </li>
</ul>

</div>

‚öôÔ∏è **Importing libraries**

Here are the libraries we are using today:

In [None]:
import json
import requests

import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

## Section 1. Collect data (20 min)

<span style="display:inline-block; background:#fff; width:100x; border:1.5px solid #E0E0E0; border-radius:7px; padding:0.65em 0.65em 0.65em 0.325em; margin:0 0.175em 0 0;">
 üë§ <strong>TEACHING MOMENT</strong>
</span> <span style="font-size:0.6em">Whenever you see a üë§ <strong>TEACHING MOMENT</strong>, follow along with your instructor and ask questions along the way!</span>


In this section, you will learn (recap) how to request data from an API.

<div style="max-width: 600px; font-size:0.85em; border-left: 6px solid #275d9c; background: #f6f9ff; margin: 1.2em 0; padding: 1.15em 1em 1.1em 1.5em; border-radius: 7px; box-shadow: 0 2px 7px rgba(39,80,150,0.03);">

<span style="font-size:1.5em">What is an API?</span>

An **API** (*Application Programming Interface*) is a system that lets programs communicate directly with each other. Unlike a website you view in your browser, which shows text and images, an API provides **structured data** for computer programs to use.

Today we will work with data from the [Open Food Facts üåê](https://world.openfoodfacts.org/). Instead of "seeing" the product in your browser, as you can do by going to their main website, we will use their API to request detailed product data directly (like nutritional information, ingredients, brands, and more) using coding, retrieved in format that is ready for analysis.

This is just one way of getting data from a place. An API is the website owner's way of giving you permission to access their organised data. Next week you'll see a way of getting data when the data is not structured so nicely (*web scraping*).

</div>


üéØ **ACTION POINT 01:**

1. **Familiarise yourself with the website:** go to the [Open Food Facts üåê](https://world.openfoodfacts.org/) website and search for the last product you bought at a supermarket in London. 
   Note what kind of information/data about that product is returned to you on the page.

2. **Check out the API documentation:**
    The maintainers of Open Food Facts have made their data available via an API. They have a [comprehensive documentation](https://openfoodfacts.github.io/openfoodfacts-server/api/) about the data available and how to retrieve it. Most importantly, they have a [Reference page for their API (v2)](https://openfoodfacts.github.io/openfoodfacts-server/api/ref-v2/) which describes the technical details of what you can do with that API.
    <span style="color:#66666;font-size:0.85em">üí° **TIP:** you might want to bookmark those links in case you need it in the next few weeks!</span>

Don't worry about trying to understand everything. There's a ton of text and information in those pages and, right now, it's more important to know that these reference pages exist.


### Using `requests`

In Python, the most standard way of collecting data from an API like this is by using a library called [`requests`](https://requests.readthedocs.io/en/latest/).

The code below illustrates how to collect data from the [`/api/v2/search` endpoint](https://openfoodfacts.github.io/openfoodfacts-server/api/ref-v2/#get-/api/v2/search)

In [None]:
# The Open Food Facts API endpoint for searching products
# as described in the Reference page
endpoint_url = "https://world.openfoodfacts.org/api/v2/search"

# Parameters for our search
params = {
    "categories_tags_en": "Breakfast cereals",
    "countries_tags": "en:united-kingdom",
    "page_size": 50,
    "fields": "product_name,brands,categories,nutriments,nova_group,ingredients_text"
}

# Send a request to the API address
# This is the equivalent of hitting Enter on a browser and waiting for the page to load
# ‚ö†Ô∏è This request below does take a few seconds. Don't panic if you have to wait a bit or if you need to re-run!
response = requests.get(endpoint_url, params=params)

# Check if the request was successful
print(f"Status code: {response.status_code}")
print("Request successful!" if response.status_code == 200 else "‚ùå Request failed")

A request returns a response with a status code. [Status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status) run from 100 to 599. The one you want is **200**! It means that everything work as intended.

Once you've got a successful response, you can convert that data to JSON:

In [None]:
# Convert the response to Python objects
data = response.json()

# Uncomment the line below to see a JSON in full
# data

## Section 2. Making sense of the data (15 min)

<span style="display:inline-block; background:#fff; width:100x; border:1.5px solid #E0E0E0; border-radius:7px; padding:0.65em 0.65em 0.65em 0.325em; margin:0 0.175em 0 0;">
 üë§ <strong>TEACHING MOMENT</strong>
</span> <span style="font-size:0.6em">Whenever you see a üë§ <strong>TEACHING MOMENT</strong>, follow along with your instructor and ask questions along the way!</span>

<div style="max-width: 600px; font-size:0.85em; border-left: 6px solid #275d9c; background: #f6f9ff; margin: 1.2em 0; padding: 1.15em 1em 1.1em 1.5em; border-radius: 7px; box-shadow: 0 2px 7px rgba(39,80,150,0.03);">

### What is JSON?

The API returns data in JSON format (JavaScript Object Notation). JSON is a text-based format for representing structured data. It is human-readable and widely used for data exchange.

When Python reads JSON, it converts it to native Python types:
- JSON **objects** `{...}` become Python **dictionaries** `dict`
- JSON **arrays** `[...]` become Python **lists** `list`
- JSON **strings** become Python `str`
- JSON **numbers** become Python `int` or `float`
- JSON **booleans** `true`/`false` become Python `bool`

Real-world JSON is often deeply nested: dictionaries containing lists of dictionaries, and so on.

</div>

You can save JSON to file:

In [None]:
# How to save JSON to file:
with open('open-food-facts-50-cereals.json', 'w') as f:
    json.dump(data, f)

You can read JSON from a file:

In [None]:
# How to read a JSON from a file into Python
with open('open-food-facts-50-cereals.json', 'r') as f:
    data2 = json.load(f)

In [None]:
# Confirm that these two objects have the same data
data == data2

### 2.1 Systematic Data Inspection

When you receive data from an unfamiliar source, you need a systematic approach to understand its structure. Here is the recommended pattern:

1. **Check the type:** `type(data)` tells you if you have a dict or list

   * **If it's a dict:** Use `data.keys()` to see available fields, then `data['key_name']` to access values

   * **If it's a list:** Use `len(data)` to see how many items, then `data[0]` to inspect the first item

2. **Repeat:** Apply the same pattern to nested structures

This pattern works for any unfamiliar data, not just APIs.

In [None]:
# Step 1: What type is our data?
print(f"Type: {type(data)}")
print()

# Step 2: Since it's a dict, what keys does it have?
print(f"Keys: {list(data.keys())}")

üéØ **ACTION POINT 02**

Just judging by what you see above, what do you think is contained in each of these keys and what is their Python type?

Useful links:

* **Possible [Python types](https://www.w3schools.com/python/python_datatypes.asp):** `int`, `float`, `boolean`, `list`, `dict`, `str`
* **[Open Food Facts API - v2 reference](https://openfoodfacts.github.io/openfoodfacts-server/api/ref-v2/)** (if you are already familiar with APIs)

_Remove the `<placeholder>`'s below (including the `<` and `>`) and replace with your answer_

| Key         | Python Type    | Description                            |
|-------------|---------------|----------------------------------------|
| `count`     | <placeholder>  | <placeholder>                          |
| `page`      | <placeholder>  | <placeholder>                          |
| `page_count`| <placeholder>  | <placeholder>                          |
| `page_size` | <placeholder>  | <placeholder>                          |
| `products`  | <placeholder>  | <placeholder>                          |
| `skip`      | <placeholder>  | <placeholder>                          |

üéØ **ACTION POINT 03**

Use the cell below to write Python code to confirm your suspicions. 

üí° **Remember** to use `data['key_name']` to access values when `data` is a dictionary.

In [None]:
# Write your Python code here

### 2.2 What's in a product?

In [None]:
# The 'products' key likely contains the actual product data
# Let's check what type it is
print(f"Type of 'products': {type(data['products'])}")
print(f"Number of products: {len(data['products'])}")

In [None]:
# Look at the first product
first_product = data['products'][0]
print(f"Type of first product: {type(first_product)}")
print(f"Keys in first product: {list(first_product.keys())[:10]}...")  # Show first 10 keys

In [None]:
# Uncomment to see all the data on the first product
# first_product

In [None]:
# Let's look at some interesting fields
print(f"Product name: {first_product['product_name']}")
print(f"Brand: {first_product['brands']}")
print(f"Categories: {first_product['categories']}")
print(f"NOVA group: {first_product['nova_group']}")
print(f"Ingredients: {first_product['ingredients_text']}...")  # First 100 chars

## Section 3. It's more productive to work with tables (25 min)

<span style="display:inline-block; background:#fff; width:100x; border:1.5px solid #E0E0E0; border-radius:7px; padding:0.65em 0.65em 0.65em 0.325em; margin:0 0.175em 0 0;">
 üë§ <strong>TEACHING MOMENT</strong>
</span> <span style="font-size:0.6em">Whenever you see a üë§ <strong>TEACHING MOMENT</strong>, follow along with your instructor and ask questions along the way!</span>

<div style="max-width: 600px; font-size:0.85em; border-left: 6px solid #275d9c; background: #f6f9ff; margin: 1.2em 0; padding: 1.15em 1em 1.1em 1.5em; border-radius: 7px; box-shadow: 0 2px 7px rgba(39,80,150,0.03);">

<span style="font-size:1.5em">What is a DataFrame?</span>

While dictionaries are great for representing structured data, they become unwieldy when you have many records. A **DataFrame** (from the [`pandas` library](https://pandas.pydata.org/docs/user_guide/index.html)) organises data into rows and columns, making it much easier to:

- Filter records
- Calculate statistics
- Create visualisations
- Export to other formats

A DataFrame is a 2D labelled data structure. Think of it as a spreadsheet or SQL table:
- Each **row** is one record (in our case, one product)
- Each **column** is one field (product name, brand, calories, etc.)

A **Series** is a single column or a single row from a DataFrame: a 1D labelled array.

</div>

<span style="font-size:0.85em;color:#666666">üí° **TIP:** If you have not worked with pandas before, you might want to reserve a bit of time to read their [Intro to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html) page later</span>

Because `data['products']` is a **list of dictionaries**, we can easily convert it to a DataFrame:

In [None]:
df = pd.DataFrame(data['products'])
df

In [None]:
# How to look at just one single row
# Just specify the row number to iloc
df.iloc[20]

### Exploring your DataFrame

pandas provides several methods to quickly understand your data:

- `.head(n)`: Show first n rows (default 5)

- `.tail(n)`: Show last n rows

- `.info()`: Show column types and non-null counts

- `.describe()`: Show statistical summary of numeric columns

- `.shape`: Number of rows and columns as a tuple

In [None]:
# Basic information about the DataFrame
print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print()

# Column types and missing values
df.info()

In [None]:
# Statistical summary of numeric columns
df.describe()

### 3.1 Check proportions of categorical data


In [None]:
# How many times each brand shows up in the data?
df['brands'].value_counts(sort=True)

In [None]:
# Same for the NOVA group
df['nova_group'].value_counts()

### 3.2 Understanding NOVA Classification

The **NOVA classification** is a framework for grouping foods based on the extent and purpose of food processing. Developed by researchers at the University of S√£o Paulo, it classifies foods into four groups. Research has linked higher consumption of ultra-processed foods (Group 4) with various health concerns. The classification helps us understand not just what we eat, but how it's been processed. For more information, see the [NOVA classification Wikipedia article](https://en.wikipedia.org/wiki/Nova_classification).

<style type="text/css">
#T_1f4d7_row0_col0 {
  background-color: #4caf50;
  text-align: left;
}
#T_1f4d7_row0_col1, #T_1f4d7_row0_col2, #T_1f4d7_row1_col1, #T_1f4d7_row1_col2, #T_1f4d7_row2_col1, #T_1f4d7_row2_col2, #T_1f4d7_row3_col1, #T_1f4d7_row3_col2 {
  text-align: left;
}
#T_1f4d7_row1_col0 {
  background-color: #ffeb3b;
  text-align: left;
}
#T_1f4d7_row2_col0 {
  background-color: #ff9800;
  text-align: left;
}
#T_1f4d7_row3_col0 {
  background-color: #f44336;
  text-align: left;
}
</style>
<table id="T_1f4d7">
  <thead>
    <tr>
      <th id="T_1f4d7_level0_col0" class="col_heading level0 col0" >NOVA Group</th>
      <th id="T_1f4d7_level0_col1" class="col_heading level0 col1" >Description</th>
      <th id="T_1f4d7_level0_col2" class="col_heading level0 col2" >Examples</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td id="T_1f4d7_row0_col0" class="data row0 col0" >1</td>
      <td id="T_1f4d7_row0_col1" class="data row0 col1" >Unprocessed or minimally processed foods</td>
      <td id="T_1f4d7_row0_col2" class="data row0 col2" >Fresh fruits, vegetables, grains, plain milk</td>
    </tr>
    <tr>
      <td id="T_1f4d7_row1_col0" class="data row1 col0" >2</td>
      <td id="T_1f4d7_row1_col1" class="data row1 col1" >Processed culinary ingredients</td>
      <td id="T_1f4d7_row1_col2" class="data row1 col2" >Oils, salt, sugar, butter</td>
    </tr>
    <tr>
      <td id="T_1f4d7_row2_col0" class="data row2 col0" >3</td>
      <td id="T_1f4d7_row2_col1" class="data row2 col1" >Processed foods</td>
      <td id="T_1f4d7_row2_col2" class="data row2 col2" >Canned vegetables, cheese, bread</td>
    </tr>
    <tr>
      <td id="T_1f4d7_row3_col0" class="data row3 col0" >4</td>
      <td id="T_1f4d7_row3_col1" class="data row3 col1" >Ultra-processed foods</td>
      <td id="T_1f4d7_row3_col2" class="data row3 col2" >Soft drinks, packaged snacks, instant noodles</td>
    </tr>
  </tbody>
</table>



<details style="border: 1px solid #0051a5;border-left: 5px solid #0051a5;border-radius: 0.25rem;margin: 1em 0;">
<summary style="cursor: pointer; font-weight: 600; background-color: #f0f7fb;color: #0051a5;padding:0.5em;"> Click to see how I created the styled table above</summary>

<div style="margin-top: 0.5em; color: #212121;margin-left:1em">

```python
# Create a table showing NOVA groups with their descriptions
nova_table = pd.DataFrame({
    'NOVA Group': [1, 2, 3, 4],
    'Description': [
        'Unprocessed or minimally processed foods',
        'Processed culinary ingredients',
        'Processed foods',
        'Ultra-processed foods'
    ],
    'Examples': [
        'Fresh fruits, vegetables, grains, plain milk',
        'Oils, salt, sugar, butter',
        'Canned vegetables, cheese, bread',
        'Soft drinks, packaged snacks, instant noodles'
    ]
})

# Display the table with styled NOVA group numbers
def color_nova_group(val):
    """Colour NOVA group numbers according to standard colours."""
    if val == 1:
        return 'background-color: #4caf50'  # Green
    elif val == 2:
        return 'background-color: #ffeb3b'  # Yellow
    elif val == 3:
        return 'background-color: #ff9800'  # Orange
    elif val == 4:
        return 'background-color: #f44336'  # Red
    return ''

styled_table = nova_table.style.applymap(
    color_nova_group,
    subset=['NOVA Group']
).set_properties(**{'text-align': 'left'}).hide(axis=0)
print(styled_table.to_html())
```

</div>

</details>

üèÖ **TIP:** The **NOVA classification** will be relevant for your first graded assignment (Problem Set 1)

### 3.3 Working with nested data

The `df['nutriments']` column is weird. It's a column made up of dictionaries. 

In [None]:
# Uncomment to look at it
# df['nutriments']

You can use the all-powerful `pd.json_normalize()` function to expand that **nested data** into a proper table:

In [None]:
df_nutriments = pd.json_normalize(df['nutriments'])

# Uncomment the line below to see what that function does
# df_nutriments

You can stack the two tables together, by **concatenating** them together horizontally:

In [None]:
clean_df = pd.concat([df.drop(columns=['nutriments']), 
                      df_nutriments], axis=1)

clean_df.head(3)

The describe function will show a lot more stuff now:

In [None]:
clean_df.describe()

In [None]:
clean_df.info()

In [None]:
clean_df.columns.tolist()

You can always focus on just a single column:

In [None]:
# Access a single column (returns a Series)
energy_series = clean_df['energy-kcal']
print(f"Type: {type(energy_series)}")
print(f"Mean energy: {energy_series.mean():.1f} kcal per 100g")
print(f"Max energy: {energy_series.max():.1f} kcal per 100g")

## Section 4: Visualising the data (remaining time / take-home)

‚è∏Ô∏è **Continue at your own pace**

Now that we have our data in a DataFrame, we can create visualisations to understand patterns. 

### Visualising Nutritional Data

Let's create histograms for our core nutritional measurements to see how they're distributed across our breakfast cereals dataset.

In [None]:
nutrition_cols = ['energy-kcal', 'fat_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g']
df_melted = clean_df.melt(
    id_vars=['nova_group'],
    value_vars=nutrition_cols,
    var_name='nutrient',
    value_name='value'
)

# Create FacetGrid with histograms
g = sns.FacetGrid(df_melted, col='nutrient', col_wrap=3, height=3, sharex=False, sharey=False)
g.map(plt.hist, 'value', bins=15, edgecolor='black', alpha=0.7)
g.set_axis_labels('Value (per 100g)', 'Count')

g.set_titles('{col_name}', fontweight='bold', fontsize=11, 
             bbox={'facecolor': 'white', 
                   'edgecolor': '#212121', 
                   'boxstyle':'round,pad=0.5', 
                   'linewidth':1})

# Adjust layout
plt.tight_layout()
plt.show()

### NOVA Classification Distribution

Let's see how our breakfast cereals are distributed across the NOVA groups using a donut chart.

In [None]:
# Count products by NOVA group
nova_counts = df['nova_group'].value_counts().sort_index()

# Define colours for each NOVA group
nova_colours = {
    1: '#4caf50',  # Green
    2: '#ffeb3b',  # Yellow
    3: '#ff9800',  # Orange
    4: '#f44336'   # Red
}

# Create donut chart
fig, ax = plt.subplots(figsize=(5, 5))

# Outer pie (donut)
wedges, texts, autotexts = ax.pie(
    nova_counts.values,
    labels=[f'Group {i}' for i in nova_counts.index],
    colors=[nova_colours[i] for i in nova_counts.index],
    autopct='%1.1f%%',
    startangle=90,
    pctdistance=0.85
)

# Inner circle to create donut effect
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
ax.add_artist(centre_circle)

# Add title
ax.set_title('Distribution of Breakfast Cereals by NOVA Classification', 
             fontsize=14, fontweight='bold', pad=20)

# Make percentage text larger and bold
for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontweight('bold')
    autotext.set_fontsize(10)

plt.tight_layout()
plt.show()

## Section 5: Playground

‚è∏Ô∏è **Continue at your own pace**

Now it's your turn! Use the data you've collected to explore and discover something interesting.

**Suggested activities:**

1. What's the relationship between sugar content and energy?
2. Which product has the highest protein content?
3. üèÜ **Challenge:** Can you collect the next batch of 50 breakfast cereals? (you will need to familiarise yourself with the [API documentation](https://openfoodfacts.github.io/openfoodfacts-server/api/ref-v2/))

Share your findings with your lab partner and discuss what you discovered.

In [None]:
# Your code here

üí≠ **Personal Reflection:**

What did you find most surprising about the data? What questions would you want to answer if you had more time?

- [*Write your notes here*]
- 