# üóìÔ∏è W01 | Lecture: From API to Insights: The Complete Pipeline

**DS205 W01 NB01 ‚Äì Advanced Data Manipulation (Winter Term 2025/2026)**

<div style="font-family: system-ui; padding: 20px 30px 20px 20px; background-color: #FFFFFF; border-left: 8px solid #ED9255; border-radius: 8px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);max-width:600px;color:#212121;">

**Lecture Demonstration Notebook**
- üìÖ Date: 19 January 2026
- üë§ Instructor: Dr Jon Cardoso-Silva
- üéØ Purpose: Demonstrate example of a full data pipeline from API collection to sophisticated analysis

ü•Ö **Learning Goals**

<ul style="margin: 0.2em 0 0.4em 0; padding-left: 1.25em; font-size:1em; list-style-type:none;font-size:0.85em;color:#666666">

  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">i)</span> Collect large datasets from APIs using pagination,
  </li>
  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">ii)</span> Apply systematic data inspection methodology,
  </li>
  <li style="margin-bottom:0.15em; padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">iii)</span> Transform nested JSON data into analysis-ready feature matrices,
  </li>
  <li style="padding-left:0.4em; text-indent:-0.4em;">
    <span style="display:inline-block;font-weight:450;width:0.75em">iv)</span> Preview advanced analysis techniques with UMAP clustering and interactive visualisation.
  </li>
</ul>

</div>

‚öôÔ∏è **Importing libraries**

Before we begin, make sure you have all required libraries installed:

```bash
pip install pandas requests matplotlib seaborn plotly umap-learn scikit-learn openpyxl
```

Here are the libraries we are using today:

In [1]:
import json
import time
from pathlib import Path

from tqdm import tqdm

import requests

import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go

import umap

## Section 1: Data Collection

In this section, we'll collect a large dataset of bread products from the Open Food Facts API. This demonstrates how to work with APIs that require pagination to collect substantial datasets.

<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border: 1px solid #03a9f4; border-left: 5px solid #03a9f4;">

**What just happened?** We're using the Open Food Facts API to collect product data. Unlike the lab notebook where we collected 50 products, here we'll collect over 1000 products to enable more sophisticated analysis. This requires handling pagination, where the API returns data in pages.

</div>

In [2]:
# The Open Food Facts API endpoint for searching products
endpoint_url = "https://world.openfoodfacts.org/api/v2/search"

# Parameters for our search - we'll search for "bread" products
# This gives us a diverse dataset with many variations
# Note: API v2 does not support search_terms, so we use categories_tags_en instead
params_base = {
    "categories_tags_en": "Breads",
    "countries_tags": "en:united-kingdom",
    "page_size": 100,  # Maximum page size for efficiency
    "fields": "product_name,brands,categories,nutriments,nova_group,ingredients_text"
}

In [3]:
# Collect data with pagination using a tqdm progressbar
all_products = []
max_pages = 15  # Collect up to 1500 products (15 pages √ó 100 products)

print("üîÑ Collecting data from Open Food Facts API...\n")

with tqdm(total=max_pages, desc="Pages", unit="page") as pbar:
    for page in range(1, max_pages + 1):
        params = params_base.copy()
        params["page"] = page

        try:
            response = requests.get(endpoint_url, params=params, timeout=30)

            if response.status_code != 200:
                print(f"‚ùå Error on page {page}: Status code {response.status_code}")
                break

            data = response.json()
            # Use .get() instead of data['products'] to avoid KeyError
            products = data.get("products", [])

            if not products:
                print(f"‚úÖ No more products found. Collected {len(all_products)} products total.")
                break

            all_products.extend(products)
            tqdm.write(f"üì¶ Page {page}: Collected {len(products)} products (Total: {len(all_products)})")

            # Rate limiting: be respectful to the API
            time.sleep(0.5)

            pbar.update(1)

        except requests.exceptions.RequestException as e:
            print(f"‚ùå Error on page {page}: {e}")
            break

print()
print(f"‚úÖ Successfully collected {len(all_products)} products")

üîÑ Collecting data from Open Food Facts API...



Pages:   0%|          | 0/15 [00:00<?, ?page/s]

Pages:   0%|          | 0/15 [00:02<?, ?page/s]

üì¶ Page 1: Collected 100 products (Total: 100)


Pages:   7%|‚ñã         | 1/15 [00:04<00:36,  2.58s/page]

üì¶ Page 2: Collected 100 products (Total: 200)


Pages:  13%|‚ñà‚ñé        | 2/15 [00:06<00:32,  2.53s/page]

üì¶ Page 3: Collected 100 products (Total: 300)


Pages:  20%|‚ñà‚ñà        | 3/15 [00:09<00:28,  2.41s/page]

üì¶ Page 4: Collected 100 products (Total: 400)


Pages:  27%|‚ñà‚ñà‚ñã       | 4/15 [00:12<00:26,  2.41s/page]

üì¶ Page 5: Collected 100 products (Total: 500)


Pages:  33%|‚ñà‚ñà‚ñà‚ñé      | 5/15 [00:14<00:26,  2.64s/page]

üì¶ Page 6: Collected 100 products (Total: 600)


Pages:  40%|‚ñà‚ñà‚ñà‚ñà      | 6/15 [00:17<00:23,  2.66s/page]

üì¶ Page 7: Collected 100 products (Total: 700)


Pages:  47%|‚ñà‚ñà‚ñà‚ñà‚ñã     | 7/15 [00:20<00:21,  2.70s/page]

üì¶ Page 8: Collected 100 products (Total: 800)


Pages:  53%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé    | 8/15 [00:22<00:18,  2.64s/page]

üì¶ Page 9: Collected 100 products (Total: 900)


Pages:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 9/15 [00:25<00:15,  2.61s/page]

üì¶ Page 10: Collected 100 products (Total: 1000)


Pages:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 10/15 [00:27<00:12,  2.56s/page]

üì¶ Page 11: Collected 100 products (Total: 1100)


Pages:  73%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 11/15 [00:30<00:10,  2.54s/page]

üì¶ Page 12: Collected 100 products (Total: 1200)


Pages:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 12/15 [00:33<00:07,  2.54s/page]

üì¶ Page 13: Collected 100 products (Total: 1300)


Pages:  87%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã | 13/15 [00:36<00:05,  2.77s/page]

üì¶ Page 14: Collected 100 products (Total: 1400)


Pages:  93%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé| 14/15 [00:38<00:02,  2.72s/page]

üì¶ Page 15: Collected 100 products (Total: 1500)


Pages: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 15/15 [00:39<00:00,  2.63s/page]


‚úÖ Successfully collected 1500 products





In [4]:
# Save the collected data to a JSON file for reproducibility
# This allows us to work with the data without making repeated API calls
output_file = Path("open-food-facts-bread-products.json")
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(all_products, f, indent=2, ensure_ascii=False)

print(f"üíæ Data saved to {output_file}")

üíæ Data saved to open-food-facts-bread-products.json


## Section 2: Systematic Data Inspection

Now that we have our data, let's apply the systematic inspection methodology from the lab notebook. This pattern works for any unfamiliar dataset:

1. **Check the type:** `type(data)` tells you if you have a dict or list
2. **If it's a dict:** Use `data.keys()` to see available fields
3. **If it's a list:** Use `len(data)` to see how many items, then `data[0]` to inspect the first item
4. **Repeat:** Apply the same pattern to nested structures

<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border-left: 5px solid #ff9800;">

**Key Idea:** This systematic approach to data inspection is transferable to any dataset you encounter. It helps you understand structure before attempting analysis.

</div>

In [5]:
# Step 1: What type is our data?
print(f"Type: {type(all_products)}")
print(f"Number of products: {len(all_products)}")
print()

# Step 2: Since it's a list, let's inspect the first product
if all_products:
    first_product = all_products[0]
    print(f"Type of first product: {type(first_product)}")
    print(f"Keys in first product: {list(first_product.keys())[:10]}...")  # Show first 10 keys

Type: <class 'list'>
Number of products: 1500

Type of first product: <class 'dict'>
Keys in first product: ['brands', 'categories', 'ingredients_text', 'nova_group', 'nutriments', 'product_name']...


In [6]:
# Let's look at a specific product to understand the structure
if all_products:
    sample = all_products[0]
    print("Sample product structure:")
    print(f"  Product name: {sample.get('product_name', 'N/A')}")
    print(f"  Brand: {sample.get('brands', 'N/A')}")
    print(f"  NOVA group: {sample.get('nova_group', 'N/A')}")
    print(f"  Nutriments type: {type(sample.get('nutriments', {}))}")

Sample product structure:
  Product name: WHITE CIABATTIN
  Brand: JASON'S SOURDOUGH
  NOVA group: 3
  Nutriments type: <class 'dict'>


## Section 3: Converting to DataFrame

Working with lists of dictionaries is manageable for small datasets, but for analysis we need a structured format. Let's convert our data to a pandas DataFrame.

In [7]:
# Convert the list of products to a DataFrame
df = pd.DataFrame(all_products)

print(f"üìä DataFrame shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print()
print("First few rows:")
df.head()

üìä DataFrame shape: 1500 rows √ó 6 columns

First few rows:


Unnamed: 0,brands,categories,ingredients_text,nova_group,nutriments,product_name
0,JASON'S SOURDOUGH,"Plant-based foods and beverages,Plant-based fo...","Wheat Flour (Fortified with Calcium Carbonate,...",3.0,"{'added-sugars': 0, 'added-sugars_100g': 0, 'a...",WHITE CIABATTIN
1,Jason's Sourdough,"Plant-based foods and beverages,Plant-based fo...","Wheat Flour (Wheat Flour, Calcium Carbonate, I...",3.0,"{'added-sugars': 0, 'added-sugars_100g': 0, 'a...",Proper Sourdough
2,Jasons,"Plant-based foods and beverages,Plant-based fo...","Wheat Flour (Wheat Flour, Calcium Carbonate, I...",3.0,"{'added-sugars': 0, 'added-sugars_100g': 0, 'a...",Sourdough Grains & Seeds
3,Jason's,"Plant-based foods and beverages,Plant-based fo...","Wheat Flour (Wheat Flour, Calcium Carbonate, I...",3.0,"{'carbohydrates': 44.8, 'carbohydrates_100g': ...",Sourdough
4,Ryvita,"Plant-based foods and beverages,Plant-based fo...","wholegrain rye flour, salt",3.0,"{'carbohydrates': 7.1, 'carbohydrates_100g': 6...",Dark Rye Crispbread


In [8]:
# Get an overview of the data structure
print("DataFrame info:")
df.info()

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   brands            1380 non-null   object 
 1   categories        1500 non-null   object 
 2   ingredients_text  1340 non-null   object 
 3   nova_group        1284 non-null   float64
 4   nutriments        1500 non-null   object 
 5   product_name      1439 non-null   object 
dtypes: float64(1), object(5)
memory usage: 70.4+ KB


In [9]:
# Check for missing values in key columns
key_columns = ['product_name', 'brands', 'nova_group', 'nutriments']
print("Missing values in key columns:")
print(df[key_columns].isnull().sum())

Missing values in key columns:
product_name     61
brands          120
nova_group      216
nutriments        0
dtype: int64


## Section 4: Feature Engineering

The `nutriments` column contains nested dictionaries with nutritional information. We need to extract this data and create a feature matrix suitable for analysis.

<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border: 1px solid #03a9f4; border-left: 5px solid #03a9f4;">

**What just happened?** The `nutriments` field is a dictionary within each product record. We'll use `pd.json_normalize()` to expand this nested structure into separate columns, making it ready for analysis.

</div>

In [10]:
# Extract nutrient data from nested dictionaries
df_nutriments = pd.json_normalize(df['nutriments'])

print(f"üìä Extracted {df_nutriments.shape[1]} nutrient columns")
print()
print("Nutrient columns:")
print(df_nutriments.columns.tolist()[:15])  # Show first 15 columns

üìä Extracted 328 nutrient columns

Nutrient columns:
['added-sugars', 'added-sugars_100g', 'added-sugars_serving', 'added-sugars_unit', 'added-sugars_value', 'caffeine', 'caffeine_100g', 'caffeine_serving', 'caffeine_unit', 'caffeine_value', 'carbohydrates', 'carbohydrates_100g', 'carbohydrates_serving', 'carbohydrates_unit', 'carbohydrates_value']


In [11]:
# Combine the original DataFrame with nutrient data
# Drop the nested nutriments column and add the expanded columns
df_combined = pd.concat([
    df.drop(columns=['nutriments']),
    df_nutriments
], axis=1)

print(f"üìä Combined DataFrame: {df_combined.shape[0]} rows √ó {df_combined.shape[1]} columns")
df_combined.head(3)

üìä Combined DataFrame: 1500 rows √ó 333 columns


Unnamed: 0,brands,categories,ingredients_text,nova_group,product_name,added-sugars,added-sugars_100g,added-sugars_serving,added-sugars_unit,added-sugars_value,...,vitamin-pp_unit,vitamin-pp_value,fiber_modifier,carbohydrates_modifier,fat_modifier,proteins_modifier,salt_modifier,saturated-fat_modifier,sodium_modifier,sugars_modifier
0,JASON'S SOURDOUGH,"Plant-based foods and beverages,Plant-based fo...","Wheat Flour (Fortified with Calcium Carbonate,...",3.0,WHITE CIABATTIN,0.0,0.0,0.0,g,0.0,...,,,,,,,,,,
1,Jason's Sourdough,"Plant-based foods and beverages,Plant-based fo...","Wheat Flour (Wheat Flour, Calcium Carbonate, I...",3.0,Proper Sourdough,0.0,0.0,0.0,g,0.0,...,,,,,,,,,,
2,Jasons,"Plant-based foods and beverages,Plant-based fo...","Wheat Flour (Wheat Flour, Calcium Carbonate, I...",3.0,Sourdough Grains & Seeds,0.0,0.0,0.0,g,0.0,...,,,,,,,,,,


In [12]:
# Select relevant nutrient features for clustering
# We'll focus on key nutritional components
nutrient_features = [
    'energy-kcal_100g',
    'fat_100g',
    'saturated-fat_100g',
    'carbohydrates_100g',
    'sugars_100g',
    'fiber_100g',
    'proteins_100g',
    'salt_100g',
    'sodium_100g'
]

# Check which features are available
available_features = [f for f in nutrient_features if f in df_combined.columns]
print(f"‚úÖ Available nutrient features: {len(available_features)}")
print(available_features)

‚úÖ Available nutrient features: 9
['energy-kcal_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g']


In [23]:
# Create feature matrix for clustering
# Select only products that have NOVA group (required for colour coding)
df_analysis = df_combined[df_combined['nova_group'].notna() & df_combined['energy-kcal_100g'].notna()].copy()

print(f"üìä Products with NOVA classification: {len(df_analysis)}")

# Select the nutrient features that are available
feature_cols = [col for col in available_features if col in df_analysis.columns]
df_features = df_analysis[feature_cols].copy()

# Handle missing values: fill with median for each column
for col in feature_cols:
    median_val = df_features[col].median()
    df_features[col] = df_features[col].fillna(median_val)

print(f"üìä Feature matrix: {df_features.shape[0]} products √ó {df_features.shape[1]} features")

üìä Products with NOVA classification: 1229
üìä Feature matrix: 1229 products √ó 9 features


In [24]:
# Check the feature matrix
print("Feature matrix summary:")
df_features.describe()

Feature matrix summary:


Unnamed: 0,energy-kcal_100g,fat_100g,saturated-fat_100g,carbohydrates_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g
count,1229.0,1229.0,1229.0,1229.0,1229.0,1229.0,1229.0,1229.0,1229.0
mean,283.485704,5.228942,1.151024,47.673393,4.009486,4.69429,9.440097,1.053935,0.421578
std,82.414564,5.378179,1.758879,13.468189,4.044089,3.56667,3.320379,0.905526,0.362196
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,240.0,1.9,0.4,41.5,2.3,3.0,8.06,0.78,0.312
50%,265.0,3.6,0.625,46.4,3.0,3.9,9.2,0.9,0.36
75%,300.0,6.6,1.17,52.0,4.0,5.5,10.6,1.1,0.44
max,1756.0,54.0,18.4,275.6,47.0,77.0,50.8,17.4,6.96


## Section 5: Advanced Analysis Preview - UMAP Clustering

Now we'll apply UMAP (Uniform Manifold Approximation and Projection) to reduce our high-dimensional nutrient data to 2D for visualisation. This is a preview of the advanced techniques you'll learn in Week 08 and beyond.

<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border-left: 5px solid #ff9800;">

**Key Idea:** UMAP is a dimensionality reduction technique that preserves the local structure of data while reducing it to 2D or 3D for visualisation. It's similar to techniques like t-SNE but often faster and better at preserving global structure. In Week 08, you'll learn about embeddings and vector databases, which use similar principles.

</div>

In [25]:
# Apply UMAP dimensionality reduction
# Set random seed for reproducibility
umap_model = umap.UMAP(
    n_components=2,
    random_state=42,
    n_neighbors=15,
    min_dist=0.1,
    metric='euclidean'
)

print("üîÑ Applying UMAP dimensionality reduction...")
umap_embedding = umap_model.fit_transform(df_features.values)

print(f"‚úÖ Reduced {df_features.shape[1]} dimensions to 2D")
print(f"   Shape: {umap_embedding.shape}")

üîÑ Applying UMAP dimensionality reduction...



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



‚úÖ Reduced 9 dimensions to 2D
   Shape: (1229, 2)


In [26]:
# Add UMAP coordinates to our analysis DataFrame
df_analysis['umap_x'] = umap_embedding[:, 0]
df_analysis['umap_y'] = umap_embedding[:, 1]

# Ensure NOVA group is numeric for colour mapping
df_analysis['nova_group'] = pd.to_numeric(df_analysis['nova_group'], errors='coerce')

print("‚úÖ UMAP coordinates added to DataFrame")

‚úÖ UMAP coordinates added to DataFrame


## Section 6: Interactive Visualisation

Now we'll create an interactive visualisation using Plotly. This allows us to explore the data dynamically, seeing how different products cluster together and how NOVA classifications relate to nutritional profiles.

<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border: 1px solid #4caf50; border-left: 5px solid #4caf50;">

**‚úÖ Your code is correct!** Plotly creates interactive visualisations that work in Jupyter notebooks and can be exported to HTML. The hover information lets you explore individual products while seeing the overall clustering patterns.

</div>

In [27]:
# Define NOVA group colours (standard colour scheme)
nova_colours = {
    1: '#4caf50',   # Green - Unprocessed or minimally processed
    2: '#ffeb3b',   # Yellow - Processed culinary ingredients
    3: '#ff9800',   # Orange - Processed foods
    4: '#f44336'    # Red - Ultra-processed foods
}

# Create colour mapping
df_analysis['nova_colour'] = df_analysis['nova_group'].map(nova_colours)

In [28]:
# Create interactive scatter plot with Plotly
fig = go.Figure()

# Add points for each NOVA group
for nova_group in sorted(df_analysis['nova_group'].dropna().unique()):
    mask = df_analysis['nova_group'] == nova_group
    group_data = df_analysis[mask]
    
    # Create hover text with product information
    hover_text = []
    for idx, row in group_data.iterrows():
        name = str(row.get('product_name', 'Unknown'))[:50]  # Truncate long names
        brand = str(row.get('brands', 'Unknown'))[:30]
        energy = row.get('energy-kcal_100g', 'N/A')
        if pd.notna(energy):
            energy = f"{energy:.0f} kcal"
        hover_text.append(
            f"<b>{name}</b><br>" +
            f"Brand: {brand}<br>" +
            f"NOVA Group: {int(nova_group)}<br>" +
            f"Energy: {energy}/100g"
        )
    
    fig.add_trace(go.Scatter(
        x=group_data['umap_x'],
        y=group_data['umap_y'],
        mode='markers',
        name=f'NOVA Group {int(nova_group)}',
        marker=dict(
            color=nova_colours[nova_group],
            size=8,
            opacity=0.7,
            line=dict(width=0.5, color='white')
        ),
        text=hover_text,
        hovertemplate='%{text}<extra></extra>',
        showlegend=True
    ))

# Update layout for professional appearance
fig.update_layout(
    title={
        'text': 'Bread Products: Nutritional Profile Clustering by NOVA Classification',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 18, 'family': 'Arial, sans-serif'}
    },
    xaxis_title='UMAP Dimension 1',
    yaxis_title='UMAP Dimension 2',
    width=900,
    height=700,
    template='plotly_white',
    hovermode='closest',
    legend=dict(
        title='NOVA Classification',
        x=1.02,
        y=1,
        bgcolor='rgba(255, 255, 255, 0.8)',
        bordercolor='rgba(0, 0, 0, 0.2)',
        borderwidth=1
    )
)

# Update axes
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='rgba(0, 0, 0, 0.1)')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='rgba(0, 0, 0, 0.1)')

# Display the figure
fig.show()

In [30]:
# Save the figure as HTML for sharing
html_file = "bread-products-umap-clustering.html"
fig.write_html("bread-products-umap-clustering.html")
print(f"üíæ Interactive visualisation saved to {html_file}")

üíæ Interactive visualisation saved to bread-products-umap-clustering.html


## Section 7: Insights Discussion

Let's explore what patterns emerge from our clustering visualisation.

<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border-left: 5px solid #ff9800;">

**Key Idea:** The UMAP visualisation reveals how products cluster based on their nutritional profiles. Products with similar nutrient compositions appear closer together in the 2D space. The NOVA classification colour coding helps us see if processing level relates to nutritional composition.

</div>

In [29]:
# Analyse NOVA group distribution
print("NOVA Group Distribution:")
nova_counts = df_analysis['nova_group'].value_counts().sort_index()
for group, count in nova_counts.items():
    pct = (count / len(df_analysis)) * 100
    print(f"  Group {int(group)}: {count} products ({pct:.1f}%)")

NOVA Group Distribution:
  Group 1: 16 products (1.3%)
  Group 3: 383 products (31.2%)
  Group 4: 830 products (67.5%)


In [31]:
# Calculate average nutrients by NOVA group
if 'energy-kcal_100g' in df_analysis.columns:
    print("\nAverage Energy (kcal/100g) by NOVA Group:")
    avg_energy = df_analysis.groupby('nova_group')['energy-kcal_100g'].mean()
    for group, energy in avg_energy.items():
        print(f"  Group {int(group)}: {energy:.1f} kcal/100g")


Average Energy (kcal/100g) by NOVA Group:
  Group 1: 252.9 kcal/100g
  Group 3: 308.6 kcal/100g
  Group 4: 272.5 kcal/100g


### Reflection Questions

üí≠ **Personal Reflection:**

- What patterns do you notice in the clustering? Do products from the same NOVA group cluster together?
- Are there any outliers or unexpected groupings?
- What questions would you want to investigate further with this data?

- [*Write your notes here*]

## Section 8: Looking Forward

This demonstration shows where systematic thinking leads. We started with raw API data, applied systematic inspection, engineered features, and used advanced techniques to reveal patterns.

<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border: 1px solid #03a9f4; border-left: 5px solid #03a9f4;">

**What just happened?** Throughout this course, you'll build on these foundations:

- **Weeks 01-05:** Master data collection (APIs, web scraping) and processing
- **Week 08+:** Learn about embeddings and vector databases (similar principles to UMAP)
- **Final Project:** Apply these techniques to climate data from TPI Centre

The food-to-climate bridge: We start with familiar food data to build confidence, then progress to complex climate research data.

</div>

### Course Progression

- **This Week (W01):** API collection and systematic data inspection
- **Next Week (W02):** Web scraping when APIs aren't available
- **Weeks 03-05:** Building production-ready APIs and collaborative workflows
- **Weeks 07-11:** Advanced techniques including embeddings, NLP, and climate data analysis

---

üìñ **Additional Resources:**

- [Open Food Facts API Documentation](https://openfoodfacts.github.io/openfoodfacts-server/api/)
- [UMAP Documentation](https://umap-learn.readthedocs.io/)
- [Plotly Python Documentation](https://plotly.com/python/)
- [NOVA Classification Wikipedia](https://en.wikipedia.org/wiki/Nova_classification)