# Dataset Health Check — `price_nok_total` & `carbon_kg_total`

This notebook:
1. Loads your dataset (default: `../data/openfoodfacts/step10.csv`).
2. Checks for **null values** in `price_nok_total` and `carbon_kg_total`.
3. Finds which **product_name** is most expensive in **NOK** and in **CO₂**.

**Tip:** Adjust the `DATA_PATH` below if your file lives elsewhere.

In [5]:
import pandas as pd
import numpy as np
from pathlib import Path

# ---- Config ----
DATA_PATH = Path('../../data/openfoodfacts/step10.csv')
REQ_COLS = ['product_name', 'price_nok_total', 'carbon_kg_total']

print('Reading:', DATA_PATH)
df = pd.read_csv(DATA_PATH, low_memory=False)
print('Rows:', len(df), '| Cols:', len(df.columns))
print('Columns:', list(df.columns))

Reading: ..\..\data\openfoodfacts\step10.csv
Rows: 80817 | Cols: 20
Columns: ['created_datetime', 'product_name', 'food_groups', 'energy-kcal_100g', 'energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'ingredients_tags', 'nutriscore_score', 'nutriscore_grade', 'created_t', 'energy-kj_100g', 'sodium_100g', 'price_nok_total', 'carbon_kg_total']


In [6]:
# ---- Basic column checks ----
missing = [c for c in REQ_COLS if c not in df.columns]
if missing:
    raise ValueError(f"Missing required columns: {missing}")

# Coerce to numeric for safety (won't alter your source file)
df['price_nok_total'] = pd.to_numeric(df['price_nok_total'], errors='coerce')
df['carbon_kg_total'] = pd.to_numeric(df['carbon_kg_total'], errors='coerce')

# ---- Null diagnostics ----
nulls = df[['price_nok_total','carbon_kg_total']].isna().sum().rename('null_count')
print('\nNull counts:')
display(nulls.to_frame())

rows_with_nulls = df[df[['price_nok_total','carbon_kg_total']].isna().any(axis=1)]
print('\nRows with any null in target columns:', len(rows_with_nulls))
display(rows_with_nulls[['product_name','price_nok_total','carbon_kg_total']].head(10))


Null counts:


Unnamed: 0,null_count
price_nok_total,0
carbon_kg_total,0



Rows with any null in target columns: 0


Unnamed: 0,product_name,price_nok_total,carbon_kg_total


In [7]:
# ---- Top items by product_name ----
# We aggregate by product_name (sum) then pick the largest.
agg = df.groupby('product_name', dropna=False)[['price_nok_total','carbon_kg_total']].sum().sort_values('price_nok_total', ascending=False)

top_nok_name = agg['price_nok_total'].idxmax()
top_nok_val  = agg.loc[top_nok_name, 'price_nok_total']
top_co2_name = agg['carbon_kg_total'].idxmax()
top_co2_val  = agg.loc[top_co2_name, 'carbon_kg_total']

print('Most expensive in NOK (by product_name, summed across rows):')
print(f"  {top_nok_name!r} -> {top_nok_val}")

print('\nHighest CO₂ total (by product_name, summed across rows):')
print(f"  {top_co2_name!r} -> {top_co2_val}")

Most expensive in NOK (by product_name, summed across rows):
  'Cookies' -> 19489

Highest CO₂ total (by product_name, summed across rows):
  'Cookies' -> 349.8


In [8]:
# ---- Optional: show the Top 10 tables ----
top10_nok = agg.nlargest(10, 'price_nok_total')[['price_nok_total']]
top10_co2 = agg.nlargest(10, 'carbon_kg_total')[['carbon_kg_total']]

print('\nTop 10 by NOK:')
display(top10_nok)

print('\nTop 10 by CO₂:')
display(top10_co2)


Top 10 by NOK:


Unnamed: 0_level_0,price_nok_total
product_name,Unnamed: 1_level_1
Cookies,19489
Foie gras de canard entier,14237
Milk chocolate,13529
Dark chocolate,13125
Chocolate candies,12466
Bloc de foie gras de canard,11813
Milk Chocolate,11211
Candy,9943
Chocolate,8779
Toaster pastries,7437



Top 10 by CO₂:


Unnamed: 0_level_0,carbon_kg_total
product_name,Unnamed: 1_level_1
Cookies,349.8
Milk chocolate,280.048
Chocolate candies,267.654
Dark chocolate,238.957
Milk Chocolate,237.458
Chocolate,171.364
Frosted sugar cookies,160.584
Frosted Sugar Cookies,148.165
Dark Chocolate,137.089
Foie gras de canard entier,132.289
