# SnackTrack ML --- Dataset Download & Inspection

This notebook downloads all **10 Kaggle datasets** used by the SnackTrack ML training pipeline,
inspects their structure and data quality, and converts them to **Parquet** format for
fast loading in subsequent notebooks.

### Datasets overview

| Category | Datasets | Purpose |
|----------|----------|---------|
| **Health/Diet** | `diet_recommendations`, `medical_diet`, `daily_food_nutrition` | Nutritional guidelines, medical diet rules, daily food logs |
| **Recipe Corpus** | `foodcom_reviews`, `foodcom_interactions`, `epicurious`, `recipes_64k` | Large-scale recipe data with ratings and nutrition |
| **Ingredients** | `recipe_ingredients`, `global_food_nutrition` | Ingredient lists, allergen/nutrient databases |
| **Recommendation** | `food_recommendation` | Baseline food + ingredient + rating data |

### Prerequisites

- A [Kaggle API token](https://www.kaggle.com/docs/api) in `~/.kaggle/kaggle.json`
- Internet connection for first-time download
- ~2 GB free disk space for raw CSV + Parquet copies

In [None]:
%pip install kagglehub pyarrow pandas -q

In [None]:
import sys
from pathlib import Path

# Ensure the notebook can find the utils package
sys.path.insert(0, "..")

from notebooks.utils.dataset_downloader import DATASETS, download_all, list_datasets

In [None]:
# Show all 10 available datasets
list_datasets()

In [None]:
# Download every dataset (skips those already present on disk)
results = download_all()

# Quick summary
succeeded = sum(1 for v in results.values() if v is not None)
print(f"\nDownloaded {succeeded}/{len(DATASETS)} datasets successfully.")

## Dataset Inspection

For every downloaded dataset we:
1. Locate all CSV files in its subdirectory
2. Print **shape**, **column names**, **dtypes**, and **null counts**
3. Preview the first 3 rows

In [None]:
import pandas as pd

DATA_DIR = Path("../data") if Path("../data").exists() else Path("data")

# Use the resolved DATA_DIR from the downloader as ground truth
from notebooks.utils.dataset_downloader import DATA_DIR

for name, dest in sorted(results.items()):
    if dest is None:
        print(f"=== {name} === SKIPPED (download failed)\n")
        continue

    print(f"{'=' * 80}")
    print(f"  {name}")
    print(f"  Directory: {dest}")
    print(f"{'=' * 80}")

    csv_files = sorted(Path(dest).glob("*.csv"))
    json_files = sorted(Path(dest).glob("*.json"))
    all_files = csv_files + json_files

    if not all_files:
        print("  No CSV/JSON files found.\n")
        continue

    for csv_path in all_files:
        try:
            if csv_path.suffix == ".json":
                df = pd.read_json(csv_path)
            else:
                df = pd.read_csv(csv_path, nrows=50_000)  # cap for inspection

            print(f"\n  File: {csv_path.name}")
            print(f"  Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
            print(f"  Columns: {list(df.columns)}")
            print(f"\n  Dtypes:")
            print(df.dtypes.to_string().replace("\n", "\n    "))
            print(f"\n  Null counts:")
            nulls = df.isnull().sum()
            print(nulls[nulls > 0].to_string().replace("\n", "\n    ") or "    None")
            print(f"\n  First 3 rows:")
            display(df.head(3))
        except Exception as e:
            print(f"  Could not load {csv_path.name}: {e}")

    print()

## Standardize & Save as Parquet

We standardize every dataset to a consistent format:
- **Column names** are lowercased with spaces/special chars replaced by underscores
- **Parquet** files are written to `data/<dataset_name>.parquet` for fast columnar reads

If a dataset contains multiple CSV files, each one is saved as
`data/<dataset_name>__<file_stem>.parquet`.

In [None]:
import re


def standardize_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Lowercase columns, replace spaces/special chars with underscores."""
    df.columns = [
        re.sub(r"[^a-z0-9]+", "_", col.lower()).strip("_")
        for col in df.columns
    ]
    return df


saved_files: list[str] = []

for name, dest in sorted(results.items()):
    if dest is None:
        continue

    csv_files = sorted(Path(dest).glob("*.csv"))
    json_files = sorted(Path(dest).glob("*.json"))
    all_files = csv_files + json_files

    if not all_files:
        continue

    for fpath in all_files:
        try:
            if fpath.suffix == ".json":
                df = pd.read_json(fpath)
            else:
                df = pd.read_csv(fpath)

            df = standardize_columns(df)

            # Choose output filename
            if len(all_files) == 1:
                out_name = f"{name}.parquet"
            else:
                out_name = f"{name}__{fpath.stem}.parquet"

            out_path = DATA_DIR / out_name
            df.to_parquet(out_path, index=False)
            saved_files.append(out_name)
            print(f"  Saved {out_name}  ({df.shape[0]:,} rows, {df.shape[1]} cols)")

        except Exception as e:
            print(f"  FAILED {fpath.name}: {e}")

print(f"\nTotal Parquet files written: {len(saved_files)}")

## Summary

This notebook has completed the following steps:

1. **Listed** all 10 Kaggle datasets required by the SnackTrack ML pipeline
2. **Downloaded** each dataset via `kagglehub` into `data/<dataset_name>/`
3. **Inspected** every CSV/JSON file -- shapes, dtypes, nulls, and sample rows
4. **Standardized** column names (lowercase + underscores) and saved as **Parquet**

### Next steps

- **Notebook 01** (`01_eda_and_data_quality.ipynb`): Exploratory data analysis across all datasets
- **Notebook 02** (`02_content_based_analysis.ipynb`): Build ingredient/nutrition vectors and test content-based filtering