# IUCN Red List Enrichment for Threatened Species

Reads the list of **unique threatened species** (and optionally invasive) from the GBIF silver layer on S3,
then enriches each species with the full textual IUCN Red List assessment via the
[IUCN Red List API v4](https://api.iucnredlist.org/api-docs/index.html).

## ⚠️ IUCN Red List vs invasive species

| Source | Covers | API |
|--------|--------|-----|
| **IUCN Red List** | Conservation status (CR, EN, VU, NT, LC…) – *threatened species* | ✅ Yes – this notebook |
| **GISD** (Global Invasive Species Database) | Invasive alien species | ❌ No public REST API (data via GBIF DwC-A) |

**IUCN Red List API** is for **threatened species** (extinction risk). It is **not** an invasive species database.
Many invasive species are **not** on the Red List (LC, NE, or not assessed) → low hit rate when querying IUCN for invasive-only species.

- **Threatened species (CR/EN/VU)** → high hit rate, rich textual profiles ✅  
- **Invasive species** → we still query IUCN (some are also assessed); expect many "not found". For invasive-specific enrichment, use GISD data via GBIF.

## Why per-species querying?

The IUCN API offers several strategies:

| Approach | Endpoint | Good for |
|----------|----------|----------|
| **Per-species** ✅ | `GET /taxa/scientific_name` + `GET /assessment/{id}` | Our use case – targeted enrichment for species in our dataset |
| Country bulk | `GET /countries/ES` | Returns **all** Red Listed species in Spain (~10k+), most not in our data |
| Category bulk | `GET /red_list_categories/CR` | All CR species globally, too broad |

Per-species querying is the right approach because:
- We only enrich species **already present in our silver layer** (no wasted calls)
- A local disk cache avoids re-hitting the API on repeated runs
- The assessment endpoint returns the **full narrative text** (rationale, habitat, threats, etc.) that the bulk endpoints omit

## Two-step API call per species

```
1.  GET /api/v4/taxa/scientific_name
        ?genus_name=Lynx&species_name=pardinus
        → assessment list → latest assessment_id

2.  GET /api/v4/assessment/{assessment_id}
        → full textual profile (rationale, habitat, threats, conservation…)
```

## Output

| Location | Content |
|----------|---------|
| **S3** `s3://ie-datalake/silver/iucn_species_profiles/country=XX/year=YYYY/` | Parquet + JSON – final data in silver layer |
| **Local** `iucn_enrichment/species_profiles.parquet` | Structured table – one row per species |
| **Local** `iucn_enrichment/species_profiles.json` | Array of dicts – ready for LLM/agentic pipeline |
| **Local** `iucn_enrichment/raw/{species}.json` | Raw API response per species (debug/audit) |
| **Local** `iucn_enrichment/iucn_cache.json` | Local call cache (saves API quota on re-runs) |

In [31]:
%pip install -q requests pyarrow s3fs pandas python-dotenv tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
import json
import logging
import os
import re
import time
from pathlib import Path

import pandas as pd
import pyarrow as pa
import pyarrow.dataset as pa_ds
import pyarrow.parquet as pq
import requests
import s3fs
from dotenv import load_dotenv
from tqdm.notebook import tqdm

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%H:%M:%S",
    force=True,
)
log = logging.getLogger("iucn_enrichment")

In [9]:
# ─── Configuration ────────────────────────────────────────────────────────────

# Load secrets from .env (one level up from src/)
load_dotenv(Path("../.env"))
IUCN_API_KEY = os.environ["IUCN_API_KEY"]
assert IUCN_API_KEY, "IUCN_API_KEY not found in .env"

# S3 silver layer location
S3_BUCKET        = "ie-datalake"
SILVER_PREFIX    = "silver/gbif"
AWS_PROFILE      = "486717354268_PowerUserAccess"     # change to match your SSO profile

# Which countries / years to extract species from
# Set to None to scan all available partitions
FILTER_COUNTRIES = ["ES"]            # e.g. ["ES", "PT"] or None
FILTER_YEARS     = [2024]            # e.g. [2024, 2023] or None

# Which species to enrich
INCLUDE_THREATENED = True            # CR, EN, VU – IUCN Red List is designed for these ✅
INCLUDE_INVASIVE   = False           # is_invasive_any – many not on Red List; low hit rate (use GISD for invasive-specific data)

# Output: local cache + S3 silver table
OUT_DIR = Path("../iucn_enrichment")
OUT_DIR.mkdir(parents=True, exist_ok=True)
(OUT_DIR / "raw").mkdir(exist_ok=True)

# S3 silver table for final species profiles (new table in silver layer)
SILVER_IUCN_PREFIX = "silver/iucn_species_profiles"   # s3://ie-datalake/silver/iucn_species_profiles/

# IUCN API
IUCN_BASE  = "https://api.iucnredlist.org/api/v4"
HEADERS    = {"Authorization": f"Bearer {IUCN_API_KEY}"}
RATE_DELAY = 0.5    # seconds between API calls
MAX_RETRIES = 3

THREATENED_CATS = {"CR", "EN", "VU"}

log.info("Config loaded. Output dir: %s", OUT_DIR.resolve())

20:26:36 [INFO] Config loaded. Output dir: /Users/jakubizewski/Desktop/repos/ie-microsoft-capstone/iucn_enrichment


In [26]:
# ─── S3 connection ────────────────────────────────────────────────────────────

import boto3
import pyarrow.dataset as pa_ds
import pyarrow.fs as pafs

session = boto3.Session(profile_name=AWS_PROFILE)
creds   = session.get_credentials().get_frozen_credentials()
_region = "eu-west-2"

# s3fs – used for partition listing and S3 writes
fs = s3fs.S3FileSystem(
    key=creds.access_key,
    secret=creds.secret_key,
    token=creds.token,
)

# PyArrow native S3FileSystem – used for READS (5–10× faster than s3fs for Parquet).
# Same pattern as gbif_silver_to_gold.ipynb.
fs_read = pafs.S3FileSystem(
    access_key=creds.access_key,
    secret_key=creds.secret_key,
    session_token=creds.token,
    region=_region,
)

pa.set_io_thread_count(min(16, (os.cpu_count() or 4) * 2))
pa.set_cpu_count(os.cpu_count() or 4)

# Quick connectivity test
try:
    fs.ls(f"{S3_BUCKET}/{SILVER_PREFIX}", detail=False)[:3]
    log.info("S3 ready (profile=%s, region=%s, io_threads=%d)", AWS_PROFILE, _region, pa.io_thread_count())
except Exception as e:
    log.error("S3 connection failed: %s", e)
    raise

20:47:22 [INFO] Found credentials in shared credentials file: ~/.aws/credentials
20:47:23 [INFO] S3 ready (profile=486717354268_PowerUserAccess, region=eu-west-2, io_threads=16)


## 1 · Extract unique threatened species from silver layer

(Optionally also invasive – see config; IUCN hit rate is low for invasive-only species.)

In [11]:
# ─── Discover silver partitions ───────────────────────────────────────────────

def list_silver_partitions(
    countries: list[str] | None = None,
    years: list[int] | None = None,
) -> list[dict]:
    """Return list of {country, year, path} dicts for silver partitions."""
    all_partitions = []
    country_dirs   = fs.ls(f"{S3_BUCKET}/{SILVER_PREFIX}", detail=False)

    for country_dir in country_dirs:
        country = country_dir.split("country=")[-1]
        if countries and country not in countries:
            continue
        year_dirs = fs.ls(country_dir, detail=False)
        for year_dir in year_dirs:
            year_str = year_dir.split("year=")[-1]
            try:
                year = int(year_str)
            except ValueError:
                continue
            if years and year not in years:
                continue
            all_partitions.append({"country": country, "year": year, "path": year_dir})

    log.info("Found %d silver partition(s) matching filters", len(all_partitions))
    return all_partitions


partitions = list_silver_partitions(FILTER_COUNTRIES, FILTER_YEARS)
for p in partitions:
    log.info("  country=%s  year=%s  path=%s", p["country"], p["year"], p["path"])

20:26:44 [INFO] Found 1 silver partition(s) matching filters
20:26:44 [INFO]   country=ES  year=2024  path=ie-datalake/silver/gbif/country=ES/year=2024


In [12]:
# ─── Extract unique species per partition ─────────────────────────────────────

# Columns we need from silver (minimal projection for speed)
SPECIES_COLS = [
    "species", "scientificName",           # name columns (one or both present)
    "genus", "family", "order",            # taxonomy
    "kingdom", "class",
    "speciesKey", "taxonKey",              # GBIF numeric IDs
    "iucn_cat", "iucnRedListCategory",     # IUCN category
    "is_threatened",
    "is_invasive_any",
]


def _find_col(df: pd.DataFrame, *candidates: str) -> str | None:
    """Case-insensitive, underscore-normalised column lookup."""
    norm = {c.lower().replace("_", ""): c for c in df.columns}
    for cand in candidates:
        key = cand.lower().replace("_", "")
        if key in norm:
            return norm[key]
    return None


def read_silver_partition(country: str, year: int) -> pd.DataFrame:
    """
    Read one silver partition from S3 with column projection.
    Uses pyarrow native S3FileSystem + scanner (same as gbif_silver_to_gold)
    – significantly faster than s3fs for large Parquet files.
    """
    native_path = f"{S3_BUCKET}/{SILVER_PREFIX}/country={country}/year={year}"
    log_path    = f"s3://{native_path}"

    info = fs_read.get_file_info(native_path)
    if info.type == pafs.FileType.NotFound:
        log.warning("Partition not found: %s", log_path)
        return pd.DataFrame()

    log.info("Reading silver: %s", log_path)
    t0 = time.time()

    dataset   = pa_ds.dataset(native_path, filesystem=fs_read, format="parquet")
    available = set(dataset.schema.names)
    avail_lower = {c.lower().replace("_", ""): c for c in available}
    project   = []
    for want in SPECIES_COLS:
        key = want.lower().replace("_", "")
        if key in avail_lower:
            project.append(avail_lower[key])

    log.info("Projecting %d / %d columns – starting download…", len(project), len(available))

    table = dataset.scanner(columns=project, use_threads=True).to_table()
    df    = table.to_pandas()

    df["source_country"] = country
    df["source_year"]    = year

    elapsed = time.time() - t0
    log.info("Loaded %d rows in %.1fs", len(df), elapsed)

    # Normalise IUCN category column
    iucn_col = _find_col(df, "iucn_cat", "iucnRedListCategory")
    if iucn_col:
        df["_iucn_cat_norm"] = df[iucn_col].astype(str).str.upper().str.strip()
    else:
        df["_iucn_cat_norm"] = pd.NA

    # Normalise invasive flag
    inv_col = _find_col(df, "is_invasive_any")
    if inv_col:
        df["_is_invasive"] = df[inv_col].astype(bool)
    else:
        df["_is_invasive"] = False

    # Apply filter
    mask = pd.Series(False, index=df.index)
    if INCLUDE_THREATENED:
        mask |= df["_iucn_cat_norm"].isin(THREATENED_CATS)
    if INCLUDE_INVASIVE:
        mask |= df["_is_invasive"]

    return df[mask]


# Collect species across all partitions
all_frames = []
for part in tqdm(partitions, desc="Reading silver partitions"):
    frame = read_silver_partition(part["country"], part["year"])
    all_frames.append(frame)
    log.info("  → %d matching rows", len(frame))

df_all = pd.concat(all_frames, ignore_index=True) if all_frames else pd.DataFrame()
log.info("Total matching rows: %d", len(df_all))

Reading silver partitions:   0%|          | 0/1 [00:00<?, ?it/s]

20:26:48 [INFO] Reading silver: s3://ie-datalake/silver/gbif/country=ES/year=2024
20:26:50 [INFO] Projecting 12 / 64 columns – starting download…
20:27:49 [INFO] Loaded 7421317 rows in 61.2s
20:27:53 [INFO]   → 149015 matching rows
20:27:53 [INFO] Total matching rows: 149015


In [13]:
# ─── Deduplicate: one row per unique scientific name ──────────────────────────

def best_species_name(row: pd.Series) -> str | None:
    """Return the most complete scientific name from available columns."""
    for col in ("species", "scientificName"):
        val = row.get(col)
        if pd.notna(val) and str(val).strip():
            return str(val).strip()
    # Fallback: build from genus + (family)
    genus = row.get("genus")
    if pd.notna(genus) and str(genus).strip():
        return str(genus).strip()
    return None


df_all["_sci_name"] = df_all.apply(best_species_name, axis=1)
df_all = df_all.dropna(subset=["_sci_name"])

# One representative row per species (keep richest IUCN info)
species_df = (
    df_all
    .sort_values("_iucn_cat_norm")    # CR < EN < VU sorts correctly alphabetically
    .drop_duplicates(subset=["_sci_name"], keep="first")
    .reset_index(drop=True)
)

log.info("Unique species to enrich: %d", len(species_df))
print("\nSample:")
display(
    species_df[["_sci_name", "_iucn_cat_norm", "_is_invasive", "source_country"]]
    .head(20)
)

20:28:43 [INFO] Unique species to enrich: 390



Sample:


Unnamed: 0,_sci_name,_iucn_cat_norm,_is_invasive,source_country
0,Puffinus mauretanicus,CR,False,ES
1,Oxalis corniculata,CR,False,ES
2,Odontites viscosus,CR,False,ES
3,Gallotia stehlini,CR,False,ES
4,Vanellus gregarius,CR,False,ES
5,Gyps rueppellii,CR,False,ES
6,Arenaria grandiflora,CR,False,ES
7,Myliobatis aquila,CR,False,ES
8,Anguilla anguilla,CR,False,ES
9,Echium acanthocarpum,CR,False,ES


In [15]:
# ─── Summary of species to be enriched ───────────────────────────────────────

threatened_only = species_df[species_df["_iucn_cat_norm"].isin(THREATENED_CATS)]
invasive_only   = species_df[species_df["_is_invasive"]]
both            = species_df[species_df["_iucn_cat_norm"].isin(THREATENED_CATS) & species_df["_is_invasive"]]

print(f"{'─'*45}")
print(f"  Threatened (CR/EN/VU)           : {len(threatened_only):5d}")
print(f"  Invasive                        : {len(invasive_only):5d}")
print(f"  Both (threatened + invasive)    : {len(both):5d}")
print(f"  TOTAL unique species to enrich  : {len(species_df):5d}")
print(f"{'─'*45}")

if len(species_df) > 0:
    cat_dist = species_df["_iucn_cat_norm"].value_counts().sort_index()
    print("\nCategory distribution:")
    for cat, cnt in cat_dist.items():
        print(f"  {cat:5s}: {cnt}")

─────────────────────────────────────────────
  Threatened (CR/EN/VU)           :   390
  Invasive                        :     1
  Both (threatened + invasive)    :     1
  TOTAL unique species to enrich  :   390
─────────────────────────────────────────────

Category distribution:
  CR   : 59
  EN   : 150
  VU   : 181


## 2 · IUCN API client with caching

In [16]:
# ─── Disk cache for IUCN responses ───────────────────────────────────────────
# Avoids re-hitting the API if you re-run the notebook.
# Key = scientific name (normalised), value = full assessment dict.

CACHE_PATH = OUT_DIR / "iucn_cache.json"

_cache: dict[str, dict] = {}
if CACHE_PATH.exists():
    with open(CACHE_PATH) as f:
        _cache = json.load(f)
    log.info("Loaded %d cached species from %s", len(_cache), CACHE_PATH)


def _save_cache() -> None:
    with open(CACHE_PATH, "w") as f:
        json.dump(_cache, f, ensure_ascii=False, indent=2)


# ─── HTTP helper with retry + rate limit ─────────────────────────────────────

def _get(url: str, params: dict | None = None) -> dict | None:
    """GET request with retries and rate-limiting. Returns parsed JSON or None."""
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            resp = requests.get(url, headers=HEADERS, params=params, timeout=20)
            if resp.status_code == 200:
                time.sleep(RATE_DELAY)
                return resp.json()
            elif resp.status_code == 404:
                return None  # Species not on Red List – not an error
            elif resp.status_code == 429:
                wait = 10 * attempt
                log.warning("Rate-limited (429). Waiting %ds…", wait)
                time.sleep(wait)
            elif resp.status_code == 401:
                raise PermissionError(f"IUCN API returned 401 – check IUCN_API_KEY in .env")
            else:
                log.warning("HTTP %d for %s (attempt %d)", resp.status_code, url, attempt)
                time.sleep(2 * attempt)
        except requests.RequestException as e:
            log.warning("Request error (attempt %d): %s", attempt, e)
            time.sleep(2 * attempt)
    return None

In [17]:
# ─── Scientific name parser ───────────────────────────────────────────────────

def parse_scientific_name(name: str) -> tuple[str, str, str | None]:
    """
    Split a scientific name into (genus, species_epithet, infra_name).

    Handles:
      - Binomials  : "Quercus robur"            → ("Quercus", "robur", None)
      - Trinomials : "Poa annua var. annua"      → ("Poa", "annua", "annua")
      - With author: "Lynx pardinus (Temminck)"  → ("Lynx", "pardinus", None)
    """
    # Strip author strings (everything in parentheses or after a capital+dot pattern)
    clean = re.sub(r"\s*\(.*?\)", "", name).strip()
    # Remove author: "Quercus robur L." → "Quercus robur"
    clean = re.sub(r"\s+[A-Z][^\s]*\.?\s*$", "", clean).strip()
    # Remove rank keywords (var., subsp., ssp., f.)
    clean = re.sub(r"\s+(var|subsp|ssp|f)\.", " ", clean, flags=re.IGNORECASE).strip()

    parts = clean.split()
    if len(parts) < 2:
        return parts[0] if parts else name, "", None

    genus   = parts[0]
    species = parts[1]
    infra   = parts[2] if len(parts) >= 3 else None
    return genus, species, infra


# Quick test
for name in ["Lynx pardinus", "Poa annua var. annua", "Quercus robur L.", "Testudo graeca (Linnaeus, 1758)"]:
    g, s, i = parse_scientific_name(name)
    print(f"  {name:45s} → genus={g!r:20s} species={s!r:20s} infra={i!r}")

  Lynx pardinus                                 → genus='Lynx'               species='pardinus'           infra=None
  Poa annua var. annua                          → genus='Poa'                species='annua'              infra='annua'
  Quercus robur L.                              → genus='Quercus'            species='robur'              infra=None
  Testudo graeca (Linnaeus, 1758)               → genus='Testudo'            species='graeca'             infra=None


In [18]:
# ─── Core: fetch assessment for one species ───────────────────────────────────

def fetch_species_assessment(scientific_name: str) -> dict | None:
    """
    Return the full IUCN assessment dict for a species, or None if not found.

    Workflow:
      1. Check local cache → return immediately if hit
      2. GET /taxa/scientific_name  → find latest assessment_id
      3. GET /assessment/{id}       → full textual profile
      4. Write to cache
    """
    cache_key = scientific_name.lower().strip()

    # Cache hit
    if cache_key in _cache:
        return _cache[cache_key]

    genus, species_epithet, infra = parse_scientific_name(scientific_name)
    if not species_epithet:
        log.debug("Could not parse species epithet from %r – skipping", scientific_name)
        return None

    # Step 1 – find assessment ID
    params: dict = {"genus_name": genus, "species_name": species_epithet}
    if infra:
        params["infra_name"] = infra

    taxa_data = _get(f"{IUCN_BASE}/taxa/scientific_name", params=params)
    if not taxa_data:
        _cache[cache_key] = None
        return None

    assessments = taxa_data.get("assessments", [])
    if not assessments:
        _cache[cache_key] = None
        return None

    # Pick the latest global assessment (prefer scope_code=1 = Global)
    latest_assessments = [a for a in assessments if a.get("latest")]
    if not latest_assessments:
        latest_assessments = assessments  # fallback: take most recent by ID

    # Prefer global scope over regional
    global_ass = [a for a in latest_assessments if a.get("scope", {}).get("code") == 1]
    target     = global_ass[0] if global_ass else latest_assessments[0]
    assessment_id = target["assessment_id"]

    # Step 2 – full assessment
    full = _get(f"{IUCN_BASE}/assessment/{assessment_id}")
    if not full:
        _cache[cache_key] = None
        return None

    # Persist raw JSON for audit
    safe_fname = re.sub(r"[^\w\s-]", "_", scientific_name).strip().replace(" ", "_")
    with open(OUT_DIR / "raw" / f"{safe_fname}.json", "w") as fh:
        json.dump(full, fh, ensure_ascii=False, indent=2)

    _cache[cache_key] = full
    return full

## 3 · Run enrichment loop

In [19]:
# ─── Enrichment loop ─────────────────────────────────────────────────────────

species_names = species_df["_sci_name"].tolist()
total         = len(species_names)

raw_assessments: dict[str, dict | None] = {}
stats = {"found": 0, "not_found": 0, "cached": 0, "error": 0}

log.info("Starting IUCN enrichment for %d species…", total)

for name in tqdm(species_names, desc="IUCN API enrichment", unit="species"):
    cache_key = name.lower().strip()
    was_cached = cache_key in _cache

    result = fetch_species_assessment(name)
    raw_assessments[name] = result

    if result:
        if was_cached:
            stats["cached"] += 1
        else:
            stats["found"] += 1
    else:
        stats["not_found"] += 1

    # Save cache periodically (every 25 species) to avoid losing work
    if (stats["found"] + stats["not_found"]) % 25 == 0:
        _save_cache()

_save_cache()

print(f"\n{'─'*40}")
print(f"  Found (new API call) : {stats['found']:4d}")
print(f"  Found (cache hit)    : {stats['cached']:4d}")
print(f"  Not on Red List      : {stats['not_found']:4d}")
print(f"  Total                : {total:4d}")
print(f"{'─'*40}")

20:29:56 [INFO] Starting IUCN enrichment for 390 species…


IUCN API enrichment:   0%|          | 0/390 [00:00<?, ?species/s]


────────────────────────────────────────
  Found (new API call) :  337
  Found (cache hit)    :    0
  Not on Red List      :   53
  Total                :  390
────────────────────────────────────────


## 4 · Parse assessments into structured profiles

In [44]:
# ─── Parse one assessment into a flat structured dict ─────────────────────────

def _text(val) -> str | None:
    """Clean HTML tags from IUCN narrative text fields."""
    if not val:
        return None
    text = re.sub(r"<[^>]+>", " ", str(val))     # strip HTML tags
    text = re.sub(r"\s+", " ", text).strip()      # normalise whitespace
    return text if text else None


def _desc(val) -> str | None:
    """Extract string from IUCN API fields that can be str or dict like {'en': 'Terrestrial'}."""
    if not val:
        return None
    if isinstance(val, str):
        return val
    if isinstance(val, dict):
        return val.get("en") or val.get("name") or (next(iter(val.values()), None) if val else None)
    return str(val)


def parse_assessment(sci_name: str, raw: dict, source_row: pd.Series) -> dict:
    """
    Flatten a full IUCN assessment JSON into a structured profile dict.

    Fields extracted:
      Taxonomy   : scientific_name, common_names, kingdom, phylum, class, order, family, genus
      Assessment : assessment_id, assessment_year, red_list_version, scope
      Status     : iucn_category, iucn_category_description, population_trend, possibly_extinct
      Narrative  : rationale, habitat_ecology, population, range_description,
                   threats_text, conservation_text, research_needed, use_and_trade
      Structured : threats (list), conservation_actions (list), systems (list),
                   biogeographical_realms (list), habitats (list)
      Source     : source_country, source_year, is_invasive (from our silver layer)
    """
    taxon   = raw.get("taxon", {})
    cat     = raw.get("red_list_category") or {}
    trend   = raw.get("population_trend")  or {}

    # Common names – pick English first
    common_names = []
    for cn in (raw.get("common_names") or []):
        if isinstance(cn, dict):
            common_names.append(cn.get("name", ""))
        else:
            common_names.append(str(cn))
    eng_names = [cn for cn in common_names if "english" in str(raw.get("common_names", "")).lower()]

    # Threats – structured list
    threats_list = []
    for t in (raw.get("threats") or []):
        if isinstance(t, dict):
            threats_list.append({
                "code":    t.get("code"),
                "title":   _desc(t.get("description")) or t.get("title") or t.get("name"),
                "stresses": [
                    x for s in (t.get("stresses") or [])
                    if isinstance(s, dict) and (x := _desc(s.get("title") or s.get("name")))
                ],
            })

    # Conservation actions – structured list (API uses description dict, not title/name)
    cons_list = []
    for ca in (raw.get("conservation_actions") or []):
        if isinstance(ca, dict):
            cons_list.append({"code": ca.get("code"), "title": _desc(ca.get("description")) or ca.get("title") or ca.get("name")})

    # Systems, realms, habitats (description/name can be dict {'en': '...'})
    systems  = [x for s in (raw.get("systems")  or []) if isinstance(s, dict) and (x := _desc(s.get("description") or s.get("name")))]
    realms   = [x for r in (raw.get("biogeographical_realms") or []) if isinstance(r, dict) and (x := _desc(r.get("description") or r.get("name")))]
    habitats = [x for h in (raw.get("habitats") or []) if isinstance(h, dict) and (x := _desc(h.get("description") or h.get("name")))]

    # Assessment year from date string
    date_str = raw.get("assessment_date") or raw.get("year_published") or ""
    year_match = re.search(r"(\d{4})", str(date_str))
    assessment_year = int(year_match.group(1)) if year_match else None

    profile = {
        # ── Taxonomy ────────────────────────────────────────────────────────
        "scientific_name":    sci_name,
        "common_names":       common_names[:5],   # top 5
        "kingdom":            taxon.get("kingdom_name"),
        "phylum":             taxon.get("phylum_name"),
        "class":              taxon.get("class_name"),
        "order":              taxon.get("order_name"),
        "family":             taxon.get("family_name"),
        "genus":              taxon.get("genus_name"),
        # ── Assessment metadata ──────────────────────────────────────────────
        "assessment_id":      raw.get("assessment_id"),
        "assessment_year":    assessment_year,
        "red_list_version":   raw.get("red_list_version"),
        "scope":              (raw.get("scope") or {}).get("description"),
        # ── Conservation status ──────────────────────────────────────────────
        "iucn_category":              cat.get("code"),
        "iucn_category_description":  _desc(cat.get("description")),
        "population_trend":           _desc(trend.get("description")) or trend.get("code"),
        "possibly_extinct":           raw.get("possibly_extinct", False),
        "possibly_extinct_in_wild":   raw.get("possibly_extinct_in_the_wild", False),
        # ── Narrative text (key fields for AI summarisation) ──────────────────
        # IUCN API v4 stores narrative fields under documentation
        "rationale":          _text(raw.get("documentation", {}).get("rationale") or raw.get("rationale")),
        "habitat_ecology":    _text(raw.get("documentation", {}).get("habitats")) or _text(raw.get("habitat")) or _text(raw.get("ecology")),
        "population":         _text(raw.get("documentation", {}).get("population")) or _text(raw.get("population")),
        "range_description":  _text(raw.get("documentation", {}).get("range")) or _text(raw.get("range")),
        "threats_text":       _text(raw.get("documentation", {}).get("threats")) or _text(raw.get("threats_text")) or _text(raw.get("threat")),
        "conservation_text":  _text(raw.get("documentation", {}).get("measures")) or _text(raw.get("conservation_actions_text")) or _text(raw.get("conservation_measures")),
        "research_needed":    _text(raw.get("documentation", {}).get("research_needed")) or _text(raw.get("research_needed")),
        "use_and_trade":      _text(raw.get("documentation", {}).get("use_trade")) or _text(raw.get("use_and_trade")),
        # ── Structured lists ─────────────────────────────────────────────────
        "threats":            threats_list,
        "conservation_actions": cons_list,
        "systems":            systems,
        "biogeographical_realms": realms,
        "habitats":           habitats,
        # ── Source provenance ────────────────────────────────────────────────
        "source_country":     source_row.get("source_country"),
        "source_year":        source_row.get("source_year"),
        "is_invasive":        bool(source_row.get("_is_invasive", False)),
        "gbif_iucn_cat":      source_row.get("_iucn_cat_norm"),
    }
    return profile

In [45]:
# ─── Build profiles table ─────────────────────────────────────────────────────

profiles: list[dict] = []

for _, row in species_df.iterrows():
    name = row["_sci_name"]
    raw  = raw_assessments.get(name)
    if raw is None:
        # Not on Red List – create minimal record so we know we tried
        # Not on Red List – create minimal record so we know we tried
        profiles.append({
            "scientific_name":  name,
            "iucn_category":    None,
            "source_country":   row.get("source_country"),
            "source_year":      row.get("source_year"),
            "is_invasive":      bool(row.get("_is_invasive", False)),
            "gbif_iucn_cat":    row.get("_iucn_cat_norm"),
            "_iucn_found":      False,
        })
    else:
        profile = parse_assessment(name, raw, row)
        profile["_iucn_found"] = True
        profiles.append(profile)

profiles_found = sum(1 for p in profiles if p.get("_iucn_found"))
log.info("Parsed %d profiles  (%d with IUCN data, %d not found)",
         len(profiles), profiles_found, len(profiles) - profiles_found)

21:21:58 [INFO] Parsed 390 profiles  (337 with IUCN data, 53 not found)


## 5 · Save outputs

In [46]:
# ─── Save AI-ready JSON (array of dicts) ──────────────────────────────────────

# Keep only species with IUCN data for the AI pipeline
profiles_for_ai = [p for p in profiles if p.get("_iucn_found")]

json_path = OUT_DIR / "species_profiles.json"
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(profiles_for_ai, f, ensure_ascii=False, indent=2, default=str)

log.info("Saved AI-ready JSON: %s  (%d species)", json_path, len(profiles_for_ai))

21:22:02 [INFO] Saved AI-ready JSON: ../iucn_enrichment/species_profiles.json  (337 species)


In [47]:
# ─── Save structured Parquet ──────────────────────────────────────────────────
# Lists (threats, conservation_actions, etc.) are stored as JSON strings in Parquet
# for broad compatibility.

df_profiles = pd.DataFrame(profiles)

# Serialise list columns to JSON strings
list_cols = ["common_names", "threats", "conservation_actions",
             "systems", "biogeographical_realms", "habitats"]
for col in list_cols:
    if col in df_profiles.columns:
        df_profiles[col] = df_profiles[col].apply(
            lambda v: json.dumps(v, ensure_ascii=False) if isinstance(v, list) else v
        )

parquet_path = OUT_DIR / "species_profiles.parquet"
df_profiles.to_parquet(parquet_path, index=False, engine="pyarrow")

log.info("Saved Parquet: %s  (%d rows × %d cols)",
         parquet_path, len(df_profiles), len(df_profiles.columns))

21:22:05 [INFO] Saved Parquet: ../iucn_enrichment/species_profiles.parquet  (390 rows × 35 cols)


In [48]:
# ─── Write to S3 (silver layer) ───────────────────────────────────────────────────
# Final data must be available on S3 in the silver layer for downstream consumers.

import pyarrow.parquet as pq

# Partition by first (country, year) we processed – represents this run's source
_part = partitions[0] if partitions else {"country": "unknown", "year": 0}
s3_partition = f"country={_part['country']}/year={_part['year']}"
s3_base = f"{S3_BUCKET}/{SILVER_IUCN_PREFIX}/{s3_partition}"

# Parquet: use pq.write_to_dataset with s3fs (same pattern as gbif_silver_to_gold)
df_out = df_profiles.copy()
df_out["country"] = _part["country"]
df_out["year"]    = _part["year"]

table_out = pa.Table.from_pandas(df_out, preserve_index=False)
pq.write_to_dataset(
    table_out,
    root_path=f"s3://{s3_base}",
    filesystem=fs,
    compression="snappy",
    existing_data_behavior="delete_matching",
)

# JSON (AI-ready): upload as single file
s3_json_path = f"{s3_base}/species_profiles.json"
with fs.open(s3_json_path, "w") as fh:
    json.dump(profiles_for_ai, fh, ensure_ascii=False, indent=2, default=str)

log.info("Written to S3:")
log.info("  Parquet: s3://%s/", s3_base)
log.info("  JSON:   s3://%s", s3_json_path)

21:22:10 [INFO] Written to S3:
21:22:10 [INFO]   Parquet: s3://ie-datalake/silver/iucn_species_profiles/country=ES/year=2024/
21:22:10 [INFO]   JSON:   s3://ie-datalake/silver/iucn_species_profiles/country=ES/year=2024/species_profiles.json


In [49]:
# ─── Output summary ───────────────────────────────────────────────────────────

found_df = df_profiles[df_profiles["_iucn_found"] == True]

print(f"\n{'='*55}")
print("ENRICHMENT COMPLETE")
print(f"{'='*55}")
print(f"  Total species enriched   : {len(df_profiles):5d}")
print(f"  With IUCN assessment     : {len(found_df):5d}")
print(f"  Not on Red List          : {len(df_profiles)-len(found_df):5d}")
print(f"\n  Files written:")
print(f"    {json_path}")
print(f"    {parquet_path}")
print(f"    {OUT_DIR}/raw/<species>.json  ({len(list((OUT_DIR/'raw').glob('*.json')))} files)")
print(f"    {CACHE_PATH}")

if len(found_df) > 0:
    print(f"\n  IUCN category distribution:")
    for cat, cnt in found_df["iucn_category"].value_counts().items():
        print(f"    {str(cat):5s}: {cnt}")

    print(f"\n  Population trend distribution:")
    for trend, cnt in found_df["population_trend"].value_counts().items():
        print(f"    {str(trend):20s}: {cnt}")


ENRICHMENT COMPLETE
  Total species enriched   :   390
  With IUCN assessment     :   337
  Not on Red List          :    53

  Files written:
    ../iucn_enrichment/species_profiles.json
    ../iucn_enrichment/species_profiles.parquet
    ../iucn_enrichment/raw/<species>.json  (337 files)
    ../iucn_enrichment/iucn_cache.json

  IUCN category distribution:
    VU   : 136
    EN   : 105
    CR   : 42
    LC   : 30
    NT   : 12
    NA   : 6
    DD   : 3
    NE   : 2
    RE   : 1

  Population trend distribution:
    Decreasing          : 235
    Unknown             : 51
    Stable              : 33
    Increasing          : 15


## 6 · Preview: sample profiles ready for AI summarisation

In [50]:
# ─── Show one full profile as it would be passed to an LLM ───────────────────

def _join_vals(items, default="–"):
    """Join list items to string; handle dicts like {'en': 'x'} from IUCN API."""
    items = items or [default]
    parts = []
    for x in items:
        if isinstance(x, str):
            parts.append(x)
        elif isinstance(x, dict):
            parts.append(x.get("en") or x.get("name") or "")
        else:
            parts.append(str(x))
    return ", ".join(p for p in parts if p) or default

sample = next((p for p in profiles_for_ai if p.get("rationale")), None)

if sample:
    print(f"{'='*60}")
    print(f"SAMPLE PROFILE – {sample['scientific_name']}")
    print(f"{'='*60}")
    print(f"Common names    : {_join_vals(sample.get('common_names'))}")
    desc = sample.get('iucn_category_description')
    print(f"IUCN status     : {sample.get('iucn_category')} – {desc.get('en', desc) if isinstance(desc, dict) else desc}")
    trend = sample.get('population_trend')
    print(f"Population trend: {trend.get('en', trend) if isinstance(trend, dict) else trend}")
    print(f"Invasive (GBIF) : {sample.get('is_invasive')}")
    print(f"Systems         : {_join_vals(sample.get('systems'))}")
    print(f"Realms          : {_join_vals(sample.get('biogeographical_realms'))}")
    print()
    print("── RATIONALE ─────────────────────────────────────────")
    print(sample.get("rationale") or "(not available)")
    print()
    print("── HABITAT & ECOLOGY ─────────────────────────────────")
    print((sample.get("habitat_ecology") or "(not available)")[:800], "…")
    print()
    print("── THREATS ───────────────────────────────────────────")
    for t in (sample.get("threats") or [])[:5]:
        print(f"  [{t.get('code')}] {t.get('title')}")
    print()
    print("── CONSERVATION ACTIONS ──────────────────────────────")
    for ca in (sample.get("conservation_actions") or [])[:5]:
        print(f"  [{ca.get('code')}] {ca.get('title')}")
else:
    print("No profiles with rationale text found (check API responses in iucn_enrichment/raw/).")

SAMPLE PROFILE – Puffinus mauretanicus
Common names    : –
IUCN status     : CR – Critically Endangered
Population trend: Decreasing
Invasive (GBIF) : False
Systems         : Terrestrial, Marine
Realms          : Palearctic

── RATIONALE ─────────────────────────────────────────
European regional assessment: Critically Endangered (CR) EU28 regional assessment:&#160; Critically Endangered (CR) This species is endemic to Europe and the EU28 as a breeder. It has an extremely small range, which meets the thresholds for Vulnerable under the range size criteria (criteria B and D2). The population size is small, and, combined with a continuing decline, meets the thresholds for Vulnerable under the population size criterion C. The population trend appears to be decreasing at a rate which is predicted to become extremely rapid and meets the thresholds for Critically Endangered (CR) under the population size reduction criterion A. It is therefore assessed as such within both Europe and the EU28.

In [51]:
# ─── Structured table preview ─────────────────────────────────────────────────

display_cols = [
    "scientific_name", "iucn_category", "iucn_category_description",
    "population_trend", "is_invasive", "possibly_extinct",
    "assessment_year", "source_country",
]
available_display = [c for c in display_cols if c in df_profiles.columns]

display(
    found_df[available_display]
    .sort_values("iucn_category", na_position="last")
    .head(30)
    .style.applymap(
        lambda v: "background-color: #ffcccc" if v == "CR" else
                  "background-color: #ffd9a0" if v == "EN" else
                  "background-color: #fffacc" if v == "VU" else "",
        subset=["iucn_category"] if "iucn_category" in available_display else []
    )
)

  .style.applymap(


Unnamed: 0,scientific_name,iucn_category,iucn_category_description,population_trend,is_invasive,possibly_extinct,assessment_year,source_country
0,Puffinus mauretanicus,CR,Critically Endangered,Decreasing,False,False,2020.0,ES
207,Margaritifera margaritifera,CR,Critically Endangered,Decreasing,False,False,2022.0,ES
36,Limonium sventenii,CR,Critically Endangered,Unknown,False,False,2011.0,ES
37,Antirrhinum charidemi,CR,Critically Endangered,Decreasing,False,False,2011.0,ES
39,Acrostira bellamyi,CR,Critically Endangered,Decreasing,False,False,2016.0,ES
41,Gyrocaryum oppositifolium,CR,Critically Endangered,Decreasing,False,False,2006.0,ES
42,Carex furva,CR,Critically Endangered,Decreasing,False,False,2017.0,ES
43,Calotriton arnoldi,CR,Critically Endangered,Decreasing,False,False,2021.0,ES
44,Theodoxus baeticus,CR,Critically Endangered,Decreasing,False,False,2010.0,ES
45,Galeorhinus galeus,CR,Critically Endangered,Decreasing,False,False,2020.0,ES
