# GBIF Silver → Gold: `gbif_cell_metrics`

Reads H3-indexed occurrences from the **silver** layer and aggregates them into
a **gold** metrics table partitioned by `country / year / h3_resolution`.

| Layer | S3 path |
|-------|---------|
| Silver in | `s3://ie-datalake/silver/gbif/country=XX/year=YYYY/` |
| Gold out  | `s3://ie-datalake/gold/gbif_cell_metrics/country=XX/year=YYYY/h3_resolution=N/` |

## Metrics per `(country, year, h3_resolution, h3_index)`

### Observation
| Column | Description |
|--------|-------------|
| `observation_count` | Total occurrence records |
| `species_richness_cell` | Distinct species (speciesKey → taxonKey → species string) |
| `unique_datasets` | Distinct datasetKey |
| `avg_coordinate_uncertainty_m` | Mean coordinateUncertaintyInMeters |
| `pct_uncertainty_gt_10km` | Share of records with uncertainty > 10 000 m |

### IUCN / Threat
| Column | Description |
|--------|-------------|
| `n_assessed_species` | Distinct species with any IUCN category |
| `n_sp_cr / en / vu / nt / lc / dd / ne` | Distinct species per category (only if present) |
| `n_threatened_species` | Distinct CR + EN + VU species |
| `threat_score_weighted` | Σ weight(iucn) per **distinct** species; CR=5, EN=4, VU=3, NT=2 |

### Diversity
| Column | Description |
|--------|-------------|
| `shannon_H` | Shannon-Wiener entropy (numerically stable) |
| `simpson_1_minus_D` | Simpson diversity index 1 − D |

### Data Quality Index (0–1)
| Column | Description |
|--------|-------------|
| `dqi` | Composite: coord completeness, species-id completeness, uncertainty quality, iucn coverage |

## Memory strategy
- Only **required columns** are loaded (pyarrow projection pushdown).
- One `(country, year)` partition is held in RAM at a time.
- Diversity metrics are computed on the **species-count table** (groupby result),
  not on the full record-level DataFrame.

## Requirements
```
pip install pyarrow s3fs pandas numpy
```


In [3]:
# ─────────────────────────────────────────────────────────────────────────────
# CONFIGURATION
# ─────────────────────────────────────────────────────────────────────────────

COUNTRIES: list[str] = ["ES"]
YEAR_START: int = 2024
YEAR_END:   int = 2024

# H3 resolutions to aggregate (must already be present in silver)
H3_RESOLUTIONS: list[int] = [9, 8, 7, 6]

S3_BUCKET:     str = "ie-datalake"
SILVER_PREFIX: str = "silver/gbif"
GOLD_PREFIX:   str = "gold/gbif_cell_metrics"
AWS_PROFILE:   str = "486717354268_PowerUserAccess"

PARQUET_COMPRESSION:    str = "snappy"
PARQUET_ROW_GROUP_SIZE: int = 250_000

# IUCN categories to pivot (only columns with ≥1 species will be written)
IUCN_ALL_CATS: list[str] = ["CR", "EN", "VU", "NT", "LC", "DD", "NE"]
IUCN_WEIGHTS:  dict[str, int] = {"CR": 5, "EN": 4, "VU": 3, "NT": 2}

In [4]:
# ─────────────────────────────────────────────────────────────────────────────
# SILVER → GOLD PIPELINE
# ─────────────────────────────────────────────────────────────────────────────

from __future__ import annotations

import logging
import os
import time
from typing import Optional

import boto3
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.fs as pafs
import pyarrow.parquet as pq
import s3fs

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%H:%M:%S",
    force=True,
)
log = logging.getLogger("gbif_gold")

# s3fs – used only for writes (pq.write_to_dataset)
fs = s3fs.S3FileSystem(profile=AWS_PROFILE)

# PyArrow native S3FileSystem – used for reads.
# The C++ S3 client pre-fetches row-groups in parallel and uses connection
# pooling, making it 5-10× faster than s3fs for reading Parquet files.
_boto_session = boto3.Session(profile_name=AWS_PROFILE)
_creds = _boto_session.get_credentials().get_frozen_credentials()
# _region = _boto_session.region_name or "eu-west-2"
_region = "eu-west-2"
fs_read = pafs.S3FileSystem(
    access_key=_creds.access_key,
    secret_key=_creds.secret_key,
    session_token=_creds.token,
    region=_region,
)

# Maximise I/O and CPU parallelism for Arrow operations
pa.set_io_thread_count(min(16, (os.cpu_count() or 4) * 2))
pa.set_cpu_count(os.cpu_count() or 4)

log.info(
    "S3 ready (profile=%s, region=%s, io_threads=%d)",
    AWS_PROFILE, _region, pa.io_thread_count(),
)


# ══════════════════════════════════════════════════════════════════════════════
# UTILITIES
# ══════════════════════════════════════════════════════════════════════════════

def _find_col(df: pd.DataFrame, name: str) -> Optional[str]:
    """Case-insensitive column lookup, normalising underscores."""
    norm = name.lower().replace("_", "")
    for col in df.columns:
        if col.lower().replace("_", "") == norm:
            return col
    return None


def _resolve_species_col(df: pd.DataFrame) -> Optional[str]:
    """Return the best available species-identifier column."""
    for candidate in ("speciesKey", "taxonKey", "species", "scientificName"):
        col = _find_col(df, candidate)
        if col:
            return col
    return None


def _resolve_iucn_col(df: pd.DataFrame) -> Optional[str]:
    """Return the IUCN category column (from bronze enrichment or raw GBIF)."""
    for candidate in ("iucn_cat", "iucnRedListCategory"):
        col = _find_col(df, candidate)
        if col and df[col].notna().any():
            return col
    return None


# ══════════════════════════════════════════════════════════════════════════════
# 1. READ INPUT (with column projection)
# ══════════════════════════════════════════════════════════════════════════════

# Columns we want to read from silver (subset – much smaller than full schema)
_CANDIDATE_COLS: list[str] = [
    # H3 spatial index
    "h3_9", "h3_8", "h3_7", "h3_6",
    # Species identification
    "speciesKey", "taxonKey", "species", "scientificName",
    # Dataset
    "datasetKey",
    # Coordinate quality
    "coordinateUncertaintyInMeters",
    # IUCN
    "iucn_cat", "iucnRedListCategory",
    # Invasive flags (for future DQI extension)
    "is_invasive_any",
    # Partition keys (may already be present as columns)
    "country", "year",
]


def read_input(country: str, year: int) -> pd.DataFrame:
    """
    Read one silver partition from S3 with column projection.

    Uses pyarrow's native C++ S3FileSystem (fs_read) which pre-fetches
    row-groups in parallel – significantly faster than s3fs for large files.
    Only columns relevant to metric computation are loaded.
    """
    # Native pyarrow paths don't have an "s3://" prefix
    native_path = f"{S3_BUCKET}/{SILVER_PREFIX}/country={country}/year={year}"
    log_path    = f"s3://{native_path}"

    # Existence check via pyarrow native (avoids s3fs recursive listing)
    info = fs_read.get_file_info(native_path)
    if info.type == pafs.FileType.NotFound:
        raise FileNotFoundError(
            f"Silver partition not found: {log_path}. "
            "Run gbif_bronze_to_silver.ipynb first."
        )

    log.info("Reading silver: %s", log_path)
    t0 = time.time()

    dataset = ds.dataset(native_path, filesystem=fs_read, format="parquet")

    # Intersect desired columns with what actually exists in the schema
    available = set(dataset.schema.names)
    avail_lower = {c.lower().replace("_", ""): c for c in available}
    project = []
    for want in _CANDIDATE_COLS:
        key = want.lower().replace("_", "")
        if key in avail_lower:
            project.append(avail_lower[key])

    log.info("Projecting %d / %d columns – starting download…", len(project), len(available))

    # use_threads=True lets pyarrow read columns in parallel across row-groups
    table = dataset.scanner(columns=project, use_threads=True).to_table()
    df = table.to_pandas()

    elapsed = time.time() - t0
    log.info(
        "Loaded %d rows, %d columns in %.1fs (%.0f MB RAM)",
        len(df), len(df.columns), elapsed,
        df.memory_usage(deep=True).sum() / 1e6,
    )

    # Ensure partition key columns exist
    if "country" not in df.columns:
        df["country"] = country
    if "year" not in df.columns:
        df["year"] = int(year)

    return df


# ══════════════════════════════════════════════════════════════════════════════
# 2. COMPUTE METRICS PER H3 RESOLUTION
# ══════════════════════════════════════════════════════════════════════════════

def _diversity_vectorized(sp_counts: pd.Series) -> pd.DataFrame:
    """
    Compute Shannon-Wiener H and Simpson 1-D **without any Python-level loops**.

    sp_counts: Series indexed by (h3_cell, species_id) with observation counts.

    Strategy (fully vectorized):
      1. total_per_cell = groupby(h3).sum()  → broadcast via transform
      2. p_i = count / total  (element-wise)
      3. shannon_H = -Σ p_i * log(p_i)  per cell  (groupby.sum on the product)
      4. simpson   = 1 - Σ p_i²          per cell  (groupby.sum on p²)

    Eliminates the O(n_cells) Python-function-call overhead of groupby.apply().
    """
    h3_level = sp_counts.index.names[0]

    # Proportions: broadcast total back to each (cell, species) row
    total = sp_counts.groupby(level=h3_level).transform("sum")
    p = sp_counts / total                           # p_i for every (cell, species)

    # Shannon: -Σ p·log(p)  – clip to avoid log(0), zeros contribute 0
    log_p = np.log(p.clip(lower=1e-300))
    shannon = -(p * log_p).groupby(level=h3_level).sum().rename("shannon_H")

    # Simpson: 1 - Σ p²
    simpson = (1.0 - (p ** 2).groupby(level=h3_level).sum()).rename("simpson_1_minus_D")

    return pd.concat([shannon, simpson], axis=1).reset_index()


def compute_metrics(
    df: pd.DataFrame,
    country: str,
    year: int,
    h3_resolution: int,
) -> pd.DataFrame:
    """
    Aggregate all metrics for one (country, year, h3_resolution).

    Returns a DataFrame with one row per H3 cell and all metric columns.
    """
    h3_col    = f"h3_{h3_resolution}"
    sk_col    = _resolve_species_col(df)
    ds_col    = _find_col(df, "datasetKey")
    unc_col   = _find_col(df, "coordinateUncertaintyInMeters")
    iucn_col  = _resolve_iucn_col(df)

    if h3_col not in df.columns:
        raise ValueError(f"Column {h3_col!r} not found. Available: {list(df.columns)}")

    log.info(
        "  res=%d | rows=%d | sk=%s | unc=%s | iucn=%s",
        h3_resolution, len(df), sk_col, unc_col, iucn_col,
    )

    # ── 2a. Base counts ───────────────────────────────────────────────────────
    g = df.groupby(h3_col)
    agg = g.size().rename("observation_count").reset_index()

    # ── 2b. Species richness ─────────────────────────────────────────────────
    if sk_col:
        sr = g[sk_col].nunique().rename("species_richness_cell")
        agg = agg.merge(sr.reset_index(), on=h3_col, how="left")
    else:
        agg["species_richness_cell"] = pd.NA

    # ── 2c. Unique datasets ───────────────────────────────────────────────────
    if ds_col:
        ud = g[ds_col].nunique().rename("unique_datasets")
        agg = agg.merge(ud.reset_index(), on=h3_col, how="left")
    else:
        agg["unique_datasets"] = pd.NA

    # ── 2d. Coordinate uncertainty ────────────────────────────────────────────
    if unc_col:
        unc = pd.to_numeric(df[unc_col], errors="coerce")
        tmp = df.assign(_unc=unc)
        avg_unc = tmp.groupby(h3_col)["_unc"].mean().rename("avg_coordinate_uncertainty_m")
        pct_gt  = (
            tmp.assign(_gt10k=(unc > 10_000).astype(float))
               .groupby(h3_col)["_gt10k"].mean()
               .rename("pct_uncertainty_gt_10km")
        )
        agg = agg.merge(avg_unc.reset_index(), on=h3_col, how="left")
        agg = agg.merge(pct_gt.reset_index(),  on=h3_col, how="left")
    else:
        agg["avg_coordinate_uncertainty_m"] = pd.NA
        agg["pct_uncertainty_gt_10km"] = pd.NA

    # ── 2e. IUCN / Threat metrics ─────────────────────────────────────────────
    if iucn_col and sk_col:
        df_iucn = df.loc[
            df[iucn_col].notna() & (df[iucn_col].astype(str).str.strip() != "")
        ].copy()

        # n_assessed_species
        n_assessed = (
            df_iucn.groupby(h3_col)[sk_col].nunique().rename("n_assessed_species")
        )
        agg = agg.merge(n_assessed.reset_index(), on=h3_col, how="left")

        # n_sp_{cat} – only emit column when ≥1 species has that category
        present_cats = [
            c for c in IUCN_ALL_CATS if (df_iucn[iucn_col] == c).any()
        ]
        for cat in present_cats:
            col_name = f"n_sp_{cat.lower()}"
            cnt = (
                df_iucn[df_iucn[iucn_col] == cat]
                .groupby(h3_col)[sk_col].nunique()
                .rename(col_name)
            )
            agg = agg.merge(cnt.reset_index(), on=h3_col, how="left")

        # n_threatened_species (CR + EN + VU)
        df_thr = df_iucn[df_iucn[iucn_col].isin(["CR", "EN", "VU"])]
        n_thr = (
            df_thr.groupby(h3_col)[sk_col].nunique().rename("n_threatened_species")
        )
        agg = agg.merge(n_thr.reset_index(), on=h3_col, how="left")

        # threat_score_weighted – over DISTINCT (cell, species)
        # Vectorized: map each iucn value to a severity int, take max per
        # (cell, species), map back to weight, sum per cell.  No Python apply().
        _SEV_MAP = {"CR": 5, "EN": 4, "VU": 3, "NT": 2, "LC": 1, "DD": 0, "NE": 0}
        _SEV_TO_WEIGHT = {5: IUCN_WEIGHTS.get("CR", 0), 4: IUCN_WEIGHTS.get("EN", 0),
                          3: IUCN_WEIGHTS.get("VU", 0), 2: IUCN_WEIGHTS.get("NT", 0),
                          1: 0, 0: 0}
        sev = df_iucn[iucn_col].map(_SEV_MAP).fillna(-1).astype(np.int8)
        # max severity per (cell, species) → one row per distinct species per cell
        sp_max_sev = (
            df_iucn.assign(_sev=sev)
            .groupby([h3_col, sk_col])["_sev"].max()
        )
        weight = sp_max_sev.map(_SEV_TO_WEIGHT).fillna(0)
        threat_score = weight.groupby(level=h3_col).sum().rename("threat_score_weighted")
        agg = agg.merge(threat_score.reset_index(), on=h3_col, how="left")

    else:
        for col in ["n_assessed_species", "n_threatened_species", "threat_score_weighted"]:
            agg[col] = pd.NA

    # ── 2f. Diversity metrics (fully vectorized, no groupby.apply) ───────────
    if sk_col:
        # Species counts per (h3_cell, species) – re-used for both diversity
        # and species-richness (avoids a second full scan).
        sp_counts = df.groupby([h3_col, sk_col]).size().rename("_n")
        div = _diversity_vectorized(sp_counts)
        agg = agg.merge(div, on=h3_col, how="left")
    else:
        agg["shannon_H"] = pd.NA
        agg["simpson_1_minus_D"] = pd.NA

    # ── 2g. Data Quality Index (0–1) ──────────────────────────────────────────
    #
    # Components (each in [0, 1]):
    #   c1 – species-id completeness: pct of records with a valid species ID
    #   c2 – uncertainty quality: (1 - pct_uncertainty_gt_10km)  [if available]
    #   c3 – iucn coverage: (1 - pct_iucn_missing)             [if iucn column exists]
    #
    # DQI = mean of available components
    # Coordinate completeness is already 1.0 after silver cleaning (all rows have coords).

    dqi_parts: list[pd.Series] = []

    if sk_col:
        c1 = (
            df.assign(_has_sp=df[sk_col].notna().astype(float))
              .groupby(h3_col)["_has_sp"].mean()
              .rename("_c1")
        )
        dqi_parts.append(c1)

    if "pct_uncertainty_gt_10km" in agg.columns:
        c2 = (1.0 - agg.set_index(h3_col)["pct_uncertainty_gt_10km"].fillna(0)).rename("_c2")
        dqi_parts.append(c2)

    if iucn_col:
        c3 = (
            df.assign(_iucn_missing=df[iucn_col].isna().astype(float))
              .groupby(h3_col)["_iucn_missing"].mean()
              .rsub(1)
              .rename("_c3")
        )
        dqi_parts.append(c3)

    if dqi_parts:
        dqi_df = pd.concat(dqi_parts, axis=1).reset_index()
        dqi_df["dqi"] = dqi_df.iloc[:, 1:].mean(axis=1)
        agg = agg.merge(dqi_df[[h3_col, "dqi"]], on=h3_col, how="left")
    else:
        agg["dqi"] = pd.NA

    # ── 2h. Partition metadata ────────────────────────────────────────────────
    agg.rename(columns={h3_col: "h3_index"}, inplace=True)
    agg["h3_resolution"] = h3_resolution
    agg["country"]       = country
    agg["year"]          = int(year)

    # Normalise int columns (nunique returns int64, NaN forces float64 after merge)
    int_cols = [
        "observation_count", "species_richness_cell", "unique_datasets",
        "n_assessed_species", "n_threatened_species",
    ] + [f"n_sp_{c.lower()}" for c in IUCN_ALL_CATS if f"n_sp_{c.lower()}" in agg.columns]
    for col in int_cols:
        if col in agg.columns:
            agg[col] = agg[col].astype("Int64")  # nullable integer

    return agg


# ══════════════════════════════════════════════════════════════════════════════
# 3. WRITE GOLD
# ══════════════════════════════════════════════════════════════════════════════

def write_gold(agg: pd.DataFrame, country: str, year: int, h3_resolution: int) -> str:
    """
    Write metrics for one (country, year, h3_resolution) slice to the gold layer.

    Path: s3://{S3_BUCKET}/{GOLD_PREFIX}/country={country}/year={year}/h3_resolution={h3_resolution}/
    """
    s3_root = (
        f"{S3_BUCKET}/{GOLD_PREFIX}"
        f"/country={country}/year={year}/h3_resolution={h3_resolution}"
    )
    log.info("Writing %d cells to s3://%s …", len(agg), s3_root)

    table = pa.Table.from_pandas(agg, preserve_index=False)
    pq.write_to_dataset(
        table,
        root_path=f"s3://{s3_root}",
        filesystem=fs,
        existing_data_behavior="delete_matching",
        row_group_size=PARQUET_ROW_GROUP_SIZE,
        compression=PARQUET_COMPRESSION,
        write_statistics=True,
    )
    s3_uri = f"s3://{s3_root}"
    log.info("Written: %s", s3_uri)
    return s3_uri


# ══════════════════════════════════════════════════════════════════════════════
# MAIN PIPELINE
# ══════════════════════════════════════════════════════════════════════════════

years = list(range(YEAR_END, YEAR_START - 1, -1))  # newest first
partition_plan = [(c, y) for c in COUNTRIES for y in years]

log.info(
    "Gold pipeline: %d (country×year) partitions × %d resolutions = %d total writes",
    len(partition_plan), len(H3_RESOLUTIONS), len(partition_plan) * len(H3_RESOLUTIONS),
)

completed: list[dict] = []
errors:    list[dict] = []

for country, year in partition_plan:
    log.info("\n── %s / %s ──────────────────────────────────────────────────", country, year)
    t_read = time.time()

    try:
        df = read_input(country, year)
    except FileNotFoundError as exc:
        log.error("%s", exc)
        errors.append({"country": country, "year": year, "h3_resolution": "all", "error": str(exc)})
        continue

    log.info("Read done in %.1fs", time.time() - t_read)

    for res in H3_RESOLUTIONS:
        t0 = time.time()
        try:
            agg     = compute_metrics(df, country, year, res)
            s3_uri  = write_gold(agg, country, year, res)
            elapsed = time.time() - t0

            completed.append({
                "country":      country,
                "year":         year,
                "h3_resolution": res,
                "n_cells":      len(agg),
                "s3_uri":       s3_uri,
                "elapsed_s":    round(elapsed, 1),
            })
            log.info("✓ res=%d | %d cells | %.1fs", res, len(agg), elapsed)

        except Exception as exc:
            log.error("✗ %s/%s res=%d: %s", country, year, res, exc, exc_info=True)
            errors.append({"country": country, "year": year, "h3_resolution": res, "error": str(exc)})


# ── Summary ───────────────────────────────────────────────────────────────────
print()
print("═" * 60)
print(f"Gold pipeline complete: {len(completed)} succeeded, {len(errors)} failed")
print("═" * 60)

if completed:
    print("\nCompleted:")
    display(pd.DataFrame(completed))

if errors:
    print("\nFailed:")
    display(pd.DataFrame(errors))

16:38:05 [INFO] Found credentials in shared credentials file: ~/.aws/credentials
16:38:05 [INFO] S3 ready (profile=486717354268_PowerUserAccess, region=eu-west-2, io_threads=16)
16:38:05 [INFO] Gold pipeline: 1 (country×year) partitions × 4 resolutions = 4 total writes
16:38:05 [INFO] 
── ES / 2024 ──────────────────────────────────────────────────
16:38:05 [INFO] Reading silver: s3://ie-datalake/silver/gbif/country=ES/year=2024
16:38:11 [INFO] Projecting 14 / 64 columns – starting download…
16:39:37 [INFO] Loaded 7421317 rows, 14 columns in 91.2s (1976 MB RAM)
16:39:37 [INFO] Read done in 91.9s
16:39:37 [INFO]   res=9 | rows=7421317 | sk=specieskey | unc=coordinateuncertaintyinmeters | iucn=iucn_cat
16:39:43 [INFO] Writing 272584 cells to s3://ie-datalake/gold/gbif_cell_metrics/country=ES/year=2024/h3_resolution=9 …
16:39:44 [INFO] Written: s3://ie-datalake/gold/gbif_cell_metrics/country=ES/year=2024/h3_resolution=9
16:39:44 [INFO] ✓ res=9 | 272584 cells | 7.4s
16:39:44 [INFO]   res=8


════════════════════════════════════════════════════════════
Gold pipeline complete: 4 succeeded, 0 failed
════════════════════════════════════════════════════════════

Completed:


Unnamed: 0,country,year,h3_resolution,n_cells,s3_uri,elapsed_s
0,ES,2024,9,272584,s3://ie-datalake/gold/gbif_cell_metrics/countr...,7.4
1,ES,2024,8,149508,s3://ie-datalake/gold/gbif_cell_metrics/countr...,5.4
2,ES,2024,7,59951,s3://ie-datalake/gold/gbif_cell_metrics/countr...,4.4
3,ES,2024,6,15029,s3://ie-datalake/gold/gbif_cell_metrics/countr...,3.9


In [5]:
# ─────────────────────────────────────────────────────────────────────────────
# VERIFY – read back one slice and inspect the schema + sample rows
# ─────────────────────────────────────────────────────────────────────────────

VERIFY_COUNTRY    = COUNTRIES[0]
VERIFY_YEAR       = YEAR_END
VERIFY_RESOLUTION = 8  # medium resolution for a readable preview

s3_path = (
    f"{S3_BUCKET}/{GOLD_PREFIX}"
    f"/country={VERIFY_COUNTRY}/year={VERIFY_YEAR}"
    f"/h3_resolution={VERIFY_RESOLUTION}"
)
print(f"Reading: s3://{s3_path}")

sample = ds.dataset(s3_path, filesystem=fs_read, format="parquet").to_table().to_pandas()

print(f"\nShape: {sample.shape[0]:,} cells × {sample.shape[1]} columns")
print(f"Columns: {list(sample.columns)}")

print("\nTop 10 cells by species richness:")
display(
    sample.sort_values("species_richness_cell", ascending=False)
          [["h3_index", "observation_count", "species_richness_cell",
            "shannon_H", "simpson_1_minus_D",
            "n_threatened_species", "threat_score_weighted", "dqi"]]
          .head(10)
          .reset_index(drop=True)
)

print("\nMetric summary:")
display(
    sample[[
        "observation_count", "species_richness_cell",
        "shannon_H", "simpson_1_minus_D", "dqi",
        "n_threatened_species", "threat_score_weighted",
        "avg_coordinate_uncertainty_m",
    ]].describe()
)

Reading: s3://ie-datalake/gold/gbif_cell_metrics/country=ES/year=2024/h3_resolution=8

Shape: 149,508 cells × 18 columns
Columns: ['h3_index', 'observation_count', 'species_richness_cell', 'unique_datasets', 'avg_coordinate_uncertainty_m', 'pct_uncertainty_gt_10km', 'n_assessed_species', 'n_sp_cr', 'n_sp_en', 'n_sp_vu', 'n_threatened_species', 'threat_score_weighted', 'shannon_H', 'simpson_1_minus_D', 'dqi', 'h3_resolution', 'country', 'year']

Top 10 cells by species richness:


Unnamed: 0,h3_index,observation_count,species_richness_cell,shannon_H,simpson_1_minus_D,n_threatened_species,threat_score_weighted,dqi
0,8839560545fffff,1516,611,6.10566,0.996958,3,9.0,0.66139
1,88184b7687fffff,13959,570,4.947319,0.988813,4,13.0,0.671896
2,88392d00e9fffff,3282,548,5.688569,0.994793,5,18.0,0.665651
3,883919d9a3fffff,2132,543,5.637586,0.993599,2,7.0,0.663852
4,88394460d7fffff,8069,465,4.855625,0.985303,5,16.0,0.653654
5,883945c751fffff,1638,458,5.458804,0.99244,2,9.0,0.666463
6,8839440b1bfffff,1126,444,5.723554,0.995054,2,7.0,0.663114
7,8839444699fffff,1260,428,5.474435,0.99069,1,4.0,0.666667
8,883971a053fffff,1718,426,5.425145,0.992412,3,10.0,0.666279
9,8839442a01fffff,672,425,5.904638,0.996774,4,17.0,0.662698



Metric summary:


Unnamed: 0,observation_count,species_richness_cell,shannon_H,simpson_1_minus_D,dqi,n_threatened_species,threat_score_weighted,avg_coordinate_uncertainty_m
count,149508.0,149508.0,148806.0,148806.0,149508.0,29088.0,29088.0,105528.0
mean,49.63826,12.054405,1.354522,0.505073,0.631973,1.361971,4.658622,7370.86
std,417.825464,22.323385,1.354814,0.408933,0.114038,0.853279,2.869122,36964.41
min,1.0,0.0,-0.0,0.0,0.0,1.0,3.0,0.71
25%,1.0,1.0,0.0,0.0,0.666667,1.0,3.0,8.0
50%,3.0,3.0,1.098612,0.666667,0.666667,1.0,4.0,61.0
75%,16.0,13.0,2.507026,0.914601,0.666667,1.0,5.0,4340.982
max,49094.0,611.0,6.10566,0.996958,1.0,11.0,41.0,2499735.0


In [6]:
sample.head(2)

Unnamed: 0,h3_index,observation_count,species_richness_cell,unique_datasets,avg_coordinate_uncertainty_m,pct_uncertainty_gt_10km,n_assessed_species,n_sp_cr,n_sp_en,n_sp_vu,n_threatened_species,threat_score_weighted,shannon_H,simpson_1_minus_D,dqi,h3_resolution,country,year
0,8818004109fffff,141,13,1,,0.0,1,,,1,1,3.0,2.315729,0.889895,0.671395,8,ES,2024
1,88180055a1fffff,215,18,1,,0.0,4,,1.0,3,4,13.0,2.428779,0.889995,0.691473,8,ES,2024
