# EOPF HEALPix Structure and Metadata (Sentinel-2)

## Description and Motivations

- Most existing EOPF metadata can be reused for products on the HEALPix grid, except those defining the spatial domain.
- Use the same product definition (e.g., for sentinel-2: a product consists of a single acquisition "scene")
- Try following current EOPF's data concepts that rely on both STAC standards and CF conventions (*)
  - [The way spatial reference is specified for variables for EOPF data products follows the CF convention](https://cpm.pages.eopf.copernicus.eu/eopf-cpm/main/PSFD/2-main-data-concepts.html#dimensions-and-coordinate-variables)
  - [Metadata storage in EOPF data shall be based on the SpatioTemporal Asset Catalogue (STAC) standard for general geospatial information data](https://cpm.pages.eopf.copernicus.eu/eopf-cpm/main/PSFD/7-metadata.html#general-principles-and-design-justification)
- Define HEALPix metadata at the group level (single scale) using CF's grid mapping variables
  - Closely based on the [proposal of CF conventions for HEALPix grid parameters](https://github.com/cf-convention/cf-conventions/issues/433)
  - Ellipsoid of reference encoded using a separate grid mapping variable (optional?)
- HEALPix data is *inherently* multi-scale
  - Follow our own (for now) conventions for storing multi-scale data on HEALPix
  - For example: `measurements/reflectance/19`, `measurements/reflectance/16`, etc. (number corresponds to HEALPix refinement level) instead of `measurements/reflectance/r10m`, `measurements/reflectance/r20m`m etc.
  - No existing multi-scale concept in either EOPF and CF conventions ; ongoing discussion for GeoZarr
- Enable HEALPix grid-aware product retrieval in a STAC catalog by adding STAC-specific metadata at the product level
  - Based on the STAC Grid extension (https://github.com/stac-extensions/grid)
  - `grid:code` value still to be defined (a unique cell at a fixed coarse level? a unique cell at the coarsest level found that encompass product spatial extent? a cell range at a fixed level?)
  - Suggested format for `grid:code` value: `HEALPIX-I{indexing_scheme}-L{level}-{cell_id}`

(*) To our knowledge, there's a lot of discussions ongoing about EOPF, GeoZarr, STAC vs. CF style specifications. We think that the specification suggested here for products on the HEALPix grid could evolve (and should adapt pretty well) with respect to further evolution of those specifications.

## Metadata Conversion Workflow (from EOPF UTM products)

- For each group containing variables defined on the HEALPix grid:
  - Clean-up irrelevant metadata (all `proj:*` STAC attributes)
  - Update EOPF's "dimension" and "coordinates" variable attributes with HEALPix "cell" dimension and "cell_ids" coordinate names
  - Add HEALPix (and optionally reference ellipsoid) grid mapping variables
- For parent groups:
  - Conform to multi-scale convention: rename subgroups and add multi-scale metadata attribute
- At the product level:
  - Update STAC `bbox` and `geometry` attribute values to account for the grid cell shapes
  - Clean-up irrelevant metadata (all `proj:*` STAC attributes) 
  - Add `grid:code` STAC property


In [None]:
import pathlib

import xarray as xr

Reusing the Zarr store exported from EOPF UTM to HEALPix conversion notebook

In [None]:
dt = xr.open_datatree("sentinel-2-l1c_healpix.zarr", decode_timedelta=True)

In [None]:
# hard-coded sentinel-2 resolution to HEALPix level
# (computed in EOPF UTM conversion notebook but not saved as metadata)
s2rm_to_level = {"r10m": 19, "r20m": 18, "r60m": 16}


# HEALPix multiscale attribute convention
# (inspired from https://github.com/zarr-developers/geozarr-spec/issues/83#issuecomment-3292459330)
multiscales_attr = {
    "name": "healpix",
    "configuration": {"refinement_ratio": 4},
}


def _update_group(ds: xr.Dataset) -> xr.Dataset:
    """Update metadata of a single group."""

    if "cell_ids" not in ds.dims:
        return ds

    ds_healpix = ds.copy()

    # rename dimension "cell_ids" to "cell"
    # (as per https://github.com/cf-convention/cf-conventions/issues/433#issuecomment-3148535637)
    # keep "cell_ids" as an *auxiliary* coordinate (which doesn't have to be strictly monotonic)
    ds_healpix = ds_healpix.rename_dims(cell_ids="cell")

    # remove cell_ids coordinate attributes (they go to the grid mapping variable)
    # extract "level" attribute
    level = ds_healpix.cell_ids.attrs["level"]
    ds_healpix["cell_ids"].attrs.clear()

    # create healpix grid mapping variable
    ds_healpix["spatial_ref"] = (
        (),
        0,
        {
            "grid_mapping_name": "healpix",
            "indexing_scheme": "nested",
            "refinement_ratio": 4,
            "refinement_level": level,
        },
    )

    # reference ellipsoid (not part of CF Healpix proposal)
    # (note: lat-lon lazy coordinates should a grid_mapping attribute
    # ds_healpix.coords["crs"] = ((), 0, pyproj.CRS.from_epsg(4326).to_cf())
    # ds_healpix.spatial_ref.attrs["reference_body"] = "crs"

    # update variable attributes
    # dimensions, coordinates and grid_mapping attributes
    # clean-up non-healpix variable attributes
    for var in ds_healpix.variables.values():
        if "cell" in var.dims:
            attrs = var.attrs.copy()
            eopf_attrs = attrs.get("_eopf_attrs", None)
            if eopf_attrs is not None:
                eopf_attrs["dimensions"] = ["cell"]
                eopf_attrs["coordinates"] = ["cell_ids"]
            for name in var.attrs:
                if "proj:" in name:
                    del attrs[name]
            attrs["grid_mapping"] = "spatial_ref"
            var.attrs = attrs

    return ds_healpix


def _update_stac_discovery(attrs: dict):
    """Update product level STAC."""

    # TODO: update "bbox" and "geometry" attributes to
    # so it is consistent with HEALPix cell geometries

    props = attrs["properties"].copy()

    # remove irrelevant properties for HEALPix
    for k in attrs["properties"]:
        if "proj:" in k:
            del props[k]

    # add STAC discovery "grid:code" property
    # "grid:code" value is formatted like "HEALPIX-I{indexing_scheme}-L{level}-{cell_id}"
    # use a placeholder cell id value for now (to be defined)
    props["grid:code"] = "HEALPIX-Inested-L10-1234"
    attrs["stac_extensions"].append(
        "https://stac-extensions.github.io/grid/v1.1.0/schema.json"
    )

    attrs["properties"] = props


def update_metadata(dt: xr.DataTree) -> xr.DataTree:
    """Update metadata of of a single EOPF product.

    The product's data has been already converted to HEALPix.

    """
    # process each group
    updated_groups = {}
    multiscale_paths = set()

    for path, group in dt.subtree_with_keys:
        path = pathlib.Path(path)

        # detect multiscale and rename group according to
        # HEALPix refinement level
        if path.name in s2rm_to_level:
            path = path.parent / str(s2rm_to_level[path.name])
            multiscale_paths.add(str(path.parent))

        updated_groups[str(path)] = _update_group(group.ds)

    dt_metadata = xr.DataTree.from_dict(updated_groups, name=getattr(dt, "name", None))

    # add multiscale metadata
    for path, group in dt_metadata.subtree_with_keys:
        if path in multiscale_paths:
            group.attrs["multiscales"] = multiscales_attr

    # update product level metadata
    _update_stac_discovery(dt_metadata.attrs["stac_discovery"])

    return dt_metadata

In [None]:
dt_metadata = update_metadata(dt)

In [None]:
dt_metadata["measurements"]

In [None]:
dt_metadata.attrs["stac_discovery"]