# Metadata

This notebook creates a SpatioTemporal Asset Catalog (STAC) catalog and item for the Space2Stats database. It reads the source parquet file and an additional metadata spreadsheet to create STAC compliant metadata.

In [44]:
from typing import Dict
import requests
import pandas as pd
import geopandas as gpd
from shapely.geometry import shape, Polygon
import h3

import shutil
import tempfile
from pathlib import Path

from pystac import Catalog, Collection, Item, Asset, CatalogType, get_stac_version, SpatialExtent
import fio_stac
from datetime import datetime, UTC
import ast
from os.path import join

import git, os

git_repo = git.Repo(os.getcwd(), search_parent_directories=True)
git_root = git_repo.git.rev_parse("--show-toplevel")

## Current Dataset

In [2]:
parquet_file = join(git_root, 'space2stats_api/src/local.parquet')

In [None]:
print(parquet_file)

In [4]:
df = pd.read_parquet(parquet_file)

In [5]:
gdf = df.copy()

In [None]:
len(gdf)

In [7]:
gdf.loc[:, 'geometry'] = gdf.apply(lambda x: Polygon(h3.h3_to_geo_boundary(x['hex_id'], geo_json=True)), axis=1)

In [8]:
gdf = gpd.GeoDataFrame(gdf, geometry='geometry', crs='EPSG:4326')

In [9]:
gdf.total_bounds

array([-179.99999562,  -89.98750455,  179.99999096,   89.98750455])

## Create STAC

In [10]:
print(get_stac_version())

1.0.0


### Content

For now, metadata fields are managed through an Excel Spreadsheet.

In [11]:
overview = pd.read_excel("Space2Stats Metadata Content.xlsx", sheet_name="DDH Dataset", index_col="Field")
nada = pd.read_excel("Space2Stats Metadata Content.xlsx", sheet_name="NADA", index_col="Field")
feature_catalog = pd.read_excel("Space2Stats Metadata Content.xlsx", sheet_name="Feature Catalog")
sources = pd.read_excel("Space2Stats Metadata Content.xlsx", sheet_name="Sources")
sources.loc[:, "Variables"] = sources.apply(lambda x: ast.literal_eval(x['Variables']), axis=1)

In [12]:
overview.head()

Unnamed: 0_level_0,Value
Field,Unnamed: 1_level_1
Title,Space2Stats Database
Description,A global dataset of geospatial variables at th...
TTL,Ben Stewart
Business Unit,DECSC
Collaborator,Andres Chamorro


In [13]:
nada.head()

Unnamed: 0_level_0,Group,Value
Field,Unnamed: 1_level_1,Unnamed: 2_level_1
Title,Identification,Space2Stats Database
Identifier,Identification,GLO_2024_SPACE2STATS_GEO_v01
Hierarchy level,Identification,dataset
Edition,Identification,v.1
Edition Date,Identification,2024-09-06 00:00:00


In [14]:
feature_catalog.head()

Unnamed: 0,variable,description,type,nodata
0,hex_id,H3 unique identifier,string,
1,ogc_fid,Feature unique identifier,numeric,
2,sum_pop_2020,"Total population, 2020",numeric,
3,sum_pop_f_0_2020,"Total population female, ages 0 to 1, 2020",numeric,
4,sum_pop_f_10_2020,"Total population female, ages 10 to 15, 2020",numeric,


In [15]:
sources.head()

Unnamed: 0,Theme,Name,Description,Methodological Notes,Variables,Source Data,Citation source,Organization,Method,Resolution
0,Demographics,Population,Gridded population disaggregated by gender.,Global raster files are processed for each hex...,"[sum_pop_2020, sum_pop_f_0_2020, sum_pop_f_10_...","WorldPop gridded population, 2020, Unconstrain...","Stevens FR, Gaughan AE, Linard C, Tatem AJ (20...","World Pop, https://www.worldpop.org/methods/po...",sum,100 mts
1,Socio-economic,Nighttime Lights,Sum of luminosity values measured by monthly c...,Monthly composites generated by NASA through t...,[ntl_sum_yyyymm],"World Bank - Light Every Night, https://regist...",,"NASA, World Bank",sum,500 mts
2,Exposure,Flood Area,"Area where flood depth is greater than 50 cm, ...","Flood data combines fluvial, pluvial, and coas...","[flood_area_100, flood_area_1000]",Fathom 3.0 High Resolution Global Flood Maps I...,Wing et al. (2024) A 30 m Global Flood Inundat...,"Fathom, https://www.fathom.global/",sum,30 mts
3,Exposure,Population Exposed to Floods,Population where flood depth is greater than 5...,Flood data is intersected with population grid...,"[flood_pop_100, flood_pop_1000]",Fathom 3.0 High Resolution Global Flood Maps I...,Wing et al. (2024) A 30 m Global Flood Inundat...,"Fathom, https://www.fathom.global/",sum of intersect,30 mts and 100 mts
4,Conflict,Number of Conflict Events,Sum of conflict events (ACLED).,Conflict data is filtered for event types and ...,[acled_events_yyyy],Armed Conflict Location and Event Data (ACLED)...,https://acleddata.com/article-categories/gener...,"ACLED, https://acleddata.com/",count,point data


### Catalog  

Basic description of project and dataset.  
Can link to World Bank metadata page with appropriate schema (DDH or NADA).  
See for example, https://nada-demo.ihsn.org/index.php/catalog/55/ or https://datacatalog.worldbank.org/search/dataset/0064614/Harmonized-Sub-National-Food-Security-Data

In [51]:
catalog = Catalog(
    id="space2stats-catalog", 
    description=overview.loc["Description Resource"].values[0],
    title=overview.loc["Title"].values[0],
    extra_fields={
        "License": overview.loc["License"].values[0],
        "Responsible Party": nada.loc["Responsible party", "Value"],
        "Purpose": nada.loc["Purpose", "Value"],
        "Keywords": ["space2stats", "sub-national", "h3", "hexagons", "global"]
        },
    href="https://worldbank.github.io/DECAT_Space2Stats/stac/catalog.json"
    )

### Collection  

In [52]:
spatial_extent = SpatialExtent([[-180.0, -90.0, 180.0, 90.0]])

In [53]:
# Function to create STAC collection
collection = Collection(
   id="space2stats-collection",
        description="A collection of Space2Stats H3 Data",
        extent=spatial_extent,
        extra_fields={
            "Title": overview.loc["Title"].values[0],
            "Description": overview.loc["Description Resource"].values[0],
            "Keywords": ["space2stats", "sub-national", "h3", "hexagons", "global"]
        }
)


In [48]:
catalog.add_child(collection)
catalog

In [22]:
print([child.id for child in catalog.get_children()])


['space2stats-collection', 'space2stats-collection', 'space2stats-collection']


### STAC Item

Represent the global H3 parquet file with column descriptions for each variable.

In [54]:
data_dict = []
for column in gdf.columns:
    if column == 'geometry':
        continue
    data_dict.append({
        "name": column,
        "description": feature_catalog.loc[feature_catalog['variable'] == column, 'description'].values[0],
        "type": str(gdf[column].dtype),
        })

In [55]:
gdf_types = gdf.dtypes.to_dict()
gdf_types = {k: str(v) for k, v in gdf_types.items()}

Using the [table](https://github.com/stac-extensions/table) extension here. Fio-stac also builds `vector:layers` property, not sure if it's necessary.

In [56]:
bb = gdf.total_bounds.tolist()
geom = Polygon.from_bounds(bb[0], bb[1], bb[2], bb[3])

item = Item(
    id="space2stats",
    geometry=geom.__geo_interface__,
    bbox=bb,
    datetime=datetime.now(),
    properties={
        "name": "Space2Stats H3 Data",
        "description": "GeoParquet dataset with h3 hexagons (level 6) covering the globe. Users can access data through an API, specifying variables and areas of interest.", 
        "table:primary_geometry" : "geometry",
        "table:columns" : data_dict,
        "vector:layers" : {
            "space2stats": gdf_types,
            }
        },  
    stac_extensions = ['https://stac-extensions.github.io/table/v1.2.0/schema.json']
    # assets={
    #     "data": Asset(href=out_file, media_type="application/geo+json")
    # } 
)
item

In [57]:
collection.add_item(item, title="Space2Stats Item")

In [58]:
catalog.add_child(collection)
catalog

In [59]:
print(list(catalog.get_children()))
print(list(catalog.get_items()))

[<Collection id=space2stats-collection>]
[]


In [60]:
catalog.describe()

* <Catalog id=space2stats-catalog>
    * <Collection id=space2stats-collection>
      * <Item id=space2stats>


### Assets

Can store additional information about authors, the source for input data, how it was processed etc.  
Add another asset for API docs.

In [71]:
sources_path = join(".", "stac", "sources_andres.json") # "space2stats"
asset = Asset(
    href="./sources.json",
    title="Sources Metadata",
    media_type="application/json",
    roles=["metadata"]
    )
asset

In [62]:
item.add_asset("sources-metadata", asset)

In [63]:
asset_api = Asset(
    href="https://space2stats.ds.io/docs",
    title="API Documentation",
    media_type="text/html",
    roles=["metadata"]
    )
asset_api

In [64]:
item.add_asset("api-docs", asset_api)

### Save Demo

In [65]:
print(catalog.get_self_href() is None)
print(item.get_self_href() is None)

False
True


In [66]:
# catalog.normalize_hrefs(join(".", "stac"))

In [67]:
print(catalog.get_self_href())
print(item.get_self_href())

https://worldbank.github.io/DECAT_Space2Stats/stac/catalog.json
None


In [69]:
catalog.save(catalog_type=CatalogType.RELATIVE_PUBLISHED, dest_href=join(".", "stac"))

ValueError: <Item id=space2stats> does not have a self_href set.

In [72]:
sources.to_json(
    sources_path, 
    orient = 'records',
    indent = 4
    )