# Harmonized Landsat-Sentinel Catalog Creation

This notebook demonstrates creation and storage of a [zarray](https://zarr.readthedocs.io/en/stable/) catalog which contains metadata and available HLS scenes. The scenes here are stored as cloud-optimized geotiffs in Azure blob storage in the East US 2 region, using the HLS tile reference system. Full dataset details can be found [here](https://hls.gsfc.nasa.gov/products-description/]) and the specifics of the Microsoft Ai-for-Earth version used in this example, [here](https://microsoft.github.io/AIforEarthDataSets/data/hls.html). The catalog is stored in Azure blob storage based on user input parameters in the 2nd cell below, and following notebooks use the catalog for data preprocessing, collection, and sampling. Multiple inputs for catalog creation methods are demonstrated. Creation of a catalog for sampling tiles with associated field data is also shown.

**Note: 'AZURE_STRG_ACCOUNT_KEY' can be found within a storage account in the Azure Portal as seen [here](../pictures/Storage_Key.jpg)**

In [1]:
import os

import fsspec
import geopandas as gpd
import pandas as pd
import sys
from tqdm import tqdm

sys.path.append("..")
from utils.hls.catalog import HLSBand
from utils.hls.catalog import HLSCatalog
from utils.hls.catalog import HLSTileLookup
from utils.hls.catalog import fia_csv_to_data_catalog_input

In [2]:
# Store environmental variables for use in subsequent notebooks
os.environ['AZURE_STRG_ACCOUNT_KEY'] = ''
os.environ['AZURE_STRG_ACCOUNT_NAME'] = ''
os.environ['CATALOG_BLOB_CONTAINER'] = ''
envdict = dict(os.environ)
%store envdict

Stored 'envdict' (dict)


In [3]:
lookup = HLSTileLookup()

Reading tile extents...
Read tile extents for 56686 tiles


In [4]:
bands = [
    HLSBand.COASTAL_AEROSOL,
    HLSBand.BLUE,
    HLSBand.GREEN,
    HLSBand.RED,
    HLSBand.NIR_NARROW,
    HLSBand.SWIR1,
    HLSBand.SWIR2,
    HLSBand.CIRRUS,
    HLSBand.QA,
]

In [5]:
# The example here queries for all scenes from the continental US using a multipolygon geojson for 2015
# Typical tiles have ~100-200 available images across all satellites, so catalog creation takes time for large queries
geom = gpd.read_file('test_data/conus_final.geojson').to_crs('EPSG:4326')
years = [2015]
conus_catalog = HLSCatalog.from_geom(geom, years, bands, lookup)

Searching for matching Landsat scenes...


100%|██████████| 988/988 [03:07<00:00,  5.26it/s]


Searching for matching Sentinel scenes...


100%|██████████| 988/988 [01:07<00:00, 14.69it/s]


In [None]:
path = 'catalogs/hls_conus_2015-2019'
write_store = fsspec.get_mapper(
    f"az://{os.environ['CATALOG_BLOB_CONTAINER']}/{path}.zarr",
    account_name=os.environ['AZURE_STRG_ACCOUNT_NAME'],
    account_key=os.environ['AZURE_STRG_ACCOUNT_KEY']
)
conus_catalog.to_zarr(write_store)

In [6]:
# The example here queries for scattered scenes in the CONUS using a csv formatted as:
# tile, year
# xxxxx, xxxx
tilesdf = pd.read_csv('test_data/fd_test_tiles.csv')
test_catalog = HLSCatalog.from_tilesdf(tilesdf,bands=bands)

Searching for matching Landsat scenes...


100%|██████████| 48/48 [00:10<00:00,  4.53it/s]


Searching for matching Sentinel scenes...


100%|██████████| 48/48 [00:27<00:00,  1.75it/s]


In [None]:
path = 'catalogs/hls_test_tiles'
write_store = fsspec.get_mapper(
    f"az://{os.environ['CATALOG_BLOB_CONTAINER']}/{path}.zarr",
    account_name=os.environ['AZURE_STRG_ACCOUNT_NAME'],
    account_key=os.environ['AZURE_STRG_ACCOUNT_KEY']
)
test_catalog.to_zarr(write_store)

In [None]:
# 2015-2019 Washington State using a shapely bounding box i.e. ['MinLon', 'MinLat', 'MaxLon', 'MaxLat']
bbox = [-124.76074218749999, 45.44471679159555, -116.91650390625, 49.05227025601607]
years = [2015, 2016, 2017, 2018, 2019]
wa_catalog = HLSCatalog.from_bbox(bbox, years, bands, lookup)

In [None]:
path = 'catalogs/hls_wa_2015-2019'
write_store = fsspec.get_mapper(
    f"az://{os.environ['CATALOG_BLOB_CONTAINER']}/{path}.zarr",
    account_name=os.environ['AZURE_STRG_ACCOUNT_NAME'],
    account_key=os.environ['AZURE_STRG_ACCOUNT_KEY']
)
wa_catalog.to_zarr(write_store)

In [None]:
# 2015-2019 Arizona

bbox = [-114.86206054687499, 31.306715155075167, -109.0283203125, 37.02886944696474]
years = [2015, 2016, 2017, 2018, 2019]
az_catalog = HLSCatalog.from_bbox(bbox, years, bands, lookup)

In [None]:
path = 'catalogs/hls_az_2015-2019'
write_store = fsspec.get_mapper(
    f"az://{os.environ['CATALOG_BLOB_CONTAINER']}/{path}.zarr",
    account_name=os.environ['AZURE_STRG_ACCOUNT_NAME'],
    account_key=os.environ['AZURE_STRG_ACCOUNT_KEY']
)
az_catalog.to_zarr(write_store)

In [None]:
# 2015-2019 Western US (Montana/Wyoming/Colorado/New Mexico and west)
bbox = [-124.78, 31.33, -102.04, 49.02]
years = [2015, 2016, 2017, 2018, 2019]
west_catalog = HLSCatalog.from_bbox(bbox, years, bands, lookup)

In [None]:
path = 'catalogs/hls_west_2015-2019'
write_store = fsspec.get_mapper(
    f"az://{os.environ['CATALOG_BLOB_CONTAINER']}/{path}.zarr",
    account_name=os.environ['AZURE_STRG_ACCOUNT_NAME'],
    account_key=os.environ['AZURE_STRG_ACCOUNT_KEY']
)
west_catalog.to_zarr(write_store)

In [None]:
# The from_point_pandas method requires any dataframe with columns lat, lon, and year
# Below, a subsetted and reindexed csv from the FIA Datamart is used 
# This is not included in the public repository
# Subsequent sampling in notebook 3 requires a column 'INDEX' that uniquely identifies samples
df = fia_csv_to_data_catalog_input('./fia_no_pltcn.csv')
pt_catalog = HLSCatalog.from_point_pandas(df, bands, include_scenes=False)

In [None]:
write_store = fsspec.get_mapper(
    f"az://fia/catalogs/fia_tiles.zarr",
    account_name=os.environ['AZURE_STRG_ACCOUNT_NAME'],
    account_key=os.environ['AZURE_STRG_ACCOUNT_KEY']
)
pt_catalog.to_zarr(write_store)