# How to create STAC Catalogs 
## STAC Community Sprint, Arlington, November 7th 2019

This notebook runs through some of the basics of using PySTAC to create a static STAC. It was part of a 30 minute presentation at the [community STAC sprint](https://github.com/radiantearth/community-sprints/tree/master/11052019-arlignton-va) in Arlington, VA in November 2019.

This tutorial will require the `boto3`, `rasterio`, and `shapely` libraries:

In [1]:
!pip install boto3
!pip install rasterio
!pip install shapely

[33mYou are using pip version 9.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


We can import pystac with the alias `stac` to access all of the API we need (saving a glorious 2 characters): 

In [2]:
import pystac as stac

## Creating a catalog from a local file

To give us some material to work with, lets download a single image from the [Spacenet 5 challenge](https://www.topcoder.com/challenges/30099956). We'll use a temporary directory to save off our single-item STAC.

In [3]:
import os
import urllib.request
from tempfile import TemporaryDirectory

tmp_dir = TemporaryDirectory()
img_path = os.path.join(tmp_dir.name, 'image.tif')

In [4]:
url = ('http://spacenet-dataset.s3.amazonaws.com/'
       'spacenet/SN5_roads/train/AOI_7_Moscow/MS/'
       'SN5_roads_train_AOI_7_Moscow_MS_chip996.tif')
urllib.request.urlretrieve(url, img_path)

('/var/folders/sv/zr8j0t4j1f726nhlt3vb8c300000gn/T/tmpytmf1tsx/image.tif',
 <http.client.HTTPMessage at 0x10817ed30>)

We want to create a Catalog. Let's check the pydocs for `Catalog` to see what information we'll need. (We use `__doc__` instead of `help()` here to avoid printing out all the docs for the class.)

In [5]:
print(stac.Catalog.__doc__)

A PySTAC Catalog represents a STAC catalog in memory.

    A Catalog is a :class:`~pystac.STACObject` that may contain children,
    which are instances of :class:`~pystac.Catalog` or :class:`~pystac.Collection`,
    as well as :class:`~pystac.Item` s.

    Args:
        id (str): Identifier for the catalog. Must be unique within the STAC.
        description (str): Detailed multi-line description to fully explain the catalog.
            `CommonMark 0.28 syntax <http://commonmark.org/>`_ MAY be used for rich text
            representation.
        title (str or None): Optional short descriptive one-line title for the catalog.
        stac_extensions (List[str]): Optional list of extensions the Catalog implements.
        href (str or None): Optional HREF for this catalog, which be set as the catalog's
            self link's HREF.

    Attributes:
        id (str): Identifier for the catalog.
        description (str): Detailed multi-line description to fully explain the catalog.
   

Let's just give an ID and a description. We don't have to worry about the HREF right now;  that  will be set later.

In [6]:
catalog = stac.Catalog(id='test-catalog', description='Tutorial catalog.')

There are no children or items in the catalog, since we haven't added anything yet.

In [7]:
print(list(catalog.get_children()))
print(list(catalog.get_items()))

[]
[]


We'll now create an Item to represent the image. Check the pydocs to see what you need to supply:

In [8]:
print(stac.Item.__doc__)

An Item is the core granular entity in a STAC, containing the core metadata
    that enables any client to search or crawl online catalogs of spatial 'assets' -
    satellite imagery, derived data, DEM's, etc.

    Args:
        id (str): Provider identifier. Must be unique within the STAC.
        geometry (dict): Defines the full footprint of the asset represented by this item,
            formatted according to `RFC 7946, section 3.1 (GeoJSON)
            <https://tools.ietf.org/html/rfc7946>`_.
        bbox (List[float]):  Bounding Box of the asset represented by this item using
            either 2D or 3D geometries. The length of the array must be 2*n where n is the
            number of dimensions.
        datetime (Datetime): Datetime associated with this item.
        properties (dict): A dictionary of additional metadata for the item.
        stac_extensions (List[str]): Optional list of extensions the Item implements.
        href (str or None): Optional HREF for this item, 

Using [rasterio](https://rasterio.readthedocs.io/en/stable/), we can pull out the bounding box of the image to use for the image metadata. If the image contained a NoData border, we would ideally pull out the footprint and save it as the geometry; in this case, we're working with a small chip the most likely has no NoData values.

In [9]:
import rasterio
from shapely.geometry import Polygon, mapping

def get_bbox_and_footprint(raster_uri):
    with rasterio.open(raster_uri) as ds:
        bounds = ds.bounds
        bbox = [bounds.left, bounds.bottom, bounds.right, bounds.top]
        footprint = Polygon([
            [bounds.left, bounds.bottom],
            [bounds.left, bounds.top],
            [bounds.right, bounds.top],
            [bounds.right, bounds.bottom]
        ])
        
        return (bbox, mapping(footprint))

In [10]:
bbox, footprint = get_bbox_and_footprint(img_path)
print(bbox)
print(footprint)

[37.6616853489879, 55.73478197572927, 37.66573047610874, 55.73882710285011]
{'type': 'Polygon', 'coordinates': (((37.6616853489879, 55.73478197572927), (37.6616853489879, 55.73882710285011), (37.66573047610874, 55.73882710285011), (37.66573047610874, 55.73478197572927), (37.6616853489879, 55.73478197572927)),)}


We're also using `datetime.utcnow()` to supply the required datetime property for our Item. Since this is a required property, you might often find yourself making up a time to fill in if you don't know the exact capture time.

In [11]:
from datetime import datetime

item = stac.Item(id='local-image',
                 geometry=footprint,
                 bbox=bbox,
                 datetime=datetime.utcnow(),
                 properties={})

We haven't added it to a catalog yet, so it's parent isn't set. Once we add it to the catalog, we can see it correctly links to it's parent.

In [12]:
item.get_parent() is None

True

In [13]:
catalog.add_item(item)

In [14]:
item.get_parent()

<Catalog id=test-catalog>

`describe()` is a useful method on `Catalog` - but be careful when using it on large catalogs, as it will walk the entire tree of the STAC.

In [15]:
catalog.describe()

* <Catalog id=test-catalog>
  * <Item id=local-image>


### Adding Assets

We've created an Item, but there aren't any assets associated with it. Let's create one:

In [16]:
print(stac.Asset.__doc__)

An object that contains a link to data associated with the Item that can be
    downloaded or streamed.

    Args:
        href (str): Link to the asset object. Relative and absolute links are both allowed.
        title (str): Optional displayed title for clients and users.
        media_type (str): Optional description of the media type. Registered Media Types
            are preferred. See :class:`~pystac.MediaType` for common media types.
        properties (dict): Optional, additional properties for this asset. This is used by
            extensions as a way to serialize and deserialize properties on asset
            object JSON.

    Attributes:
        href (str): Link to the asset object. Relative and absolute links are both allowed.
        title (str): Optional displayed title for clients and users.
        media_type (str): Optional description of the media type. Registered Media Types
            are preferred. See :class:`~pystac.MediaType` for common media types.
       

In [17]:
item.add_asset(key='image', asset=stac.Asset(href=img_path, media_type=stac.MediaType.GEOTIFF))

<Item id=local-image>

At any time we can call `to_dict()` on STAC objects to see how the STAC JSON is shaping up. Notice the asset is now set:

In [18]:
import json
print(json.dumps(item.to_dict(), indent=4))

{
    "type": "Feature",
    "stac_version": "0.8.1",
    "id": "local-image",
    "properties": {
        "datetime": "2019-11-05 16:43:22Z"
    },
    "geometry": {
        "type": "Polygon",
        "coordinates": [
            [
                [
                    37.6616853489879,
                    55.73478197572927
                ],
                [
                    37.6616853489879,
                    55.73882710285011
                ],
                [
                    37.66573047610874,
                    55.73882710285011
                ],
                [
                    37.66573047610874,
                    55.73478197572927
                ],
                [
                    37.6616853489879,
                    55.73478197572927
                ]
            ]
        ]
    },
    "bbox": [
        37.6616853489879,
        55.73478197572927,
        37.66573047610874,
        55.73882710285011
    ],
    "links": [
        {
            "rel":

Note that the link `href` properties are `null`. This is OK, as we're working with the STAC in memory. Next, we'll talk about writing the catalog out, and how to set those HREFs.

### Saving the catalog

As the JSON above indicates, there's no HREFs set on these in-memory items. PySTAC uses the `self` link on STAC objects to track where the file lives. Because we haven't set them, they evaluate to `None`:

In [19]:
print(catalog.get_self_href() is None)
print(item.get_self_href() is None)

True
True


In order to set them, we can use `normalize_hrefs`. This method will create a normalized set of HREFs for each STAC object in the catalog, according to the [best practices document](https://github.com/radiantearth/stac-spec/blob/v0.8.1/best-practices.md#catalog-layout)'s recommendations on how to lay out a catalog.

In [20]:
catalog.normalize_hrefs(os.path.join(tmp_dir.name, 'stac'))

<Catalog id=test-catalog>

Now that we've normalized to a root directory (the temporary directory), we see that the `self` links are set:

In [21]:
print(catalog.get_self_href())
print(item.get_self_href())

/var/folders/sv/zr8j0t4j1f726nhlt3vb8c300000gn/T/tmpytmf1tsx/stac/catalog.json
/var/folders/sv/zr8j0t4j1f726nhlt3vb8c300000gn/T/tmpytmf1tsx/stac/local-image/local-image.json


We can now call `save` on the catalog, which will recursively save all the STAC objects to their respective self HREFs.

Save requires a `CatalogType` to be set. You can review the [API docs](https://pystac.readthedocs.io/en/stable/api.html#catalogtype) on `CatalogType` to see what each type means (unfortunately `help` doesn't show docstrings for attributes).

In [22]:
catalog.save(catalog_type=stac.CatalogType.SELF_CONTAINED)

In [23]:
!ls {tmp_dir.name}/stac/*

/var/folders/sv/zr8j0t4j1f726nhlt3vb8c300000gn/T/tmpytmf1tsx/stac/catalog.json

/var/folders/sv/zr8j0t4j1f726nhlt3vb8c300000gn/T/tmpytmf1tsx/stac/local-image:
local-image.json


In [24]:
with open(catalog.get_self_href()) as f:
    print(f.read())

{
    "id": "test-catalog",
    "stac_version": "0.8.1",
    "description": "Tutorial catalog.",
    "links": [
        {
            "rel": "root",
            "href": "./catalog.json",
            "type": "application/json"
        },
        {
            "rel": "item",
            "href": "./local-image/local-image.json",
            "type": "application/json"
        }
    ]
}


In [25]:
with open(item.get_self_href()) as f:
    print(f.read())

{
    "type": "Feature",
    "stac_version": "0.8.1",
    "id": "local-image",
    "properties": {
        "datetime": "2019-11-05 16:43:22Z"
    },
    "geometry": {
        "type": "Polygon",
        "coordinates": [
            [
                [
                    37.6616853489879,
                    55.73478197572927
                ],
                [
                    37.6616853489879,
                    55.73882710285011
                ],
                [
                    37.66573047610874,
                    55.73882710285011
                ],
                [
                    37.66573047610874,
                    55.73478197572927
                ],
                [
                    37.6616853489879,
                    55.73478197572927
                ]
            ]
        ]
    },
    "bbox": [
        37.6616853489879,
        55.73478197572927,
        37.66573047610874,
        55.73882710285011
    ],
    "links": [
        {
            "rel":

As you can see, all links are saved with relative paths. That's because we used `catalog_type=CatalogType.SELF_CONTAINED`. If we save an Absolute Published catalog, we'll see absolute paths:

In [26]:
catalog.save(catalog_type=stac.CatalogType.ABSOLUTE_PUBLISHED)

Now the links included in the STAC item are all absolute:

In [27]:
with open(item.get_self_href()) as f:
    print(f.read())

{
    "type": "Feature",
    "stac_version": "0.8.1",
    "id": "local-image",
    "properties": {
        "datetime": "2019-11-05 16:43:22Z"
    },
    "geometry": {
        "type": "Polygon",
        "coordinates": [
            [
                [
                    37.6616853489879,
                    55.73478197572927
                ],
                [
                    37.6616853489879,
                    55.73882710285011
                ],
                [
                    37.66573047610874,
                    55.73882710285011
                ],
                [
                    37.66573047610874,
                    55.73478197572927
                ],
                [
                    37.6616853489879,
                    55.73478197572927
                ]
            ]
        ]
    },
    "bbox": [
        37.6616853489879,
        55.73478197572927,
        37.66573047610874,
        55.73882710285011
    ],
    "links": [
        {
            "rel":

Notice that the Asset HREF is absolute in both cases. We can make the Asset HREF relative to the STAC Item by using `.make_all_asset_hrefs_relative()`:

In [28]:
catalog.make_all_asset_hrefs_relative()
catalog.save(catalog_type=stac.CatalogType.SELF_CONTAINED)

In [29]:
with open(item.get_self_href()) as f:
    print(f.read())

{
    "type": "Feature",
    "stac_version": "0.8.1",
    "id": "local-image",
    "properties": {
        "datetime": "2019-11-05 16:43:22Z"
    },
    "geometry": {
        "type": "Polygon",
        "coordinates": [
            [
                [
                    37.6616853489879,
                    55.73478197572927
                ],
                [
                    37.6616853489879,
                    55.73882710285011
                ],
                [
                    37.66573047610874,
                    55.73882710285011
                ],
                [
                    37.66573047610874,
                    55.73478197572927
                ],
                [
                    37.6616853489879,
                    55.73478197572927
                ]
            ]
        ]
    },
    "bbox": [
        37.6616853489879,
        55.73478197572927,
        37.66573047610874,
        55.73882710285011
    ],
    "links": [
        {
            "rel":

### Creating an EO Item

In the code above, we encapsulated our imagery as a core STAC item. However, there's more information that we can encapsulate, given that we know this is a World View 3 image. We can do this by creating an `EOItem`, which is an Item that is extended via the [eo extension](https://github.com/radiantearth/stac-spec/tree/v0.8.1/extensions/eohttps://github.com/radiantearth/stac-spec/tree/v0.8.1/extensions/eo):

In [30]:
print(stac.EOItem.__doc__)

EOItem represents a snapshot of the earth for a single date and time.

    Args:
        id (str): Provider identifier. Must be unique within the STAC.
        geometry (dict): Defines the full footprint of the asset represented by this item,
            formatted according to `RFC 7946, section 3.1 (GeoJSON)
            <https://tools.ietf.org/html/rfc7946>`_.
        bbox (List[float]):  Bounding Box of the asset represented by this item using
            either 2D or 3D geometries. The length of the array must be 2*n where n is the
            number of dimensions.
        datetime (Datetime): Datetime associated with this item.
        properties (dict): A dictionary of additional metadata for the item.
        gsd (float): Ground Sample Distance at the sensor.
        platform (str): Unique name of the specific platform to which the instrument is attached.
        instrument (str): Name of instrument or sensor used (e.g., MODIS, ASTER, OLI, Canon F-1).
        bands (List[Band]): 

To create the EOItem, we'll need to encode some more information. First, let's define the bands of World View 3:

In [31]:
# From: https://www.spaceimagingme.com/downloads/sensors/datasheets/DG_WorldView3_DS_2014.pdf

wv3_bands = [stac.Band(name='Coastal', description='Coastal: 400 - 450 nm', common_name='coastal'),
             stac.Band(name='Blue', description='Blue: 450 - 510 nm', common_name='blue'),
             stac.Band(name='Green', description='Green: 510 - 580 nm', common_name='green'),
             stac.Band(name='Yellow', description='Yellow: 585 - 625 nm', common_name='yellow'),
             stac.Band(name='Red', description='Red: 630 - 690 nm', common_name='red'),
             stac.Band(name='Red Edge', description='Red Edge: 705 - 745 nm', common_name='rededge'),
             stac.Band(name='Near-IR1', description='Near-IR1: 770 - 895 nm', common_name='nir08'),
             stac.Band(name='Near-IR2', description='Near-IR2: 860 - 1040 nm', common_name='nir09')]

We can now create an EO Item, and add it to our catalog:

In [32]:
eo_item = stac.EOItem(id='local-image-eo',
                      geometry=footprint,
                      bbox=bbox,
                      datetime=datetime.utcnow(),
                      properties={},
                      gsd=0.3,
                      platform="Maxar",
                      instrument="WorldView3",
                      bands=wv3_bands)

In [33]:
eo_item

<EOItem id=local-image-eo>

In [34]:
eo_item.add_asset(key='image', asset=stac.EOAsset(href=img_path, 
                                                  media_type=stac.MediaType.GEOTIFF, 
                                                  bands=list(range(0,8))))

<EOItem id=local-image-eo>

Let's clear the in-memory catalog, add the EO item, and save to a new STAC:

In [35]:
catalog.clear_items()
list(catalog.get_items())

[]

In [36]:
catalog.add_item(eo_item)
list(catalog.get_items())

[<EOItem id=local-image-eo>]

In [37]:
catalog.normalize_and_save(root_href=os.path.join(tmp_dir.name, 'stac-eo'), 
                           catalog_type=stac.CatalogType.SELF_CONTAINED)

Now, if we read the catalog from the filesystem, PySTAC recognizes the EOItem and loads it in with the correct type:

In [38]:
catalog2 = stac.Catalog.from_file(os.path.join(tmp_dir.name, 'stac-eo', 'catalog.json'))

In [39]:
list(catalog2.get_items())

[<EOItem id=local-image-eo>]

In [40]:
next(catalog2.get_all_items()).assets

{'image': <EOAsset href=/var/folders/sv/zr8j0t4j1f726nhlt3vb8c300000gn/T/tmpytmf1tsx/image.tif>}

In [41]:
import json

print(json.dumps(eo_item.to_dict(), indent=4))

{
    "type": "Feature",
    "stac_version": "0.8.1",
    "id": "local-image-eo",
    "properties": {
        "datetime": "2019-11-05 16:43:28Z",
        "eo:gsd": 0.3,
        "eo:platform": "Maxar",
        "eo:instrument": "WorldView3",
        "eo:bands": [
            {
                "name": "Coastal",
                "common_name": "coastal",
                "description": "Coastal: 400 - 450 nm"
            },
            {
                "name": "Blue",
                "common_name": "blue",
                "description": "Blue: 450 - 510 nm"
            },
            {
                "name": "Green",
                "common_name": "green",
                "description": "Green: 510 - 580 nm"
            },
            {
                "name": "Yellow",
                "common_name": "yellow",
                "description": "Yellow: 585 - 625 nm"
            },
            {
                "name": "Red",
                "common_name": "red",
                "description"

### Collections

Collections are a subtype of Catalog that have some additional properties to make them more searchable. They also can define common properties so that items in the collection don't have to duplicate common data for each item. Let's create a collection to hold common properties between two images from the Spacenet 5 challenge.

First we'll get another image, and it's bbox and footprint:

In [42]:
url2 = ('http://spacenet-dataset.s3.amazonaws.com/'
       'spacenet/SN5_roads/train/AOI_7_Moscow/MS/'
       'SN5_roads_train_AOI_7_Moscow_MS_chip997.tif')
img_path2 = os.path.join(tmp_dir.name, 'image.tif')
urllib.request.urlretrieve(url2, img_path2)

('/var/folders/sv/zr8j0t4j1f726nhlt3vb8c300000gn/T/tmpytmf1tsx/image.tif',
 <http.client.HTTPMessage at 0x119333908>)

In [43]:
bbox2, footprint2 = get_bbox_and_footprint(img_path2)

We can take a look at the pydocs for Collection to see what information we need to supply in order to satisfy the spec.

In [44]:
print(stac.Collection.__doc__)

A Collection extends the Catalog spec with additional metadata that helps
    enable discovery.

    Args:
        id (str): Identifier for the collection. Must be unique within the STAC.
        description (str): Detailed multi-line description to fully explain the collection.
            `CommonMark 0.28 syntax <http://commonmark.org/>`_ MAY be used for rich text
            representation.
        extent (Extent): Spatial and temporal extents that describe the bounds of
            all items contained within this Collection.
        title (str or None): Optional short descriptive one-line title for the collection.
        stac_extensions (List[str]): Optional list of extensions the Collection implements.
        href (str or None): Optional HREF for this collection, which be set as the collection's
            self link's HREF.
        license (str):  Collection's license(s) as a `SPDX License identifier
            <https://spdx.org/licenses/>`_ or `expression
            <https:/

Beyond what a Catalog reqiures, a Collection requires a license, and an `Extent` that describes the range of space and time that the items it hold occupy.

In [45]:
print(stac.Extent.__doc__)

Describes the spatio-temporal extents of a Collection.

    Args:
        spatial (SpatialExtent): Potential spatial extent covered by the collection.
        temporal (TemporalExtent): Potential temporal extent covered by the collection.

    Attributes:
        spatial (SpatialExtent): Potential spatial extent covered by the collection.
        temporal (TemporalExtent): Potential temporal extent covered by the collection.
    


An Extent is comprised of a SpatialExtent and a TemporalExtent. These hold one or more bounding boxes and time intervals, respectively, that completely  cover the items contained in the collections.

Let's start with creating two new items - these will be core Items, not `EOItems`, although they will be imparted with `eo` information by the collection. This is why we add `eo` to the `stac_extensions`. We are also adding `EOAssets` to the Items, so that the assets have the proper `eo:bands` metadata associated with them:

In [46]:
collection_item1 = stac.Item(id='local-image-col-1',
                             geometry=footprint,
                             bbox=bbox,
                             datetime=datetime.utcnow(),
                             properties={},
                             stac_extensions=['eo'])
collection_item1.add_asset('image', stac.EOAsset(href=img_path, 
                                                 media_type=stac.MediaType.GEOTIFF, 
                                                 bands=list(range(0,8))))

collection_item2 = stac.Item(id='local-image-col-2',
                              geometry=footprint2,
                              bbox=bbox2,
                              datetime=datetime.utcnow(),
                              properties={},
                              stac_extensions=['eo'])
collection_item2.add_asset('image', stac.EOAsset(href=img_path, 
                                                 media_type=stac.MediaType.GEOTIFF, 
                                                 bands=list(range(0,8))))

<Item id=local-image-col-2>

We can use our two items' metadata to find out what the proper bounds are:

In [47]:
from shapely.geometry import shape

unioned_footprint = shape(footprint).union(shape(footprint2))
collection_bbox = list(unioned_footprint.bounds)
spatial_extent = stac.SpatialExtent(bboxes=[collection_bbox])

In [48]:
collection_interval = sorted([collection_item1.datetime, collection_item2.datetime])
temporal_extent = stac.TemporalExtent(intervals=[collection_interval])

In [49]:
collection_extent = stac.Extent(spatial=spatial_extent, temporal=temporal_extent)

We can list the common properties for the items, with their proper extension names, and use it in the Collection properties:

In [50]:
common_properties = { 'eo:bands': [b.to_dict() for b in wv3_bands], 
                      'eo:gsd': 0.3,
                      'eo:platform': 'Maxar',
                      'eo:instrument': 'WorldView3'
                    }

In [51]:
collection = stac.Collection(id='wv3-images',
                             description='Spacenet 5 images over Moscow',
                             extent=collection_extent,
                             properties=common_properties,
                             license='CC-BY-SA-4.0')

Now if we add our items to our Collection, and our Collection to our Catalog, we get the following STAC that can be saved:

In [52]:
collection.add_items([collection_item1, collection_item2])

In [53]:
catalog.clear_items()
catalog.clear_children()
catalog.add_child(collection)

In [54]:
catalog.describe()

* <Catalog id=test-catalog>
    * <Collection id=wv3-images>
      * <Item id=local-image-col-1>
      * <Item id=local-image-col-2>


In [55]:
catalog.normalize_and_save(root_href=os.path.join(tmp_dir.name, 'stac-collection'), 
                           catalog_type=stac.CatalogType.SELF_CONTAINED)

Notice our collection item does not have any of the `eo` metadata in it's properties:

In [56]:
collection_item1.to_dict()

{'type': 'Feature',
 'stac_version': '0.8.1',
 'id': 'local-image-col-1',
 'properties': {'datetime': '2019-11-05 16:43:31Z'},
 'geometry': {'type': 'Polygon',
  'coordinates': (((37.6616853489879, 55.73478197572927),
    (37.6616853489879, 55.73882710285011),
    (37.66573047610874, 55.73882710285011),
    (37.66573047610874, 55.73478197572927),
    (37.6616853489879, 55.73478197572927)),)},
 'bbox': [37.6616853489879,
  55.73478197572927,
  37.66573047610874,
  55.73882710285011],
 'links': [{'rel': 'collection',
   'href': '../collection.json',
   'type': 'application/json'},
  {'rel': 'self',
   'href': '/var/folders/sv/zr8j0t4j1f726nhlt3vb8c300000gn/T/tmpytmf1tsx/stac-collection/wv3-images/local-image-col-1/local-image-col-1.json',
   'type': 'application/json'},
  {'rel': 'root', 'href': '../../catalog.json', 'type': 'application/json'},
  {'rel': 'parent', 'href': '../collection.json', 'type': 'application/json'}],
 'assets': {'image': {'href': '/var/folders/sv/zr8j0t4j1f726nhlt

However, when we read the catalog in, the collection information is merged with the item metadata, and we get `EOItem`s in our STAC:

In [57]:
catalog3 = stac.Catalog.from_file(os.path.join(tmp_dir.name, 'stac-collection', 'catalog.json'))

In [58]:
catalog3.describe()

* <Catalog id=test-catalog>
    * <Collection id=wv3-images>
      * <EOItem id=local-image-col-1>
      * <EOItem id=local-image-col-2>


In [59]:
col_items = list(catalog3.get_all_items())

In [60]:
col_items[0].bands

[<Band name=Coastal>,
 <Band name=Blue>,
 <Band name=Green>,
 <Band name=Yellow>,
 <Band name=Red>,
 <Band name=Red Edge>,
 <Band name=Near-IR1>,
 <Band name=Near-IR2>]

### Cleanup

Don't forget to clean up the temporary directory!

In [61]:
tmp_dir.cleanup()

## Creating a STAC of imagery from Spacenet 5 data

Now, let's take what we've learned and create a Catalog with more data in it.


### Allowing PySTAC to read from AWS S3

PySTAC aims to be virtually zero-dependency (notwithstanding the why-isn't-this-in-stdlib datetime-util), so it doesn't have the ability to read from or write to anything but the local file system. However, we can hook into PySTAC's IO in the following way. Learn more about how to use STAC_IO in the [documentation on the topic](https://pystac.readthedocs.io/en/latest/concepts.html#using-stac-io):

In [62]:
from urllib.parse import urlparse
import boto3
from pystac import STAC_IO

def my_read_method(uri):
    parsed = urlparse(uri)
    if parsed.scheme == 's3':
        bucket = parsed.netloc
        key = parsed.path[1:]
        s3 = boto3.resource('s3')
        obj = s3.Object(bucket, key)
        return obj.get()['Body'].read().decode('utf-8')
    else:
        return STAC_IO.default_read_text_method(uri)

def my_write_method(uri, txt):
    parsed = urlparse(uri)
    if parsed.scheme == 's3':
        bucket = parsed.netloc
        key = parsed.path[1:]
        s3 = boto3.resource("s3")
        s3.Object(bucket, key).put(Body=txt)
    else:
        STAC_IO.default_write_text_method(uri, txt)

STAC_IO.read_text_method = my_read_method
STAC_IO.write_text_method = my_write_method

We'll need a utility to list keys for reading the lists of files from S3:

In [63]:
# From https://alexwlchan.net/2017/07/listing-s3-keys/

def get_s3_keys(bucket, prefix):
    """Generate all the keys in an S3 bucket."""
    s3 = boto3.client('s3')
    kwargs = {'Bucket': bucket, 'Prefix': prefix}
    while True:
        resp = s3.list_objects_v2(**kwargs)
        for obj in resp['Contents']:
            yield obj['Key']

        try:
            kwargs['ContinuationToken'] = resp['NextContinuationToken']
        except KeyError:
            break

Let's make a STAC of imagery over Moscow as part of the Spacenet 5 challenge. As a first step, we can list out the imagery and extract IDs from each of the chips.

In [64]:
moscow_training_chip_uris = list(get_s3_keys(bucket='spacenet-dataset', 
                                             prefix='spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS'))

In [65]:
import re

chip_id_to_data = {}

def get_chip_id(uri):
    return re.search(r'.*\_chip(\d+)\.', uri).group(1)

for uri in moscow_training_chip_uris:
    chip_id = get_chip_id(uri)
    chip_id_to_data[chip_id] = { 'img': 's3://spacenet-dataset/{}'.format(uri) }

For this tutorial, we'll only take a subset of the data.

In [66]:
chip_id_to_data = dict(list(chip_id_to_data.items())[:10])

In [67]:
chip_id_to_data

{'0': {'img': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip0.tif'},
 '1': {'img': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1.tif'},
 '10': {'img': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip10.tif'},
 '100': {'img': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip100.tif'},
 '1000': {'img': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1000.tif'},
 '1001': {'img': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1001.tif'},
 '1002': {'img': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1002.tif'},
 '1003': {'img': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Mo

Let's turn each of those chips into a STAC Item that represents the image.

In [68]:
chip_id_to_items = {}

We'll create core `Item`s for our imagery, but mark them with the `eo` extension as we did above, and store the `eo` data in a `Collection`.

Note that the image CRS is in WGS:84 (Lat/Lng). If it wasn't, we'd have to reproject the footprint to WGS:84 in order to be compliant with the spec (which can easily be done with [pyproj](https://github.com/pyproj4/pyproj)).

Here we're taking advantage of `rasterio`'s ability to read S3 URIs, which only grabs the GeoTIFF metadata and does not pull the whole file down.

In [69]:
for chip_id in chip_id_to_data:
    img_uri = chip_id_to_data[chip_id]['img']
    print('Processing {}'.format(img_uri))
    bbox, footprint = get_bbox_and_footprint(img_uri)

    item = stac.Item(id='img_{}'.format(chip_id), 
                     geometry=footprint,
                     bbox=bbox,
                     datetime=datetime.utcnow(),
                     properties={},
                     stac_extensions=['eo']) 
    item.add_asset(key='ps-ms', 
                   asset=stac.EOAsset(href=img_uri,
                                      media_type=stac.MediaType.COG,
                                      bands=list(range(0, 8))))
    chip_id_to_items[chip_id] = item

Processing s3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip0.tif
Processing s3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1.tif
Processing s3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip10.tif
Processing s3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip100.tif
Processing s3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1000.tif
Processing s3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1001.tif
Processing s3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1002.tif
Processing s3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/PS-MS/SN5_roads_train_AOI_7_Moscow_PS-MS_chip1003.tif
Processin

### Creating the Collection

All of these images are over Moscow. In Spacenet 5, we have a couple cities that have imagery; a good way to separate these collections of imagery. We can store all of the common `eo` metadata in the collection.

In [70]:
from shapely.geometry import (shape, MultiPolygon)

footprints = list(map(lambda i: shape(i.geometry).envelope, 
                      chip_id_to_items.values()))
collection_bbox = MultiPolygon(footprints).bounds
spatial_extent = stac.SpatialExtent(bboxes=[collection_bbox])

In [71]:
datetimes = sorted(list(map(lambda i: i.datetime,
                            chip_id_to_items.values())))
temporal_extent = stac.TemporalExtent(intervals=[[datetimes[0], datetimes[-1]]])

In [72]:
collection_extent = stac.Extent(spatial=spatial_extent, temporal=temporal_extent)

In [73]:
common_properties = { 'eo:bands': [b.to_dict() for b in wv3_bands], 
                      'eo:gsd': 0.3,
                      'eo:platform': 'Maxar',
                      'eo:instrument': 'WorldView3'
                    }

In [74]:
collection = stac.Collection(id='wv3-images',
                             description='Spacenet 5 images over Moscow',
                             extent=collection_extent,
                             properties=common_properties,
                             license='CC-BY-SA-4.0')

In [75]:
collection.add_items(chip_id_to_items.values())

In [76]:
collection.describe()

* <Collection id=wv3-images>
  * <Item id=img_0>
  * <Item id=img_1>
  * <Item id=img_10>
  * <Item id=img_100>
  * <Item id=img_1000>
  * <Item id=img_1001>
  * <Item id=img_1002>
  * <Item id=img_1003>
  * <Item id=img_1004>
  * <Item id=img_1005>


Now, we can create a Catalog and add the collection.

In [77]:
catalog = stac.Catalog(id='spacenet5', description='Spacenet 5 Data (Test)')
catalog.add_child(collection)

In [78]:
catalog.describe()

* <Catalog id=spacenet5>
    * <Collection id=wv3-images>
      * <Item id=img_0>
      * <Item id=img_1>
      * <Item id=img_10>
      * <Item id=img_100>
      * <Item id=img_1000>
      * <Item id=img_1001>
      * <Item id=img_1002>
      * <Item id=img_1003>
      * <Item id=img_1004>
      * <Item id=img_1005>


## Adding label items to the Spacenet 5 catalog

We can use the [label extension](https://github.com/radiantearth/stac-spec/tree/v0.8.1/extensions/label) of the STAC spec to represent the training data in our STAC. For this, we need to grab the URIs of the GeoJSON of roads:

In [79]:
moscow_training_geojson_uris = list(get_s3_keys(bucket='spacenet-dataset',
                                               prefix='spacenet/SN5_roads/train/AOI_7_Moscow/geojson_roads_speed/'))

In [80]:
for uri in moscow_training_geojson_uris:
    chip_id = get_chip_id(uri)
    if chip_id in chip_id_to_data:
        chip_id_to_data[chip_id]['label'] = 's3://spacenet-dataset/{}'.format(uri)

We'll add the LabelItems to their own subcatalog; since they don't inherit the Collection's `eo` properties, they shouldn't go in the Collection.

In [81]:
label_catalog = stac.Catalog(id='spacenet-data-labels', description='Labels for Spacenet 5')
catalog.add_child(label_catalog)

We can check the pydocs to see what a LabelItem needs in order to fit the spec:

In [82]:
print(stac.LabelItem.__doc__)

A Label Item represents a polygon, set of polygons, or raster data defining
    labels and label metadata and should be part of a Collection.

    Args:
        id (str): Provider identifier. Must be unique within the STAC.
        geometry (dict): Defines the full footprint of the asset represented by this item,
            formatted according to `RFC 7946, section 3.1 (GeoJSON)
            <https://tools.ietf.org/html/rfc7946>`_.
        bbox (List[float]):  Bounding Box of the asset represented by this item using
            either 2D or 3D geometries. The length of the array must be 2*n where n is the
            number of dimensions.
        datetime (Datetime): Datetime associated with this item.
        properties (dict): A dictionary of additional metadata for the item.
        label_desecription (str): A description of the label, how it was created,
            and what it is recommended for
        label_type (str): An ENUM of either vector label type or raster label type. Us

This loop creates our LabelItems and associates each to the appropriate source image Item.

In [83]:
for chip_id in chip_id_to_data:
    img_item = collection.get_item('img_{}'.format(chip_id))
    label_uri = chip_id_to_data[chip_id]['label']
    
    label_item = stac.LabelItem(id='label_{}'.format(chip_id),
                                geometry=img_item.geometry,
                                bbox=img_item.bbox,
                                datetime=datetime.utcnow(),
                                properties={},
                                label_description="SpaceNet 5 Road labels",
                                label_type=stac.LabelType.VECTOR,
                                label_tasks=['segmentation', 'regression'])
    label_item.add_source(img_item)
    label_item.add_geojson_labels(label_uri)
    
    label_catalog.add_item(label_item)

Now we have a STAC of training data!

In [84]:
catalog.describe()

* <Catalog id=spacenet5>
    * <Collection id=wv3-images>
      * <Item id=img_0>
      * <Item id=img_1>
      * <Item id=img_10>
      * <Item id=img_100>
      * <Item id=img_1000>
      * <Item id=img_1001>
      * <Item id=img_1002>
      * <Item id=img_1003>
      * <Item id=img_1004>
      * <Item id=img_1005>
    * <Catalog id=spacenet-data-labels>
      * <LabelItem id=label_0>
      * <LabelItem id=label_1>
      * <LabelItem id=label_10>
      * <LabelItem id=label_100>
      * <LabelItem id=label_1000>
      * <LabelItem id=label_1001>
      * <LabelItem id=label_1002>
      * <LabelItem id=label_1003>
      * <LabelItem id=label_1004>
      * <LabelItem id=label_1005>


In [85]:
label_item = catalog.get_child('spacenet-data-labels').get_item('label_1')
label_item.to_dict()

{'type': 'Feature',
 'stac_version': '0.8.1',
 'id': 'label_1',
 'properties': {'datetime': '2019-11-05 16:43:42Z',
  'label:description': 'SpaceNet 5 Road labels',
  'label:type': 'vector',
  'label:properties': None,
  'label:tasks': ['segmentation', 'regression']},
 'geometry': {'type': 'Polygon',
  'coordinates': (((37.68191035616281, 55.73478210707574),
    (37.68191035616281, 55.73882710285011),
    (37.68595535193718, 55.73882710285011),
    (37.68595535193718, 55.73478210707574),
    (37.68191035616281, 55.73478210707574)),)},
 'bbox': [37.68191035616281,
  55.73478210707574,
  37.68595535193718,
  55.73882710285011],
 'links': [{'rel': 'source', 'href': None, 'type': 'application/json'},
  {'rel': 'root', 'href': None, 'type': 'application/json'},
  {'rel': 'parent', 'href': None, 'type': 'application/json'}],
 'assets': {'labels': {'href': 's3://spacenet-dataset/spacenet/SN5_roads/train/AOI_7_Moscow/geojson_roads_speed/SN5_roads_train_AOI_7_Moscow_geojson_roads_speed_chip1.ge