# MLHub Dataset Download

The dataset downloader offers download of STAC catalog archives, and linked dataset assets, as well as partial downloads with filtering options. High level features are:

* Robustness
    * Asset download resuming.
    * Retry and backoff for http error conditions.
    * Error reporting for unrecoverable download errors.
    * **Checksum validation if MD5 hash is available. TODO: maybe remove yank this feature?**
* Performance
    * Scales to millions of assets.
    * Multithreaded workers: parallel downloads.
* Convenience
    * Temporal filter
    * Bounding box filter
    * GeoJSON intersection filter
    * STAC collection_id and item asset key filter
    
## Filesystem layout of STAC Catalog and Assets

STAC archive file:

`{output_dir}/{dataset_id}/{dataset_id}.tar.gz` 

Unarchived STAC catalog:

`{output_dir}/{dataset_id}/catalog.json` 

Collection, Item and Asset layout:

`{output_dir}/{dataset_id}/{collection_id}/{item_id}/{asset_key}.{ext}` 

Common Assets, ex: `documentation.pdf` are saved into a `_common` directory
instead of duplicating them for many items:

`{output_dir}/{dataset_id}/_common/{asset_key}.{ext}`

Asset Database: TIP: this file may be safely deleted to free up disk space.

`{output_dir}/{dataset_id}/mlhub_stac_assets.db`

In [1]:
from radiant_mlhub.models import Dataset

The most basic usage is to fetch a dataset, and then call it's download method. The output directory is by default, the current working directory.

In [2]:
nasa_marine_debris = Dataset.fetch_by_id('nasa_marine_debris')
nasa_marine_debris.title

'Marine Debris Dataset for Object Detection in Planetscope Imagery'

In [3]:
nasa_marine_debris.download()

nasa_marine_debris: fetch stac catalog: 258KB [00:00, 75252.46KB/s]                                                     
unarchive nasa_marine_debris.tar.gz...: 100%|████████████████████████████████████| 2830/2830 [00:00<00:00, 14185.00it/s]
download assets: 100%|█████████████████████████████████████████████████████████████| 2825/2825 [00:19<00:00, 145.36it/s]


In [4]:
# shell command to check size of dir:
! rm -rf nasa_marine_debris/mlhub_stac_assets.db
! du -sh nasa_marine_debris

98M	nasa_marine_debris


In [5]:
# shell command to cleanup
! rm -rf nasa_marine_debris*

## Logging

The Python logging module can be used to control the verbosity of the download. Turn in INFO or DEBUG messages to see additional messages:

In [6]:
import logging
logging.basicConfig(level=logging.INFO)

In [7]:
nasa_marine_debris.download()

nasa_marine_debris: fetch stac catalog: 258KB [00:00, 34940.12KB/s]                                                     
INFO:radiant_mlhub.client.catalog_downloader:unarchive nasa_marine_debris.tar.gz...
unarchive nasa_marine_debris.tar.gz...: 100%|████████████████████████████████████| 2830/2830 [00:00<00:00, 14191.09it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:2825 unique assets in stac catalog.
download assets: 100%|█████████████████████████████████████████████████████████████| 2825/2825 [00:18<00:00, 152.37it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to /data/ME-1140-dataset-archive-support/radiant-mlhub/examples/nasa_marine_debris


## Output directory options

The output directory is by default, the current working directory. The `output_dir` parameter takes a `str` or `pathlib.Path`. It will be created if it does not exist.

In [8]:
# output_dir as string
nasa_marine_debris.download(output_dir='/tmp')

nasa_marine_debris: fetch stac catalog: 258KB [00:00, 91752.62KB/s]                                                     
INFO:radiant_mlhub.client.catalog_downloader:unarchive nasa_marine_debris.tar.gz...
unarchive nasa_marine_debris.tar.gz...: 100%|████████████████████████████████████| 2830/2830 [00:00<00:00, 14454.68it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:2825 unique assets in stac catalog.
download assets: 100%|█████████████████████████████████████████████████████████████| 2825/2825 [00:12<00:00, 226.07it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to /tmp/nasa_marine_debris


In [10]:
# output_dir as Path object
from pathlib import Path
nasa_marine_debris.download(output_dir=Path.home() / 'my_projects' / 'ml_datasets')

INFO:radiant_mlhub.client.catalog_downloader:unarchive nasa_marine_debris.tar.gz...
unarchive nasa_marine_debris.tar.gz...: 100%|███████████████████████████████████| 2830/2830 [00:00<00:00, 115104.10it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:2825 unique assets in stac catalog.
download assets: 100%|████████████████████████████████████████████████████████████| 2825/2825 [00:01<00:00, 1874.55it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to /home/guidorice/my_projects/ml_datasets/nasa_marine_debris


In [11]:
# shell commands to cleanup
! rm -rf nasa_marine_debris*
! rm -rf /tmp/nasa_marine_debris*
p = str(Path.home() / 'my_projects' / 'ml_datasets' / 'nasa_marine_debris')
! rm -rf ${p}

## Large dataset performance

Let's try a bit larger dataset (tens of thousands of assets). After downloading the complete dataset, we'll explore all of the options for filtering assets. Filtering lets you limit the items and assets to those you are interested in, prior to downloading!

This download example was run on a compute-optimized 16-core virtual machine in the MS Azure West-Europe region. You would likely experience slower download performance on your machine, depending on number of cores and network bandwidth.

In [12]:
sen12floods = Dataset.fetch_by_id('sen12floods')

In [13]:
%%time

sen12floods.download()

sen12floods: fetch stac catalog: 2060KB [00:00, 127699.36KB/s]                                                          
INFO:radiant_mlhub.client.catalog_downloader:unarchive sen12floods.tar.gz...
unarchive sen12floods.tar.gz...: 100%|█████████████████████████████████████████| 22278/22278 [00:01<00:00, 14239.53it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:39063 unique assets in stac catalog.
download assets: 100%|███████████████████████████████████████████████████████████| 39063/39063 [06:26<00:00, 101.06it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to /data/ME-1140-dataset-archive-support/radiant-mlhub/examples/sen12floods


CPU times: user 11min 44s, sys: 2min 15s, total: 14min
Wall time: 6min 40s


In [14]:
# shell commands to get size of assets
! rm -rf sen12floods/mlhub_stac_assets.db
! du -sh sen12floods

15G	sen12floods


In [15]:
# shell command to cleanup
! rm -rf sen12floods*

## Filtering assets options

Filters may be freely combined (except `bbox` and `intersects` where are independent options).

### Filter by collection and asset keys

To download only the specified STAC collection ids and STAC item asset keys, create a dictionary in this format and pass it to the `collection_filter` parameter:

```
{ collection_id1: [ asset_key1, asset_key2, ...], collection_id2: [asset_key1, asset_key2, ...] , ... }
```

In [16]:
my_filter = dict(
    sen12floods_s2_source=['B02', 'B03', 'B04', 'B08'],   # Red, Green, Blue, NIR
    sen12floods_s2_labels=['labels', 'documentation'], 
)  
sen12floods.download(collection_filter=my_filter)

sen12floods: fetch stac catalog: 2060KB [00:00, 118918.57KB/s]                                                          
INFO:radiant_mlhub.client.catalog_downloader:unarchive sen12floods.tar.gz...
unarchive sen12floods.tar.gz...: 100%|█████████████████████████████████████████| 22278/22278 [00:01<00:00, 14177.67it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:39063 unique assets in stac catalog.
INFO:radiant_mlhub.client.catalog_downloader:filter collection ids and asset keys...
filter collection ids and asset keys...:   0%|                                                | 0/39063 [00:00<?, ?it/s]INFO:radiant_mlhub.client.catalog_downloader:11181 assets after collection filter.
filter collection ids and asset keys...: 1052395it [00:00, 5777049.66it/s]                                              
download assets: 100%|████████████████████████████████████████████████████████████| 11181/11181 [01:53<00:00, 98.82it/s

In [17]:
# shell commands to get size of assets
! rm -rf sen12floods/mlhub_stac_assets.db
! du -sh sen12floods

4.6G	sen12floods


In [18]:
# shell command to cleanup
! rm -rf sen12floods*

### Filter by temporal range

Using the `datetime` filter lets you specify the datetime range (tuple of `datetime`), or a single day (single `datetime`), to filter STAC items.

In [19]:
from dateutil.parser import parse

my_start_date=parse("2019-04-01T00:00:00+0000")
my_end_date=parse("2019-04-07T00:00:00+0000")
sen12floods.download(datetime=(my_start_date, my_end_date))

sen12floods: fetch stac catalog: 2060KB [00:00, 117281.78KB/s]                                                          
INFO:radiant_mlhub.client.catalog_downloader:unarchive sen12floods.tar.gz...
unarchive sen12floods.tar.gz...: 100%|█████████████████████████████████████████| 22278/22278 [00:01<00:00, 14269.02it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:39063 unique assets in stac catalog.
INFO:radiant_mlhub.client.catalog_downloader:filter by temporal query...
filter by temporal query...: 1052395it [00:02, 646371.82it/s]                                                           INFO:radiant_mlhub.client.catalog_downloader:1842 assets after temporal filter.
filter by temporal query...: 1052395it [00:02, 351032.70it/s]
download assets: 100%|██████████████████████████████████████████████████████████████| 1842/1842 [00:22<00:00, 80.31it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to /data/ME-11

### Filter by spatial bounding box (bbox)

Using the `bbox` filter lets you specify a bounding box in lat/lng (CRS EPSG:4326) for a spatial intersection test with each STAC item's bounding box. Note: the `bbox` filter may not be used with the `intersects` filter (use one or the other).

In [21]:
my_bbox = [
        -13.278254,
        8.447033,
        -13.231551,
        8.493532
]
sen12floods.download(bbox=my_bbox)

INFO:radiant_mlhub.client.catalog_downloader:unarchive sen12floods.tar.gz...
unarchive sen12floods.tar.gz...: 100%|████████████████████████████████████████| 22278/22278 [00:00<00:00, 120440.60it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:39063 unique assets in stac catalog.
INFO:radiant_mlhub.client.catalog_downloader:filter by bounding box...
filter by bounding box...: 107520it [00:00, 1045769.97it/s]                                                             INFO:radiant_mlhub.client.catalog_downloader:131 assets after bounding box filter.
filter by bounding box...: 1052395it [00:00, 3133664.79it/s]
download assets: 100%|██████████████████████████████████████████████████████████████| 131/131 [00:00<00:00, 2713.19it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to /data/ME-1140-dataset-archive-support/radiant-mlhub/examples/sen12floods


### Filter by GeoJSON area of interest

Using the `intersects` filter lets you specify a GeoJSON area of interest for a spatial intersection test with each STAC item's bounding box. Note: the `intersects` filter may not be used with the `bbox` filter (use one or the other).

In [22]:
import json
my_geojson = json.loads(
    """
    {
        "type": "Feature",
        "geometry": {
            "type": "Polygon",
            "coordinates": [
                [
                    [
                        -13.278048,
                        8.493532
                    ],
                    [
                        -13.278254,
                        8.447241
                    ],
                    [
                        -13.231762,
                        8.447033
                    ],
                    [
                        -13.231551,
                        8.493323
                    ],
                    [
                        -13.278048,
                        8.493532
                    ]
                ]
            ]           
        }
    }
    """
)
sen12floods.download(intersects=my_geojson)

INFO:radiant_mlhub.client.catalog_downloader:unarchive sen12floods.tar.gz...
unarchive sen12floods.tar.gz...: 100%|████████████████████████████████████████| 22278/22278 [00:00<00:00, 120482.21it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:39063 unique assets in stac catalog.
INFO:radiant_mlhub.client.catalog_downloader:filter by intersects...
filter by intersects...: 122880it [00:00, 1146170.23it/s]                                                               INFO:radiant_mlhub.client.catalog_downloader:131 assets after intersects filter.
filter by intersects...: 1052395it [00:00, 3192921.97it/s]
download assets: 100%|██████████████████████████████████████████████████████████████| 131/131 [00:00<00:00, 2808.97it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to /data/ME-1140-dataset-archive-support/radiant-mlhub/examples/sen12floods


In [23]:
# shell command to cleanup
! rm -rf sen12floods*

## STAC Catalog Only download

If you want to inspect the STAC catalog and write your own download client for the dataset, assets just pass the `catalog_only` option to the download method:

In [24]:
sen12floods.download(catalog_only=True)

sen12floods: fetch stac catalog: 2060KB [00:00, 127903.52KB/s]                                                          
INFO:radiant_mlhub.client.catalog_downloader:unarchive sen12floods.tar.gz...
unarchive sen12floods.tar.gz...: 100%|█████████████████████████████████████████| 22278/22278 [00:01<00:00, 14284.65it/s]
INFO:radiant_mlhub.client.catalog_downloader:catalog saved to /data/ME-1140-dataset-archive-support/radiant-mlhub/examples/sen12floods
