<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>

# How to use the Radiant MLHub API to browse and download the LandCoverNet dataset

This Jupyter notebook, which you may copy and adapt for any use, shows basic examples of how to use the API to download labels and source imagery for the LandCoverNet dataset. Full documentation for the API is available at [docs.mlhub.earth](http://docs.mlhub.earth).

We'll show you how to set up your authorization, list collection properties, and retrieve the items (the data contained within them) from those collections.

Each item in our collection is explained in json format compliant with STAC label extension definition.

## Citation

Alemohammad S.H., Ballantyne A., Bromberg Gaber Y., Booth K., Nakanuku-Diggs L., & Miglarese A.H. (2020) "LandCoverNet: A Global Land Cover Classification Training Dataset", Version 1.0, Radiant MLHub. \[Date Accessed\] [https://doi.org/10.34911/rdnt.d2ce8i](https://doi.org/10.34911/rdnt.d2ce8i)

## Dependencies

This notebook utilizes the [`radiant-mlhub` Python client](https://pypi.org/project/radiant-mlhub/) for interacting with the API. If you are running this notebooks using Binder, then this dependency has already been installed. If you are running this notebook locally, you will need to install this yourself.

See the official [`radiant-mlhub` docs](https://radiant-mlhub.readthedocs.io/) for more documentation of the full functionality of that library.

## Authentication

### Create an API Key

Access to the Radiant MLHub API requires an API key. To get your API key, go to [dashboard.mlhub.earth](https://dashboard.mlhub.earth). If you have not used Radiant MLHub before, you will need to sign up and create a new account. Otherwise, sign in. In the **API Keys** tab, you'll be able to create API key(s), which you will need. *Do not share* your API key with others: your usage may be limited and sharing your API key is a security risk.

### Configure the Client

Once you have your API key, you need to configure the `radiant_mlhub` library to use that key. There are a number of ways to configure this (see the [Authentication docs](https://radiant-mlhub.readthedocs.io/en/latest/authentication.html) for details). 

For these examples, we will set the `MLHUB_API_KEY` environment variable. Run the cell below to save your API key as an environment variable that the client library will recognize.

*If you are running this notebook locally and have configured a profile as described in the [Authentication docs](https://radiant-mlhub.readthedocs.io/en/latest/authentication.html), then you do not need to execute this cell.*


In [None]:
import os

os.environ['MLHUB_API_KEY'] = 'PASTE_YOUR_API_KEY_HERE'

In [None]:
import urllib.parse
import re
from pathlib import Path
import itertools as it
from functools import partial
from concurrent.futures import ThreadPoolExecutor

from tqdm.notebook import tqdm
from radiant_mlhub import client, get_session

## Listing Collection Properties

The following cell makes a request to the API for the properties for the LandCoverNet labels collection and prints out a few important properties.

In [None]:
collection_id = 'ref_landcovernet_v1_labels'

collection = client.get_collection(collection_id)
print(f'Description: {collection["description"]}')
print(f'License: {collection["license"]}')
print(f'DOI: {collection["sci:doi"]}')
print(f'Citation: {collection["sci:citation"]}')

## Finding Possible Land Cover Labels

Each label item within the collection has a property which lists all of the possible land cover types and which ones are present in each label item. The code below prints out which land cover types are present in the dataset and we will reference these later in the notebook when we filter downloads.

In [None]:
items = client.list_collection_items(collection_id, limit=1)

first_item = next(items)

label_classes = first_item['properties']['label:classes']
for label_class in label_classes:
    print(f'Classes for {label_class["name"]}')
    for c in sorted(label_class['classes']):
        print(f'- {c}')

## Downloading Assets

> **NOTE:** If you are running these notebooks using Binder these resources will be downloaded to the remote file system that the notebooks are running on and **not to your local file system.** If you want to download the files to your machine, you will need to clone the repo and run the notebook locally.

### Create Download Helpers

The cell below creates 3 helper functions that we will use to select items from a collection and download the associated assets (source imagery or labels).

* **`get_items`**

    This is a [Python generator](https://realpython.com/introduction-to-python-generators/) that yields items from the given collection that match the criteria we give it. For instance, the following code will yield up to 10 items from the BigEarthNet labels collection that contain *either the `'Coniferous forest'` or the `'Rice fields'` labels*:
    ```python
    get_items('bigearthnet_v1_labels', classes=['Coniferous forest', 'Rice fields'], max_items=10)
    ```

* **`download`** 

    This function takes an item dictionary and an asset key and downloads the given asset. By default, the asset is downloaded to the current working directory, but this can be changed using the `output_dir` argument.

* **`filter_item`** 

    This is a helper function used by the `get_items` function to filter items returned by `client.list_collection_items`.


In [None]:
items_pattern = re.compile(r'^/mlhub/v1/collections/(\w+)/items/(\w+)$')


def filter_item(item, classes=None, cloud_and_shadow=None, seasonal_snow=None):
    """Function to be used as an argument to Python's built-in filter function that filters out any items that 
    do not match the given classes, cloud_and_shadow, and/or seasonal_snow values.
    
    If any of these filter arguments are set to None, they will be ignored. For instance, using 
    filter_item(item, cloud_and_shadow=True) will only return items where item['properties']['cloud_and_shadow'] == 'true', 
    and will not filter based on classes/labels, or seasonal_snow.
    """
    # Match classes, if provided
    
    item_labels = item['properties'].get('labels', [])
    if classes is not None and not any(label in classes for label in item_labels):
        return False
    
    # Match cloud_and_shadow, if provided
    item_cloud_and_shadow = item['properties'].get('cloud_and_shadow', 'false') == 'true'
    if cloud_and_shadow is not None and item_cloud_and_shadow != cloud_and_shadow:
        return False
    
    # Match seasonal_snow, if provided
    item_seasonal_snow = item['properties'].get('seasonal_snow', 'false') == 'true'
    if seasonal_snow is not None and item_seasonal_snow != seasonal_snow:
        return False
    
    return True


def get_items(collection_id, classes=None, cloud_and_shadow=None, seasonal_snow=None, max_items=1):
    """Generator that yields up to max_items items that match the given classes, cloud_and_shadow, and seasonal_snow 
    values. Setting one of these filter arguments to None will cause that filter to be ignored (e.g. classes=None 
    means that items will not be filtered by class/label).
    """
    filter_fn = partial(
        filter_item, 
        classes=classes, 
        cloud_and_shadow=cloud_and_shadow, 
        seasonal_snow=seasonal_snow
    )
    filtered = filter(
        filter_fn, 

        # Note that we set the limit to None here because we want to limit based on our own filters. It is not 
        #  recommended to use limit=None for the client.list_collection_items method without implementing your 
        #  own limits because the bigearthnet_v1_labels collection contains hundreds of thousands of items and 
        #  looping over these items without limit may take a very long time.
        client.list_collection_items(collection_id, limit=None)
    )
    yield from it.islice(filtered, max_items)
    

def download(item, asset_key, output_dir='./data'):
    """Downloads the given item asset by looking up that asset and then following the "href" URL."""

    # Try to get the given asset and return None if it does not exist
    asset = item.get('assets', {}).get(asset_key)
    if asset is None:
        print(f'Asset "{asset_key}" does not exist in this item')
        return None
    
    # Try to get the download URL from the asset and return None if it does not exist
    download_url = asset.get('href')
    if download_url is None:
        print(f'Asset {asset_key} does not have an "href" property, cannot download.')
        return None
    
    session = get_session()
    r = session.get(download_url, allow_redirects=True, stream=True)
    
    filename = urllib.parse.urlsplit(r.url).path.split('/')[-1]
    output_path = Path(output_dir) / filename

    
    with output_path.open('wb') as dst:
        for chunk in r.iter_content(chunk_size=512 * 1024):
            if chunk:
                dst.write(chunk)
    

def download_labels_and_source(item, assets=None, output_dir='./data'):
    """Downloads all label and source imagery assets associated with a label item that match the given asset types.
    """
    
    # Follow all source links and add all assets from those
    def _get_download_args(link):
        # Get the item ID (last part of the link path)
        source_item_path = urllib.parse.urlsplit(link['href']).path
        source_item_collection, source_item_id = items_pattern.fullmatch(source_item_path).groups()
        source_item = client.get_collection_item(source_item_collection, source_item_id)

        source_download_dir = download_dir / 'source'
        source_download_dir.mkdir(exist_ok=True)
        
        matching_source_assets = [
            asset 
            for asset in source_item.get('assets', {}) 
            if assets is None or asset in assets
        ] 
        return [
            (source_item, asset, source_download_dir) 
            for asset in matching_source_assets
        ]

    
    download_args = []
    
    download_dir = Path(output_dir) / item['id']
    download_dir.mkdir(parents=True, exist_ok=True)
    
    labels_download_dir = download_dir / 'labels'
    labels_download_dir.mkdir(exist_ok=True)

    # Download the labels assets
    matching_assets = [
        asset 
        for asset in item.get('assets', {}) 
        if assets is None or asset in assets
    ]

    for asset in matching_assets:
        download_args.append((item, asset, labels_download_dir))
        
    source_links = [link for link in item['links'] if link['rel'] == 'source']
    
    with ThreadPoolExecutor(max_workers=16) as executor:
        for argument_batch in executor.map(_get_download_args, source_links):
            download_args += argument_batch
        
    print(f'Downloading {len(download_args)} assets...')
    with ThreadPoolExecutor(max_workers=16) as executor:
        with tqdm(total=len(download_args)) as pbar:
            for _ in executor.map(lambda triplet: download(*triplet), download_args):
                pbar.update(1)
    

### Download Assets for 1 Item

The following cell below will navigate and API and collect all the download links for labels and source imagery assets. 

In this case we specified the `max_items` argument to the `get_items` function, which limits the number of label items fetched to just 1. We also pass a list of `assets` to the `download_labels_and_source` function, which limits the types of assets downloaded to only those included in the list. We limit the results in these two ways because there a nearly 2,000 label items and over 150,000 source items in the LandCoverNet collections, and each source item contains at least 13 items representing the various Sentinel 2 bands. Attempting to download all items or all assets for even a few items can take a very long time.

In [None]:
items = get_items(
    collection_id,
    max_items=1
)
for item in items:
    download_labels_and_source(item, assets=['labels', 'B02', 'B03', 'B04'])

### Filtering on Land Cover Type

We can specify which land cover types we want to download by adding the "classes" argument. This argument accepts an array of land cover types and only label items which contain one or more of the classes specified will be downloaded. The possible land cover types can be found in the "Finding Possible Land Cover Labels" cell above.

In [None]:
items = get_items(
    collection_id,
    classes=['Woody Vegetation'],
    max_items=1,
)
for item in items:
    download_labels_and_source(item, assets=['labels', 'B02', 'B03', 'B04'])

### Download All Assets

Looping through all items and downloading the associated assets may be *very* time-consuming for larger datasets like LandCoverNet. Instead, MLHub provides TAR archives of all collections that can be downloaded using the `/archive/{collection_id}` endpoint. 

The following cell uses the `client.download_archive` function to download the `ref_landcovernet_v1_labels` archive to the current working directory.

In [None]:
client.download_archive(collection_id, output_dir='./data')