# Training Data from Radiant MLHub

<a style="display: inline-block;" href="https://mybinder.org/v2/gh/RadiantMLHub/ml4eo-bootcamp-2021/main?filepath=Lecture%205%2Fexercises%2F4_training_data_from_radiant_mlhub.ipynb"><img src="https://mybinder.org/badge_logo.svg" alt="Launch in Binder"/></a>

Now that we have an understanding of the STAC spec and how it relates to some example training data, we can move on to using the Radiant MLHub API to obtain training datasets.

## Radiant MLHub

> Radiant MLHub is an open library for geospatial training data (and soon machine learning models) to advance machine learning applications on Earth Observations. It serves as a resource for a community of practice, giving data scientists benchmarks they can use to train and validate their models and improve its performance.
>
>  \- https://mlhub.earth

The Radiant MLHub API is a [STAC API-compliant](https://github.com/radiantearth/stac-api-spec) service that enables the search, discovery, and download of geospatial training datasets for use with ML models. MLHub Collections generally fall into 2 categories: source imagery or labels. 

### Label Collections

Label Collections contain assets representing labels that could be used as train or test labels when training a machine learning model. Items in these collections implement the [STAC Label Extension](https://github.com/stac-extensions/label) and may contain vector or raster label assets (or both). Items in a label Collection will have at least one link in their `links` property that refers to an Item from one of the [Source Imagery Collections](#source-imagery-collections) corresponding to the label Item. Label collections may also implement the [STAC Scientific Extension](https://github.com/stac-extensions/scientific) to describe publications associated with the data and how they should be cited.

### Source Imagery Collections

Source imagery Collections contain assets representing source imagery that can be used in conjuction with labels from a label Collection to train a machine learning model. As described in the [Label Collections](#label-collections) section above, each label Item will have one or more links to a source imagery Item associated with the labels.

### Datasets

The Radiant MLHub API also defines "dataset" objects that collect associated source imagery and label collections into a single object. These "dataset" objects ar *not* part of the STAC or STAC API spec. For instance, the `bigearthnet_v1` dataset is comprised of 1 source imagery Collection (`bigearthnet_v1_source`) and 1 label Collection (`bigearthnet_v1_labels`). A request to `GET /datasets/bigearthnet_v1` returns the following:

```json
{
    "collections": [
        {
            "id": "bigearthnet_v1_source",
            "types": [
                "source_imagery"
            ]
        },
        {
            "id": "bigearthnet_v1_labels",
            "types": [
                "labels"
            ]
        }
    ],
    "id": "bigearthnet_v1",
    "title": "BigEarthNet"
}
```

### Collection Archives

While it is possible to crawl all of the STAC Items associated with a Collection and download assets individually, the more typical use-case is to download a Collection or dataset in its entirety. To make it easier to download all of the assets and STAC objects for a given Collection, Radiant MLHub provides a single [tarball archive](https://en.wikipedia.org/wiki/Tar_(computing)) download of most Collections. Collection archives can be downloaded from the `GET /archive/{collection_id}` endpoint.

Check out [this blog post](https://medium.com/radiant-earth-insights/archived-training-dataset-downloads-now-available-on-radiant-mlhub-7eb67daf094e) for a more detailed description of the collection archives and how to work with them.

## `radiant-mlhub` Python Client

To make it easier to work with the MLHub API in Python code, Radiant Earth Foundation has released an open-source [Python client](https://radiant-mlhub.readthedocs.io/en/latest/) (`radiant-mlhub`). This client may be installed [from `PyPi`](https://pypi.org/project/radiant-mlhub/) or using `conda` ([in the `conda-forge` channel](https://anaconda.org/conda-forge/radiant-mlhub)).

Some advantages to using the Python client are:

* Easy authentication configuration
* Convenience methods for making requests using Python
* Ability to work with responses as [PySTAC](https://pystac.readthedocs.io/en/latest/) objects

## Authentication

### Create an API Key

Access to the Radiant MLHub API requires an API key. To get your API key, go to [dashboard.mlhub.earth](https://dashboard.mlhub.earth). If you have not used Radiant MLHub before, you will need to sign up and create a new account. Otherwise, sign in. In the **API Keys** tab, you'll be able to create API key(s), which you will need. *Do not share* your API key with others: your usage may be limited and sharing your API key is a security risk.

### Configure the Client

Once you have your API key, you need to configure the `radiant_mlhub` library to use that key. There are a number of ways to configure this (see the [Authentication docs](https://radiant-mlhub.readthedocs.io/en/latest/authentication.html) for details). 

For these examples, we will set the `MLHUB_API_KEY` environment variable. Run the cell below to save your API key as an environment variable that the client library will recognize.

*If you are running this notebook locally and have configured a profile as described in the [Authentication docs](https://radiant-mlhub.readthedocs.io/en/latest/authentication.html), then you do not need to execute this cell.*


In [None]:
import os

os.environ['MLHUB_API_KEY'] = 'PASTE_YOUR_API_KEY_HERE'

## Find a Dataset

In this exercise, we will use the `radiant_mlhub` client to find a dataset of interest and download the collection archives for that dataset. The [`radiant_mlhub.Dataset`](https://radiant-mlhub.readthedocs.io/en/latest/api/radiant_mlhub.html#radiant_mlhub.models.Dataset) class has some convenient methods for working with datasets and their associated collections.

First, let's import the libraries we'll be working with.

In [20]:
from pathlib import Path
import tarfile

from radiant_mlhub import Dataset

tmp_dir = Path.cwd().parent / 'tmp'
tmp_dir.mkdir(exist_ok=True)

We can use the [`Dataset.list`](https://radiant-mlhub.readthedocs.io/en/latest/api/radiant_mlhub.html#radiant_mlhub.models.Dataset.list) method to get a [Python generator](https://realpython.com/introduction-to-python-generators/) of all the Radiant MLHub datasets. We will then loop through these datasets and print the IDs and titles.

In [11]:
for dataset in Dataset.list():
    print(f'{dataset.id: <34} {dataset.title}')

idiv_asia_crop_type                A crop type dataset for consistent land cover classification in Central Asia
bigearthnet_v1                     BigEarthNet
microsoft_chesapeake               Chesapeake Land Cover
ref_african_crops_kenya_02         CV4A Kenya Crop Type Competition
ref_african_crops_uganda_01        Dalberg Data Insights Crop Type Uganda
rti_rwanda_crop_type               Drone Imagery Classification Training Dataset for Crop Types in Rwanda
ref_african_crops_tanzania_01      Great African Food Company Crop Type Tanzania
landcovernet_v1                    LandCoverNet
open_cities_ai_challenge           Open Cities AI Challenge
ref_african_crops_kenya_01         PlantVillage Crop Type Kenya
su_african_crops_ghana             Semantic Segmentation of Crop Type in Ghana
su_african_crops_south_sudan       Semantic Segmentation of Crop Type in South Sudan
sen12floods                        SEN12-FLOOD
ts_cashew_benin                    Smallholder Cashew Plantations in Ben

For this exercise, let's work with the `su_african_crops_south_sudan` dataset. We can fetch this dataset using the [`Dataset.fetch`](https://radiant-mlhub.readthedocs.io/en/latest/api/radiant_mlhub.html#radiant_mlhub.models.Dataset.fetch) method.

In [14]:
dataset = Dataset.fetch('su_african_crops_south_sudan')

Once we have the dataset, we can list the collections associated with it. The `Dataset.collections` property is a custom list-like class that allows you to list source imagery collections, label collections, or all collections associated with the dataset.

In [15]:
# List all collections
print('All Collections\n---------------')
for collection in dataset.collections:
    print(collection.id)
print('')

# List source imagery collections
print('Source Imagery Collections\n---------------')
for collection in dataset.collections.source_imagery:
    print(collection.id)
print('')

# List label collections
print('Label Collections\n---------------')
for collection in dataset.collections.labels:
    print(collection.id)
print('')


All Collections
---------------
su_african_crops_south_sudan_labels
su_african_crops_south_sudan_source_planet
su_african_crops_south_sudan_source_s1
su_african_crops_south_sudan_source_s2

Source Imagery Collections
---------------
su_african_crops_south_sudan_source_planet
su_african_crops_south_sudan_source_s1
su_african_crops_south_sudan_source_s2

Label Collections
---------------
su_african_crops_south_sudan_labels



We can use the [`Dataset.download`](https://radiant-mlhub.readthedocs.io/en/latest/api/radiant_mlhub.html#radiant_mlhub.models.Dataset.download) method to download all of the collection archives associated with this dataset. This is probably the best approach when working with a dataset because it ensures you have all source imagery and labels for that dataset. However, this can take a significant amount of time due to the size of the archives.

For the purposes of this exercise, we will just download the label collection.

In [21]:
label_collection = dataset.collections.labels[0]
archive_path = label_collection.download(tmp_dir)

We can now use the `tarfile` package to unpack the tarball archive into our temporary directory.

In [31]:
# Extract the label collection into the directory
with tarfile.open(archive_path) as t:
    t.extractall(path=tmp_dir)
    

In [33]:
!ls ../tmp/su_african_crops_south_sudan_labels | head -n 20

_common
collection.json
su_african_crops_south_sudan_labels_000000
su_african_crops_south_sudan_labels_000001
su_african_crops_south_sudan_labels_000002
su_african_crops_south_sudan_labels_000003
su_african_crops_south_sudan_labels_000004
su_african_crops_south_sudan_labels_000005
su_african_crops_south_sudan_labels_000006
su_african_crops_south_sudan_labels_000007
su_african_crops_south_sudan_labels_000008
su_african_crops_south_sudan_labels_000009
su_african_crops_south_sudan_labels_000010
su_african_crops_south_sudan_labels_000011
su_african_crops_south_sudan_labels_000012
su_african_crops_south_sudan_labels_000013
su_african_crops_south_sudan_labels_000014
su_african_crops_south_sudan_labels_000015
su_african_crops_south_sudan_labels_000016
su_african_crops_south_sudan_labels_000017
