<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>

# Radiant Earth Spot the Crop Challenge
# A Guide to Access the data on Radiant MLHub [Downloading individual bands' archives]


This notebook walks you through the steps to get access to Radiant MLHub and access the data for the `Radiant Earth Spot the Crop Challenge`. In this updated notebook, you will learn how to download specific bands of Sentinel-2 data instead of the whole archive that might be too large to access. 

## Radiant MLHub API


The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the [Radiant MLHub site](https://mlhub.earth) and about the organization behind it at the [Radiant Earth Foundation site](https://radiant.earth).

Full documentation for the API is available at [https://mlhub.earth/docs](https://mlhub.earth/docs).

Each item in our collection is explained in json format compliant with [STAC](https://stacspec.org/) [label extension](https://github.com/stac-extensions/label) definition.

## Dependencies

This notebook utilizes the [`radiant-mlhub` Python client](https://pypi.org/project/radiant-mlhub/) for interacting with the API. This notebook also utilizes the [`pandas` library](https://pandas.pydata.org/). If you are running this notebooks using Binder, then these dependencies have already been installed. If you are running this notebook locally, you will need to install these yourself.

See the official [`radiant-mlhub` docs](https://radiant-mlhub.readthedocs.io/) for more documentation of the full functionality of that library.

In [1]:
# Required libraries
import os
import tarfile
import json
import pandas as pd
from pathlib import Path
from radiant_mlhub.client import _download as download_file

os.environ['MLHUB_API_KEY'] = 'N/A'

## Download Options

By editing the cell below, you can chose which bands of the Sentinel-2 imagery to download and whether or not to download the Sentinel-1 data.

In [2]:
DOWNLOAD_S1 = False # If you set this to true then the Sentinel-1 data will be downloaded

# Select which imagery bands you'd like to download here
DOWNLOAD_BANDS = {
    'B01': False,
    'B02': False,
    'B03': True,
    'B04': False,
    'B05': False,
    'B06': False,
    'B07': False,
    'B08': False,
    'B8A': False,
    'B09': False,
    'B11': False,
    'B12': False,
    'CLM': True
}

Downloading Datasets and Loading Asset File Paths into a Pandas Dataframe
===

The cells in this notebook will show you how to download all of the datasets for this competition and read the STAC metadata into a pandas dataframe. There will be two dataframes, one for train and one for test, which contain all of the information you will need to filter based off datetime, satellite platform, and asset type. Contained in each row of the dataframe is also the file path for that asset being described. Assets which have a `None` value for the  `datetime` and `satellite_platform` columns are assets which are related to the label item.

In [3]:
FOLDER_BASE = 'ref_south_africa_crops_competition_v1'

def download_archive(archive_name):
    if os.path.exists(archive_name.replace('.tar.gz', '')):
        return
    
    print(f'Downloading {archive_name} ...')
    download_url = f'https://radiant-mlhub.s3.us-west-2.amazonaws.com/archives/{archive_name}'
    download_file(download_url, '.')
    print(f'Extracting {archive_name} ...')
    with tarfile.open(archive_name) as tfile:
        tfile.extractall()
    os.remove(archive_name)

for split in ['train', 'test']:
    # Download the labels
    labels_archive = f'{FOLDER_BASE}_{split}_labels.tar.gz'
    download_archive(labels_archive)
    
    # Download Sentinel-1 data
    if DOWNLOAD_S1:
        s1_archive = f'{FOLDER_BASE}_{split}_source_s1.tar.gz'
        download_archive(s1_archive)
        

    for band, download in DOWNLOAD_BANDS.items():
        if not download:
            continue
        s2_archive = f'{FOLDER_BASE}_{split}_source_s2_{band}.tar.gz'
        download_archive(s2_archive)
        
def resolve_path(base, path):
    return Path(os.path.join(base, path)).resolve()
        
def load_df(collection_id):
    split = collection_id.split('_')[-2]
    collection = json.load(open(f'{collection_id}/collection.json', 'r'))
    rows = []
    item_links = []
    for link in collection['links']:
        if link['rel'] != 'item':
            continue
        item_links.append(link['href'])
        
    for item_link in item_links:
        item_path = f'{collection_id}/{item_link}'
        current_path = os.path.dirname(item_path)
        item = json.load(open(item_path, 'r'))
        tile_id = item['id'].split('_')[-1]
        for asset_key, asset in item['assets'].items():
            rows.append([
                tile_id,
                None,
                None,
                asset_key,
                str(resolve_path(current_path, asset['href']))
            ])
            
        for link in item['links']:
            if link['rel'] != 'source':
                continue
            source_item_id = link['href'].split('/')[-2]
            
            if source_item_id.find('_s1_') > 0 and not DOWNLOAD_S1:
                continue
            elif source_item_id.find('_s1_') > 0:
                for band in ['VV', 'VH']:
                    asset_path = Path(f'{FOLDER_BASE}_{split}_source_s1/{source_item_id}/{band}.tif').resolve()
                    date = '-'.join(source_item_id.split('_')[10:13])
                    
                    rows.append([
                        tile_id,
                        f'{date}T00:00:00Z',
                        's1',
                        band,
                        asset_path
                    ])
                
            if source_item_id.find('_s2_') > 0:
                for band, download in DOWNLOAD_BANDS.items():
                    if not download:
                        continue
                    
                    asset_path = Path(f'{FOLDER_BASE}_{split}_source_s2_{band}/{source_item_id}_{band}.tif').resolve()
                    date = '-'.join(source_item_id.split('_')[10:13])
                    rows.append([
                        tile_id,
                        f'{date}T00:00:00Z',
                        's2',
                        band,
                        asset_path
                    ])
            
    return pd.DataFrame(rows, columns=['tile_id', 'datetime', 'satellite_platform', 'asset', 'file_path'])

train_df = load_df(f'{FOLDER_BASE}_train_labels')
test_df = load_df(f'{FOLDER_BASE}_test_labels')

Filter on Asset Types
===
This cell will select rows in the test dataframe which are the field_id rasters for the labels.

In [4]:
test_df.loc[test_df['asset'] == 'field_ids']

Unnamed: 0,tile_id,datetime,satellite_platform,asset,file_path
1,0590,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...
158,1026,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...
257,0100,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...
414,0332,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...
495,0756,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...
...,...,...,...,...,...
128503,0376,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...
128660,1062,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...
128817,0382,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...
128898,0349,,,field_ids,/Users/kevinbooth/Projects/notebooks/Projects/...


Filter on Satellite Platform
===
This cell will select only assets which are related to the Sentinel-1 Source Imagery.

In [5]:
test_df.loc[test_df['satellite_platform'] == 's1']

Unnamed: 0,tile_id,datetime,satellite_platform,asset,file_path


Filter on Datetime
===
This cell will select only assets which fall between the specified datetime range.

In [6]:
test_df.loc[(test_df['datetime'] >= '2017-04-01T00:00:00+0000') & (test_df['datetime'] < '2017-05-01T00:00:00+0000')]

Unnamed: 0,tile_id,datetime,satellite_platform,asset,file_path
5,0590,2017-04-01T00:00:00Z,s2,B03,/Users/kevinbooth/Projects/notebooks/Projects/...
6,0590,2017-04-01T00:00:00Z,s2,CLM,/Users/kevinbooth/Projects/notebooks/Projects/...
7,0590,2017-04-04T00:00:00Z,s2,B03,/Users/kevinbooth/Projects/notebooks/Projects/...
8,0590,2017-04-04T00:00:00Z,s2,CLM,/Users/kevinbooth/Projects/notebooks/Projects/...
9,0590,2017-04-11T00:00:00Z,s2,B03,/Users/kevinbooth/Projects/notebooks/Projects/...
...,...,...,...,...,...
128990,0947,2017-04-14T00:00:00Z,s2,CLM,/Users/kevinbooth/Projects/notebooks/Projects/...
128991,0947,2017-04-21T00:00:00Z,s2,B03,/Users/kevinbooth/Projects/notebooks/Projects/...
128992,0947,2017-04-21T00:00:00Z,s2,CLM,/Users/kevinbooth/Projects/notebooks/Projects/...
128993,0947,2017-04-24T00:00:00Z,s2,B03,/Users/kevinbooth/Projects/notebooks/Projects/...
