## Ingesting Data

Ingest is the process of transforming ARD tiles into smaller chips of data. To trigger ingest, you HTTP PUT a `source` to the Aardvark REST API that contains the URL of the data and a checksum used verify its integrity.

In [16]:
import requests
import io
import string

## Building a Collection of Sources

ARD tiles are organized by individual area using an HXXVXX nomenclature. The directory should contain text file with one or more lines that have an ID, a URL, and a checksum for each archive in a directory. The entries in a manifest are transformed into a collection of dictionaries using the `get_manifest` and `put_source` functions.

In [17]:
def source(base_url, line):
    """Transform a line into a dict with source attributes"""
    checksum, file = line.split("\t")
    file_url = base_url + file.strip()
    return {'id': file.strip(), 'uri': file_url, 'checksum': checksum }

def get_manifest(manifest_url, base_url):
    """Get a manifest and transform it into a list of source dicts"""
    res = requests.get(manifest_url)
    buffer = io.StringIO(res.text)
    return [source(base_url, line) for line in buffer]

def put_source(base_url, source):
    """Put a source, triggering ingest"""
    url = base_url.format(**source)
    return requests.put(url, source)

The `get_manifest` and `put_source` function are used together to trigger ingest. Change the URL to the manifest and URL to the Aardvark REST API as needed.

In [18]:
# This is a text file that contains a checksum and file name on each line.
manifest_url = "https://edclpdsftp.cr.usgs.gov/downloads/collections/tiles-l2-20170427/h04v03/h04v03.md5_list"

# This is the URL that can be used with each file name to produce an absolute URL
# to the archive. The system that performs ingest needs the full URL so that it
# can download data.
base_url = "https://edclpdsftp.cr.usgs.gov/downloads/collections/tiles-l2-20170427/h04v03/"

# Produce a list of sources.
sources = get_manifest(manifest_url, base_url)

Now, you can iterate over each source to trigger ingest. This makes a single HTTP PUT request for each item, so you may have to be patient when you are putting thousands of sources.

In [20]:
# This is the URL to the network resource used to trigger ingest. Notice that you
# are defining a new network resource for the source you want to ingest.
SOURCE_URL = "http://lcmap-test.cr.usgs.gov/v1/landsat/source/{id}"

# This is commented out so that you don't make a large number of requests
# accidentally, if you actually want to trigger ingest then uncomment it
# and evaluate this cell.
# results = [put_source(SOURCE_URL, source) for source in sources]

## Checking Progress

The progress of ingest for a single scene can be obtained by performing an HTTP GET request. The response contains a list of `progress` entries sorted from oldest to newest that describe what has happened so far. You can use the last entry to count the number of missing/pending/started/finished/failed sources.

In [21]:
def get_source(base_url, source):
    """Get progress information about the source."""
    url = base_url.format(**source)
    return requests.get(url, source)

In [22]:
# This is commented out so that you don't make a large number of requests
# accidentally. If you want to get a large number of sources, uncomment the
# line and evaluate the cell. For large lists, this can take a while!
saved_sources = [get_source(SOURCE_URL, source) for source in sources]

The `saved_sources` contains a list of HTTP responses that will need to converted from JSON into something usable.

In [23]:
progress = [progress.json() for progress in saved_sources]

You can list the most recent activity for the first ten sources like so...

In [25]:
some_progress = [p for p in progress if p]
source_last_activity = [(ps[-1]['id'], ps[-1]['progress_name']) for ps in some_progress]
source_last_activity[0:10]

[('LC08_CU_004003_20130320_20170426_C01_V01_BT.tar', 'scene-finish'),
 ('LC08_CU_004003_20130320_20170426_C01_V01_QA.tar', 'scene-finish'),
 ('LC08_CU_004003_20130320_20170426_C01_V01_SR.tar', 'scene-finish'),
 ('LC08_CU_004003_20130320_20170426_C01_V01_TA.tar', 'scene-finish'),
 ('LC08_CU_004003_20130325_20170426_C01_V01_BT.tar', 'scene-finish'),
 ('LC08_CU_004003_20130325_20170426_C01_V01_QA.tar', 'scene-finish'),
 ('LC08_CU_004003_20130325_20170426_C01_V01_SR.tar', 'scene-finish'),
 ('LC08_CU_004003_20130325_20170426_C01_V01_TA.tar', 'scene-finish'),
 ('LC08_CU_004003_20130404_20170426_C01_V01_BT.tar', 'scene-finish'),
 ('LC08_CU_004003_20130404_20170426_C01_V01_QA.tar', 'scene-finish')]

## Retrieve Ingested Data

Once you have ingested sources, you may also want to verify that you are able to obtain results. You will specify an x/y, UBID, and ISO8601 time range.

In [163]:
# Notice that the x/y is slightly offset from the tile; this is done intentionally
# to confirm that the REST API properly finds the correct tile containing the point.
x, y, ubid, acquired = -1806583, 2999803, "LANDSAT_7/ETM/SRB2", '1980-01-01/2020-01-01'
params = {'ubid': ubid, 'x': x, 'y': y, 'acquired': acquired}
chips = requests.get("http://lcmap-test.cr.usgs.gov/v1/landsat/chips", params).json()

Once you have retrieved chips, you can check the acquisition date and source. To work with chip data, see the related tutorial.

In [164]:
[(c['x'],c['y'],c['acquired']) for c in chips[0:10]]

[(-1806585, 2999805, '1999-07-14T18:41:52Z'),
 (-1806585, 2999805, '1999-07-30T18:41:57Z'),
 (-1806585, 2999805, '1999-08-15T18:41:56Z'),
 (-1806585, 2999805, '1999-08-31T18:42:01Z'),
 (-1806585, 2999805, '1999-09-16T18:41:54Z'),
 (-1806585, 2999805, '1999-10-02T18:42:06Z'),
 (-1806585, 2999805, '1999-10-18T18:41:59Z'),
 (-1806585, 2999805, '1999-11-03T18:41:56Z'),
 (-1806585, 2999805, '1999-12-21T18:41:58Z'),
 (-1806585, 2999805, '2000-01-22T18:41:52Z')]