# Indexing data from Amazon (AWS S3)
The purpose of this notebook is to explain the indexing of free publicly accessible notebooks available in Amazon's cloud storage service S3 buckets. This allows geospatial data stored on the cloud to be indexed provided the machine with datacube setup has internet access. Most of the data on S3 buckets are available to use freely and the requester need not even have an AWS account to do so. 

Indexing from cloud storages is encouraged as this means that user’s disk usage is minimised and he/she can access any temporal dataset(in most cases) of any geographical location of a specific product.
## Description
The topics covered in this notebook include
* [Prerequisites for Cloud Indexing](#Prerequisites)
* [S3 Indexing Process](#S3-Indexing-Process)
* [Data on Amazon S3](#Data-on-Amazon-S3)
* [Recommended Next Steps](#Recommended-Next-Steps)

**Note:** *The commands are meant to be run on a command line interface(like terminal in Linux). But in JupyterHub Notebook you can run the commands by placing a `!` before the command.*

`! <command>`

## Prerequisites
Options currently exist that allow for a user to store, index, and retrieve data from cloud object stores, such as Amazon S3 buckets, using the open ODC. There are a few additional requirements outline below. With the help of python package [odc_apps_dc_tools](https://github.com/opendatacube/odc-tools/tree/develop/apps/dc_tools).
users can index data available from S3 buckets.

The python package `odc-apps-dc-tools` can be installed using command:

`pip install odc-apps-dc-tools`

***Note: This package has already been installed in the JupyterHub environment you are currently using***

## Data on Amazon S3
Regarding data on S3 buckets checkout the following links
* [Registry of Open Data on AWS](https://registry.opendata.aws/)
* [Digital Earth Australia - Public Data](https://data.dea.ga.gov.au/?prefix=)
* [Digital Earth Africa](https://explorer.dev.digitalearth.africa/products/alos_palsar_mosaic/extents)

## The `s3-to-dc` command
The `s3-to-dc` command is used to index the datasets available in the S3 buckets. Use the command in by following the below syntax

`s3-to-dc [OPTIONS] [URL] [PRODUCT NAME]`
 * The URL points to the product defintion(in yaml, json formats in the cloud storage
 * The PRODUCT NAME refers to the name of the product the dataset is a part of
 * The `s3-to-dc` command line tool allow you to specify some options if required

##### The cell below runs the help command on the s3-to-dc app

In [1]:
! s3-to-dc --help

Usage: s3-to-dc [OPTIONS] URI PRODUCT

  Iterate through files in an S3 bucket and add them to datacube

Options:
  --skip-lineage                  Default is not to skip lineage. Set to skip
                                  lineage altogether.

  --fail-on-missing-lineage / --auto-add-lineage
                                  Default is to fail if lineage documents not
                                  present in the database. Set auto add to try
                                  to index lineage documents.

  --verify-lineage                Default is no verification. Set to verify
                                  parent dataset definitions.

  --stac                          Expect STAC 1.0 metadata and attempt to
                                  transform to ODC EO3 metadata.

  --absolute                      Use absolute paths from the STAC document.
  --update                        If set, update instead of add datasets.
  --update-if-exists              If the dataset or pr

***Note***:
* *Use the `--stac` option if the metadata on the S3 bucket is in STAC format*
* *Money is charged when accessing requester pays public buckets*

## S3 Indexing Process
For this example we will be indexing Digital Earth Australia’s public data bucket, which you can browse at [data.dea.ga.gov.au](https://data.dea.ga.gov.au/).

Run the two lines below, the first will add the product definition for the Landsat Geomedian product and the second will add all of the Geomedian datasets. This will take some time, but will add a continental product to the Datacube setup.

1. `datacube product add https://data.dea.ga.gov.au/geomedian-australia/v2.1.0/product-definition.yaml`

2. `s3-to-dc --no-sign-request 's3://dea-public-data/geomedian-australia/v2.1.0/L8/**/*.yaml' ls8_nbart_geomedian_annual`

##### Indexing the product definition using the `datacube` command

In [None]:
! datacube product add https://data.dea.ga.gov.au/geomedian-australia/v2.1.0/product-definition.yaml

##### Indexing the datasets using the `s3-to-dc` command available through the [odc_apps_dc_tools](https://github.com/opendatacube/odc-tools/tree/develop/apps/dc_tools)

In [None]:
! s3-to-dc --no-sign-request 's3://dea-public-data/geomedian-australia/v2.1.0/L8/**/*.yaml' ls8_nbart_geomedian_annual

## Recommended Next Steps
Loading the datasets and plotting satellite images come after the process of indexing. Therefore we recommend you to go through the indexing notebooks to understand the different steps involved and different sources from where data could be indexed. Click on the links which will take you to the respective notebooks.

1. [Introduction to ODC Indexing](01_Introduction_to_ODC_Indexing.ipynb)
2. [Indexing Product Definition](02_Indexing_Product_Definition.ipynb)
3. [Indexing from Local File System](03_Indexing_from_Local_File_System.ipynb)
4. **Indexing from Amazon - AWS S3(This Notebook)**
5. [Indexing using STAC](05_Indexing_using_STAC.ipynb)