# Indexing New Data
This notebook mainly covers on the ODC indexing process.

ODC uses the concepts of *Product* and *Dataset* to represent the data indexed in its catalog. *Products* are collections of data sets that share the same set of measures and some subsets of metadata. The *Dataset* represents the smallest independently described, inventoried and managed data aggregation. These are usually scenes stored in files that together represent a *Product*.

## Description
The topics covered in this notebook include:
* [What are Products and Product defintions?](#Products)
* [What are Datasets?](#Datasets)
* [Indexing in ODC](#Indexing)

## Products
Products are collections of `datasets` that share the same set of measurements and some subset of metadata.
## Product Definition
The Data Cube can handle many different types of data, and requires a bit of information up front to know what to do with them. This is the task of a Product Definition.

A *Product Definition* provides a short name, a description, some basic source metadata and (optionally) a list of measurements describing the type of data that will be contained in the Datasets of its type. In Landsat Surface Reflectance, for example, the measurements are the list of bands.

The measurements is an ordered list of data, which specify a name and some aliases, a data type or dtype, and some options extras including what type of units the measurement is in, a nodata value, and even a way of specifying bit level descriptions or the spectral response in the case of reflectance data.

A typical product definiton file would have the following structure:

```
name: product_name
description: A short description on the product
metadata_type: type of metadata

metadata:
    product:
        name: product_name

measurements:
    - name: '<measurement_name>'
      aliases: [band_4, sr_band4]
      dtype: <datatype>
      nodata: <no_data_value>
      units: '<unit_of_measurement>'
```

### Example Product definition:
```
name: landsat8_example_product
description: Landsat 8 example product
metadata_type: eo3

metadata:
    product:
        name: landsat8_example_product
    # Alternatively, include specific items to match
    # properties:
        # eo:instrument: OLI_TIRS
        # eo:platform: landsat-8

measurements:
    - name: 'red'
      aliases: [band_4, sr_band4]
      dtype: int16
      nodata: -9999
      units: 'reflectance'
    - name: 'blue'
      aliases: [band_2, sr_band2]
      dtype: int16
      nodata: -9999
      units: 'reflectance'
    - name: 'pixel_qa'
      aliases: [pixel_quality, level2_qa]
      dtype: uint16
      nodata: 1
      units: 'bit_index'
      flags_definition:
        pixel_qa:
          bits: [0,1,2,3,4,5,6,7,8,9,10,11]
          description: Level 2 pixel quality band 
          values:
            1: Fill
            2: Clear
            4: Water
            8: Cloud shadow
            16: Snow
            32: Cloud
            64: Cloud Confidence Low Bit
            128: Cloud Confidence High Bit
            256: Cirrus Confidence Low Bit
            512: Cirrus Confidence High Bit
            1024: Terrain Occlusion
            2048: Unused  # Be careful of repeated names which could confuse the masking code
        # Alternatively or additionally, use the bit on/off method
        fill:
          bits: 0
          description: No data
          values: {0: false, 1: true}
        clear:
          bits: 1
          description: Clear
          values: {0: no_clear_land, 1: clear_land}
        # ...
        cloud_confidence:
          bits: [6, 7]
          description: Cloud confidence
          values: {0: none, 1: low, 2: medium, 3: high}
```

* *To learn more about product definition visit the [ODC documentation](https://opendatacube.readthedocs.io/en/latest/about-core-concepts/products.html) and to learn more about the keywords used in product definitions visit [Product Definition API](https://opendatacube.readthedocs.io/en/latest/about-core-concepts/products.html#product-definition-api)*

## Datasets
Datasets are a fundamental part of the Open Data Cube. A dataset is “The smallest aggregation of data independently described, inventoried, and managed.”

A dataset refers to the data of a particular geological location(usually represented by longitude(x) and latitude(y)) at a particular time captured using a platform(*example: Landsat 8, Sentinel 1*)

Dataset metadata documents define critical metadata about a dataset including:
* Available data measurements
* Platform and sensor names
* Geospatial extents and projection
* Acquisition time
* Provenance information

A typical dataset document would have the following structure:
```
# UUID of the dataset
id: f884df9b-4458-47fd-a9d2-1a52a2db8a1a
$schema: 'https://schemas.opendatacube.org/dataset'

# Product name
product:
  name: landsat8_example_product

# Native CRS, assumed to be the same across all bands
crs: "epsg:32660"

# Optional GeoJSON object in the units of native CRS.
# Defines a polygon such that all valid pixels across all bands
# are inside this polygon.
geometry:
  type: Polygon
  coordinates: [[..]]

# Mapping name:str -> { shape:     Tuple[ny: int, nx: int]
#                       transform: Tuple[float x 9]}
# Captures image size, and geo-registration
grids:
    default:  # "default" grid must be present
       shape: [7811, 7691]
       transform: [30, 0, 618285, 0, -30, -1642485, 0, 0, 1]
    pan:  # Landsat Panchromatic band is higher res image than other bands
       shape: [15621, 15381]
       transform: [15, 0, 618292.5, 0, -15, -1642492.5, 0, 0, 1]

# Per band storage information and references into `grids`
# Bands using the "default" grid should not need to reference it
measurements:
   pan:               # Band using non-default "pan" grid
     grid: "pan"      # should match the name used in `grids` mapping above
     path: "pan.tif"
   red:               # Band using "default" grid should omit `grid` key
     path: red.tif    # Path relative to the dataset location
   blue:
     path: blue.tif
   multiband_example:
     path: multi_band.tif
     band: 2          # int: 1-based index into multi-band file
   netcdf_example:    # just example, mixing TIFF and netcdf in one product is not recommended
     path: some.nc
     layer: some_var  # str: netcdf variable to read

# optional dataset location (useful for public datasets)
location: https://landsatonaws.com/L8/099/072/LC08_L1GT_099072_20200523_20200523_01_RT/metadata.yaml

# Dataset properties, prefer STAC standard names here
# Timestamp is the only compulsory field here
properties:
  eo:platform: landsat-8
  eo:instrument: OLI_TIRS

  # If it's a single time instance use datetime
  datetime: 2020-01-01T07:02:54.188Z  # Use UTC

  # When recording time range use dtr:{start,end}_datetime
  dtr:start_datetime: 2020-01-01T07:02:02.233Z
  dtr:end_datetime:   2020-01-01T07:03:04.397Z

  # ODC specific "extensions"
  odc:processing_datetime: 2020-02-02T08:10:00.000Z

  odc:file_format: GeoTIFF
  odc:region_code: "074071"   # provider specific unique identified for the same location
                              # for Landsat '{:03d}{:03d}'.format(path, row)

  dea:dataset_maturity: final # one of: final| interim| nrt (near real time)
  odc:product_family: ard     # can be useful for larger installations

# Lineage only references UUIDs of direct source datasets
# Mapping name:str -> [UUID]
lineage: {}  # set to empty object if no lineage is defined
```

To know more about Dataset Documents visit [Open Data Cube Read the Docs](https://opendatacube.readthedocs.io/en/latest/about-core-concepts/dataset-documents.html)

## Indexing
To load geospatial data in jupyter notebooks it needs to indexed with the datacube setup. This uses a Postgres database to store locations of product defintion and datasets. The data indexing process in ODC consists of three main steps:
1. **Indexing Product Definition:** A *Product* must be registered from its metadata. 
2. **Dataset Document Preparation:** In the second step, the metadata is extracted from each file (*Dataset*) that will be linked to the *Product*. To automate this process, ODC provides metadata extraction scripts for the following instruments/sensors: Landsat-5/7/8, Sentinel-1/2, ALOS-1/2, ASTER Digital Elevation Model (DEM), and MODIS. With the metadata of the prepared files. 
3. **Indexing Dataset:** The third step is to register these Datasets in the ODC catalog.

<img align="center" src="../resources/ODC.png" width=600 height=186>
<div style="text-align: center;font-style: italic;">A simple diagram of what constitutes an ODC deployment.</div>

The above diagram shows how the ODC setup works. The **Data** that is available locally or found in a cloud storage platform can be indexed into datacube setup(storing datasets in the clouds means less disk space is consumed). The **Infrastrure** is responsible for data indexing and retrieval of data. The **Apps** consitute to tools used by users to load an plot dataset.

### Indexed Datasets
An indexed dataset is available via a file location or from an external URI, with associated metadata available in a format understood by the Data Cube. The pixel data does not need to be stored in the DataCube.

Example:

<!---Example Table--->
<table align='left'>
  <tr>
    <th style="text-align:left;">Dataset</th>
    <th style="text-align:left;">Product Name</th>
  </tr>
  <tr>
    <td style="text-align:left;"><a href=https://registry.opendata.aws/sentinel-2-l2a-cogs/>Sentinel-2a and Sentinel-2b imagery</a></td>
    <td style="text-align:left;">s2_l2a</td>
  </tr>
  <tr>
    <td style="text-align:left;"><a href=https://planetarycomputer.microsoft.com/dataset/io-lulc>ESRI Landcover Classification</a></td>
    <td style="text-align:left;">io_lulc</td>
  </tr>
</table>

## Recommended Next Steps
Loading the datasets and plotting satellite images come after the process of indexing. Therefore we recommend you to go through the indexing notebooks to understand the different steps involved and different sources from where data could be indexed. Click on the links which will take you to the respective notebooks.

1. **Introduction to ODC Indexing(This notebook)**
2. [Indexing Product Definition](02_Indexing_Product_Definition.ipynb)
3. [Indexing from Local File System](03_Indexing_from_Local_File_System.ipynb)
4. [Indexing from Amazon - AWS S3](04_Indexing_from_AWS_S3.ipynb)
5. [Indexing using STAC](05_Indexing_using_STAC.ipynb)