## Getting Started with OpenDataCube
A "datacube" is a digital information architecture that specialises in hosting and cataloguing spatial information.
Piksel is an Indonesian implementation of the [Open Data Cube](https://www.opendatacube.org/) infrastructure, and specialises in storing remotely sensed data, particularly from Earth Observation satellites such as [Landsat](https://landsat.gsfc.nasa.gov/) and [Sentinel-2](https://www.copernicus.eu/en/about-copernicus/infrastructure/discover-our-satellites).

The Piksel datacube currently contains analysis-ready satellite data, and in the future will also contain derivative data "products".
These data datasets and products are often composed of a range of "measurements" such as the suite of remote sensing band values or statistical product summaries. Before running a query to load data from the datacube, it is useful to know what it contains.

This notebook demonstrates several straightforward ways to inspect the product and measurement contents of a datacube.

We acknowledge and give credit to the [Digital Earth Australia Knowledge Hub](https://knowledge.dea.ga.gov.au/notebooks/README/) for providing the content that informed the development of this notebook.

***

## Description
This notebook demonstrates how to connect to a datacube and interrogate the available products and measurements stored within.
Topics covered include:

* How to connect to a datacube
* How to list all the products
* How to list all the product measurements
* How to interactively visualise data in the datacube 

***

In [None]:
# Import required packages
import pandas as pd
from datacube import Datacube

# Set some configurations for displaying tables nicely
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_rows", None)

In [None]:
# Connect to the Piksel datacube. We will then use the variable 'dc' to interact with the datacube throughout the notebook.
dc = Datacube()

## Product Discovery

Once a datacube instance has been created, users can explore the products and measurements stored within.

The following cell lists all products that are currently available in the Piksel datacube by using the `dc.list_products()` function. 

Products listed under **name** in the following table represent the product options available when querying the datacube. 
The table below provides some useful information about each product, including a brief product **description**, the data's **license**, and the product's default **crs** (coordinate reference system) and **resolution** if applicable.

In [None]:
# List Products by using the dc.list_products() function. All products that have been indexed into the Piksel datacube.
dc.list_products()

## List measurements

Most products are associated with a range of available measurements.
These can be individual satellite bands (e.g. Landsat's near-infrared band) or statistical product summaries.

The `dc.list_measurements()` function can be used to interrogate the measurements associated with a given product (specified by the **name** column from the table above).
For example, `s2_l2a` refers to the Sentinel-2 Level 2a product.

The table below includes a range of technical information about each band in the `s2_l2a` dataset, including any **aliases** which can be used to load the data, the data type or **dtype**, any **flags_definition** that are associated with the measurement (this information is used for tasks like cloud masking), and the measurement's **nodata** value.

Change the `product` name below and re-run the following cell to explore available measurements associated with other products.

In [None]:
# We see in the table of products that we have a Sentinel-2 product named 's2_l2a'. 
# Let's use the list_measurements() function to explore what measurements are in that product.
#In this case, it is spectral bands (i.e., green, blue, etc.) and statistical product summaries. 
#Stretch goal: Change the product variable to 'ls9_c2l2_sr' to view the measurements of that product

product = "s2_l2a"

measurements = dc.list_measurements()
measurements.loc[product]

## Dataset Searching & Querying

### Finding Dataset

We can use the `find_datasets` function to search the datacube for all datasets that match our search query. We can then store the results of our search to the `datasets` variable.

In [None]:
datasets = dc.find_datasets(product="s2_l2a", limit=1)
datasets

We can also search for datasets within a specific spatial extent or time period. To do this, we supply a spatiotemporal query (i.e. a range of x- and y-coordinates defining the spatial area to load, and a range of times).

`dc.find_datasets()` will then return a subset of datasets that match this query:

In [None]:
#Create a variable 'datasets' and set it to the results of the find_datasets function according to the
# below x-y coordinate ranges  in decimal degrees and time-period in YYYY-MM-DD dormat
x_range = (106.566, 106.644)
y_range = (-5.842, -5.875)
time_range = ("2025-06-01", "2025-06-30")

datasets = dc.find_datasets(
    product="s2_l2a", x=x_range, y=y_range, time=time_range
)

print(f"Found {len(datasets)} datasets")

# Select the first dataset we have returned from our query, and print its AWS URL
dataset = datasets[0]
print(f"Here's the STAC URL for the first dataset: {dataset.uri}")

## Inspecting dataset  

Once a dataset has been loaded from the datacube, we can explore its properties to better understand the data it contains.  
The `dataset.measurements` attribute lists all available measurements for the dataset, similar to the `dc.list_measurements()` output but specific to this individual dataset.  

Other attributes, such as the dataset’s **coordinate reference system** (`dataset.crs`) and various metadata fields accessible via `dataset.metadata`, provide important contextual information. For example, `dataset.metadata.cloud_cover` reports the percentage of the scene affected by cloud.  

By iterating through a collection of datasets, we can compare cloud cover values and identify the dataset with the least cloud contamination. This is useful for selecting the clearest available imagery for further analysis.  

In [None]:
dataset.measurements

In [None]:
print(f"The CRS of the dataset is: {dataset.crs}")

In [None]:
# Attributes and methods that are available
dir(dataset.metadata)

In [None]:
print(f"Cloud cover for the first dataset is: {dataset.metadata.cloud_cover:.3f}%")

In [None]:
# Let's find the dataset with the least cloud cover, and store it in the `least_cloudy_datasets` variable to explore more later
least = 101
least_cloudy_dataset = None
for ds in datasets:
    # print(f"Cloud cover for {ds.id} is: {ds.metadata.cloud_cover:.3f}%")
    if ds.metadata.cloud_cover < least:
        least = ds.metadata.cloud_cover
        least_cloudy_dataset = ds
print(
    f"The dataset with the least cloud cover is: {least_cloudy_dataset.id} with {least:.3f}%"
)

## Load data  

Once we have identified the datasets we want to work with, we can load them directly using the `dc.load()` function.  
Here, we specify the `datasets` parameter to load only our selected dataset (in this case, the least cloudy one) and limit the measurements to the visible spectrum bands — `"red"`, `"blue"`, and `"green"`. We also set the `output_crs` and `resolution` to match the source dataset.  

The result is returned as an `xarray.Dataset`, which provides a structured way to store and interact with multi-dimensional data. We will use xarray datasets extensively throughout this workshop to analyse satellite data.

We can quickly visualise the loaded data using the `.plot.imshow()` method to display the RGB bands as a static image, or interactively explore the imagery in a map viewer using `data.odc.explore()`.  

In [None]:
print("Available measurements:")
for meas in dataset.measurements.keys():
    print(meas)

In [None]:
data = dc.load(
    datasets=[least_cloudy_dataset],
    measurements=["red", "blue", "green"],
    output_crs=least_cloudy_dataset.crs,
    resolution=20,
    # chunks={"x": 512, "y": 512},  # Uncomment to use Dask for lazy loading
).squeeze()  # Remove singleton dimensions, which in this case is the time dimension
data

We can see that `dc.load` has returned an `xarray.Dataset` containing data from our two input datasets. 

> This `xarray.Dataset` includes:  
> **Dimensions**  
> This header identifies the number of timesteps returned (time: 2) as well as the number of resulting pixels in the `x` and `y` directions.
> 
> **Coordinates**  
> - time identifies the time attributed to each returned timestep.
> - x and y provide coordinates for each pixel within the returned data.  
> - spatial_ref provides information about the spatial grid used to load the data
> 
>**Data variables**  
> These are the measurements available for the loaded product.
> For every timestep (time) returned by the query, the measured value at each pixel (y, x) is returned as an array for each measurement.
> Each data variable is itself an `xarray.DataArray` object.
> 
> **Attributes**  
> Other important metadata or attributes for the loaded data

We can also inspect our loaded data by plotting it:

In [None]:
# Plot the data. If we set it up as a 3 band array, we
# can use the `plot.imshow()` function to visualize it as
# a true color image.
bands = ["red", "green", "blue"]
data[bands].to_array().plot.imshow(robust=True)

In [None]:
# Or plot an interactive map
data.odc.explore(bands=bands, vmin=0, vmax=3000)