# Performing a basic analysis

Note that this notebook was inspired by [DE Africa's example](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks/blob/main/Beginners_guide/05_Basic_analysis.ipynb).

## Background
To understand the world around us, it's important to combine the key steps of loading, visualising, analysing, and interpreting satellite data.
To perform an analysis, we begin with a question and use these steps to reach an answer.

## Description
This notebook demonstrates how to conduct a basic analysis with using Sentinel-2 and the Open Data Cube.

It will combine many of the steps that have been covered in the other beginner's notebooks.

In this notebook, the analysis question is _"How is the health of vegetation changing over time in a given area?"_

This could be related to a number of broader questions: 

* What is the effect of a new land use practice on a field of crops?
* How has a patch of forest changed after a fire? 
* How does proximity to water affect vegetation throughout the year?

For this notebook, the analysis question will be kept simple, without much real-world context. 
For more examples of notebooks that demonstrate how to use DE Africa to answer specific analysis questions, see the notebooks in the "Real world examples" folder. 

Topics covered in this notebook include:

1. Choosing a study area.
2. Loading data for the study area.
3. Plotting the chosen data and exploring how it changes with time.
4. Calculating a measure of vegetation health from the loaded data.
5. Exporting the data for further analysis.

***

## Getting started
To run this introduction to performing basic analysis with DE Africa data and the datacube, run all the cells in the notebook starting with the "Load packages" cell. For help with running notebook cells, refer back to the [Jupyter Notebooks notebook](01_Jupyter_notebooks.ipynb). 

### Load packages
The cell below imports Python packages that are used for the analysis.
The first command is `%matplotlib inline`, which ensures figures plot correctly in the Jupyter notebook.
The following commands import various functionality: 

* `deafrica_tools` contains helpful support functions, including those in the `deafrica_plotting` module which we are using in this notebook. 
* `datacube` provides the ability to query and load data.
* `matplotlib` provides the ability to format and manipulate plots.

In [None]:
import datacube
import matplotlib.pyplot as plt
from odc.geo.geom import BoundingBox
from odc.stac import configure_s3_access

### Connect to the datacube
The next step is to connect to the datacube database.
The resulting `dc` datacube object can then be used to load data.
The `app` parameter is a unique name used to identify the notebook that does not have any effect on the analysis.

In [None]:
dc = datacube.Datacube()

## Step 1: Choose a study area

When working with the Open Data Cube, it's important to load only as much data as needed.
This helps keep an analysis running quickly and avoids the notebook crashing due to insufficient memory.

One way to set the study area is to set a central latitude and longitude coordinate pair, `(central_lat, central_lon)`, then specify how many degrees to include either side of the central latitude and longitude, known as the `buffer`.

### Location
Below, we have set the study area covering an agricultural area in Java, Indonesia.
To load a different area, you can provide your own central_lat and central_lon values.
One way to source these is to Google a location, or click directly on the map in [Google Maps](https://www.google.com/maps/place/21%C2%B007'25.4%22N+11%C2%B023'51.1%22W/@21.0925851,-11.555448,82035m/data=!3m1!1e3!4m14!1m7!3m6!1s0x10a06c0a948cf5d5:0x108270c99e90f0b3!2sAfrica!3b1!8m2!3d-8.783195!4d34.508523!3m5!1s0x0:0x0!7e2!8m2!3d21.1237127!4d-11.3975263).
Other options are:













North Sumatra Plantation Region, Indonesia

central_lat = 3.6
central_lon = 98.7

Note: If you change the study area latitude and longitude, you'll need to re-run all of the cells after to apply that change to the whole analysis. These coordinates are optimized for Indonesian agricultural regions covered by this datacube instance.Other options are:

* **Bogor Agricultural Area, Indonesia**
```
central_lat = -6.594
central_lon = 106.793
```


* **Central Java Rice Belt, Indonesia**
```
central_lat = -7.8
central_lon = 110.4
```

* **East Java Agricultural Zone, Indonesia**
```
central_lat = -7.9
central_lon = 112.6
```
> **Note**: If you change the study area latitude and longitude, you'll need to re-run all of the cells after to apply that change to the whole analysis.

### Buffer
Feel free to experiment with the `buffer` value to load different sized areas.
We recommend that you keep the `buffer` relatively small, no higher than `buffer=0.1` degrees.
This will help keep the loading times reasonable and prevent the notebook from crashing.

> **Extension**: Can you modify the code to use a different `buffer` value for latitude and longitude? 

> *Hint*: You may want two variables, `buffer_lat` and `buffer_lon` that you can set independently. You'll then need to update the definitions of `study_area_lat` and `study_area_lon` with their corresponding buffer value.

In [None]:
# Set the central latitude and longitude
central_lat = -6.53333
central_lon = 105.72639

# Set the buffer to load around the central coordinates
buffer = 0.03

# Compute the bounding box for the study area
study_area_lat = (central_lat - buffer, central_lat + buffer)
study_area_lon = (central_lon - buffer, central_lon + buffer)

After choosing the study area, it can be useful to visualise it on an interactive map.
This provides a sense of scale.
> **Note**: The interactive map also returns latitude and longitude values when clicked.
You can use this to generate new latitude and longitude values to try without leaving the notebook.

In [None]:
aoi = BoundingBox(
    left=study_area_lon[0],
    bottom=study_area_lat[0],
    right=study_area_lon[1],
    top=study_area_lat[1]
)
aoi.explore()

## Step 2: Loading data

When asking analysis questions about vegetation, it's useful to work with optical imagery, such as Sentinel-2 or Landsat.
The Sentinel-2 satellites have 10 metre resolution and go back to 2017. 

The code below sets up the required information to load the data.

In [None]:
# Set the data source - s2_l2a corresponds to Sentinel-2A
set_product = "s2_l2a"

# Set the date range to load data over
set_time = ("2025-01-01", "2025-03-31")

# Set the measurements/bands to load
# For this analysis, we'll load the red, green, blue and near-infrared bands
set_measurements = [
    "red",
    "blue",
    "green",
    "nir"
]

# Set the coordinate reference system and output resolution
set_crs = 'EPSG:6933'
set_resolution = 10

After setting all of the necessary parameters, the `dc.load()` command is used to load the data:

In [None]:
configure_s3_access(aws_unsigned=True)

# This will load the data into a DataArray
# It may take a while to run...
dataset = dc.load(
    product=set_product,
    x=study_area_lon,
    y=study_area_lat,
    time=set_time,
    measurements=set_measurements,
    output_crs=set_crs,
    resolution=set_resolution,
    group_by='solar_day',
    dask_chunks={"time": 1, "x": 512, "y": 512},
    driver="rio"
)

In [None]:
num_images = len(dataset)
print(f"There are {num_images} images available for {set_product}.")

Following the load step, printing the `dataset` object will give you insight into all of the data that was loaded.
Do this by running the next cell.

There's a lot of information to unpack, which is represented by the following aspects of the data:
- `Dimensions`: the names of data dimensions, frequently `time`, `x` and `y`, and number of entries in each
- `Coordinates`: the coordinate values for each point in the data cube
- `Data variables`: the observations loaded, frequently different spectral bands from a satellite
- `Attributes`: any useful information for the data, such as the `crs` (coordinate reference system)

In [None]:
dataset

## Step 3: Plotting data

After loading the data, it is useful to view it to understand the resolution, which observations are impacted by cloud cover, and whether there are any obvious differences between time steps.

See if you can plot multiple images at the same time? This requires two changes: 

1. Use a list of values as an argument to the index select function `isel`, i.e., `isel(time=[0, 1])`
2. Pass in some arguments to the `imshow` function, `col` and `col_wrap`. You can [read the documentation](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.plot.html) to find out more.

In [None]:
# Set the time step to view
time_step = 1  # Change this to view different time steps

# Set the band combination to plot
bands = ["red", "green", "blue"]

# Generate the image by running the rgb function
one_time_step = dataset.isel(time=time_step)
one_time_step[bands].to_array().plot.imshow(
    size=8,
    robust=True
)
# rgb(dataset, bands=bands, index=time_step, size=8)

# Format the time stamp for use as the plot title
time_string = str(dataset.time.isel(time=time_step).values).split('.')[0]  

# Set the title and axis labels
ax = plt.gca()
ax.set_title(f"Timestep {time_string}", fontweight='bold', fontsize=16)
ax.set_xlabel('Easting (m)', fontweight='bold')
ax.set_ylabel('Northing (m)', fontweight='bold')

# Display the plot
plt.show()

### Visualise on a web map

We have another way of viewing the images, by creating a simple web map.

In [None]:
# We need to set vmin and vmax to ensure the image is displayed correctly
# The values here are set based on the expected range of pixel values for the Sentinel-2 RGB bands

one_time_step.odc.explore(vmin=0, vmax=3000)

## Find a cloud-free image

Keep going through the images until you find one that is relatively free of clouds, we'll use that later.

## Step 4: Calculate vegetation health

While it's possible to identify vegetation in the RGB image, it can be helpful to have a quantitative index to describe the health of vegetation directly. 

In this case, the [Normalised Difference Vegetation Index](https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index) (NDVI) can help identify areas of healthy vegetation.
For remote sensing data such as satellite imagery, it is defined as

$$
\begin{aligned}
\text{NDVI} & = \frac{(\text{NIR} - \text{Red})}{(\text{NIR} + \text{Red})}, \\
\end{aligned}
$$

where $\text{NIR}$ is the near-infrared band of the data, and $\text{Red}$ is the red band.
NDVI can take on values from -1 to 1; high values indicate healthy vegetation and negative values indicate non-vegetation (such as water). 

The following code calculates the top and bottom of the fraction separately, then computes the NDVI value directly from these components.

In [None]:
# Calculate the components that make up the NDVI calculation
band_diff = dataset.nir - dataset.red
band_sum = dataset.nir + dataset.red

# Calculate NDVI and store it as a measurement in the original dataset
dataset["ndvi"] = band_diff / band_sum

After calculating the NDVI values, it is possible to plot them by adding the `.plot()` method to `ndvi` (the variable that the values are stored in).
The code below will plot a single image, based on the time selected with the `ndvi_time_step` variable.
Try changing this value to plot the NDVI map at different time steps.
Do you notice any differences?

> **Extension 1**: Sometimes, it is valuable to change the colour scale to something that helps with intuitively understanding the image.
For example, the "viridis" colour map shows high values in greens/yellows (mapping to vegetation), and low values in blue (mapping to water).
Try modifying the `.plot(cmap="RdYlGn")` command below to use `cmap="viridis"` instead.

In [None]:
# Set the NDVI time step to view
ndvi_time_step = time_step

# This is the simple way to plot
# Note that high values are likely to be vegetation.
plt.figure(figsize=(8, 8))
dataset.ndvi.isel(time=ndvi_time_step).plot(cmap="RdYlGn", vmin=0, vmax=1)
plt.show()

> **Extension 2**: For the cell above, a single time step was selected using the `.isel()` method.
It is possible to plot all time steps by removing the `.isel()` method, and modifying the `.plot()` method to be `.plot(col='time', col_wrap=3)` where `time` is the timesteps for the images.
Plotting all of the time steps at once may make it easier to notice differences in vegetation over time.

In [None]:
plt.figure(figsize=(8, 8))
dataset.ndvi.plot(col='time', cmap="RdYlGn", vmin=0, vmax=1, col_wrap=3)
plt.show()

## Step 5: Exporting data

Sometimes, you will want to analyse satellite imagery in a GIS program, such as QGIS.
The `write_cog()` command from the Open Data Cube library allows loaded data to be exported to GeoTIFF, a commonly used file format for geospatial data. This example export an image based on the time_step provided. for more information on exporting multiple images check 
[Exporting GeoTIFFS notebook](../Frequently_used_code/Exporting_GeoTIFFs.ipynb)
> **Note**: the saved file will appear in the same directory as this notebook, and it can be downloaded from here for later use.

In [None]:
# You can change the name from example to your prefered name, if you like.
filename = "example.tif"

dataset.ndvi.isel(time=ndvi_time_step).odc.write_cog(filename, overwrite=True)

## Recommended next steps

### For this notebook
Many of the variables used in this analysis are configurable. We recommend returning to the
beginning of the notebook and re-running the analysis with a different location, dates, measurements, and so on.

You could try to do the same analysis using the cloud masking techniques used in the Case Study notebook on Sentinel-2
Machine Learning, and create a temporal composite, plotting the max NDVI over time.

This will help give you more understanding for running your own analysis.

If you didn't try the extension activities the first time, try and work on these when you run through the notebook again.