# III. Opening a Sentinel-2 image with `Xarray`

---
**Author(s):** Quentin Yeche, Kenji Ose, Dino Ienco - [UMR TETIS](https://umr-tetis.fr) / [INRAE](https://www.inrae.fr/)

---

## 1. Introduction

In this notebook we will cover basic manipulation of multi-band imagery with `Xarray`. This is the last step before reaching the most complex use case of multi-band and multi-date data. Fortunately the libraries and concepts we cover here will transfer to times series.

## 2. Import libraries

As usual, we import all the required Python libraries. Th new one is `xarray`, a package that makes working with labelled multi-dimensional arrays in Python simple, efficient, and fun! (*not sure*).

In [None]:
# STAC access
import pystac_client
import planetary_computer

# dataframes
import pandas as pd

# xarrays
import xarray as xr

# library for turning STAC objects into xarrays
import stackstac

# visualization
from matplotlib import pyplot as plt

# miscellanous
import numpy as np
from IPython.display import display


## 3. Creating a `DataArray` from STAC object

### 3.1. Getting a Sentinel-2 STAC Item 

As a practical use case let's consider that we have identified the STAC Item we're interested in (see [this notebook](Joensuu_01-STAC.ipynb) for a refresher), and we also have an area of interest defined as a bounding box.

In [None]:
# retrieving the relevant STAC Item
catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
    )

# bounding box expressed in Lat/Lon
aoi_bounds = (3.875107329166124, 43.48641456618909, 4.118824575734205, 43.71739887308995)

item = catalog.get_collection('sentinel-2-l2a').get_item("S2A_MSIL2A_20201213T104441_R008_T31TEJ_20201214T083443")
#display(item)

assets = [asset.title for asset in item.assets.values()]
descriptions = pd.DataFrame(assets, columns=['Description'], index=pd.Series(item.assets.keys(), name='asset_key'))
descriptions

<span style='color:red'> **IMPORTANT:** The URLs that we obtain from the STAC Catalog are not valid indefinitely. They expire after about 30 minutes. If you see an error in the notebook, it is likely because the URL that you obtained by running the first few cells has now expired.</span> If that happens you should be able to just run the notebook again from the top to get a new URL. You can get longer-lasting URLs by signing up for Planetary Computer (which is free at the time of writing this notebook). More info [here](https://planetarycomputer.microsoft.com/docs/concepts/sas/). 

### 3.2. Using `stackstac` library

Now that we have our STAC Item we will use `stackstac` in order to create a multi-dimensional array for our multi-band data. This library is very powerful and will handle in a single call a lot of small inconveniences: resampling, data types, CRS, etc.

The function [`stackstac.stack`](https://stackstac.readthedocs.io/en/latest/api/main/stackstac.stack.html#stackstac.stack) has quite a few parameters, but most values can often be left as default. Here are the ones we will use here:
- `assets`. We don't want all the assets in the Item, especially since some of them are just metadata and have no dimension. Let's select all the bands: B02 to B08, B11 and B12
- `resolution`. As we can see in the table above, not all the bands have the same resolution. If we want them in a multi-dimensional array we need to choose a single resolution for all the bands. Here we choose 10 (for 10 meters), which means that the 20 m bands will be resampled to 10 m. The `resampling` parameter can be added to specify which resampling method to use, here the default value (`nearest`) is kept.
- `dtype`. By default the data type is `float64`. Sentinel-2 data is almost always distributed as unisigned 16 bit integers. Thus we will use `uint16` which will save space and allow faster computations. See the note below for more details.
- `fill_value`. This is the value to use for missing data. This is not in case a STAC Asset is an array with some pixel values missing, it is meant to replace missing Assets. If for some Items the relevant Asset is missing then an array full of fill values is supplied instead. This kind of situation is certainly unusual for most workflows, but we still need to supply the parameter because the default value `np.nan` (NumPy's version of NaN i.e. 'not a number') is only valid for floating point numbers, not for integers. Here we will use $2^{16} -1$. See the note below for more details.
- `bounds_latlon` and `bounds`. Only one of the two can be used. Here we use the former, which allows us to directly specify a bounding box as latitude and longitude. On the other hand `bounds` requires using the correct coordinate system as defined by the `epsg` parameter if specified, or the `proj:epsg` property on the STAC assets (and in this case all assets must have the same CRS) if the `epsg` parameter is left as default.

<details>
<summary><b>More on data types and fill values</b></summary>

Understanding data types for satellite imagery is not especially difficult. But as they rarely pose a problem it can be easy to overlook them. The following explanation will be focused on Sentinel-2 data. For other satellite images it can be very different so some attention is needed when transferring or writing code for this imagery. Moreover, even for Sentinel-2 data (for example) some things can differ depending on the data distributer (Theia, Planetary, etc.).

For the most part the Copernicus which manages the Sentinel mission offers 3 main products for Sentinel-2: Level-1B, Level-1C, and Level-2A. They correspond to increasingly processed and refined data. For the vast majority of uses and users, Level-2A is the right choice. As described [here](https://sentinels.copernicus.eu/web/sentinel/missions/sentinel-2/data-products), Level-2A is reflectance data.

Reflectance is defined as the proportion of incident radiant energy that is reflected by a surface. It ranges from $0$ (total absorption) to $1$ (total reflectance). However for reasons of efficiency and convenience Sentinel-2 Level-2A is typically distributed as integers. The conversion is simply $10000* \text{reflectance}$ which results in integers values which range from $1$ to $10000$. The value $0$ should not typically appear in Sentinel-2 data, it is often reserved to denote NODATA pixels. This means that valid data will never fall between $10000$ and $2^{16}-1=65536$. Thus we can safely choose $2^{16} -1$ as the fill value, and we should not be able to ever confuse it with valid data or NODATA values.
</details>

In [None]:
bands = ['B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B11', 'B12']
FILL_VALUE = 2**16-1

array = stackstac.stack(
                    item,
                    assets = bands,
                    resolution=10,
                    dtype="uint16",
                    fill_value=FILL_VALUE,
                    bounds_latlon=aoi_bounds,
                    )

## 4. Xarray overview

### 4.1. Xarray structure

***Note:** `Xarray` is the name of the library. The objects are called `DataArrays` (or `Datasets`) but they tend to be called xarray regardless. In these notebooks we will favor DataArray or simply array.*

DataArrays are multi-dimensional arrays. Some of the advantages over more traditional arrays such as NumPy arrays include the following options:
- naming dimensions,
- providing support for holding various metadata alongside the numeric data.

DataArrays can use one of two objects to actually hold the multi-dimensional array: either a **[NumPy](https://numpy.org/doc/stable/user/absolute_beginners.html) array** or a **Dask array**. Dask is designed to handle large arrays that don't fit into memory. It also enables streaming. We created this DataArrays from the URL we got from the STAC Catalog. As long as the values of the array are not explicitly needed for computation, they will not be downloaded and loaded into memory.

`Xarray` is closely related to `pandas` and `NumPy`. `DataArray` dimensions use pandas Index, and as mentioned previously the multi-dimensional array can be a NumPy object. Most NumPy methods have a DataArray equivalent. Just like pandas dataframes and NumPy arrays, Xarray does not do in-place processing. This means that most operations on a DataArray do not modify the array, they return a new modified array instead.

***Note:** It may seem wasteful and/or slow to create new objects instead of modifying them. But Xarray, NumPy and pandas were developed with this in mind so they are well optimized to that effect. In fact it is much more of a stylistic choice (i.e. how the code is written) than a fundamental inability to modify objects.*


In [None]:
#display the xarray object's structure

array

DataArrays display very nicely as the last line in cell (or with a `IPython.display.display` call). From this visualization we can already gather a lot of information about our data and DataArray in general.

The first table and accompanying cube visual essentially show the same information. First of all the table is separated into two columns: array and chunks. Arrays are divided into chunks for reasons of efficiency. If you need to perform a lot of calculations on big arrays, it can become necessary to choose an appropriate chunk size to allow for better computation

The rows of the table are fairly self-explanatory. The `Bytes` row gives the size of the array and a single chunk. It is important to note that even if the size is given, it does not mean that the data has been loaded. In fact at this point in the code the DataArray is created but the data has not been downloaded. But the size can still be calculated from the dimensions of the array and the data type.

The visual on the right give a spatial interpretation to the table data. The rectangular cuboid on the right is a representation of the 2 spatial dimensions as well as the 9 bands we selected. We can see that the spatial dimensions are separated into 3 rows and 2 columns. These smaller sections are the chunks. By looking at the table we can see that the chunks have spatial dimensions 1024 x 1024. Given our array of dimensions 1999 x 2590, the number of chunks is coherent.

The smaller square at the top-left represents the extra time dimension (of size 1 in this case since this is single-date data).

### 4.2. Coordinates and dimensions
***Note:** The name Coordinate can be confusing, especially since it does not inherently relate in any way to spatial coordinates of any kind. DataArrays are used in fields other than GIS, but the terminology still applies. Feel free to understand "coordinates" as "metadata which may be aligned along the dimensions of the array".*

If we look at the Coordinates section we can see a lot of different rows. Feel free to explore these coordinates. You can see more about a coordinate by clicking the grey icon (stacked disks) to the right. You can also look at some coordinates more specifically with `array['coordinate_name']` or `array.coordinate_name`.

In [None]:
array

Here 4 of the rows in the Coordinates section have names in bold (or * before their names if you prefer using `print(array`) instead): `time`, `band`, `x` and `y` . These coordinates are also **dimensions**. They are named dimensions, which means that we will be able to select data by directly specifying the dimension name. The dimensions contain values called labels, which permits selecting by these labels rather than using indices. For example in `array` the second dimension has the name `band` and 9 labels associated with it: `B02`, `B03`, `B04`, `B05`, `B06`, `B07`, `B08`, `B11`, and `B12`.

Most of the other coordinates can be seen as metadata gathered from the STAC Assets: the CRS, some information about the satellite, the cloud cover, full names for the bands... This is one of the advantage of Xarray: we can keep the relevant metadata as we perform calculations on the underlying numeric values.

The second colum in the coordinate table can be useful. It indicates to which dimension the coordinate relates to. As an obvious example let's consider the `title` coordinate. This coordinate refers to the full human-readable names of the bands. The row has `(band)` in the second column. This means that the `title` coordinate contains an array of the same size as the `band` dimension, and the values it contains refer to the same positions.

In the following cell we can see that the `title` coordinate is itself a DataArray with a single dimension : `band`.

On the other hand, most of the metadata we have in Coordinates has nothing specified for the second column. It simply means that there are no dimensions associated with these properties. Indeed values like `orbit` or `proj:epsg` don't have any dimensional aspect, they're just properties of the whole array.

In [None]:
# these two syntaxes are strictly equivalent
array['title']
array.title

In [None]:
# this coordinate name contains a colon ':'
# which cannot be used in a Python name
# so this is the only syntax which will work
array['proj:epsg']

#### 4.2.1. Looking at the `x` and `y` dimensions

It is fairly obvious that the `band` dimension contains the labels for the bands: `B02`, `B03` and so on. Similarly, the `time` dimension contains the date label (just one in this case). However the nature of the labels of the `x` and `y` dimensions is not necessarily obvious. These labels are the spatial coordinates for the pixels in the relevant CRS (in this case EPSG:32631, as specified by the `epsg` coordinate). By default the coordinates are for the top-left corner of the pixel, but it is possible to choose the center as the reference by using the `xy_coords` argument for `stackstac.stack`.

**Note:** Since we chose 10 meter as our common resolution, it makes sense that these spatial coordinates are incremented by 10 with every pixel.

In [None]:
display(array.x)
display(array.y)

Here's how to understand the `x` and `y` dimension labels with an example. The third label of the dimension `x` is 570510, and the second label for `y` is 4841090. This means that the top-left corner of the pixel located at (3, 2) in the array (i.e. third column from the left, and second row from the top) maps to the geometric point with coordinates (570510, 4841090) in EPSG:32631.

### 4.3. Indexes and Attributes

The Indexes are simply the values associated with each dimension, available as a `PandasIndex` object. They can be useful in some situations, but those uses will not be covered here.

Finally the Attributes section contain some information about our imagery: the CRS, the affine transformation which links the spatial coordinates with row/column coordinates in the array, and the resolution. These Attributes would be used and modified if we were to re-project or resample our data.

## 5. Indexing and selecting

### 5.1. Different ways to select data in DataArray

When deciding to filter and select data in an DataArray there are two choices to make:
 1. Are we we referring to the dimension by its name or by its position?
 2. Are we using the dimension labels that are simply the positional indices?

Since these two choices are independent, this leads to four different ways to select data in a DataArray, summarized in the table below:

| Dimension lookup | Index lookup | Syntax                |
|------------------|--------------|-----------------------|
| Positional       | By integer   | `array[:, 0]`         |
| Positional       | By label     | `array.loc[:, 'B02']` or <br> `array.loc[dict(band='B02')]`    |
| By name          | By integer   | `array.isel(band=0)` |
| By name          | By label     | `array.sel(band='B02')`  |

In [None]:
(
    (array[:,0]).equals(array.loc[:,'B02']) and
    (array[:,0]).equals(array.isel(band=0)) and
    (array[:,0]).equals(array.sel(band='B02'))
)

In most cases, the `sel` method is the most convenient, so we will favor it. `isel` is useful in situations where the integer index has a meaning. For example if we want to manipulate our current array as if it were a regular image `sel` would be interesting to specify row/column values instead of spatial coordinates.

***Note:** The two methods which use brackets have one advantage compared to `sel` and `isel`: they allow assigning values from a selection.*

In [None]:
array2 = array.copy(deep=True)
print(f"Value before assignment: {array2.loc[dict(band='B02', x=570490, y=4841100)].values}")
array2.loc[dict(band='B02', x=570490, y=4841100)] = 0
print(f"Value after assignment: {array2.loc[dict(band='B02', x=570490, y=4841100)].values}")

# If you uncomment the following line and try to execute, it will fail
# Indeed sel() and isel() are function calls, so they don't support assignment.
#array2.sel(band='B02', x=570490, y=4841100) = 0


### 5.2. Selecting single values

The simplest use case is to select with a single value per dimension. Note that it is possible to chain multiple selections

In [None]:
filtered = array.sel(band='B02').isel(x=658)
display(filtered)
filtered.dims

It is interesting to note that after selecting with a unique value for `band` and `x`, they still appear in coordinates but they are no longer dimensions.

### 5.3. Selecting multiple values

There are multiple ways to select multiple values at once. The easiest is to simply put the labels (resp. integer indices) as a list in `sel` (resp. `sel`). It is important to note that the order the values are given will be the order for the selection:

In [None]:
selection = array.sel(band=['B04', 'B02'])
display(selection)

print(selection.band.values)

***Note:** This time the dimension `x` is kept as multiple values still exist for it.*

### 5.4. Using boolean masks
Boolean masks are very powerful as they allow performing very diverse filtering operations.

If we want to select a multiple possible values for a dimension, we can use a boolean mask with `isin`: 

In [None]:
mask_x = array.band.isin(['B02', 'B04', 
                        'this label does not exist',
                        'but it doesn matter'])
display(mask_x)

In [None]:
filtered = array.sel(band=mask_x)
display(filtered)
filtered.dims

There are two differences compared to the previous method of specifying a list directly:
 - with a list, all the values must appear in the dimension, or an error occurs. It is not the case here
 - with a list, the values are selected in the order that they appear. With a boolean mask the original order of the array is preserved

One issue of boolean masking is that in a single mask you must restrict yourself to a single dimension. But it is possible to use multiple masks at once.

In [None]:
mask_x = (array.x%20 == 0) & (array.x%30 == 0) | (array.x < 578340)
mask_y = (array.y > 4840000)
array.sel(x=mask_x, y=mask_y)


### 5.5. Selecting based on a coordinate

Selection only works on dimensions. However there is a workaround to select based on a coordinate which isn't a dimension. Consider the following cell:

In [None]:
display(array.band)

If we look at the Coordinates for this DataArray, we can see that the `band` dimension, which is itself a DataArray, has kept all the coordinates which are not 'attached' to another dimension. This means that we can do the following:

In [None]:
mask = (array.band.title == 'Band 2 - Blue - 10m')
display(mask)
array.sel(band=mask)

### 5.6. Selecting in a range and slices

Although it is possible to do it with a boolean mask of two conditions, selecting in a range (spatially or a range of dates) is a fairly typical use case, so more dedicated tools exist. We use Python `slice` objects. What follows is a small introduction to slicing if you're not familiar with that concept.

#### 5.6.1. Reminder on Python slicing

In Python there are data types called sequences. Lists and arrays are examples of sequences. One characteristic of sequences is the ability to access elements in a variety of ways. The most basic way is to just access an element

In [None]:
a = list(chr(i+97) for i in range(10))
print(f"initial list: {a=}\n")
      
print("accessing with just an index")
print(f"{a[3]=}, {a[-1]=}, {a[-2]=}\n") # negative values start from the end

But we can also use a very similar syntax to select and create sub-lists:

In [None]:
print("accessing a range")
print(f"{a[2:5]=}\n")

print("accessing a range and leaving some arguments blank")
print(f"{a[:5]=}") # equivalent to a[0:5]
print(f"{a[3:]=}\n") # equivalent to a[3:-1]

print("accessing with a step")
print(f"{a[::2]=}") # every two elements
print(f"{a[::-2]=}") # every two elements in reverse

This kind of slicing is fairly easy to understand. But sometimes writing all of this manually can be troublesome, especially if we want to slice multiple sequences in the same way, or if we want to consider a way to slice independently of a specific object. This is what `slice` objects are for. They allow to turn this `x:y:z` syntax into an object we can reuse. The syntax of slice is: `slice(start, stop, step)`. Only `stop` is mandatory, `start` and `stop` are optional and will be replaced to `None` if not specified. Here are some example of slicing the same list with slice objects:


In [None]:
a = list(chr(i+97) for i in range(10))
print(f"initial list: {a=}\n")

print("accessing a range")
print(f"{a[slice(2,5)]=}\n")

print("accessing a range with just a stop value")
print(f"{a[slice(5)]=} \n") #equivalent to a[slice(0,5)]
print("accessing a range with just a start value")
print(f"{a[slice(3, None)]=}\n") #equivalent to a[slice(3,-1)]

print("accessing with a step")
print(f"{a[slice(None, None, 2)]=}") #every two elements
print(f"{a[slice(None, None, -2)]=}") #every two elements in reverse

Of course it makes little sense to write a[slice(x,y,z)] instead of a[x:y:z]. Here's a more typical if simplistic reason to use `slice` objects:

In [None]:
letters = list(chr(i+97) for i in range(10))
letter_numbers = list(i for i in range(10))
print(f"{letters=}")
print(f"{letter_numbers=}\n")

print("let's slice both lists with the same object")
s = slice(3,8,2)
print(f"{letters[s]=}")
print(f"{letter_numbers[s]=}")

#### 5.6.2. Using slice objects with `sel` or `isel`

Selecting with with  a slice object is fairly straight-forward:

In [None]:
array.sel(x=slice(580000, 580100), y=slice(None, None, 500))

However there are **three important details** to keep in mind.
 - Contrary to slicing with lists or NumPy arrays, both bounds are included. This means that `slice(a,b)` also includes `b`
 - When we use `sel`, the slice arguments we use are values, not indices.
 - Despite the previous point, the selection is still operated on indices

This means that when writing `sel(x=slice(a,b))`, the operation is equivalent to the following:

    1. Find the index with value `a`. Let's name it `id_a`
    2. Find the index with value `b`. Let's name it `id_b`
    3. If `id_a` > `id_b` then return nothing
    4. Else return all the values with indices between `id_a` and `id_b` (bounds included).

These quirks can often show when dealing with unsorted dimensions. What follows is a small example with an unsorted dimension `x`.

In [None]:
x = np.array([3, 1, 2, 7, 6, 4, 5, 0])  # Values for the first dimension (not sorted)
y = np.arange(3)  # Values for the second dimension

data = np.random.rand(len(x), len(y))  # Random data values

# Create the DataArray
ds = xr.DataArray(data, coords={'x': x, 'y': y}, dims=['x', 'y'])

ds

In [None]:
display(ds.sel(x=slice(1,4)).x) #still includes 7 and 6 since indices are used
display(ds.sel(x=slice(5,7)).x) #nothing is selected since 5 comes after 7 in the order
ds.sel(x=slice(7,5)).x #this selects correctly

This somewhat surprising behavior is part of the reason why using unsorted dimensions is usually discouraged. Similarly, dimensions don't necessarily need to have unique values, but this also leads to confusing behavior. Fortunately in remote sensing we rarely if ever encounter these situations.

## 6. A quick introduction to `xarray.where`

This section aims to show a few uses for [`xarray.where`](https://docs.xarray.dev/en/stable/generated/xarray.where.html). This function is a useful tool in many situations. It enables selecting data from arrays/DataArrays based on a boolean condition. Its signature is the following `xarray.where(cond, x, y)`:
 - `cond` is the boolean condition. It can be a scalar, an array or an DataArray
 -  `x` values to choose when `cond` is True
 - `y` values to choose when `cond` is False

In [None]:
a = np.arange(10)
print(a)
xr.where(a>5, 0, a)

This first example is trivial, but it captures the essence of the `where` function. The complexity lies in writing the right boolean condition, and understand the shapes of the objects involved.

When we created the DataArray with [`stackstac.stack`](#creating-a-dataarray-from-a-stac-object) we specified a fill value to be used if data was missing. We can use `where` in combination with `sum` in order to check how many, if any, pixels were filled with that value.

In [None]:
xr.where(array == FILL_VALUE, 1, 0).sum().values

One last thing about `where` is that it allows use to create a modified array fairly easily based on conditions:

In [None]:
xr.where(array>2000, 2000, array) #useful for clipping values before creating visuals
array2 = xr.where(array.band=='B02', 0, array)
array2.sel(band='B02').values

## 7. Plotting

### 7.1. First try

Yet another advantage of Xarray is that it directly includes a lot of plotting tools from the `matplotlib` library. Let's play with them a bit. Since we are going to make a lot of calls to the underlying data array, this is one situation where using `compute` to force the data into memory is a good choice.

In [None]:
# We only take the visible bands for our visualization
bs = array.band.isin(['B02', 'B03', 'B04'])
ds_bands = array.sel(band=bs).isel(time=0).compute()

ds_bands.plot.imshow()

This first plot definitely leaves a lot to be desired. Let's try to improve it step by step.

### 7.2. Band order
The first issue is that the colors are all wrong. The yellow/orange at the bottom is supposed to be the sea. Sentinel-2 names its bands in order of increasing wavelengths, from blue to red to infrared.

In [None]:
ds_bands.title

We can indeed see that our bands are in BGR order, while the default standard for bands in regular images is RGB. There are three ways to deal with this issue:
 - using the convenient `center_wavelength` coordinate to sort our bands in descending order of wavelength with `sortby`
 - using advanced vectorized indexing
 - using a slice object to reverse the order in a selection

In [None]:
# sortby
display(ds_bands.sortby('center_wavelength', ascending=False).band)

# vectorized indexing on the first dimension
# which is band
display(ds_bands.loc[['B04','B03','B02']].band)


# slice object
ds_bands = ds_bands.sel(band=slice(None, None, -1))
ds_bands.band

In [None]:
ds_bands.plot.imshow()

### 7.3. Clipping and contrast

At this point the bands are in the right order, but the colors are still problematic. Conveniently the `imshow` function printed out a warning that the values have been clipped. This explains the overabundance or white, as it means that most of our integer values are above 255. This can be confirmed by running a quick histogram on values:

In [None]:
ds_bands.plot.hist(bins=30)
plt.show() #not strictly necessary but it supresses some ugly output

#### 7.3.1. Contrast and normalization - manual approach

Let's see how to adjust the values so that they fit between 0 and 1. This is similar to what we've done in the [previous notebook](Joensuu_02-Single-date.ipynb), but using Xarray functions. Here we will only use the 95th percentile, and we will clip anything above it

In [None]:
# we specify the spatial dimensions because we want
# a value for each band separately
quantile = ds_bands.quantile(.95, dim=['x','y'])
display(quantile)
ds_bands_clipped = xr.where(ds_bands<quantile, ds_bands, quantile)

Now that we've clipped the values we can normalize the range so that the values fall between 0 and 1. To do that, we can simply calculate the following:

$$z_i = {x_i−min(x) \over max(x)−min(x)}$$

where $x=(x_1,...,x_n)$
and $z_i$ is now your $i^{th}$ normalized pixel

In [None]:
bands_max = ds_bands_clipped.max(dim=['x', 'y'])
bands_min = ds_bands_clipped.min(dim=['x', 'y'])
ds_bands_clipped = (ds_bands_clipped - bands_min)/(bands_max - bands_min)
ds_bands_clipped.plot.imshow()

#### 7.3.2. Adjusting with `imshow` directly

This last part was a useful way to learn about functions like `quantile` and performing simple calculations. But rather than having to deal with contrast by hand, we can some more convenient parameters of `imshow`. 

The first are `vmax` and `vmin`. They allow to specify cutoff bounds. Any value lower than `vmin` will be treated as equal to `vmin`. Same goes for `vmax`, but with values higher than `vmax`. When either `vmax` or `vmin` is specified, the normlization betewen `vmin` and `vmax` is also handled by `imshow`:

In [None]:
# we choose vmax close to the 95th percentile
# we found previously
ds_bands.plot.imshow(vmax=2900)

The second option is even more straightforward: `robust` It doesn't requirer specifying any value, and it will compute the 2nd and 98th percentiles to be used as `vmin` and `vmax`:

In [None]:
ds_bands.plot.imshow(robust=True)

These parameters are very useful to produce very quick visuals. However they also are less customizable. With `robust` there is no way to choose the percentiles. With both `robust` and `vmax`/`vmin`, only a single value is computed for all the bands. In this case we can see that the percentiles are similar for all 3 bands so it doesn't matter much, but it is not always the case.

### 7.4. Aspect ratio

The last issue with our visual is that the aspect ratio is wrong. By default the pixels of the array don't have to be squares with `imshow`. We can forcibly specify an aspect ratio by using the `aspect` parameter. However we also must specify the `size` parameter which is  the height in inches of the plot.

In [None]:
# calculating the aspect ratio as the
# size of the x dimension divided by
# the size of the y dimension
aspect = ds_bands.x.shape[0]/ds_bands.y.shape[0]

ds_bands.plot.imshow(vmax=2900, aspect=aspect, size=10)

Another option is to set the aspect `'equal'` directly with `matplotlib`. Keep in mind that this must be done after calling the `plot` function (or in this case `imshow`). This removes the requirement of calculating the aspect ratio. However a figure size should probably still be specified (with `matplotlib`) as otherwise the automatic margins are likely to be unsightly.

In [None]:
plt.figure(figsize=(15, 10))
ds_bands.plot.imshow(vmax=2900)
plt.gca().set_aspect('equal')
plt.show()

As a recap below are some short snippets to do some quick visualization. Keep in mind that Xarray uses `matplotlib` under the hood, so it is possible to use it in order to customize some aspects directly before calling `plt.show()`:

In [None]:
# removing the time dimension
# with size 1 with squeeze
ds_bands = array.squeeze()

plt.figure(figsize=(15, 10))
ds_bands[[2,1,0]].plot.imshow(robust=True, xticks=[])
plt.gca().set_aspect('equal')
# removing the axis ticks by setting them to an empty list
plt.yticks([])
plt.show()

# using matplotib object-oriented API
fig, ax = plt.subplots(figsize=(15,10))
# we use the `ax` parameter to specify the Axes object to plot on
ds_bands.loc[['B04','B03','B02']].plot.imshow(ax=ax, vmax=2900)
# setting the aspect ratio
ax.set_aspect('equal')
# removing the axis ticks and labels
# turning the axis off completely would work
# but it would also remove the black frame around the image
#ax.axis('off')

# removing the ticks and their labels
ax.tick_params(axis='both', bottom=False, left=False, labelbottom=False, labelleft=False)
# removing 'x' and 'y' axis labels
ax.set(xlabel=None)
ax.set(ylabel=None)
# adding a title
ax.set_title('Setting a title with matplotlib')
fig.show()

## 8. Extra: Accessing values, loading and `compute`

This section is not fundamental to understanding xarrays. However it can provide some context on why some operations on xarrays are faster or slower.

**Note:** The following code cells contain code warped in functions. These functions serve no purpose other than allowing to time their execution with the `@timeit` decorator.

Xarray has a lazy approach to computing when it comes to loading data. This means that the multi-dimensional array is not loaded into memory unless it is expressly needed. In this notebook we've learned about dimensions and coordinates. We've also seen how to filter an xarray. However, as long as we work on the dimensions or coordinates, there was no need to access the data itself, i.e. the underlying multi-dimensional array.

In fact, in this notebook there are only one situation in which we've had to access the multi-dimensional array: when we accessed it directly by using the `values` method. Here's the first instance where we did that:

In [None]:
# used to time code execution at the end of the notebook
from functools import wraps
import time
def timeit(func):
    @wraps(func)
    def timeit_wrapper(*args, **kwargs):
        start_time = time.perf_counter()
        result = func(*args, **kwargs)
        end_time = time.perf_counter()
        total_time = end_time - start_time
        print(f'Function {func.__name__}{args} {kwargs} Took {total_time:.4f} seconds')
        return result
    return timeit_wrapper

In [None]:
array2 = array.copy(deep=True)

%time
@timeit
def access_values():
    print(f"Value before assignment: {array2.loc[dict(band='B02', x=570490, y=4841100)].values}")
access_values()

However, the call to `values` was done after we selected a pixel by using dimensions. Thus we didn't have to load the whole array in memory, only the relevant part. In this situation we had a single pixel to assign, but since data is loaded in chunks, the whole chunk that the pixel belongs to was downloaded and loaded into memory. This explains why the previous cell only took little time to execute.

If we use `values` without limiting the selection beforehand, this will download the whole array which is much slower:

In [None]:
@timeit
def access_values2():
    array.values
access_values2()

There are some instances where the naive assumption is that the array data *must* have been loaded. For in the next cell it is reasonable to assume that the whole array has to be downloaded:

In [None]:
@timeit
def create_where():
    return xr.where(array%2, array, 2*array)
where_array = create_where()
where_array

The explanation is that until we actually need to access the values of the array, there is no need to actually apply the boolean condition defined in `xarray.where`. For lack of a better word, this condition is virtual until we actually access the values:

In [None]:
@timeit
def access_values3():
    where_array.values
access_values3()

It is technically possible to forcibly load an array into memory by using the `compute` method which will return a new xarray with the data loaded. However there are few situations where this is necessary. One obvious reason would be frequents call to `values`, but that is usually the sign that the code should be rewritten to avoid these frequents calls.

In [None]:
@timeit
def compute():
    return array.compute()
where_array2 = compute()

In [None]:
@timeit
def where_on_loaded_xarray():
    return xr.where(where_array2 % 2, where_array2, 2*where_array2)
where_on_loaded_xarray()