# Advanced Indexing

## Learning Objectives

* Vectorized and Pointwise Indexing 
* Boolean Indexing & Masking
   * Dropping/Masking Data Using `where` and `isin`

## Overview

In the previous notebooks, we learned basic forms of indexing with xarray (positional and name based dimensions, integer and label based indexing), Datetime Indexing, and nearest neighbor lookups. In this tutorial, we will discover more advanced options for vectorized indexing and learn about additional useful methods relating to indexing/selecting data such as masking. 


First, let's import packages needed for this repository: 

In [None]:
import numpy as np
import pandas as pd
import xarray as xr

In this tutorial, weâ€™ll use air temperature tutorial dataset from the National Center for Environmental Prediction.  

In [None]:
ds = xr.tutorial.load_dataset("air_temperature")
da = ds.air
ds

## Vectorized Indexing

Like NumPy and pandas, Xarray supports indexing many array elements at once in a
*vectorized* manner. 

If you only provide integers, slices, or unlabeled arrays (array without dimension names, such as `np.ndarray`, `list`, but not `DataArray()`) indexing can be understood as orthogonally (i.e. along independent axes, instead of using numpyâ€™s broadcasting rules to vectorize indexers). 

*Orthogonal* or *outer* indexing considers one-dimensional arrays in the same way as slices when deciding the output shapes. The principle of outer or orthogonal indexing is that the result mirrors the effect of independently indexing along each dimension with integer or boolean arrays, treating both the indexed and indexing arrays as one-dimensional. This method of indexing is analogous to vector indexing in programming languages like MATLAB, Fortran, and R, where each indexer component independently selects along its corresponding dimension. 

For example : 

In [None]:
da[0, [2, 4, 10, 13], [1, 6, 7]].plot();  # -- orthogonal indexing

For more flexibility, you can supply `DataArray()` objects as indexers. Dimensions on resultant arrays are given by the ordered union of the indexersâ€™ dimensions:

For example, in the example below we do orthogonal indexing using `DataArray()` objects. 

In [None]:
target_lat = xr.DataArray([31, 41, 42, 42], dims="degrees_north")
target_lon = xr.DataArray([200, 201, 202, 205], dims="degrees_east")

da.sel(lat=target_lat, lon=target_lon, method="nearest")  # -- orthogonal indexing

In the above example, you can see how the output shape is time x lats x lons. 

But what if we would like to find the information from the nearest grid cell to collection of specified points (for example weather stations or tower data)?


**Vectorized indexing** using `DataArrays()` may be used to extract information from the nearest grid cells of interest, for example, the nearest climate model grid cells to a collection specified weather station latitudes and longitudes.

**To trigger vectorized indexing behavior you will need to provide the selection dimensions with a new shared output dimension name.** 

In the example below, the selections of the closest latitude and longitude are renamed to an output dimension named `points`:

In [None]:
# Define target latitude and longitude (where weather stations might be)
lat_points = xr.DataArray([31, 41, 42, 42], dims="points")
lon_points = xr.DataArray([200, 201, 202, 205], dims="points")
lat_points

In [None]:
lon_points

Now, retrieve data at the grid cells nearest to the target latitudes and longitudes (weather stations):

In [None]:
da.sel(lat=lat_points, lon=lon_points, method="nearest")

In [None]:
lat_points = [31, 41, 42, 42]
lon_points = [31, 41, 42, 42]
da.sel(lat=lat_points, lon=lon_points, method="nearest")  # --orthogonal indexing
# da.sel_points(lat=lat_points, lon=lon_points, method="nearest")

ðŸ‘† Please notice how the shape of our `DataArray` is time x points, extracting time series for each weather stations. 


In [None]:
da.sel(lat=lat_points, lon=lon_points, method="nearest").dims

:::{attention}
Please note that slices or sequences/arrays without named-dimensions are treated as if they have the same dimension which is indexed along.
:::

For example:

In [None]:
da.sel(lat=[20, 30, 40], lon=lon_points, method="nearest")

:::{warning}
If an indexer is a `DataArray()`, its coordinates should not conflict with the selected subpart of the target array (except for the explicitly indexed dimensions with `.loc`/`.sel`). Otherwise, `IndexError` will be raised!
:::

## Pointwise Indexing

Positional indexing deviates from the NumPy when indexing with multiple arrays. 

**Xarray pointwise indexing supports the indexing along multiple labeled dimensions using list-like objects similar to NumPy indexing behavior.** 


In [None]:
da = xr.DataArray(np.arange(56).reshape((7, 8)), dims=["x", "y"])
da

In [None]:
da.isel(x=xr.DataArray([0, 1, 6], dims="z"), y=xr.DataArray([0, 1, 0], dims="z"))

In the above three elements at `(ix, iy) = ((0, 0), (1, 1), (6, 0))` are selected and mapped along a new dimension `z`.   
If you want to add a coordinate to the new dimension `z`, you can supply a `DataArray` with a coordinate:



In [None]:
da.isel(
    x=xr.DataArray([0, 1, 6], dims="z", coords={"z": ["a", "b", "c"]}),
    y=xr.DataArray([0, 1, 0], dims="z"),
)

Analogously, label-based pointwise-indexing is also possible by the `.sel` method:



In [None]:
da = xr.DataArray(
    np.random.rand(4, 3),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IA", "IL", "IN"]),
    ],
)


times = xr.DataArray(pd.to_datetime(["2000-01-03", "2000-01-02", "2000-01-01"]), dims="new_time")


da.sel(space=xr.DataArray(["IA", "IL", "IN"], dims=["new_time"]), time=times)

## Boolean Masking
### Masking with `where()`

Indexing methods on Xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. To do this type of selection in Xarray, use `where()`:

In [None]:
# Let's replace the missing values (nan) with some placeholder

ds.air.where(ds.air.notnull(), -9999)

We can also select a condition to create a mask. For example, here we want to mask all the points with latitudes above 60 N. 

In [None]:
da[0, :, :].plot();

In [None]:
da_masked = da.where(da.lat < 60)
da_masked[0, :, :].plot();

By default where maintains the original size of the data. You can use of the option `drop=True` to clips coordinate elements that are fully masked:


In [None]:
da_masked = da.where(da.lat < 60, drop=True)
da_masked[0, :, :].plot();

### Selecting Values with `isin`

To check whether elements of an xarray object contain a single object, you can compare with the equality operator `==` (e.g., `arr == 3`). To check multiple values, use `isin()`:



Here is a simple example: 

In [None]:
x_da = xr.DataArray([1, 2, 3, 4, 5], dims=["x"])

# -- select points with values equal to 2 and 4:
x_da.isin([2, 4])

`isin()` works particularly well with `where()` to support indexing by arrays that are not already labels of an array. 

For example, we have another DataArray that displays the status flags of the data-collecting device for our data. Here, flags with value 0 and -1 signifies the device was functioning correctly, while 0 indicates a malfunction, implying that the resulting data collected may not be accurate.

In [None]:
flags = xr.DataArray(np.random.randint(-1, 5, da.shape), dims=da.dims, coords=da.coords)
flags

Now, we want to only see the data for points where out measurement device is working correctly: 

In [None]:
da_masked = da.where(flags.isin([1, 2, 3, 4, 5]), drop=True)
da_masked[0, :, :].plot();

<div class="alert alert-block alert-warning">
    
<strong>Warning:</strong> Please note that when done repeatedly, this type of indexing is significantly slower than using sel().
    
</div>

## Align and Reindex 

Xarray enforces alignment between index Coordinates (that is, coordinates with the same name as a dimension, marked by *) on objects used in binary operations.


In [None]:
da

In [None]:
da = xr.DataArray(
    np.random.rand(4, 3),
    [
        ("time", pd.date_range("2000-01-01", periods=4)),
        ("space", ["IL", "IA", "IN"]),
    ],
)
da

In [None]:
da.reindex(space=["IA", "CA"])

In [None]:
new_time = pd.date_range('2013-02-01', periods=20, freq='H')

In [None]:
da.reindex(time=new_time)

## Additional Resources

- [Xarray Docs - Indexing and Selecting Data](https://docs.xarray.dev/en/stable/indexing.html)
