# General Dataset Utilities

Authors: [Tom Vo](https://github.com/tomvothecoder/) & [Stephen Po-Chedley](https://github.com/pochedls/)

Updated: 11/07/24 [xcdat v0.7.3]


## Overview

This notebook demonstrates the use of general utility methods available in `xcdat`, including
the reorientation of the longitude axis, centering of time coordinates using time bounds, and
adding and getting bounds.

The data used in this example can be found in the [xarray-data repository](https://github.com/pydata/xarray-data).


### Notebook Kernel Setup

Users can [install their own instance of xcdat](../getting-started-guide/installation.rst) and follow these examples using their own environment (e.g., with VS Code, Jupyter, Spyder, iPython) or [enable xcdat with existing JupyterHub instances](../getting-started-guide/getting-started-hpc-jupyter.rst).

First, create the conda environment:

```bash
conda create -n xcdat_notebook -c conda-forge xcdat xesmf matplotlib ipython ipykernel cartopy nc-time-axis gsw-xarray jupyter pooch
```

Then install the kernel from the `xcdat_notebook` environment using `ipykernel` and name the kernel with the display name (e.g., `xcdat_notebook`):

```bash
python -m ipykernel install --user --name xcdat_notebook --display-name xcdat_notebook
```

Then to select the kernel `xcdat_notebook` in Jupyter to use this kernel.


In [1]:
import xcdat as xc

## Open a dataset

Datasets can be opened and read using `open_dataset()` or `open_mfdataset()` (multi-file).

Related APIs: [xcdat.open_dataset()](../generated/xcdat.open_dataset.rst) & [xcdat.open_mfdataset()](../generated/xcdat.open_mfdataset.rst)


Let's use the example dataset and save it to netCDF (`.nc`), then open it with xCDAT.


In [4]:
ds = xc.tutorial.open_dataset("air_temperature", use_cftime=True)
ds.to_netcdf("air_temperature.nc")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [5]:
# NOTE: Opening a multi-file dataset will result in data variables to be dask
# arrays.
ds = xc.open_mfdataset("air_temperature.nc")
# print dataset
ds

Unnamed: 0,Array,Chunk
Bytes,29.52 MiB,29.52 MiB
Shape,"(2920, 25, 53)","(2920, 25, 53)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 29.52 MiB 29.52 MiB Shape (2920, 25, 53) (2920, 25, 53) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",53  25  2920,

Unnamed: 0,Array,Chunk
Bytes,29.52 MiB,29.52 MiB
Shape,"(2920, 25, 53)","(2920, 25, 53)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,424 B,424 B
Shape,"(53, 2)","(53, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 424 B 424 B Shape (53, 2) (53, 2) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2  53,

Unnamed: 0,Array,Chunk
Bytes,424 B,424 B
Shape,"(53, 2)","(53, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200 B,200 B
Shape,"(25, 2)","(25, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 200 B 200 B Shape (25, 2) (25, 2) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2  25,

Unnamed: 0,Array,Chunk
Bytes,200 B,200 B
Shape,"(25, 2)","(25, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,45.62 kiB,45.62 kiB
Shape,"(2920, 2)","(2920, 2)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 45.62 kiB 45.62 kiB Shape (2920, 2) (2920, 2) Dask graph 1 chunks in 3 graph layers Data type object numpy.ndarray",2  2920,

Unnamed: 0,Array,Chunk
Bytes,45.62 kiB,45.62 kiB
Shape,"(2920, 2)","(2920, 2)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,object numpy.ndarray,object numpy.ndarray


## Reorient the longitude axis

Longitude can be represented from 0 to 360 E or as 180 W to 180 E. `xcdat` allows you to convert between these axes systems.

- Related API: [xcdat.swap_lon_axis()](../generated/xcdat.swap_lon_axis.rst)
- Alternative solution: `xcdat.open_mfdataset(dataset_links, lon_orient=(-180, 180))`


In [6]:
ds.lon

In [7]:
ds2 = xc.swap_lon_axis(ds, to=(-180, 180))

In [8]:
ds2.lon

## Center the time coordinates

A given point of time often represents some time period (e.g., a monthly average). In this situation, data providers sometimes record the time as the beginning, middle, or end of the period. `center_times()` places the time coordinate in the center of the time interval (using time bounds to determine the center of the period).

- Related API: [xcdat.center_times()](../generated/xcdat.center_times.rst)
- Alternative solution: `xcdat.open_mfdataset(dataset_links, center_times=True)`

The time bounds used for centering time coordinates:


In [9]:
# We access the values with .values because it is a dask array.
ds.time_bnds.values

array([[cftime.DatetimeGregorian(2013, 1, 1, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(2013, 1, 1, 6, 0, 0, 0, has_year_zero=False)],
       [cftime.DatetimeGregorian(2013, 1, 1, 6, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(2013, 1, 1, 12, 0, 0, 0, has_year_zero=False)],
       [cftime.DatetimeGregorian(2013, 1, 1, 12, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(2013, 1, 1, 18, 0, 0, 0, has_year_zero=False)],
       ...,
       [cftime.DatetimeGregorian(2014, 12, 31, 6, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(2014, 12, 31, 12, 0, 0, 0, has_year_zero=False)],
       [cftime.DatetimeGregorian(2014, 12, 31, 12, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(2014, 12, 31, 18, 0, 0, 0, has_year_zero=False)],
       [cftime.DatetimeGregorian(2014, 12, 31, 18, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(2015, 1, 1, 0, 0, 0, 0, has_year_zero=False)]],
      dtype=obje

Before centering time coordinates:


In [10]:
ds.time

Now center the time coordinates:


In [11]:
ds3 = xc.center_times(ds)

After centering time coordinates:


In [12]:
ds3.time

## Add bounds

Bounds are critical to many `xcdat` operations. For example, they are used in determining the weights in spatial or temporal averages and in regridding operations. `add_bounds()` will attempt to produce bounds if they do not exist in the original dataset.

- Related API: [xarray.Dataset.bounds.add_bounds()](../generated/xarray.Dataset.bounds.add_bounds.rst)
- Alternative solution: `xcdat.open_mfdataset(dataset_links, add_bounds=["X", "Y", "T"])`
  - (Assuming the file doesn't already have bounds for your desired axis/axes)


In [13]:
# We are dropping the existing bounds to demonstrate adding bounds.
# we are starting with the dataset with centered time points
ds4 = ds3.drop_vars("time_bnds")

In [14]:
try:
    ds4.bounds.get_bounds("T")
except KeyError as e:
    print(e)

"No bounds data variables were found for the 'T' axis. Make sure the dataset has bound data vars and their names match the 'bounds' attributes found on their related time coordinate variables. Alternatively, you can add bounds with `ds.bounds.add_missing_bounds()` or `ds.bounds.add_bounds()`."


There are two options for adding time bounds. The midpoint method places bounds at the midpoints between time bounds and the frequency method creates bounds based on the time stamp of each time point
and the frequency of the data. This is the midpoint method:


In [15]:
# midpoint method
ds4 = ds4.bounds.add_time_bounds(method="midpoint")
# print results
ds4.bounds.get_bounds("T")

Notice that the midpoint method does not place the bounds between the last moment of month `n` and the first moment of month `n+1`. The `frequency` method was meant to try to infer the correct bounds by taking into account the time stamps and the frequency of the data. The frequency method (below) is what is used when `add_bounds=["T"]` is specified in `open_dataset` or `open_mfdataset`.


In [16]:
# drop time bounds again
ds5 = ds4.drop_vars("time_bnds")
# timestamp / frequency method
ds5 = ds5.bounds.add_time_bounds(method="freq")
# print results
ds5.bounds.get_bounds("T")

Note that `ds.bounds.add_time_bounds(method="midpoint")` is the same as `ds.bounds.add_bounds("T")`. The latter method can be used to add bounds to other axes (e.g., latitude) as show below.


In [17]:
ds6 = ds.drop_vars("lat_bnds")
ds6 = ds6.bounds.add_bounds("Y")
ds6.lat_bnds

## Add missing bounds for all axes supported by xcdat (X, Y, T, Z)

- Related API: [xarray.Dataset.bounds.add_missing_bounds()](../generated/xarray.Dataset.bounds.add_missing_bounds.rst)


In [18]:
# We drop the dataset axes bounds to demonstrate generating missing bounds.
ds7 = ds.drop_vars(["time_bnds", "lat_bnds", "lon_bnds"])

In [19]:
ds7

Unnamed: 0,Array,Chunk
Bytes,29.52 MiB,29.52 MiB
Shape,"(2920, 25, 53)","(2920, 25, 53)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 29.52 MiB 29.52 MiB Shape (2920, 25, 53) (2920, 25, 53) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",53  25  2920,

Unnamed: 0,Array,Chunk
Bytes,29.52 MiB,29.52 MiB
Shape,"(2920, 25, 53)","(2920, 25, 53)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [20]:
# add now-missing bounds
ds7 = ds7.bounds.add_missing_bounds(["X", "Y", "T"])
# print dataset
ds7

Unnamed: 0,Array,Chunk
Bytes,29.52 MiB,29.52 MiB
Shape,"(2920, 25, 53)","(2920, 25, 53)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 29.52 MiB 29.52 MiB Shape (2920, 25, 53) (2920, 25, 53) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",53  25  2920,

Unnamed: 0,Array,Chunk
Bytes,29.52 MiB,29.52 MiB
Shape,"(2920, 25, 53)","(2920, 25, 53)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Note that `ds.bounds.add_missing_bounds` uses `ds.bounds.add_bounds` for the latitude and longitude axes and defaults to the `frequency` method and `add_time_bounds` for the time axis. If you click on the database symbol for `time_bnds` above, the bounds are slightly mis-aligned because the time axis was not centered before adding the time axis. In this case, the user should call `xcdat.center_times` and then `ds.bounds.add_missing_bounds` (as shown earlier).


## Get the dimension coordinates for an axis.

In `xarray`, you can get a dimension coordinates by directly referencing its name (e.g., `ds.lat`). `xcdat` provides an alternative way to get dimension coordinates agnostically by simply passing the CF axis key to applicable APIs.

- Related API: [xcdat.get_dim_coords()](../generated/xcdat.get_dim_coords.rst) & [xcdat.get_dim_keys()](../generated/xcdat.get_dim_keys.rst)

Helpful knowledge:

- This API uses `cf_xarray` to interpret CF axis names and coordinate names in the xarray object attributes. Refer to [Metadata Interpretation](../getting-started-guide/faqs.rst) for more information.

Xarray documentation on coordinates ([source](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#coordinates)):

- There are two types of coordinates in xarray:

  - **dimension coordinates** are one dimensional coordinates with a name equal to their sole dimension (marked by \* when printing a dataset or data array). They are used for label based indexing and alignment, like the index found on a pandas DataFrame or Series. Indeed, these “dimension” coordinates use a pandas.Index internally to store their values.

  - **non-dimension coordinates** are variables that contain coordinate data, but are not a dimension coordinate. They can be multidimensional (see Working with Multidimensional Coordinates), and there is no relationship between the name of a non-dimension coordinate and the name(s) of its dimension(s). Non-dimension coordinates can be useful for indexing or plotting; otherwise, xarray does not make any direct use of the values associated with them. They are not used for alignment or automatic indexing, nor are they required to match when doing arithmetic (see Coordinates).

- Xarray’s terminology differs from the [CF terminology](https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#terminology), where the “dimension coordinates” are called “coordinate variables”, and the “non-dimension coordinates” are called “auxiliary coordinate variables” (see [GH1295](https://github.com/pydata/xarray/issues/1295) for more details).


### 1. `axis` attr


In [21]:
ds.lat.attrs["axis"]

'Y'

### 2. `standard_name` attr


In [22]:
ds.lat.attrs["standard_name"]

'latitude'

In [23]:
"lat" in ds.dims

True

### Utilities to get the coordinate axis and coordinate axis key


In [24]:
xc.get_dim_coords(ds, axis="Y")

In [25]:
xc.get_dim_keys(ds, axis="X")

'lon'