6 November 2023

# NetCDF

- **Network Common Data Form** is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. NetCDF was initially developed at the Unidata Program Center and is supported on almost all platforms, and parsers exist for most scientific programming languages.
    - `.nc` file
- large
- complex
- multi-dimensional
- metadata must be included
- array oriented scientific data

### Data organization
- *Variables*
    - what is being measured
    - N-dimensional array of numbers
- *Dimensions*
     - with respect to what are we measring the variable (time, location, etc)
     - describe the axes of the arrays
- *Attributes*
     - how we're measuring the var. and dim.
     - annotations about a var, dim, or the whole file (how var was measured, sampling freq, who took measurements)

# xarray
- opensource `Python` package
- based on netCDF data model
- works well with `Dask`, `Matplotlib`, and `Pandas`

In [1]:
import os              
import pandas as pd
import numpy as np

import xarray as xr   # This is the package we'll explore

In [3]:
# values of a single variable at each point of the coords 
temp_data = np.array([np.zeros((5,5)), # temp 0ºC on 1st day
                      np.ones((5,5)),  # temp 1ºC on 2nd day
                      np.ones((5,5))*2]).astype(int) # temp 2ºC on 3rd day

temp_data # numpy array of mock temperature data

array([[[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]],

       [[2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2]]])

### Dimensions and Coordinates

Our first dimension is time, second is latitude, and third is longitude.

![alt text](netcdf_xarray_indexing.png "Title")

In [4]:
# names of the dimensions in the required order
dims = ('time', 'lat', 'lon')

# create coordinates to use for indexing along each dimension as a dictionary
coords = {'time' : pd.date_range("2022-09-01", "2022-09-03"),
          'lat' : np.arange(70, 20, -10),
          'lon' : np.arange(60, 110, 10)}  

In [5]:
# attributes (metadata) of the data array 
attrs = { 'title' : 'temperature across weather stations',
          'standard_name' : 'air_temperature',
          'units' : 'degree_c'}

In [6]:
# initialize xarray.DataArray
temp = xr.DataArray(data = temp_data, 
                    dims = dims,
                    coords = coords,
                    attrs = attrs)
temp

In [7]:
# update attributes
temp.attrs['description'] = 'simple example of an xarray.DataArray'

# add attributes to coordinates 
temp.time.attrs = {'description':'date of measurement'}

temp.lat.attrs['standard_name']= 'grid_latitude'
temp.lat.attrs['units'] = 'degree_N'

temp.lon.attrs['standard_name']= 'grid_longitude'
temp.lon.attrs['units'] = 'degree_E'
temp

## Subsetting 

In [8]:
# access dimensions by position, then use integers for indexing
temp[0,3,2]
# access the value that is 4 down and 3 in for the 1st day.

In [9]:
# access dimensions by position, then use labels for indexing
temp.loc['2022-09-01', 40, 80]

In [10]:
# acess dimensions by name, then use integers for indexing
temp.isel(time=0, lon=2, lat=3)

In [11]:
# access dimensions by name, then use labels for indexing
temp.sel(time='2022-09-01', lat=40, lon=80)

## Reduction

In [12]:
avg_temp = temp.mean(dim = 'time') 
# to keep attributes add keep_attrs = True

avg_temp.attrs = {'title':'average temperature over three days'}
avg_temp

## `xarray.DataSet`

In [13]:
# make dictionaries with variables and attributes
data_vars = {'avg_temp': avg_temp,
            'temp': temp}

attrs = {'title':'temperature data at weather stations: daily and and average',
        'description':'simple example of an xarray.Dataset'}

# create xarray.Dataset
temp_dataset = xr.Dataset( data_vars = data_vars,
                        attrs = attrs)

In [14]:
temp_dataset

## Saving

# save file - don't forget the .nc extension!
temp_dataset.to_netcdf('temp_dataset.nc')

# open to check:
check = xr.open_dataset('temp_dataset.nc'
)
check