# The CF Conventions

The [Climate and Forecast Metadata Conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html) define metadata that provide a definitive description of a file's data and it's spatial and temporal properties, making it "self-describing." The conventions have been in development since 2003 and are most applicable to earth sciences data.

#### CF Data Model

A generalized ["data model"](https://doi.org/10.5194/gmd-10-4619-2017) of the CF conventions has recently been developed to "aid in development of CF-compliant software and to capture with a minimal set of elements all of the information contained in the CF conventions", which is meant to enable development of CF-compliant data in other encodings (non-netCDF) as well as software that can work with any CF-compliant dataset. I've spent some time with this paper and I'm not quite sure yet what the data model's role is in SNAP's work. 

# Implementation of CF-compliance

Structuring a netCDF file in a way that is CF-compliant is the primary application of all of this.

Making a file CF compliant isn't simply about adding the correct metadata. The data within the file must follow a certain structure. 

This notebook demonstrates implementing all **required** steps to achieve CF compliance via `xarray` in Python. If it is merely recommended by the CF-convention, it is noted as such to distinguish it from a requirement. 

## Example: Sea ice indicators dataset

This example walks through making one of the final product datasets of the NOAA Arctic Indicators "Sea Ice Indicators" project, the freeze-up/break-up start/end dates, CF-compliant. 

The best practice here is probably to make the dataset CF-compliant via the script that produces the data. However, my plan is to make the dataset CF-compliant with another script in the pipeline for now, which provides an opportunity here to both develop that code and an example for making it CF compian.

Here is what the dataset looks like when loaded by `xarray`: 

In [2]:
import os
import xarray as xr
import numpy as np

In [3]:
fubu_fp = "nsidc_0051_1979-2019_fubu.nc"

fubu = xr.load_dataset(fubu_fp)

fubu

### Coordinate system info

#### Spatial (a.k.a. horizontal) coordinate system

We need to do something to provide enough info to geolocate these data. This is based on the following CF-convention requirement:

"*If the coordinate variables for a horizontal grid are not longitude and latitude, it is required that further information is provided to geo-locate the horizontal position.*" ([CF](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#coordinate-system)) 

There are two ways provided to meet this requirement:

1. provide the Auxiliary Coordinate Variables for latitude and longitude
2. provide information about the coordinate system in a "grid mapping" variable (enough info to derive latitude and longitude for each grid cell)

For now, we will go with #2 and provide info about the projection coordinate reference system via a grid mapping variable. Option #1 may be done in addition to the provision of a grid mapping variable, **but that is not required**.

**`xarray` implementation**

Create the grid mapping variable in the dataset. Name and data type do not matter:

In [216]:
# there are no requirements for naming (as far as I can tell)
grid_mapping_varname = "crs"

# the data type does not matter, it is just a dummy variable. 
fubu[grid_mapping_varname] = xr.DataArray().astype(np.int32)

# verify the new variable is present
fubu["crs"]

The `grid_mappping_name` attribute of the grid mapping variable needs to adhere to guidelines in the FGDC "Content Standard for Digital Geospatial Metadata". The available mappings with the proper values for the `grid_mapping_name` attribute are given in the CF [grid mappings appendix](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#appendix-grid-mappings).

That appendix also gives ["Map Parameters" for the Stereographic mapping](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#appendix-grid-mappings) that specify which attributes must be defined. They provide [this link](http://geotiff.maptools.org/proj_list/stereographic.html) and [this table](https://cfconventions.org/wkt-proj-4.html) as resources for help translating between the Map parameters and OGC WKT parameters.

However, creation of the correct attributes can be done *automatically* with the `to_cf()` method of the `pyproj.crs.CRS` class! This assigns the grid mapping name, as well as all mandatory and optional "Map parameters."

Here is all that is assigned:

In [217]:
from pyproj.crs import CRS

fubu[grid_mapping_varname].attrs = CRS.from_epsg(3411).to_cf()

fubu[grid_mapping_varname].attrs

{'crs_wkt': 'PROJCRS["NSIDC Sea Ice Polar Stereographic North",BASEGEOGCRS["Unspecified datum based upon the Hughes 1980 ellipsoid",DATUM["Not specified (based on Hughes 1980 ellipsoid)",ELLIPSOID["Hughes 1980",6378273,298.279411123064,LENGTHUNIT["metre",1]]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],ID["EPSG",4054]],CONVERSION["US NSIDC Sea Ice polar stereographic north",METHOD["Polar Stereographic (variant B)",ID["EPSG",9829]],PARAMETER["Latitude of standard parallel",70,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8832]],PARAMETER["Longitude of origin",-45,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8833]],PARAMETER["False easting",0,LENGTHUNIT["metre",1],ID["EPSG",8806]],PARAMETER["False northing",0,LENGTHUNIT["metre",1],ID["EPSG",8807]]],CS[Cartesian,2],AXIS["easting (X)",south,MERIDIAN[45,ANGLEUNIT["degree",0.0174532925199433]],ORDER[1],LENGTHUNIT["metre",1]],AXIS["northing (Y)",south,MERIDIAN[135,ANGLEUNIT["degree",0.0174532925199433]],ORDER[2],LENGTH

Confusingly, the `latitude_of_projection_origin` attribute is not created. This parameter is not present in the WKT for the [spatialreference.org](https://www.spatialreference.org/ref/epsg/nsidc-sea-ice-polar-stereographic-north/) EPSG 3411 entry, but it is available via the `to_dict()` method of the `pyproj.crs.CRS` class (and perhaps it's obvious that it is +90.0, anyway, since we're dealing with the north pole).

In [120]:
epsg_3411_proj = CRS.from_epsg(3411).to_dict()

epsg_3411_proj

{'proj': 'stere',
 'lat_0': 90,
 'lat_ts': 70,
 'lon_0': -45,
 'x_0': 0,
 'y_0': 0,
 'a': 6378273,
 'b': 6356889.449,
 'units': 'm',
 'no_defs': None,
 'type': 'crs'}

The lack of inclusion of this prameter might be a bug in the `to_cf()` method, since it is treated like a required parameter for the Polar Stereographic grid mapping in the CF conventions (i.e., it doesn't say "optional"). We will add it manually just in case:

In [218]:
fubu[grid_mapping_varname].attrs["latitude_of_projection_origin"] = float(epsg_3411_proj["lat_0"])

fubu[grid_mapping_varname].attrs["latitude_of_projection_origin"]

90.0

The `standard_name` attribute of the coordinate variables associated with the grid mapping must be defined according to the requirements in the CF [grid mappings appendix](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#appendix-grid-mappings). These are `projection_x_coordinate` and `projection_y_coordinate` for the x and y rectangular coordinates:

In [219]:
fubu["xc"].attrs["standard_name"] = "projection_x_coordinate"
fubu["yc"].attrs["standard_name"] = "projection_y_coordinate"

The `units` attribute should also be set on these coordinate variables:

In [220]:
for coord_var in ["xc", "yc"]:
    fubu[coord_var].attrs["units"] = "m"

Grid mapping variables are associated with the data variables by the `grid_mapping` attribute. Assign this attribute for all four data variables:

In [221]:
indicator_names = ["freezeup_start", "freezeup_end", "breakup_start", "breakup_end"]

for indicator in indicator_names:
    fubu[indicator].attrs["grid_mapping"] = grid_mapping_varname
    
fubu[indicator_names[0]].attrs

{'grid_mapping': 'crs'}

#### Time coordinate

It is often the case that we would have a `time` coordinate variable as well. In fact, other datasets having a "day-of-year of some event" data variable have been structured to use a time coordinate variable, such as [this implementation](http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2017/019238.html) shared by a user to the CF mailing list. The full discussion around this can be found [in the 2017 archive](http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2017.txt), but I haven't found anything definitive for this use-case.

For this particular dataset, I think there is a better option than what is offered above.

The issue is with how [CF treats the `time` coordinate](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#time-coordinate) - the required units for time coordinate variables are "\<time unit> since YYYY-MM-DD 00:00:00". This is not a user-friendly implementation here because this makes comparison among different years, a typical use-case, more complicated. 

Instead, we will treat the year dimension as a [discrete axis](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#discrete-axis), wherein it effectively becomes an ordinal variable, which allows the actual data values to retain their more meaningful representation as "day-of-year" instead of "days since 1900-01-01". 

**`xarray` implementation**

All we really need to do in this case is add some naming attributes to the `year` coordinate variable. We will just do that in the next section, "Variable naming attributes", where we do the same for the other variables.

#### Variable naming attributes

CF conventions provide two attributes for  describing data variables: `standard_name` and `long_name`. Neither are required, but they are strongly encouraged, so we will include them. The conventions include a set of [standard names](http://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html) to use if applicable, but the quantities we are working with here (day of year of freeze-up/break-up) do not have a match with the names provided. 

**`xarray` implementation**

We will just create a `long_name` attribute for each variable. Since the unit we have, "day-of-year", is not a valid UDUNITS unit, we will omit the `units` attribute altogether (it is only encouraged, not required) and provide more info about these variables via the `comment` attribute:

In [222]:
for indicator in indicator_names:
    group, status = indicator.split("_")
    fubu[indicator].attrs["long_name"] = f"Day-of-year of {group[:-2].title()}-up {status}"
    # also 
    fubu[indicator].attrs["comment"] = "No 'units' attribute is provided for this variable because UDUNITS does not have support for a 'day-of-year' variable. See the '' global attribute "
    
fubu["freezeup_start"].attrs

{'grid_mapping': 'crs',
 'long_name': 'Day-of-year of Freeze-up start',
 'comment': "No 'units' attribute is provided for this variable because UDUNITS does not have support for a 'day-of-year' variable. See the '' global attribute "}

Likewise, there is no standard name for "year" or "sea ice year", and we do not wish to represent year in the canonical way, but rather as ordered categories. Add a long name and comment for why we are ommitting units:

In [223]:
fubu["year"].attrs["long_name"] = "year of indicator observation"
fubu["year"].attrs["comment"] = "this variable corresponds to the 'year' discrete axis. It give the ineger value of the year (standard calendar) in which each indicator was defined in a way that is more straightfoward than working the year of observation as the 'days since <timestamp>' format for time coordinates with bounds."

#### Global attributes

The following global attributes must be set for the dataset:

* `Conventions`
* `title`
* `institution`
* `source`
* `history`
* `references`
* `comment`

**`xarray` implementation**

These can be set up in a dict and passed to the `attrs` of the DataSet:

In [9]:
fubu_attrs = {
    "Conventions": "CF-1.8",
    "title": "Arctic sea ice freeze-up and break-up dates derived from passive microwave satellite data, 1979-2018",
    "institution": "Scenarios Network for Alaska and Arctic Planning, International Arctic Research Center, University of Alaska Fairbanks",
    "source": "",
    "comment": "This dataset was developed as an extension of the work presented in Johnson and Eicken (2016, see 'references' attribute).\n",
    "references": "Mark Johnson, Hajo Eicken; Estimating Arctic sea-ice freeze-up and break-up from the satellite record: A comparison of different approaches in the Chukchi and Beaufort Seas. Elementa: Science of the Anthropocene 1 January 2016; 4 000124. doi: https://doi.org/10.12952/journal.elementa.000124"
}

# add some more info about the methods used to derive the data in the comment
with open("indicators_criteria.txt", mode="r") as f:
    fubu_attrs["comment"] += f.read()

fubu.attrs = fubu_attrs

print(fubu.attrs["comment"])

This dataset was developed as an extension of the work presented in Johnson and Eicken (2016, see 'references' attribute).

The indicators variables in this dataset we defined as follows:

Freeze-up Start
    Valid date range: 9/1 - 1/31
    Threshold: sum of mean and standard deviation of daily summer sea ice concentration (SIC), with a minimum of 15%
    Definition: day-of-year of first day the daily SIC exceeds the threshold
    Undefined if:
        a. daily SIC never exceeds this threshold
        b. mean summer SIC is > 25%
        c. subsequent Freeze-up End is not defined 

Freeze-up End
    Valid date range: 9/1 - 2/28
    Threshold: mean SIC in winter minus 10%, with a minimum of 15% and a maximum or 50%
    Definition: day-of-year of the first day after the Freeze-up Start date where the daily SIC exceeds the threshold for the following two weeks
    Undefined if: 
        a. daily SIC is above this threshold for every day of the search period
        b. Freeze-up Start is n

We can now test compliance with the conventions via the `cfchecker` command line utility, which can be installed via `pip install cfchecker`.

In [226]:
fubu.to_netcdf("test.nc")

In [228]:
# installed via pip install cfchecker

!cfchecks test.nc

CHECKING NetCDF FILE: test.nc
Using CF Checker Version 4.1.0
Checking against CF Version CF-1.8
Using Standard Name Table Version 77 (2021-01-19T13:38:50Z)
Using Area Type Table Version 10 (23 June 2020)
Using Standardized Region Name Table Version 4 (18 December 2018)


------------------
Checking variable: freezeup_start
------------------
INFO: (3.1): No units attribute set.  Please consider adding a units attribute for completeness.

------------------
Checking variable: freezeup_end
------------------
INFO: (3.1): No units attribute set.  Please consider adding a units attribute for completeness.

------------------
Checking variable: breakup_start
------------------
INFO: (3.1): No units attribute set.  Please consider adding a units attribute for completeness.

------------------
Checking variable: breakup_end
------------------
INFO: (3.1): No units attribute set.  Please consider adding a units attribute for completeness.

------------------
Checking variable: xc
-------------