Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-canonical NetCDF axis order #89

Closed
dblodgett-usgs opened this issue Jan 8, 2019 · 16 comments
Closed

Handle non-canonical NetCDF axis order #89

dblodgett-usgs opened this issue Jan 8, 2019 · 16 comments

Comments

@dblodgett-usgs
Copy link
Contributor

While XYZT is the typical axis order, it is by no means a guarantee. I tried permuting the dimensions of a file to be TYX and convinced read_ncdf to give me a very odd result. We should be relying on the dimension ids of variables to determine the axis order and enforcing a canonical axis order after reading out of NetCDF.

I'll see if I can work up a PR that handles this.

@edzer
Copy link
Member

edzer commented Jan 8, 2019

stars doesn't prescribe any order (or names), but has the labels "x" and "y" behind the x and y (spatial raster) dimensions, if there are any. It has (and uses) an aperm method to reorder when needed (e.g. by image/plot).

@dblodgett-usgs
Copy link
Contributor Author

Right. So it's important that we get the x and y spatial dimensions into the right slots in the st_dimensions of a stars object. I guess that's not quite a canonical axis order, but it is in a sense.

@edzer
Copy link
Member

edzer commented Mar 1, 2019

Do we have a test case or example for this? @adrfantini I recall a netcdf file that took infinite time to read through GDAL, was that one with non-standard axis order?

@adrfantini
Copy link

Do we have a test case or example for this? @adrfantini I recall a netcdf file that took infinite time to read through GDAL, was that one with non-standard axis order?

To be honest I do not recall this happening to me, but I might be wrong. Can you find the issue?

@dblodgett-usgs
Copy link
Contributor Author

This file had the axis order switched. https://github.com/r-spatial/stars/pull/88/files#diff-ae514b50c29888928e1d25553d71ec03

But I would actually say this should be closed until it becomes an issue with a real file. Theoretically people shouldn't be doing this. I know they do, but they shouldn't.

@adrfantini
Copy link

adrfantini commented Mar 3, 2019

I feel like stars should not aim at supporting generic netCDF files. stars should aim at fully supporting CF-compliant netCDF files. Workarounds for supporting non-compliant files should not be automatic in stars, instead we should make it so that manual user workarounds (e.g. redefining which dimensions are spatial) are possible with not too much effort.

EDIT: the difference between supporting generic and CF-compliant netCDF files should also be clearly stated in the docs.

@dblodgett-usgs
Copy link
Contributor Author

I 100% agree with this. This issue of axis order is somewhat grey though. COARDS relies on axis order and axis/dim name to link coordinate variables to dimensions. CF extends things to allow more generic treatment of axes so axis order can't always be assumed. In practice people (almost never) mess it up -- but I'll admit that I have -- and the files still work with CF-compliant implementations.

@edzer
Copy link
Member

edzer commented Mar 3, 2019

I also agree. OTOH, the file read comes in as

> r = read_ncdf('bcsd_obs_1999_borked.nc')
> r
stars object with 3 dimensions and 2 attributes
attribute(s):
   pr [mm/m]         tas [C]      
 Min.   :  0.59   Min.   :-0.421  
 1st Qu.: 56.14   1st Qu.: 8.899  
 Median : 81.88   Median :15.658  
 Mean   :101.26   Mean   :15.489  
 3rd Qu.:121.07   3rd Qu.:21.780  
 Max.   :848.55   Max.   :29.386  
 NA's   :7116     NA's   :7116    
dimension(s):
          from to offset delta  refsys point                    values    
time         1 12     NA    NA POSIXct    NA 1999-01-31,...,1999-12-31 [x]
longitude    1 81    -85 0.125      NA    NA                      NULL [y]
latitude     1 33     33 0.125      NA    NA                      NULL    

i.e. with [x] and [y] set wrongly. Pretty simple heuristics, e.g. look at dimension names and/or refsys, would set them right. Doing it now manually is a one-liner:

> attr(attr(r, "dimensions"), "raster")$dimensions = c("longitude", "latitude")
> r
stars object with 3 dimensions and 2 attributes
attribute(s):
   pr [mm/m]         tas [C]      
 Min.   :  0.59   Min.   :-0.421  
 1st Qu.: 56.14   1st Qu.: 8.899  
 Median : 81.88   Median :15.658  
 Mean   :101.26   Mean   :15.489  
 3rd Qu.:121.07   3rd Qu.:21.780  
 Max.   :848.55   Max.   :29.386  
 NA's   :7116     NA's   :7116    
dimension(s):
          from to offset delta  refsys point                    values    
time         1 12     NA    NA POSIXct    NA 1999-01-31,...,1999-12-31    
longitude    1 81    -85 0.125      NA    NA                      NULL [x]
latitude     1 33     33 0.125      NA    NA                      NULL [y]

@dblodgett-usgs
Copy link
Contributor Author

Yeah. It can also be figured out by linking the coordinate variable to the dimension id they are on then seeing that the data variable is defined YXT rather than XYT and correcting right up front.

@adrfantini
Copy link

@edzer I think that we should provide a simpler one-liner function, other than attr(attr(r, "dimensions"), "raster")$dimensions = c("longitude", "latitude"). Dealing with the attributes of the attributes is definitely not user-friendly. Something with st_dimensions() <- or a separate function (st_xy_dimensions)?

Note that the CF conventions say:

we recommend, but do not require (see Section 1.4, "Relationship to the COARDS Conventions" ), those dimensions to appear in the relative order T, then Z, then Y, then X in the CDL definition corresponding to the file. All other dimensions should, whenever possible, be placed to the left of the spatiotemporal dimensions.

So it is recommended, but NOT mandatory, to set a proper dimension ordering.
Also note that the file which @edzer is using as an example does have improper dimension ordering, but it also has a correctly defined coordinates = "time latitude longitude " attribute. This attribute is not mandatory, but if present has precedence over coordinates variables. The conventions say (search for coordinates attribute):

There are two methods used to identify variables that contain coordinate data. The first is to use the NUG-defined "coordinate variables." The use of coordinate variables is required for all dimensions that correspond to one dimensional space or time coordinates . In cases where coordinate variables are not applicable, the variables containing coordinate data are identified by the coordinates attribute.

However...

There is no restriction on the order in which the auxiliary coordinate variables appear in the coordinates attribute string.

So, in short, in the CF-conventions:

  • coordinates should by TZYX, but it is NOT mandatory. I would say that the vast majority of files respects this.
  • an additional coordinates attribute is possible which overrides the dimensions; this is particularly useful for curvilinear grids and netCDF features (e.g. time-series, see Support for reading station data timeseries (NetCDF example) #30 ) and should be used by stars for this.
  • the coordinates attribute also does not enforces any particular order

IMHO stars could do this:

  • Have some heuristics to check coordinates. Either assume TZYX and warn if heuristics say otherwise, or use heuristics directly and warn the user anyway. Do not warn if heuristics and TZYX agree.
  • The order of coordinates, if present, has precedence over dimension order and should be used for heuristics
  • In case something is messed up, a simple one-liner to set spatial dimensions should be available.

@dblodgett-usgs
Copy link
Contributor Author

dblodgett-usgs commented Apr 24, 2019

I just introduced the ncmeta function get_coord_vars in #171 which returns something like:

nc_coord_var(f)
#> # A tibble: 5 x 6
#>   variable       X     Y     Z     T     bounds   
#>   <chr>          <chr> <chr> <chr> <chr> <chr>    
#> 1 RAINNC_present XLONG XLAT  NA    Time  NA       
#> 2 Time           NA    NA    NA    Time  time_bnds
#> 3 T2_present     XLONG XLAT  NA    Time  NA       
#> 4 U10_present    XLONG XLAT  NA    Time  NA       
#> 5 V10_present    XLONG XLAT  NA    Time  NA  

It has a bunch of logic to figure out which variables belong with which coordinate.

There is still the issue of what NetCDF dimension order each variable and coordinate variable are defined on. This is really only an issue when lat and lon are 2d. e.g there is a chance that lat is defined [j,i] and a variable is defined [k,i,j]. We could debate if a function needs to be added to ncmeta to provide the NetCDF dimension order for each variable, or some other convenience to help deal with this nuance. I think all we'd really need is to add the dimids element of the RNetCDF::nc.var.inq() response as a list column in the response of ncmeta::nc_vars(). For the time being, it would be easy enough to add it ad-hoc in stars.

e.g.

nc <- RNetCDF::open.nc(system.file("nc/test_stageiv_xyt.nc", package = "stars"))
add_dimids <- function(nc, vars) {
  vars$dimids <- lapply(vars$name, function(x) RNetCDF::var.inq.nc(nc, x)$dimids)
}

@dblodgett-usgs
Copy link
Contributor Author

This issue has come up again in #199 -- should I implement the fix for it? I tried a while back by read_ncdf was in flux and I wasn't able to get my contribution in.

@edzer
Copy link
Member

edzer commented Aug 8, 2019

It seems much less in flux now - @mdsumner ?

@dblodgett-usgs
Copy link
Contributor Author

I got in touch over in @mdsumner's fork. Am working on it now.

@dblodgett-usgs
Copy link
Contributor Author

Given the work done in recent PRs, I think this can be considered fixed. I've had really good luck reading non-canonical axis order files with the latest implementation.

@edzer
Copy link
Member

edzer commented Sep 30, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants