Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'sub' does not allow to read all variables from a netcdf file, ignores lat and lon #65

Closed
adrfantini opened this issue Oct 11, 2018 · 22 comments

Comments

@adrfantini
Copy link

adrfantini commented Oct 11, 2018

I am reading the file from #64 : https://drive.google.com/open?id=1YnrUnpydoNdMUUSJ_TrTh-nBxl693dik
I want to read it as a curvilinear grid and as such I need to extract the variables lat and lon, which are bidimensional, and assign the curvilinear values to the tridimensional variable pr (although in this example the time dimension is redundant since it has length 1).

However, it seems that no matter the value of sub, read_stars only wants to read the pr variable:

identical(read_stars('/dev/shm/test.nc', sub='lat'), read_stars('/dev/shm/test.nc', sub='pr'))
[1] TRUE

read_ncdf from Mike's branch instead works, so a perfectly viable workaround exists.

I suspect this is a limitation of the underlying GDAL. Should I just default on reading netCDF with read_ncdf when possible?

@adrfantini adrfantini changed the title 'sub' does not seem to allow me to read all variables from a file 'sub' does not seem to allow me to read all variables from a netcdf file Oct 11, 2018
@adrfantini adrfantini changed the title 'sub' does not seem to allow me to read all variables from a netcdf file 'sub' does not seem to allow to read all variables from a netcdf file Oct 11, 2018
@adrfantini adrfantini changed the title 'sub' does not seem to allow to read all variables from a netcdf file 'sub' does not allow to read all variables from a netcdf file Oct 11, 2018
@adrfantini adrfantini changed the title 'sub' does not allow to read all variables from a netcdf file 'sub' does not allow to read all variables from a netcdf file, ignores lat and lon Oct 11, 2018
@edzer
Copy link
Member

edzer commented Oct 18, 2018

The logic behind the GDAL netcdf driver is this: if the file contains multiple variables, they are presented as sub-datasets, and you can select one of them. If there is only one, as in this case, the driver doesn't return any subdatasets but only a single one, and you cannot select. However, you can explicitly ask GDAL to read it:

> gdal_utils("info", "test.nc")

output:

Driver: netCDF/Network Common Data Format
Files: test.nc
Size is 424, 412
Coordinate System is `'
Metadata:
  NC_GLOBAL#CDI=Climate Data Interface version 1.9.1 (http://mpimet.mpg.de/cdi)
  NC_GLOBAL#CDO=Climate Data Operators version 1.9.1 (http://mpimet.mpg.de/cdo)
  NC_GLOBAL#contact=obc@dmi.dk
  NC_GLOBAL#Conventions=CF-1.6
  NC_GLOBAL#CORDEX_domain=EUR-11
  NC_GLOBAL#creation_date=2012-12-03 03:14:15
  NC_GLOBAL#driving_experiment=ICHEC-EC-EARTH,rcp85,r3i1p1
  NC_GLOBAL#driving_experiment_name=rcp85
  NC_GLOBAL#driving_model_ensemble_member=r3i1p1
  NC_GLOBAL#driving_model_id=ICHEC-EC-EARTH
  NC_GLOBAL#experiment=Scenario run using ECEARTH as driving model
  NC_GLOBAL#experiment_id=rcp85
  NC_GLOBAL#frequency=mon
  NC_GLOBAL#history=Wed Oct 03 11:53:06 2018: cdo -C seltimestep,1 /home/netapp-clima-scratch/fraffael/SPI/EC-EARTH_HIRHAM5/prOK_mon_2071-2098.nc /dev/shm/test.nc
Mon Oct 01 18:02:40 2018: cdo mulc,86400 pr_mon_2071-2098.nc prOK_mon_2071-2098.nc
Mon Oct 01 12:24:12 2018: cdo selyear,2071/2098 pr_mon_2071-2100.nc pr_mon_2071-2098.nc
Mon Oct  1 12:23:45 2018: ncrcat /home/esp-shared-a/RegionalModels/CORDEX/EUR-11/rcp85/MM/EC-EARTH_HIRHAM5/pr_EUR-11_ICHEC-EC-EARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_207101-208012.nc /home/esp-shared-a/RegionalModels/CORDEX/EUR-11/rcp85/MM/EC-EARTH_HIRHAM5/pr_EUR-11_ICHEC-EC-EARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_208101-209012.nc /home/esp-shared-a/RegionalModels/CORDEX/EUR-11/rcp85/MM/EC-EARTH_HIRHAM5/pr_EUR-11_ICHEC-EC-EARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_209101-210012.nc pr_mon_2071-2100.nc
Mon Nov 18 04:53:38 2013: /usr/local/bin/ncks --mk_rec_dmn time pr_EUR-11_ICHEC-EC-EARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_207101-208012.nc out1.nc
Fri Dec 07 08:23:08 2012: cdo setday,16 NAM-44/DMI/DMI-ECEARTH/rcp85/r3i1p1/DMI-HIRHAM5/v1/mon/tas/tas_NAM-44_DMI-ECEARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_207101-208012.nc NAM-44/DMI/DMI-ECEARTH/rcp85/r3i1p1/DMI-HIRHAM5/v1/mon/tas/tas_NAM-44_DMI-ECEARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_207101-208012_sub.nc
Fri Dec 07 08:23:07 2012: cdo monmean NAM-44/DMI/DMI-ECEARTH/rcp85/r3i1p1/DMI-HIRHAM5/v1/mon/tas/tas_NAM-44_DMI-ECEARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_20710101-20801231.nc NAM-44/DMI/DMI-ECEARTH/rcp85/r3i1p1/DMI-HIRHAM5/v1/mon/tas/tas_NAM-44_DMI-ECEARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_207101-208012.nc
Fri Dec 07 08:22:56 2012: cdo mergetime NAM-44/DMI/DMI-ECEARTH/rcp85/r3i1p1/DMI-HIRHAM5/v1/day/tas/tas_NAM-44_DMI-ECEARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_day_20710101-20751231.nc NAM-44/DMI/DMI-ECEARTH/rcp85/r3i1p1/DMI-HIRHAM5/v1/day/tas/tas_NAM-44_DMI-ECEARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_day_20760101-20801231.nc NAM-44/DMI/DMI-ECEARTH/rcp85/r3i1p1/DMI-HIRHAM5/v1/mon/tas/tas_NAM-44_DMI-ECEARTH_rcp85_r3i1p1_DMI-HIRHAM5_v1_mon_20710101-20801231.nc
  NC_GLOBAL#institute_id=DMI
  NC_GLOBAL#institution=Danish Meteorological Institute
  NC_GLOBAL#model_id=DMI-HIRHAM5
  NC_GLOBAL#NCO="4.6.4"
  NC_GLOBAL#nco_openmp_thread_number=1
  NC_GLOBAL#product=output
  NC_GLOBAL#project_id=CORDEX
  NC_GLOBAL#rcm_version_id=v1
  NC_GLOBAL#tracking_id=e9cc9b3b-9755-400b-b101-49c24b811781
  NETCDF_DIM_EXTRA={time}
  NETCDF_DIM_time_DEF={1,6}
  NETCDF_DIM_time_VALUES=44241.5
  pr#cell_methods=time: mean
  pr#coordinates=lat lon
  pr#long_name=Precipitation
  pr#missing_value=1e+20
  pr#standard_name=precipitation_flux
  pr#units=kg m-2 s-1
  pr#_FillValue=1e+20
  time#axis=T
  time#bounds=time_bnds
  time#calendar=proleptic_gregorian
  time#standard_name=time
  time#units=days since 1949-12-01 00:00:00
Subdatasets:
  SUBDATASET_1_NAME=NETCDF:"test.nc":time_bnds
  SUBDATASET_1_DESC=[1x2] time_bnds (64-bit floating-point)
  SUBDATASET_2_NAME=NETCDF:"test.nc":lon
  SUBDATASET_2_DESC=[412x424] longitude (64-bit floating-point)
  SUBDATASET_3_NAME=NETCDF:"test.nc":lat
  SUBDATASET_3_DESC=[412x424] latitude (64-bit floating-point)
  SUBDATASET_4_NAME=NETCDF:"test.nc":pr
  SUBDATASET_4_DESC=[1x412x424] precipitation_flux (32-bit floating-point)
Geolocation:
  LINE_OFFSET=0
  LINE_STEP=1
  PIXEL_OFFSET=0
  PIXEL_STEP=1
  SRS=GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]]
  X_BAND=1
  X_DATASET=NETCDF:"test.nc":lon
  Y_BAND=1
  Y_DATASET=NETCDF:"test.nc":lat
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  412.0)
Upper Right (  424.0,    0.0)
Lower Right (  424.0,  412.0)
Center      (  212.0,  206.0)
Band 1 Block=424x1 Type=Float32, ColorInterp=Undefined
  NoData Value=1.00000002004087734e+20
  Unit Type: kg m-2 s-1
  Metadata:
    cell_methods=time: mean
    coordinates=lat lon
    long_name=Precipitation
    missing_value=1e+20
    NETCDF_DIM_time=44241.5
    NETCDF_VARNAME=pr
    standard_name=precipitation_flux
    units=kg m-2 s-1
    _FillValue=1e+20
and read the sub-dataset as "main" array:
> (lon = read_stars('NETCDF:"test.nc":lon'))
stars object with 2 dimensions and 1 attribute
attribute(s):
 NETCDF:"test.nc":lon [°]
 Min.   :-44.594         
 1st Qu.: -7.067         
 Median : 10.091         
 Mean   : 10.225         
 3rd Qu.: 28.112         
 Max.   : 64.964         
dimension(s):
  from  to offset delta refsys point values    
x    1 424     NA    NA     NA    NA   NULL [x]
y    1 412     NA    NA     NA    NA   NULL [y]
Warning message:
In get_raster(affine = pr$geotransform[c(3, 5)], dimensions = c("x",  :
  setting NA affine values to zero

It's a bit convoluted, but this is how GDAL deals with these data.

I'd be happy to add a curvilinear argument to read_stars that takes two names, say c("lon", "lat"), and then automatically reads these arrays and sets them as the curvilinear coordinate values.

@adrfantini
Copy link
Author

I'd be happy to add a curvilinear argument to read_stars that takes two names, say c("lon", "lat"), and then automatically reads these arrays and sets them as the curvilinear coordinate values.

That would be handy.

edzer added a commit that referenced this issue Oct 18, 2018
@edzer
Copy link
Member

edzer commented Oct 18, 2018

This now works:

> library(stars)
Loading required package: abind
Loading required package: sf
Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 4.9.3
> (r = read_stars('test.nc', curvilinear = c('lon', 'lat')))
stars object with 3 dimensions and 1 attribute
attribute(s):
    test.nc      
 Min.   : 0.000  
 1st Qu.: 1.171  
 Median : 2.236  
 Mean   : 2.840  
 3rd Qu.: 3.961  
 Max.   :92.606  
dimension(s):
     from  to              offset delta                       refsys point
x       1 424                  NA    NA +proj=longlat +datum=WGS8...    NA
y       1 412                  NA    NA +proj=longlat +datum=WGS8...    NA
time    1   1 2071-01-16 12:00:00    NA                      POSIXct    NA
                                       values    
x    [424x412] -44.5939 [°], ..., 64.9644 [°] [x]
y      [424x412] 21.9878 [°], ..., 72.585 [°] [y]
time                                     NULL    
curvilinear grid
Warning messages:
1: In get_raster(affine = pr$geotransform[c(3, 5)], dimensions = c("x",  :
  setting NA affine values to zero
2: In get_raster(affine = pr$geotransform[c(3, 5)], dimensions = c("x",  :
  setting NA affine values to zero
3: In get_raster(affine = pr$geotransform[c(3, 5)], dimensions = c("x",  :
  setting NA affine values to zero

@adrfantini
Copy link
Author

adrfantini commented Oct 19, 2018

Thanks! Ideally we'll want this to work with read_ncdf too, @mdsumner .
@edzer This seems similar to what xarray does here, did you think of the issue they raise re inferring cell boundaries?

@edzer
Copy link
Member

edzer commented Oct 19, 2018

I also see room for making this automatic: without specifying curvilinear, we see

> (r = read_stars('test.nc'))
stars object with 3 dimensions and 1 attribute
attribute(s):
    test.nc      
 Min.   : 0.000  
 1st Qu.: 1.171  
 Median : 2.236  
 Mean   : 2.840  
 3rd Qu.: 3.961  
 Max.   :92.606  
dimension(s):
     from  to              offset delta  refsys point values    
x       1 424                  NA    NA      NA    NA   NULL [x]
y       1 412                  NA    NA      NA    NA   NULL [y]
time    1   1 2071-01-16 12:00:00    NA POSIXct    NA   NULL    
Warning message:
In get_raster(affine = pr$geotransform[c(3, 5)], dimensions = c("x",  :
  setting NA affine values to zero

i.e. no geotransform; in the metadata returned by gdal however we see a metadata element (in meta_bands) that has a flag coordinates=lon lat, which can be parsed to give defaults for curvilinear so to speak.

I was trying to automate this and looking into the s5p data in starsdata, which is HDF5, and where things are all just a little bit different... will try next week.

@mdsumner
Copy link
Member

I think we should use GDAL to discover the curvilinear arrays, then read and apply them with NetCDF. Rectilnear is a different game, still exploring

@mdsumner
Copy link
Member

To @edzer's last comment, yes GDAL has the best hueristics here and gdalwarp oftens does fine with no manual interpretation. I am still not au fait with GCPs vs geolocation arrays, and maybe there are other types too.

@mdsumner
Copy link
Member

If we can find an example of rectilnear grid that GDAL identifies coordinates for that would clear up a question for me. I think it might be an opportunity to improve that library to our benefit. It's very active in this area recently

@adrfantini
Copy link
Author

If we can find an example of rectilnear grid that GDAL identifies coordinates for that would clear up a question for me. I think it might be an opportunity to improve that library to our benefit. It's very active in this area recently

I'm not sure I understand, can you be a bit more specific? GDAL should be able to understand any regular rectilinear lat-lon grid, for example, but I'm clearly missing something.

@mdsumner
Copy link
Member

Oh that's a good point, I mean particularly non regular 1d arrays - gdal ignores them for the transform, but does it identify them as coordinates formally as it does for 2d ones? I need to show some examples. I definitely need more non-regular rectilinear cases.

@adrfantini
Copy link
Author

adrfantini commented Oct 19, 2018

At your command!
Here is a link to a file coming from an ocean model, which is regular in longitude but non regular in latitude.
Additionally, it has two grids inside since the model has an Arakawa staggered grid and different variables are calculated on the center or edge of the grid point. (I don't use this model so don't know much of it)

GDALINFO:

Warning 1: No UNIDATA NC_GLOBAL:Conventions attribute
Driver: netCDF/Network Common Data Format
Files: /home/esp-shared-a/GlobalModels/CMIP5/OCEAN/Control/CM2.1/CM2.1_regionmask.nc
Size is 512, 512
Coordinate System is `'
Metadata:
  NC_GLOBAL#filename=regionmask.nc
  NC_GLOBAL#MPP_IO_VERSION=$Id: mpp_io.F90,v 6.5.6.1.2.1.2.1.2.1 2003/02/19 15:47:51 fms Exp $
Subdatasets:
  SUBDATASET_1_NAME=NETCDF:"/home/esp-shared-a/GlobalModels/CMIP5/OCEAN/Control/CM2.1/CM2.1_regionmask.nc":tmask
  SUBDATASET_1_DESC=[200x360] tmask (32-bit floating-point)
  SUBDATASET_2_NAME=NETCDF:"/home/esp-shared-a/GlobalModels/CMIP5/OCEAN/Control/CM2.1/CM2.1_regionmask.nc":umask
  SUBDATASET_2_DESC=[200x360] umask (32-bit floating-point)
  SUBDATASET_3_NAME=NETCDF:"/home/esp-shared-a/GlobalModels/CMIP5/OCEAN/Control/CM2.1/CM2.1_regionmask.nc":geolon_t
  SUBDATASET_3_DESC=[200x360] geolon_t (32-bit floating-point)
  SUBDATASET_4_NAME=NETCDF:"/home/esp-shared-a/GlobalModels/CMIP5/OCEAN/Control/CM2.1/CM2.1_regionmask.nc":geolon_c
  SUBDATASET_4_DESC=[200x360] geolon_c (32-bit floating-point)
  SUBDATASET_5_NAME=NETCDF:"/home/esp-shared-a/GlobalModels/CMIP5/OCEAN/Control/CM2.1/CM2.1_regionmask.nc":geolat_t
  SUBDATASET_5_DESC=[200x360] geolat_t (32-bit floating-point)
  SUBDATASET_6_NAME=NETCDF:"/home/esp-shared-a/GlobalModels/CMIP5/OCEAN/Control/CM2.1/CM2.1_regionmask.nc":geolat_c
  SUBDATASET_6_DESC=[200x360] geolat_c (32-bit floating-point)
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  512.0)
Upper Right (  512.0,    0.0)
Lower Right (  512.0,  512.0)
Center      (  256.0,  256.0)

@mdsumner
Copy link
Member

Cool, that's a good one - definitely haven't seen that kind of grid in CMIP5 yet - and the Arakawa grids I've used before are way more complicated than this.

But, it looks like gridlon_t and gridlat_t only apply in a limited area? (Perhaps the central lines?). Can you map with only those axes ? (doesn't seem so)

@mdsumner
Copy link
Member

Or, are they pertinent only between a range of lats? (Greenland is obviously wrong, and the grid is less dense there as per geolon/lat_t, so does it apply only south of that, as per this mask classification?).

image

It's not that good a data set because there's no way to know down south what anything is, do you have a data set that uses these coordinates and not just the mask?

@adrfantini
Copy link
Author

I really do not know much, this is an ocean model and they might have changed the land/topography/bathymetry to better suite their flows... and I don't have any idea what those variables are for. Here is the same dataset with 2 timesteps for SST, in case you need it.

@mdsumner
Copy link
Member

Well that's interesting, there's a relationship between the mask and the sst, but not by anything explicit (300 and 360 is clear, but that's pretty tenuous). The xt in the sst file is not regular, but gridlon_t in the mask file is. They just don't seem to be related, so I'm still searching for some missing insight.

At any rate the SST data is more useful as an example, so I'm satisfied - thanks!

@adrfantini
Copy link
Author

adrfantini commented Oct 19, 2018

Yeah I'm sorry I can't dig more into it right now, I just went scavenging for a suitable example in our archive, but I don't know anyone using those models right now. But yeah, the two grids actually are slightly different, despite having similar names and being in the same folder... they might come from different simulations, I don't know.

@mdsumner
Copy link
Member

mdsumner commented Oct 22, 2018

A PR update to apply curvilinear to read_ncdf #68 ping @edzer

@edzer
Copy link
Member

edzer commented Oct 22, 2018

Did you close the PR inadvertently?

@mdsumner
Copy link
Member

ah yes, was right in the crux of the GH outage - I'm still getting some confusing behaviour, but probably quirks of caching here and there.

#69

In the meantime I cleaned up, and will try to stick to "mdsumner-dev" branch on my side for PRs.

@adrfantini
Copy link
Author

@edzer
I'm playing around with a dataset which I read as curvilinear lat-lon, and the subsetting with stars_dataset[sf_object] seems to fail with:

Error in xy_from_colrow(x, geotransform, inverse = TRUE) : 
  geotransform not invertible

MRE with the above file:

r = read_stars('test (2).nc', curvilinear = c('lon', 'lat'))
sf = st_sf(a=1, st_geometry(st_point(c(0, 30))), crs=4326) %>% st_buffer(1)
r[sf]

edzer added a commit that referenced this issue Oct 27, 2018
@edzer
Copy link
Member

edzer commented Oct 27, 2018

Should work now; note that cropping doesn't work (happy to receive a PR on this if there is a need).

@adrfantini
Copy link
Author

Sorry I'm late. Can confirm working. Cropping would be nice but not a priority for me, for now (since I transform to sf anyway).

@edzer edzer closed this as completed Dec 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants