Nan/ changed values in output when only reading data, saving and reading again #5490

lthUniBonn · 2021-06-18T08:35:09Z

What happened: When combining monthly ERA5 data and saving it individually for single locations, different values/nan values appear when reading the single location file back in.

What you expected to happen: Both should be the same. This works, e.g. when only one month is read.

Minimal Complete Verifiable Example:

import xarray as xr  #using version 0.18.2
import numpy as np

import dask
# only as many threads as requested CPUs | only one to be requested, more threads don't seem to be used
dask.config.set(scheduler='synchronous') # this is used only because of the Cluster I work on, but keeping it here in case it is relevant

model_level_file_name_format = "{:d}_europe_{:d}_130_131_132_133_135.nc" 
ml_files = [model_level_file_name_format.format(2012, 9), model_level_file_name_format.format(2012, 10)]
ds = xr.open_mfdataset(ml_files, decode_times=True)

# Select  single location data
lons = ds['longitude'].values
lats = ds['latitude'].values
i_lat, i_lon = 27,30
ds_loc = ds.sel(latitude=lats[i_lat], longitude=lons[i_lon])

# Save to file
ds_loc.to_netcdf('europe_i_lat_{i_lat}_i_lon_{i_lon}.nc'.format(i_lat=i_lat, i_lon=i_lon))

# Read in again
ds_loc_1 = xr.open_dataset('europe_i_lat_{i_lat}_i_lon_{i_lon}.nc'.format(i_lat=i_lat, i_lon=i_lon), decode_times=True) 

print('Test all q values same: ', np.all(ds_loc.q.values == ds_loc_1.q.values))

Anything else we need to know?: I tested this using these two months - many times saving the output works, or the values are slightly different (in the 6th digit). Using a larger timespan (2010-2012) even nan values appear. This issue is not clearly restricted to the q variable, I've not yet found the pattern.
I've included a more detailed assessment (output, data, code)

only one month: no discrepancies
two months: discrepancies (in the second month)
2010-2013: discrepancies and nan values
at https://uni-bonn.sciebo.de/s/OLHhid8zJg65IFB
I'm not sure where the issue might come from, but as the data is read in correctly at first, it does not seem to be on that side - which would then only leave the process of writing the netcdf output in xarray. I've tested this for a few years and for two months I always get the result, that not all q values are the same. I'm not sure where the problem might be, so I'm not sure where to start for a more minimal example. Hope this is ok.
Cheers, Lavinia

Environment:
INSTALLED VERSIONS

commit: None
python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.25.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.8.0

xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.06.0
distributed: 2021.06.0
matplotlib: 3.4.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.2
conda: None
pytest: None
IPython: None
sphinx: None

d70-t · 2021-06-18T10:44:38Z

Are your input files on (exactly) the same grid? If not, combining the files might introduce NaN to fill up missmatching cells. Furthemore, if you are working with NaNs, are you aware of:

In [1]: import numpy as np

In [2]: np.nan == np.nan
Out[2]: False

Which is as it should be per IEEE 754.

When writing out the files to netCDF, do you accidentally convert from 64bit float to 32bit float?

lthUniBonn · 2021-06-18T11:03:18Z

Yes, they are generated on a .25x.25 lat lon grid in europe, so these values match (when reading the original files there is no nan, which I think excludes this option)

The test is all q values are the same is not meant for the case where I even find nan, but where I don't see them. I should have included the output I get - see below e.q. for the last test I ran.

It say that both original and read back in are F32 - that's what confuses me. I also expected to see a difference in data type to be responsible, but at first glance here it does not seem to be the case.

Below that output I print a timespan of the original and the second dataset, where the values clearly differ - in the last few digits. I can also include the test, where it even returns nan at some places. The full testing code and data is in the link if you want to see that - or I can post it here.

original
<xarray.Dataset>
Dimensions:    (level: 26, time: 1464)
Coordinates:
    longitude  float32 10.0
    latitude   float32 38.0
  * level      (level) int32 112 113 114 115 116 117 ... 132 133 134 135 136 137
  * time       (time) datetime64[ns] 2014-09-01 ... 2014-10-31T23:00:00
Data variables:
    t          (time, level) float32 dask.array<chunksize=(720, 26), meta=np.ndarray>
    q          (time, level) float32 dask.array<chunksize=(720, 26), meta=np.ndarray>
    u          (time, level) float32 dask.array<chunksize=(720, 26), meta=np.ndarray>
    v          (time, level) float32 dask.array<chunksize=(720, 26), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    history:      2021-02-05 01:00:40 GMT by grib_to_netcdf-2.16.0: /opt/ecmw...


read back in
<xarray.Dataset>
Dimensions:    (level: 26, time: 1464)
Coordinates:
    longitude  float32 ...
    latitude   float32 ...
  * level      (level) int32 112 113 114 115 116 117 ... 132 133 134 135 136 137
  * time       (time) datetime64[ns] 2014-09-01 ... 2014-10-31T23:00:00
Data variables:
    t          (time, level) float32 ...
    q          (time, level) float32 ...
    u          (time, level) float32 ...
    v          (time, level) float32 ...
Attributes:
    Conventions:  CF-1.6
    history:      2021-02-05 01:00:40 GMT by grib_to_netcdf-2.16.0: /opt/ecmw...
-----------------------------------
test for nan - np.any(np.isnan(array))
original
q:  False t:  False u:  False v:  False
read back in
q:  False t:  False u:  False v:  False
-----------------------------------
look at one of the problematic portions: (q.values[timespan] - values for same timespan original and read back in)
original
[0.01286593 0.01290165 0.01218289 0.01229404 0.01238789 0.0125237
 0.01275251 0.01274316 0.01292717 0.01308822 0.01309219 0.01304683
 0.01299834 0.01299749 0.01267057 0.01274089 0.01281064 0.01282141
 0.01286848 0.01291271 0.01302868 0.01290676 0.01276612 0.01273976
 0.01273635 0.01271169 0.01244998 0.01250867 0.01229999 0.01256708
 0.01265356 0.01276471 0.01274259 0.01243155 0.01195124 0.01166572
 0.01124779 0.01097304 0.01091747 0.01098779 0.01105896 0.01114317
 0.01122823 0.01133569 0.01147207 0.01155231 0.01154834 0.01154579
 0.01155486 0.01158009 0.0114715  0.01169464 0.01170598 0.01151034
 0.01124751 0.01127246 0.01125374 0.01128862 0.01127643 0.0112631
 0.01126225 0.01126594 0.01154182 0.01162574 0.01169833 0.01176354
 0.01183301 0.01184066 0.01187781 0.01194756 0.01208564 0.01224102
 0.01244346 0.01260706 0.01236549 0.01256538 0.0127528  0.01287415
 0.01304286 0.01327876 0.01366919 0.01396406 0.0142683  0.01445004
 0.01449626 0.01438228 0.01404204 0.01419486 0.01447329 0.01472309
 0.01493943 0.01512514 0.01532986 0.01552691 0.01566074 0.01577302
 0.01581669 0.015832   0.01564515 0.01568768]
read back in
[0.01286582 0.01290182 0.01218301 0.01229396 0.01238785 0.01252367
 0.01275264 0.01274299 0.01292705 0.01308811 0.01309219 0.01304692
 0.0129983  0.01299756 0.01267063 0.01274076 0.01281053 0.01282129
 0.01286842 0.01291258 0.01302873 0.01290664 0.012766   0.01273965
 0.01273631 0.01271182 0.01244982 0.01250883 0.0122999  0.01256709
 0.01265355 0.01276488 0.01274262 0.01243164 0.01195107 0.0116657
 0.01124785 0.01097287 0.01091757 0.01098771 0.01105896 0.0111432
 0.01122818 0.0113358  0.01147199 0.01155215 0.01154844 0.01154584
 0.01155474 0.01157998 0.01147162 0.01169465 0.01170615 0.01151021
 0.01124748 0.01127234 0.01125379 0.01128867 0.01127642 0.01126306
 0.01126232 0.01126603 0.01154175 0.01162562 0.01169836 0.01176367
 0.01183306 0.01184049 0.01187797 0.01194773 0.01208578 0.0122409
 0.01244352 0.01260717 0.01236559 0.01256523 0.01275264 0.01287398
 0.01304283 0.01327885 0.01366924 0.01396389 0.01426819 0.01445002
 0.01449641 0.01438211 0.01404219 0.01419471 0.0144734  0.01472315
 0.0149395  0.01512505 0.01532989 0.01552694 0.01566091 0.01577298
 0.01581677 0.01583198 0.01564532 0.01568763]
timespan:
2014-10-04T08:00:00.000000000 2014-10-08T11:00:00.000000000
Test all q values same:  False

d70-t · 2021-06-18T11:32:38Z

I've checked your example files. This is mostly related to the fact, that the original data is encoded as short and uses scale_factor and add_offset:

In [35]: ds_loc.q.encoding
Out[35]: 
{'source': '/private/tmp/test_xarray/Minimal_test_data/2012_europe_9_130_131_132_133_135.nc',
 'original_shape': (720, 26, 36, 41),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 3.0672840096982675e-07,
 'add_offset': 0.010050721147263318}

Probably the scaling and adding is carried out in float64, but then rounded down to float32. When storing the dataset back to netCDF, xarray re-uses the information from the encoding attribute and goes back to int16, possibly creating even more rounding errors. Reading the data back in is then not reproducible anymore.

Possibly related issues are #4826 and #3020

keewis · 2021-06-18T15:52:18Z

related to that there's also #5082 which proposes to drop the encoding more aggressively.

lthUniBonn · 2021-06-19T08:22:15Z

Probably the scaling and adding is carried out in float64, but then rounded down to float32.
This refers to how xarray reads the netcdf (and not something to do with the original data), right?

Is there a way to avoid this by not scaling/adding in the first place? If only the integer values were read, selected by index and saved again this should then not happen anymore, right? I could try decode_cf=False for this...

kmuehlbauer · 2021-06-19T10:18:21Z

@lthUniBonn You would need to use kwarg mask_and_scale=False in the call to open_dataset. decode_cf=False will work too, but it completely disables any cf decoding.

kmuehlbauer · 2023-05-02T13:20:46Z

Xref: #5739

kmuehlbauer · 2023-05-02T13:38:49Z

This is indeed an issue with scale_factor and add_offset as @d70-t has already mentioned.

That is not a problem per se, but those attributes are obviously different for different files. When concatenating only the first files's attributes survive. That might already be the source of the above problem, as it might slightly change values.

An even bigger problem is, when the dynamic range of the decoded data (min/max) doesn't overlap. Then the data might be folded from the lower border to the upper border or vica versa.

I've put an example into #5739. The suggestion for now is as @keewis comment to drop encoding in such cases and use floating point values for writing. You might use the available compression options for floating point data.

kmuehlbauer · 2023-09-13T13:38:32Z

It seems the situation is clear. Please reopen if there is more to discuss.

keewis mentioned this issue Mar 3, 2022

propagation of encoding #6323

Open

kmuehlbauer closed this as completed Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan/ changed values in output when only reading data, saving and reading again #5490

Nan/ changed values in output when only reading data, saving and reading again #5490

lthUniBonn commented Jun 18, 2021

d70-t commented Jun 18, 2021

lthUniBonn commented Jun 18, 2021 •

edited

d70-t commented Jun 18, 2021 •

edited

keewis commented Jun 18, 2021

lthUniBonn commented Jun 19, 2021 •

edited

kmuehlbauer commented Jun 19, 2021

kmuehlbauer commented May 2, 2023

kmuehlbauer commented May 2, 2023

kmuehlbauer commented Sep 13, 2023

Nan/ changed values in output when only reading data, saving and reading again #5490

Nan/ changed values in output when only reading data, saving and reading again #5490

Comments

lthUniBonn commented Jun 18, 2021

Environment: INSTALLED VERSIONS

d70-t commented Jun 18, 2021

lthUniBonn commented Jun 18, 2021 • edited

d70-t commented Jun 18, 2021 • edited

keewis commented Jun 18, 2021

lthUniBonn commented Jun 19, 2021 • edited

kmuehlbauer commented Jun 19, 2021

kmuehlbauer commented May 2, 2023

kmuehlbauer commented May 2, 2023

kmuehlbauer commented Sep 13, 2023

Environment:
INSTALLED VERSIONS

lthUniBonn commented Jun 18, 2021 •

edited

d70-t commented Jun 18, 2021 •

edited

lthUniBonn commented Jun 19, 2021 •

edited