Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xr.to_netcdf() alters time dimension #8542

Closed
lohae opened this issue Dec 11, 2023 · 10 comments
Closed

xr.to_netcdf() alters time dimension #8542

lohae opened this issue Dec 11, 2023 · 10 comments
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)

Comments

@lohae
Copy link

lohae commented Dec 11, 2023

What is your issue?

Hi!
I was downloading some data from single files (15min temporal resolution with some smaller gaps here and there) and wanted to save it for further processing. If I reopen the netcdf file, the time dimensions is distorted in a way I cannot really understand. Basically, it changes the 15min difference into something between the first timestamp (e.g. 2018-01-01 00:00:00) and something a few hours later. However, the timestamps are also unordered as the latest time seems to be somewhere in the middle.

my steps are basically a download script which is not really reproducible as it needs login tokens but afterwards everything is purely xarray:

grafik

Then, I simply call ds.to_netcdf('filename.nc') and when I re-open it withxr.open_dataset('filename.nc') I get this funny data below where the ds.time.max() is array('2018-01-01T09:06:04.000000000', dtype='datetime64[ns]') with argmax=array(7383, dtype=int64), so it is not even increasing.

grafik

Interestingly, saving it, opening it and assigning (ds.assign_coords(time=correct_time) ) the correct time values from the ds previously saved and then saving it again seems to be a workaround but I would like to understand if it me missing something or it might be a bug? I had to re-download quite a lot of data due to this as I was not able to recover the correct time dimension from the altered dimension. If I open the corrupted one with decode_times=False it gives me seconds since 2018-01-01 with only np.unique(ds.time) = 16384 whereas len(ds.time) = 174910.

Thanks in advance!

@lohae lohae added the needs triage Issue that has not been reviewed by xarray team member label Dec 11, 2023
Copy link

welcome bot commented Dec 11, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@kmuehlbauer
Copy link
Contributor

@lohae Could you please check that the time-units of the individual files are same for all files?

@lohae
Copy link
Author

lohae commented Dec 11, 2023

@kmuehlbauer hmm difficult, unfortunately the files are stored in one netcdf for each 15min, opening every of the 170k+ files takes forever. But if that would be the reason, why does it look OK before I save it on disk?

@kmuehlbauer
Copy link
Contributor

That's just a guess. You would just have to check some, where the times are corrupted.

The guess is, that the unit's differ between files and that something breaks when encoding again with the unit's from the first file.

If you can share the first file and one which gets corrupted that would also be an option to get behind this.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Dec 12, 2023

But if that would be the reason, why does it look OK before I save it on disk?

Because the data is cf encoded on write. And obviously something happens during this step. The thing is, that each file can have different time units, they will be decoded correctly and look OK in the dataset. On write only the units from the first file survives and everything is encoded with that units.

But normally for time units this should not be a problem. At least not for these time ranges here.

There might be more CF encoding involved as the number your max value counts up to (array('2018-01-01T09:06:04.000000000', dtype='datetime64[ns]') and seconds since 2018-01-01 -> 9 * 60 * 60 + 6 * 60 + 4 = 32764 which is almost at int16 overflow (32767). That might indicate a packed scheme putting times into two byte integers. It could also be a bug which is uncovered by your special use case here.

But without more information on your source data, we can only speculate.

One more thing you might check is the output of ds.time.encoding. This might at least indicate if a packed scheme is used or not.

Possible Workaround

A possible workaround is to drop .encoding.

ds = ds.drop_encoding()

This should create fresh time units on encode, fitting to your data. That's probably the same compared to your ds.assign_coords(time=correct_time), but without the need to reload.

Update: This will remove all decoding information of every data variable and coordinate. This might not be wanted. So removing .encoding from the time-coordinate only might be better.

@kmuehlbauer kmuehlbauer added topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 12, 2023
@lohae
Copy link
Author

lohae commented Dec 13, 2023

Thanks @kmuehlbauer , we are getting closer! I think the int overflow is what is happening as the encoding variable returns the following, the same for all 170k+ files, note that I replaced the source with ??, as it contained my login credentials.

{'source': '??',
 'original_shape': (1,),
 'dtype': dtype('int16'),
 'units': 'seconds since 2018-01-01T00:00:00Z',
 'calendar': 'standard'}

As you pointed out that means the maximum value would be 02.Jan.2018 09:06:07 given the the max seconds after 01-01-2018 in int16. After saving and loading the netcdf calling ds.time.encoding returns an empty dict and after ds.assign_coords(time=correct_time) the encoding looks like this:

{'zlib': False,
 'shuffle': False,
 'complevel': 0,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (512,),
 'source': ??,
 'original_shape': (174907,),
 'dtype': dtype('int64'),
 'units': 'minutes since 2018-01-01 00:00:00',
 'calendar': 'proleptic_gregorian'}

I still can't really figure out what exactly is happening in the int overflow situation and why this happens in the first place. But I also must say that I am mostly a data user and do not have any deeper knowledge about the internals of xarrays (or netcdf?) time format. Is it normal that time is stored in time unit since first time record of the ds?

Also, removing the time encoding ds.time.encoding = {}, (not sure whether this is the best practice to do so) leads to a correct time dimension when saving.

So, I have a solution/workaround now, however I believe this can be quite annoying as everything looks correct before saving but then is not. In my case I was producing some data for several sites in an automated way and only noticed afterwards. Maybe I should implement some extra checks on the time dimension in future endeavors.

@kmuehlbauer
Copy link
Contributor

Is it normal that time is stored in time unit since first time record of the ds?

If there is a datetime64 array xarray will encode it. If there is no time units xarray will try to find some suiting time units by inspecting the array:

def infer_datetime_units(dates) -> str:
"""Given an array of datetimes, returns a CF compatible time-unit string of
the form "{time_unit} since {date[0]}", where `time_unit` is 'days',
'hours', 'minutes' or 'seconds' (the first one that can evenly divide all
unique time deltas in `dates`)
"""
dates = np.asarray(dates).ravel()
if np.asarray(dates).dtype == "datetime64[ns]":
dates = to_datetime_unboxed(dates)
dates = dates[pd.notnull(dates)]
reference_date = dates[0] if len(dates) > 0 else "1970-01-01"
# TODO: the strict enforcement of nanosecond precision Timestamps can be
# relaxed when addressing GitHub issue #7493.
reference_date = nanosecond_precision_timestamp(reference_date)
else:
reference_date = dates[0] if len(dates) > 0 else "1970-01-01"
reference_date = format_cftime_datetime(reference_date)
unique_timedeltas = np.unique(np.diff(dates))
units = _infer_time_units_from_diff(unique_timedeltas)
return f"{units} since {reference_date}"

@kmuehlbauer
Copy link
Contributor

@lohae Did you get along here, or is there something which should be addressed?

@lohae
Copy link
Author

lohae commented Dec 21, 2023

@kmuehlbauer I think its fine, I used the workaround and I am aware of that in future! Thanks very much!

@kmuehlbauer
Copy link
Contributor

@lohae Glad you can use the workaround. I'll close for now. Please reopen or open a follow-up issue if there is anything to do.

spencerkclark added a commit to spencerkclark/xarray that referenced this issue Dec 31, 2023
spencerkclark added a commit to spencerkclark/xarray that referenced this issue Jan 1, 2024
spencerkclark added a commit to spencerkclark/xarray that referenced this issue Jan 27, 2024
dcherian pushed a commit that referenced this issue Jan 29, 2024
…imedelta` (#8575)

* Add proof of concept dask-friendly datetime encoding

* Add dask support for timedelta encoding and more tests

* Minor error message edits; add what's new entry

* Add return type for new tests

* Fix typo in what's new

* Add what's new entry for update following #8542

* Add full type hints to encoding functions

* Combine datetime64 and timedelta64 zarr tests; add cftime zarr test

* Minor edits to what's new

* Address initial review comments

* Add proof of concept dask-friendly datetime encoding

* Add dask support for timedelta encoding and more tests

* Minor error message edits; add what's new entry

* Add return type for new tests

* Fix typo in what's new

* Add what's new entry for update following #8542

* Add full type hints to encoding functions

* Combine datetime64 and timedelta64 zarr tests; add cftime zarr test

* Minor edits to what's new

* Address initial review comments

* Initial work toward addressing typing comments

* Restore covariant=True in T_DuckArray; add type: ignores

* Tweak netCDF3 error message

* Move what's new entry

* Remove extraneous text from merge in what's new

* Remove unused type: ignore comment

* Remove word from netCDF3 error message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)
Projects
None yet
Development

No branches or pull requests

2 participants