xr.to_netcdf() alters time dimension #8542

lohae · 2023-12-11T13:49:55Z

What is your issue?

Hi!
I was downloading some data from single files (15min temporal resolution with some smaller gaps here and there) and wanted to save it for further processing. If I reopen the netcdf file, the time dimensions is distorted in a way I cannot really understand. Basically, it changes the 15min difference into something between the first timestamp (e.g. 2018-01-01 00:00:00) and something a few hours later. However, the timestamps are also unordered as the latest time seems to be somewhere in the middle.

my steps are basically a download script which is not really reproducible as it needs login tokens but afterwards everything is purely xarray:

Then, I simply call ds.to_netcdf('filename.nc') and when I re-open it withxr.open_dataset('filename.nc') I get this funny data below where the ds.time.max() is array('2018-01-01T09:06:04.000000000', dtype='datetime64[ns]') with argmax=array(7383, dtype=int64), so it is not even increasing.

Interestingly, saving it, opening it and assigning (ds.assign_coords(time=correct_time) ) the correct time values from the ds previously saved and then saving it again seems to be a workaround but I would like to understand if it me missing something or it might be a bug? I had to re-download quite a lot of data due to this as I was not able to recover the correct time dimension from the altered dimension. If I open the corrupted one with decode_times=False it gives me seconds since 2018-01-01 with only np.unique(ds.time) = 16384 whereas len(ds.time) = 174910.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

welcome · 2023-12-11T13:49:57Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

kmuehlbauer · 2023-12-11T13:58:33Z

@lohae Could you please check that the time-units of the individual files are same for all files?

lohae · 2023-12-11T16:49:20Z

@kmuehlbauer hmm difficult, unfortunately the files are stored in one netcdf for each 15min, opening every of the 170k+ files takes forever. But if that would be the reason, why does it look OK before I save it on disk?

kmuehlbauer · 2023-12-11T17:00:07Z

That's just a guess. You would just have to check some, where the times are corrupted.

The guess is, that the unit's differ between files and that something breaks when encoding again with the unit's from the first file.

If you can share the first file and one which gets corrupted that would also be an option to get behind this.

kmuehlbauer · 2023-12-12T14:42:26Z

But if that would be the reason, why does it look OK before I save it on disk?

Because the data is cf encoded on write. And obviously something happens during this step. The thing is, that each file can have different time units, they will be decoded correctly and look OK in the dataset. On write only the units from the first file survives and everything is encoded with that units.

But normally for time units this should not be a problem. At least not for these time ranges here.

There might be more CF encoding involved as the number your max value counts up to (array('2018-01-01T09:06:04.000000000', dtype='datetime64[ns]') and seconds since 2018-01-01 -> 9 * 60 * 60 + 6 * 60 + 4 = 32764 which is almost at int16 overflow (32767). That might indicate a packed scheme putting times into two byte integers. It could also be a bug which is uncovered by your special use case here.

But without more information on your source data, we can only speculate.

One more thing you might check is the output of ds.time.encoding. This might at least indicate if a packed scheme is used or not.

Possible Workaround

A possible workaround is to drop .encoding.

ds = ds.drop_encoding()

This should create fresh time units on encode, fitting to your data. That's probably the same compared to your ds.assign_coords(time=correct_time), but without the need to reload.

Update: This will remove all decoding information of every data variable and coordinate. This might not be wanted. So removing .encoding from the time-coordinate only might be better.

lohae · 2023-12-13T14:55:20Z

Thanks @kmuehlbauer , we are getting closer! I think the int overflow is what is happening as the encoding variable returns the following, the same for all 170k+ files, note that I replaced the source with ??, as it contained my login credentials.

{'source': '??',
 'original_shape': (1,),
 'dtype': dtype('int16'),
 'units': 'seconds since 2018-01-01T00:00:00Z',
 'calendar': 'standard'}

As you pointed out that means the maximum value would be 02.Jan.2018 09:06:07 given the the max seconds after 01-01-2018 in int16. After saving and loading the netcdf calling ds.time.encoding returns an empty dict and after ds.assign_coords(time=correct_time) the encoding looks like this:

{'zlib': False,
 'shuffle': False,
 'complevel': 0,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (512,),
 'source': ??,
 'original_shape': (174907,),
 'dtype': dtype('int64'),
 'units': 'minutes since 2018-01-01 00:00:00',
 'calendar': 'proleptic_gregorian'}

I still can't really figure out what exactly is happening in the int overflow situation and why this happens in the first place. But I also must say that I am mostly a data user and do not have any deeper knowledge about the internals of xarrays (or netcdf?) time format. Is it normal that time is stored in time unit since first time record of the ds?

Also, removing the time encoding ds.time.encoding = {}, (not sure whether this is the best practice to do so) leads to a correct time dimension when saving.

So, I have a solution/workaround now, however I believe this can be quite annoying as everything looks correct before saving but then is not. In my case I was producing some data for several sites in an automated way and only noticed afterwards. Maybe I should implement some extra checks on the time dimension in future endeavors.

kmuehlbauer · 2023-12-15T13:53:47Z

Is it normal that time is stored in time unit since first time record of the ds?

If there is a datetime64 array xarray will encode it. If there is no time units xarray will try to find some suiting time units by inspecting the array:

xarray/xarray/coding/times.py

Lines 420 to 440 in 2971994

    
           def infer_datetime_units(dates) -> str: 
        
               """Given an array of datetimes, returns a CF compatible time-unit string of 
        
               the form "{time_unit} since {date[0]}", where `time_unit` is 'days', 
        
               'hours', 'minutes' or 'seconds' (the first one that can evenly divide all 
        
               unique time deltas in `dates`) 
        
               """ 
        
               dates = np.asarray(dates).ravel() 
        
               if np.asarray(dates).dtype == "datetime64[ns]": 
        
                   dates = to_datetime_unboxed(dates) 
        
                   dates = dates[pd.notnull(dates)] 
        
                   reference_date = dates[0] if len(dates) > 0 else "1970-01-01" 
        
                   # TODO: the strict enforcement of nanosecond precision Timestamps can be 
        
                   # relaxed when addressing GitHub issue #7493. 
        
                   reference_date = nanosecond_precision_timestamp(reference_date) 
        
               else: 
        
                   reference_date = dates[0] if len(dates) > 0 else "1970-01-01" 
        
                   reference_date = format_cftime_datetime(reference_date) 
        
               unique_timedeltas = np.unique(np.diff(dates)) 
        
               units = _infer_time_units_from_diff(unique_timedeltas) 
        
               return f"{units} since {reference_date}"

kmuehlbauer · 2023-12-21T16:02:11Z

@lohae Did you get along here, or is there something which should be addressed?

lohae · 2023-12-21T16:14:42Z

@kmuehlbauer I think its fine, I used the workaround and I am aware of that in future! Thanks very much!

kmuehlbauer · 2023-12-21T16:38:47Z

@lohae Glad you can use the workaround. I'll close for now. Please reopen or open a follow-up issue if there is anything to do.

…imedelta` (#8575) * Add proof of concept dask-friendly datetime encoding * Add dask support for timedelta encoding and more tests * Minor error message edits; add what's new entry * Add return type for new tests * Fix typo in what's new * Add what's new entry for update following #8542 * Add full type hints to encoding functions * Combine datetime64 and timedelta64 zarr tests; add cftime zarr test * Minor edits to what's new * Address initial review comments * Add proof of concept dask-friendly datetime encoding * Add dask support for timedelta encoding and more tests * Minor error message edits; add what's new entry * Add return type for new tests * Fix typo in what's new * Add what's new entry for update following #8542 * Add full type hints to encoding functions * Combine datetime64 and timedelta64 zarr tests; add cftime zarr test * Minor edits to what's new * Address initial review comments * Initial work toward addressing typing comments * Restore covariant=True in T_DuckArray; add type: ignores * Tweak netCDF3 error message * Move what's new entry * Remove extraneous text from merge in what's new * Remove unused type: ignore comment * Remove word from netCDF3 error message

lohae added the needs triage Issue that has not been reviewed by xarray team member label Dec 11, 2023

kmuehlbauer added topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 12, 2023

kmuehlbauer closed this as completed Dec 21, 2023

spencerkclark mentioned this issue Dec 31, 2023

Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta #8575

Merged

7 tasks

spencerkclark added a commit to spencerkclark/xarray that referenced this issue Dec 31, 2023

Add what's new entry for update following pydata#8542

2ef3d6e

spencerkclark added a commit to spencerkclark/xarray that referenced this issue Jan 1, 2024

Add what's new entry for update following pydata#8542

a28bc2a

spencerkclark added a commit to spencerkclark/xarray that referenced this issue Jan 27, 2024

Add what's new entry for update following pydata#8542

141f1c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xr.to_netcdf() alters time dimension #8542

xr.to_netcdf() alters time dimension #8542

lohae commented Dec 11, 2023

welcome bot commented Dec 11, 2023

kmuehlbauer commented Dec 11, 2023

lohae commented Dec 11, 2023 •

edited

kmuehlbauer commented Dec 11, 2023

kmuehlbauer commented Dec 12, 2023 •

edited

lohae commented Dec 13, 2023

kmuehlbauer commented Dec 15, 2023

kmuehlbauer commented Dec 21, 2023

lohae commented Dec 21, 2023

kmuehlbauer commented Dec 21, 2023

xr.to_netcdf() alters time dimension #8542

xr.to_netcdf() alters time dimension #8542

Comments

lohae commented Dec 11, 2023

What is your issue?

welcome bot commented Dec 11, 2023

kmuehlbauer commented Dec 11, 2023

lohae commented Dec 11, 2023 • edited

kmuehlbauer commented Dec 11, 2023

kmuehlbauer commented Dec 12, 2023 • edited

Possible Workaround

lohae commented Dec 13, 2023

kmuehlbauer commented Dec 15, 2023

kmuehlbauer commented Dec 21, 2023

lohae commented Dec 21, 2023

kmuehlbauer commented Dec 21, 2023

lohae commented Dec 11, 2023 •

edited

kmuehlbauer commented Dec 12, 2023 •

edited