Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.dt accessor returns int instead of float, resulting in misrepresentation of NaT values #7928

Closed
4 tasks done
idantene opened this issue Jun 19, 2023 · 12 comments · Fixed by #8084
Closed
4 tasks done

Comments

@idantene
Copy link

What happened?

With the latest xarray (this doesn't happen at least in version 2023.2.0), accessing .dt parts returns a strict int64 DataArray, resulting in wrongly presented missing values.

In [2]: s = pd.to_datetime(pd.Series(['2021-12-01', pd.NaT]))

In [3]: s
Out[3]: 
0   2021-12-01
1          NaT
dtype: datetime64[ns]

In [4]: s.dt.year
Out[4]: 
0    2021.0
1       NaN
dtype: float64

In [5]: s.to_xarray()
Out[5]: 
<xarray.DataArray (index: 2)>
array(['2021-12-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
Coordinates:
  * index    (index) int64 0 1

In [6]: s.to_xarray().dt.year
Out[6]: 
<xarray.DataArray 'year' (index: 2)>
array([                2021, -9223372036854775808])
Coordinates:
  * index    (index) int64 0 1

Notice how:

  1. The series and generated DataArray are both correctly datetime64[ns]
  2. The series' .dt.year accessor returns a float64 to accommodate the missing value (it will use int32 w/o the missing value).
  3. The DataArray's .dt.year returns a negative integer instead of nan.

Additionally, compare with the same snippet's output with xarray 2023.2.0:

In [3]: s.to_xarray().dt.year
Out [3]:
<xarray.DataArray 'year' (index: 2)>
array([2021.,   nan])
Coordinates:
  * index    (index) int64 0 1

What did you expect to happen?

The .dt accessor should return a float with missing values when needed.

Minimal Complete Verifiable Example

import pandas as pd

s = pd.to_datetime(pd.Series(['2021-12-01', pd.NaT]))
s.to_xarray().dt.year

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.19.0-1025-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2023.5.0
pandas: 2.0.2
numpy: 1.23.4
scipy: 1.10.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.6.0
distributed: 2023.6.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.6.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 59.6.0
pip: 23.1.2
conda: None
pytest: None
mypy: None
IPython: 8.14.0
sphinx: None

@idantene idantene added bug needs triage Issue that has not been reviewed by xarray team member labels Jun 19, 2023
@welcome
Copy link

welcome bot commented Jun 19, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@idantene
Copy link
Author

idantene commented Jun 19, 2023

FWIW I believe the source for this is the dtype changes in #7724, with the addition of forcing the cast to the request dtype as noted here.

@kmuehlbauer
Copy link
Contributor

@idantene Thanks for bringing this to attention. I can reproduce and it seems the unconditional cast (as you suggested) is the cause:

return access_method(values, name).astype(dtype, copy=False)

If NaT is present in the source data access_method will return float64 with nan. The cast to int64 will transform nan -> -9223372036854775808. I'm not really sure how and where to handle this, but one solution would be to check for presence of NaT or float64 return type and do not cast in these occasions.

@idantene
Copy link
Author

Hey @kmuehlbauer, thanks for addressing and confirming the issue!

I think both of those are valid approaches, and would surely address the issue at hand. However, these feel a bit like stop-gap measures, and I have to wonder (as someone who hasn't contributed to xarray yet) - why are the type casts even necessary, especially in the series case?
It seems like it can be directly off-loaded to pandas.

@kmuehlbauer
Copy link
Contributor

From reading the linked PR I think the main thing was to cast int32 to int64 (for xarray internal use). @keewis and @dcherian surely have more insight into that topic.

@dcherian
Copy link
Contributor

Yes I think we should cast to float like pandas does.

@dcherian dcherian added regression and removed needs triage Issue that has not been reviewed by xarray team member labels Jun 21, 2023
@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jun 22, 2023

@dcherian pandas currently returns float64 in presence of NaT and int32 otherwise. As mentioned above, should xarray cast only int32 -> int64 and keep the pandas float64 behaviour? Or should we also cast to int32 -> float64. Anyway, this would need to happen inside

def _access_through_series(values, name):
to also make this work for dask arrays.

Side note: If isocalender is requested, xarray will already cast to int64 unconditionally.

Update: This already raises in current main branch (and probably earlier releases) with pandas version 2.0.1:

import pandas as pd
s = pd.to_datetime(pd.Series(['2021-12-01', pd.NaT]))
s.to_xarray().dt.isocalendar()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 3
      1 import pandas as pd
      2 s = pd.to_datetime(pd.Series(['2021-12-01', pd.NaT]))
----> 3 s.to_xarray().dt.isocalendar()

File /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/core/accessor_dt.py:363, in DatetimeAccessor.isocalendar(self)
    360 if not is_np_datetime_like(self._obj.data.dtype):
    361     raise AttributeError("'CFTimeIndex' object has no attribute 'isocalendar'")
--> 363 values = _get_date_field(self._obj.data, "isocalendar", np.int64)
    365 obj_type = type(self._obj)
    366 data_vars = {}

File /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/core/accessor_dt.py:122, in _get_date_field(values, name, dtype)
    118     return map_blocks(
    119         access_method, values, name, dtype=dtype, new_axis=new_axis, chunks=chunks
    120     )
    121 else:
--> 122     return access_method(values, name).astype(dtype, copy=False)

File /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/core/accessor_dt.py:78, in _access_through_series(values, name)
     75     field_values = _season_from_months(months)
     76 elif name == "isocalendar":
     77     # isocalendar returns iso- year, week, and weekday -> reshape
---> 78     field_values = np.array(values_as_series.dt.isocalendar(), dtype=np.int64)
     79     return field_values.T.reshape(3, *values.shape)
     80 else:

File /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/pandas/core/generic.py:1998, in NDFrame.__array__(self, dtype)
   1996 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
   1997     values = self._values
-> 1998     arr = np.asarray(values, dtype=dtype)
   1999     if (
   2000         astype_is_view(values.dtype, arr.dtype)
   2001         and using_copy_on_write()
   2002         and self._mgr.is_single_block
   2003     ):
   2004         # Check if both conversions can be done without a copy
   2005         if astype_is_view(self.dtypes.iloc[0], values.dtype) and astype_is_view(
   2006             values.dtype, arr.dtype
   2007         ):

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NAType'

Looks like no test was catching this up to now.

@idantene
Copy link
Author

Any update on this @dcherian @kmuehlbauer?

We're stuck with 2023.2.0 because of this 😢

@kmuehlbauer
Copy link
Contributor

@idantene Thanks for checking back. I lost track over summer. I've a somewhat clumsy approach over in #8084. Would be great if you can test with your setup.

It checks for NaT and casts to float64 if that's the case. Unfortunately isocalendar needs special handling which requires NaT check twice for now. If anyone has suggestions to make this more slim, please comment in the PR.

@kmuehlbauer
Copy link
Contributor

@idantene @dcherian I think this problem should be fixed upstream. I've opened an issue pandas-dev/pandas#54657 over at pandas to see, if and how this can be aligned.

@idantene
Copy link
Author

Thanks @kmuehlbauer! I'll have a test run soon to verify this fix - it looks good on paper.

I'm curious as to how you'll fix this for dask, though it does not apply to our use case :D

@kmuehlbauer
Copy link
Contributor

I'm curious as to how you'll fix this for dask, though it does not apply to our use case :D

I've no idea how this can be made consistent for dask without a priory knowledge if there are NaT involved or not. As far as I checked it will return some RuntimeWarning: invalid value encountered in cast warning, which will at least let the user know about the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants