Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support 'datetime64' and 'timedelta64' types #835

Merged
merged 45 commits into from
Jun 10, 2021

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Apr 15, 2021

issue #367

@ianna ianna marked this pull request as draft April 15, 2021 16:12
@ianna ianna force-pushed the ianna/datetime-and-timedelta-types branch from cab0f3d to 243959b Compare April 19, 2021 15:53
Copy link
Collaborator Author

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpivarski - I'm not sure if I should box it differently or leave it as uint64_t

tests/test_0835-datetime-type.py Outdated Show resolved Hide resolved
tests/test_0835-datetime-type.py Outdated Show resolved Hide resolved
@ianna
Copy link
Collaborator Author

ianna commented Apr 22, 2021

It appears the test was failing due to a missing "dtype" attribute. The check if it is present is added in content.cpp#2226.

____________________________________________________________ test_sort_zero_length_arrays _____________________________________________________________

    def test_sort_zero_length_arrays():
        array = ak.layout.IndexedArray64(
>           ak.layout.Index64([]), ak.layout.NumpyArray([1, 2, 3])
        )
E       AttributeError: 'list' object has no attribute 'dtype'

tests/test_0074-argsort-and-sort.py:493: AttributeError

@sterbini
Copy link

sterbini commented Apr 23, 2021

I would be very interested on the datetime support. Thanks!

@jpivarski jpivarski linked an issue Apr 27, 2021 that may be closed by this pull request
@ianna ianna force-pushed the ianna/datetime-and-timedelta-types branch from 5a19d26 to 866573b Compare June 2, 2021 13:01
@ianna ianna changed the title start work on 'datetime' and 'timedelta' types support 'datetime64' and 'timedelta64' types Jun 2, 2021
Copy link
Collaborator Author

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two issues to fix:

  • precision loss: for example, when a np.datetime64("2020-05") is converted to seconds np.datetime64("2020-05-01T20:56:24.000000")
  • build timedelta64 for a high-level ak.ArrayBuilder

@ianna
Copy link
Collaborator Author

ianna commented Jun 3, 2021

  • precision loss: for example, when a np.datetime64("2020-05") is converted to seconds np.datetime64("2020-05-01T20:56:24.000000")
>>> dt = np.datetime64("2020-05")
>>> dt.dtype
dtype('<M8[Y]')
>>> dt2 = np.datetime64("2020-05-01T20:56:24.000000")
>>> dt2
numpy.datetime64('2020-05-01T20:56:24.000000')
>>> dt2.astype(np.datetime64(1, 'M'))
numpy.datetime64('2020-05')

I think it could be left as is. @jpivarski ?

If not, possible solutions are to:

  1. convert all incoming np.datetime64 to 's' - what if a user may want to keep given units?
>>> dt.astype(np.int64)
604
>>> dt2.astype(np.int64)
1588366584000000
>>> dt.astype(np.datetime64(1, 'us')).astype(np.int64)
1588291200000000
>>> dt.astype(np.datetime64(1, 's')).astype(np.int64)
1588291200
  1. recalculate it later:
>>> def leap_years_between(start,end):
...     if start < end:
...             return leap_years_before(end) - leap_years_before(start + 1)
...     else:
...             raise ValueError
... 
>>> def leap_years_before(year):
...     if year > 0:
...             year = year - 1
...             return (year / 4) - (year / 100) + (year / 400)
...     else:
...             raise ValueError
... 

@jpivarski
Copy link
Member

I think it could be left as is. @jpivarski ?

I don't see anything wrong here. Maybe you should ask the potential users of datetimes?

@ianna ianna marked this pull request as ready for review June 3, 2021 16:29
@ianna ianna requested a review from jpivarski June 3, 2021 16:29
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it out and ran into a few bugs, most likely minor, and added them to the testing suite as test_more.

I also made a modification to the repr format:

<Array [2021-06-03T10:00, ... 2021-06-03T13:00] type='4 * datetime64'>

and

<Array [60 minutes, 60 minutes, 60 minutes] type='3 * timedelta64'>

The string representation of datetime/timedelta types don't agree with Datashape, though. Is it possible to fix this? For one thing, the type strings should drop the "64", but is there a way to get timezone information in datetime (e.g. datetime[tz='UTC'])?

There is one significant problem, though: whenever we get data from a datetime/timedelta NumpyArray into an np.ndarray, it's through an expensive-looking loop in Python. This should be a zero-copy view. The NumpyArray class already has methods like to_cupy and to_jax that circumvent the buffer interface (and have to be called explicitly by the code layer above it): we need a similar one for to_numpy that would be called when the dtype is not something that can be expressed in a buffer.

Actually, we can short-cut this: I created a view_int64 property on NumpyArray, which just drops the datetime/timedelta metadata so that you can send it through a buffer and add the metadata back in:

>>> akarray
<Array [2021-06-03T10:00, 2021-06-03T11:00] type='2 * datetime64'>
>>> akarray.layout
<NumpyArray format="M8[m]" shape="2" data="0x 78ad9c01 00000000 b4ad9c01 00000000" at="0x562eaa6b1e10"/>
>>> akarray.layout.view_int64
<NumpyArray format="l" shape="2" data="27045240 27045300" at="0x562eaa6b1e10"/>
>>> np.asarray(akarray.layout.view_int64)
array([27045240, 27045300])
>>> np.asarray(akarray.layout.view_int64).view(akarray.layout.format)
array(['2021-06-03T10:00', '2021-06-03T11:00'], dtype='datetime64[m]')

Use this to fix the two Python loops over array data.

src/awkward/_util.py Outdated Show resolved Hide resolved
src/awkward/operations/convert.py Outdated Show resolved Hide resolved
tests/test_0835-datetime-type.py Outdated Show resolved Hide resolved
tests/test_0835-datetime-type.py Outdated Show resolved Hide resolved
@ianna ianna force-pushed the ianna/datetime-and-timedelta-types branch from 5009477 to d34ead5 Compare June 5, 2021 16:19
ianna added 2 commits June 7, 2021 13:29
if py::isinstance(obj, py::module::import("numpy").attr("datetime64")) is true,
then py::isinstance(obj, py::module::import("numpy").attr("integer")) is also true
@ianna ianna force-pushed the ianna/datetime-and-timedelta-types branch from 8b9082b to 57ed8f3 Compare June 8, 2021 15:53
Copy link
Collaborator Author

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpivarski - please, have a look when you have time. thanks!

src/awkward/operations/convert.py Show resolved Hide resolved
src/python/content.cpp Show resolved Hide resolved
@ianna
Copy link
Collaborator Author

ianna commented Jun 10, 2021

@jpivarski -here is an example where a pyarrow date64 array is converted correctly to NumpyArray, but not to Array due to the units mismatch. I'm not sure how realistic it is.

>>> parr = pa.array([datetime.datetime(2002, 1, 23), datetime.datetime(2019, 2, 20)], type=pa.date64())
>>> parr.type
DataType(date64[ms])
>>> narr = ak.layout.NumpyArray(parr)
>>> narr
<NumpyArray format="M8[D]" shape="2" data="Wed Jan 23 00:00:00 2002
 Wed Feb 20 00:00:00 2019
" at="0x000109071000"/>
>>> array = ak.Array(parr)
>>> array
<Array [1970-01-01T00:16:51.744000000, ... ] type='2 * datetime'>

@drahnreb
Copy link
Contributor

drahnreb commented Jun 10, 2021

but is there a way to get timezone information in datetime (e.g. datetime[tz='UTC'])?

Numpy deprecated storing timezone information (at the end of this paragraph). Overall, I would say a feature that is not really relevant. Though, pyarrow does support timezones and any persisted data will lose this information.

(/cc @jpivarski )

@jpivarski
Copy link
Member

@ianna To answer your question, this is not an intended use-case (and it works "by accident" because this pyarrow array happens to be flat and interpretable as a buffer):

>>> parr = pa.array([datetime.datetime(2002, 1, 23), datetime.datetime(2019, 2, 20)], type=pa.date64())
>>> parr.type
DataType(date64[ms])
>>> narr = ak.layout.NumpyArray(parr)

The proper way to convert pyarrow data into Awkward Arrays is with ak.from_arrow.

The issues with pyarrow were because we rely on to_pandas_dtype to get a NumPy dtype (it's not well named) and this method is completely wrong for a lot of units. I'm going to file a bug report to the Arrow project.

The appearance of correct times in the low-level NumpyArray view was due to another bug—the two bugs cancelled—which was that the scale from datetime_util was an integer, but sometimes you need to scale down, not just up. I've promoted that quantity to double type (including the kernel that it's used in).

I think it's done! I noticed that we can't take datetime objects in the ak.from_iter constructor, but neither can NumPy (it makes a dtype="O" array). That might be nice to have, but it would be for another PR someday.

Congrats! This was a long time coming!

@jpivarski jpivarski enabled auto-merge (squash) June 10, 2021 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for datetime[*] numpy dtype
4 participants