-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support 'datetime64' and 'timedelta64' types #835
Conversation
cab0f3d
to
243959b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpivarski - I'm not sure if I should box it differently or leave it as uint64_t
It appears the test was failing due to a missing ____________________________________________________________ test_sort_zero_length_arrays _____________________________________________________________
def test_sort_zero_length_arrays():
array = ak.layout.IndexedArray64(
> ak.layout.Index64([]), ak.layout.NumpyArray([1, 2, 3])
)
E AttributeError: 'list' object has no attribute 'dtype'
tests/test_0074-argsort-and-sort.py:493: AttributeError |
I would be very interested on the |
5a19d26
to
866573b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two issues to fix:
- precision loss: for example, when a
np.datetime64("2020-05")
is converted to secondsnp.datetime64("2020-05-01T20:56:24.000000")
- build
timedelta64
for a high-levelak.ArrayBuilder
>>> dt = np.datetime64("2020-05")
>>> dt.dtype
dtype('<M8[Y]')
>>> dt2 = np.datetime64("2020-05-01T20:56:24.000000")
>>> dt2
numpy.datetime64('2020-05-01T20:56:24.000000')
>>> dt2.astype(np.datetime64(1, 'M'))
numpy.datetime64('2020-05') I think it could be left as is. @jpivarski ? If not, possible solutions are to:
>>> dt.astype(np.int64)
604
>>> dt2.astype(np.int64)
1588366584000000
>>> dt.astype(np.datetime64(1, 'us')).astype(np.int64)
1588291200000000
>>> dt.astype(np.datetime64(1, 's')).astype(np.int64)
1588291200
>>> def leap_years_between(start,end):
... if start < end:
... return leap_years_before(end) - leap_years_before(start + 1)
... else:
... raise ValueError
...
>>> def leap_years_before(year):
... if year > 0:
... year = year - 1
... return (year / 4) - (year / 100) + (year / 400)
... else:
... raise ValueError
... |
I don't see anything wrong here. Maybe you should ask the potential users of datetimes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it out and ran into a few bugs, most likely minor, and added them to the testing suite as test_more
.
I also made a modification to the repr format:
<Array [2021-06-03T10:00, ... 2021-06-03T13:00] type='4 * datetime64'>
and
<Array [60 minutes, 60 minutes, 60 minutes] type='3 * timedelta64'>
The string representation of datetime/timedelta types don't agree with Datashape, though. Is it possible to fix this? For one thing, the type strings should drop the "64", but is there a way to get timezone information in datetime (e.g. datetime[tz='UTC']
)?
There is one significant problem, though: whenever we get data from a datetime/timedelta NumpyArray into an np.ndarray
, it's through an expensive-looking loop in Python. This should be a zero-copy view. The NumpyArray class already has methods like to_cupy
and to_jax
that circumvent the buffer interface (and have to be called explicitly by the code layer above it): we need a similar one for to_numpy
that would be called when the dtype is not something that can be expressed in a buffer.
Actually, we can short-cut this: I created a view_int64
property on NumpyArray, which just drops the datetime/timedelta metadata so that you can send it through a buffer and add the metadata back in:
>>> akarray
<Array [2021-06-03T10:00, 2021-06-03T11:00] type='2 * datetime64'>
>>> akarray.layout
<NumpyArray format="M8[m]" shape="2" data="0x 78ad9c01 00000000 b4ad9c01 00000000" at="0x562eaa6b1e10"/>
>>> akarray.layout.view_int64
<NumpyArray format="l" shape="2" data="27045240 27045300" at="0x562eaa6b1e10"/>
>>> np.asarray(akarray.layout.view_int64)
array([27045240, 27045300])
>>> np.asarray(akarray.layout.view_int64).view(akarray.layout.format)
array(['2021-06-03T10:00', '2021-06-03T11:00'], dtype='datetime64[m]')
Use this to fix the two Python loops over array data.
5009477
to
d34ead5
Compare
if py::isinstance(obj, py::module::import("numpy").attr("datetime64")) is true, then py::isinstance(obj, py::module::import("numpy").attr("integer")) is also true
… do not manipulate with numpy array, return it asis
8b9082b
to
57ed8f3
Compare
eePlease enter the commit message for your changes. Lines starting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpivarski - please, have a look when you have time. thanks!
@jpivarski -here is an example where a pyarrow >>> parr = pa.array([datetime.datetime(2002, 1, 23), datetime.datetime(2019, 2, 20)], type=pa.date64())
>>> parr.type
DataType(date64[ms])
>>> narr = ak.layout.NumpyArray(parr)
>>> narr
<NumpyArray format="M8[D]" shape="2" data="Wed Jan 23 00:00:00 2002
Wed Feb 20 00:00:00 2019
" at="0x000109071000"/>
>>> array = ak.Array(parr)
>>> array
<Array [1970-01-01T00:16:51.744000000, ... ] type='2 * datetime'> |
Numpy deprecated storing timezone information (at the end of this paragraph). Overall, I would say a feature that is not really relevant. Though, (/cc @jpivarski ) |
@ianna To answer your question, this is not an intended use-case (and it works "by accident" because this pyarrow array happens to be flat and interpretable as a buffer): >>> parr = pa.array([datetime.datetime(2002, 1, 23), datetime.datetime(2019, 2, 20)], type=pa.date64())
>>> parr.type
DataType(date64[ms])
>>> narr = ak.layout.NumpyArray(parr) The proper way to convert pyarrow data into Awkward Arrays is with The issues with pyarrow were because we rely on The appearance of correct times in the low-level NumpyArray view was due to another bug—the two bugs cancelled—which was that the I think it's done! I noticed that we can't take Congrats! This was a long time coming! |
Here's the Arrow issue: https://issues.apache.org/jira/browse/ARROW-13040 Until we start taking whichever version has this fix as a minimum (Arrow 5? 6?), we'll have to use special cases: and |
issue #367