Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for datetime[*] numpy dtype #367

Closed
drahnreb opened this issue Aug 4, 2020 · 18 comments · Fixed by #835
Closed

Support for datetime[*] numpy dtype #367

drahnreb opened this issue Aug 4, 2020 · 18 comments · Fixed by #835
Assignees
Labels
feature New feature or request

Comments

@drahnreb
Copy link
Contributor

drahnreb commented Aug 4, 2020

Currently, a pandas DataFrame with a column of eg. datetime[ns] dtype is resulting in an obscure ValueError:
ValueError: cannot include dtype 'M' in a buffer

values = {'time':  ['20190902093000','20190913093000','20190921200000']}
df = pd.DataFrame(values, columns = ['time'])
df['time'] = pd.to_datetime(df['time'], format='%Y%m%d%H%M%S')
df.dtypes
time       datetime64[ns]
dtype: object
# current workaround
df['time'] = df.time.values.astype(int)
akarr = ak.Record({c: df[c].values for c in df.columns})
@drahnreb drahnreb added the feature New feature or request label Aug 4, 2020
@jpivarski
Copy link
Member

This is actually waiting on a pybind11 feature: it currently can't ingest datetime64 ('M') or timedelta64 ('m') arrays as py::buffer_info. I think that's being handled in pybind/pybind11#2078 or pybind/pybind11#2085 or both, I'm not sure. (Issue pybind/pybind11#1970 might also be related.)

Either way, we have two dtype enums ready and waiting:

https://github.com/scikit-hep/awkward-1.0/blob/46767d776dcdc825621f8151dbd4289225b09659/include/awkward/util.h#L26-L47

The bigger question, though, is what reducers would mean for these types. ak.sum could make sense for timedelta64, but not datetime64, ak.prod wouldn't make sense for either, though ak.min and ak.max could be extended for either. (Maybe the datetime64 could reuse the uint64 min/max and the timedelta64 could reuse the int64 min/max...)

All the type coersions when concatenating non-temporal and temporal numbers would also have to be added, which is a lot of boilerplate, but straightforward.

@jpivarski
Copy link
Member

It's actually not a pybind11 error, it's a NumPy error:

>>> memoryview(np.array(["2018-01-01", "2019-01-01", "2020-01-01"], "datetime64[s]"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot include dtype 'M' in a buffer

and that's why we get an error here:

https://github.com/scikit-hep/awkward-1.0/blob/c41360532ed0e36660fcc918965f49451d1d3677/src/python/content.cpp#L1674

To work around this, we'd have to detect it with array.dtype().kind() == 'M' and then get the data pointer, shape, strides, itemsize, and mock up a format manually using the py::array only, with no help from py::buffer_info.

In the py::array class,

https://github.com/pybind/pybind11/blob/6f3470f757cc5162d5f9115ea9e280e071c212fa/include/pybind11/numpy.h#L543-L834

I see everything that we need except the pointer (have to .attr("ctypes").attr("data") to get a Python integer of the pointer, then unbox that as a size_t, then cast that as a void*) and the format string. There is no equivalent of the format string, but I could do py::str(array.dtype()) and cast that as a std::string to get something like "datetime64[s]", which is not a format, but combined with the util::dtype enum telling me it's a datetime64, put in special logic to construct the appropriate Python dtype when turning it back into a NumPy array on the other side. It can't go through the buffer interface, but I should be changing the Python to call to_numpy() as opposed to to_cupy() to accommodate GPU arrays, anyway.

@jpivarski
Copy link
Member

I'd like to thank you, @YannickJadoul, for your help in the above. Feel free to "unwatch" this thread, since giving you credit also pulled you into the GitHub issue as a side-effect.

@stevesimmons
Copy link

Hello, is there an ETA on fixing this? I was hoping it would land in 1.0.0.

The underlying numpy issue is 6 years old: numpy/numpy#4983.

I'd have a go, except this is probably not the best "first issue" for someone new to Awkward!

@jpivarski
Copy link
Member

There isn't an ETA, though I've been eyeing it as a medium-priority item. It's on my increasingly-misnamed November bug-fixes project, one of the (currently 3) C++ tasks; the majority are Python tasks and much lower-hanging fruit.

If you're interested in working on this, I can help. In particular, I can give background and status of the problem, answer questions on a PR (open a draft PR early so we can talk), and even on Zoom/other chat for "broader bandwidth" clarifications and context.

Here's some initial background to the status:

  • Since there are many NumPy versions out there already whose datetime dtypes don't support the buffer protocol, an Awkward implementation of this will have to bypass the buffer protocol. This won't be entirely novel, since CuPy arrays are already handled as a special case in the constructor and a method for output, and NumPy arrays with datetime type would have to be handled something like that. It means that np.asarray(numpyarray) won't work on these NumpyArray instances, but this isn't an issue for the high-level user interface. The high-level interface is an ak.Array class that wraps these NumpyArrays (and other layouts).
  • Datetime types are already stubbed out everywhere as enum values that have been commented out. So for instance, searching the repo for "datetime64" reveals all the places where a change is needed.
  • Most of the operations you might do on specialty types are not actually performed by Awkward Array. All ufuncs, for instance, which includes implementations of operator overloads like __add__, __mul__, etc., are passed to NumPy (between unwrapping and rewrapping the one-dimensional NumPy array buried inside the Awkward layout). So most of the things that would be different for date-time types, such as the fact that the difference of two datetime64 arrays is a timedelta64 array, do not need any explicit code in Awkward Array.
  • The exception to the above is reducers: (see the base case that all reducers eventually reach, NumpyArray::reduce_next). The ReducerSum class, which implements ak.sum, for instance, needs to know how to add elements of an array and hence it needs to know what types they are. But among all the reducers, only ak.count, ak.count_nonzero, ak.min, ak.max, ak.argmin, and ak.argmax could be used on data of date-time type, and possibly ak.any and ak.all if we consider 0 to be false and any other value to be true. Maybe you can ak.sum timedelta64, but not datetime64. ALL of these would be implemented by view-casting the data as int64_t anyway, so there isn't any computation to do, just a proper accounting of types before and after the reduction.
  • The NumpyArray class has somewhat redundant dtype information: itemsize is the number of bytes in each item, format is the string returned by Python's buffer protocol, and dtype is our own enum, added later to reign in the madness. Part of this "madness" is the platform-dependence of format. On the bright side, that means there's plenty of room to add all the metadata needed to track date-time types through the system. I know that date-times have units, and perhaps these can be put in the format string, since format won't be useful when the NumPy array doesn't pass through the buffer protocol. (I wish I hadn't ever used the buffer protocol—pybind11 made it look like the right way to do this, but there are exceptions like CuPy and date-times that have to work around it.) Since NumPy+Python might start representing date-time types in the format string someday and we don't know what convention they'll choose, perhaps this format string could be prepended by "ak" or something? (It will always be in conjunction with our dtype enum being equal to datetime64 or timedelta64, so maybe that kind of future-proofing is unnecessary.)

The problem of adding date-time dtypes is similar to the problem of adding complex dtypes, which is PR #421, temporarily stopped, but @sjperkins intends to pick it up again. The difference with that case is that Python's buffer protocol does support complex numbers (format strings that start with "Z") and reducers like ak.sum and ak.prod need special calculations for complex types, but date-time types can just be passed through these calculations as though they were int64_t. So they're similar problems, but each is easier/harder in different ways.

@stevesimmons
Copy link

Thanks for the extended write up, @jpivarski. I'll dive into the code this weekend and see if I can make some progress. It feels a lot to assimilate for the time I have, so I won't promise a solution.

On the positive side, I am highly motivated to try Awkward for a large scale problem (processing 3m Arrow tables of 1k-1m rows, each with time histories of some related business entities). Time is an intrinsic part of the processing I need to do on this dataset.

@jpivarski jpivarski moved this from C++ to In Progress in Imminent fixes/features Dec 11, 2020
@jpivarski
Copy link
Member

Hi @stevesimmons! Let me know if you're willing and able to do this (date-time types in Awkward Array).

If not, I'll move it out of "in progress," but it will be a high priority item for me. There's evidently a lot of interest.

Thanks!

@AllanLeanderRostockHansen
Copy link

Hi @jpivarski
First off all, I'd like to thank you and your co-creators for Awkward Array – it's so nice to have a tool like this to handle... well, awkward and ragged data structures!

Secondly, while I don't have the skills required to help with the development of the datetime functionality, it's a feature I'd love to see included in Awkward, as I work with at lot of time series data... so please increment the "interest counter" by one ;)

@jpivarski
Copy link
Member

I think this is the top-requested feature right now. I'll keep that in mind and prioritize it accordingly.

I just went searching for a formal upvote tool, but I couldn't find one that

  • stays open indefinitely (Slideo, which we used to good effect in the PyHEP workshop, only supports week-long events)
  • lets me pick a list of things to provide upvotes for
  • ideally, would be something I could insert in the GitHub readme, right next to the Roadmap.

GitHub suggests "thumbs up" on issues, but I don't see a way to make a leaderboard of most thumbs-up issues. No wait, yes I do.

I added instructions for upvoting issues, though it will take some time before enough people do this that the vote is very meaningful.

@AllanLeanderRostockHansen
Copy link

AllanLeanderRostockHansen commented Jan 4, 2021

@jpivarski Does the sort-by-upvotes view of the open issues aggregate the total number of upvotes in the thread, or just the number of upvotes on the initial post?
At the time if writing, the second result seems to have a single upvote quite a bit down the thread... this behaviour might not be the intended one when using it to prioritize issues?

@jpivarski
Copy link
Member

I'll modify the recommended search string to include a filter for nonzero reactions (on the initial comment): is:issue is:open sort:reactions-+1-desc reactions:>0. Only two issues actually have reactions on the initial comment because we've never done this voting thing before.

@jpivarski
Copy link
Member

I'll modify the recommended search string to include a filter for nonzero reactions

Done: 97124ef.

Now if anyone is trying to vote, they can check against this list to see if they've actually done it; this might be enough to see that the reaction has to be on the initial comment (or at least, it would prompt a search, rather than just assuming that it worked).

@stevesimmons
Copy link

Hi @stevesimmons! Let me know if you're willing and able to do this (date-time types in Awkward Array).

If not, I'll move it out of "in progress," but it will be a high priority item for me. There's evidently a lot of interest.

I'm afraid I have too many other things on my plate right now and won't be able to work on this. Sorry about that, because this is a feature than lots of people clearly want.

@jpivarski jpivarski moved this from In Progress to C++ in Imminent fixes/features Feb 2, 2021
@jpivarski
Copy link
Member

Thanks for giving it a look! I understand that things come up. Also, this does seem to be the most requested feature right now, so it's high on my priority list, too. I'll feel free to work on an implementation when I get a chance.

@ianna
Copy link
Collaborator

ianna commented Jun 10, 2021

@drahnreb - please, let me know if this is what you'd expect. Thanks!

    values = {"time": ["20190902093000", "20190913093000", "20190921200000"]}
    df = pandas.DataFrame(values, columns=["time"])
    df["time"] = pandas.to_datetime(df["time"], format="%Y%m%d%H%M%S")
    array = ak.layout.NumpyArray(df)
    assert ak.to_list(array) == [
        np.datetime64("2019-09-02T09:30:00"),
        np.datetime64("2019-09-13T09:30:00"),
        np.datetime64("2019-09-21T20:00:00"),
    ]

@drahnreb
Copy link
Contributor Author

@ianna - thanks for the huge PR! I could only follow it loosely and have apparently not commented a lot - sorry.

let me know if this is what you'd expect

Yes, this looks good to me.

The type casting to int instead of datetime.datetime when dtype=datetime64[ns], I would handle like numpy, as mentioned here.

I pulled your latest version 2aa2ed5 and did a few tests with a parquet dataset similar to the one I had when I raised this issue. Awkward does not convert to datetime field correctly at this point of time.
<Array [1970-01-19T16:32:31.555000000, ... ] type='14709 * ?datetime'>
instead of array(['2021-03-03T06:05:55.000000000', .... ], dtype='datetime64[ns]')
While pandas and pyarrow (to timestamp[us]) do handle it correctly.

I need to investigate and try to give you a reproducer. I will update you shortly.

@jpivarski
Copy link
Member

@drahnreb I just got this message, too late to stop the auto-merge. Is it incorrect in main? (I corrected a few Arrow-related things at the end, which might be relevant.)

If it's still broken, fixing it can be a new PR.

@drahnreb
Copy link
Contributor Author

@jpivarski Yes, it's fixed. Thanks, great update!

Not sure if this and my comment passed unnoticed.
This is yet another way of dealing with datetime types (besides the confusing and different ways in pandas, numpy and pyarrow including your reported issue). But after giving it a bit of thought it is probably the most concise way. It should be documented though that ak.to_list() always creates lists of np.datetime64 objects independent from - but with the same - input unit.

# numpy converts to `py_datetime` iff possible and same type...
>>> np_ms = np.asarray([np.datetime64('2019-09-02T09:30:00', 'ms')])
>>> ak_ms = ak.Array(np_ms)
>>> np_ms.tolist()
[datetime.datetime(2019, 9, 2, 9, 30)]
>>> ak_ms.tolist()
[numpy.datetime64('2019-09-02T09:30:00.000000000')]

# ...but `py_datetime` does not support `ns` scaled units
>>> np_ns = np.asarray([np.datetime64('2019-09-02T09:30:00', 'ns')])
>>> ak_ns = ak.Array(np_ns)
>>> np_ns.tolist()
[1567416600000000000]
>>> ak_ns.tolist()
[numpy.datetime64('2019-09-02T09:30:00.000000000')]

# ...but `option` typed numpy arrays are of dtype object
>>> np_ms_opt = np.asarray([np.datetime64('2019-09-02T09:30:00', 'ms'), None])
>>> ak_ms_opt = ak.Array(np_ms_opt)
>>> np_ms_opt.tolist()
[numpy.datetime64('2019-09-02T09:30:00.000'), None]
>>> ak_ms_opt.tolist()
[numpy.datetime64('2019-09-02T09:30:00.000000000'), None]

I guess the reasoning for numpy is that a list should preferably hold python native datetime objects if possible/supported.
I do not see a problem here (people are used to unintuitive datetime conversions for a long time). As the current implementation is not vectorized, ak.to_list() is also not performance critical? One could argue, why not just converting to pure int64 (that will prevent convenient arithmetics in pure python and then there is also int96 coercing of timestamps)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants