New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for datetime[*] numpy dtype #367
Comments
This is actually waiting on a pybind11 feature: it currently can't ingest datetime64 ( Either way, we have two dtype enums ready and waiting: The bigger question, though, is what reducers would mean for these types. ak.sum could make sense for timedelta64, but not datetime64, ak.prod wouldn't make sense for either, though ak.min and ak.max could be extended for either. (Maybe the datetime64 could reuse the uint64 min/max and the timedelta64 could reuse the int64 min/max...) All the type coersions when concatenating non-temporal and temporal numbers would also have to be added, which is a lot of boilerplate, but straightforward. |
It's actually not a pybind11 error, it's a NumPy error: >>> memoryview(np.array(["2018-01-01", "2019-01-01", "2020-01-01"], "datetime64[s]"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: cannot include dtype 'M' in a buffer and that's why we get an error here: To work around this, we'd have to detect it with In the I see everything that we need except the pointer (have to |
I'd like to thank you, @YannickJadoul, for your help in the above. Feel free to "unwatch" this thread, since giving you credit also pulled you into the GitHub issue as a side-effect. |
Hello, is there an ETA on fixing this? I was hoping it would land in 1.0.0. The underlying numpy issue is 6 years old: numpy/numpy#4983. I'd have a go, except this is probably not the best "first issue" for someone new to Awkward! |
There isn't an ETA, though I've been eyeing it as a medium-priority item. It's on my increasingly-misnamed November bug-fixes project, one of the (currently 3) C++ tasks; the majority are Python tasks and much lower-hanging fruit. If you're interested in working on this, I can help. In particular, I can give background and status of the problem, answer questions on a PR (open a draft PR early so we can talk), and even on Zoom/other chat for "broader bandwidth" clarifications and context. Here's some initial background to the status:
The problem of adding date-time dtypes is similar to the problem of adding complex dtypes, which is PR #421, temporarily stopped, but @sjperkins intends to pick it up again. The difference with that case is that Python's buffer protocol does support complex numbers ( |
Thanks for the extended write up, @jpivarski. I'll dive into the code this weekend and see if I can make some progress. It feels a lot to assimilate for the time I have, so I won't promise a solution. On the positive side, I am highly motivated to try Awkward for a large scale problem (processing 3m Arrow tables of 1k-1m rows, each with time histories of some related business entities). Time is an intrinsic part of the processing I need to do on this dataset. |
Hi @stevesimmons! Let me know if you're willing and able to do this (date-time types in Awkward Array). If not, I'll move it out of "in progress," but it will be a high priority item for me. There's evidently a lot of interest. Thanks! |
Hi @jpivarski Secondly, while I don't have the skills required to help with the development of the datetime functionality, it's a feature I'd love to see included in Awkward, as I work with at lot of time series data... so please increment the "interest counter" by one ;) |
I think this is the top-requested feature right now. I'll keep that in mind and prioritize it accordingly. I just went searching for a formal upvote tool, but I couldn't find one that
GitHub suggests "thumbs up" on issues, but I don't see a way to make a leaderboard of most thumbs-up issues. No wait, yes I do. I added instructions for upvoting issues, though it will take some time before enough people do this that the vote is very meaningful. |
@jpivarski Does the sort-by-upvotes view of the open issues aggregate the total number of upvotes in the thread, or just the number of upvotes on the initial post? |
I'll modify the recommended search string to include a filter for nonzero reactions (on the initial comment): is:issue is:open sort:reactions-+1-desc reactions:>0. Only two issues actually have reactions on the initial comment because we've never done this voting thing before. |
Done: 97124ef. Now if anyone is trying to vote, they can check against this list to see if they've actually done it; this might be enough to see that the reaction has to be on the initial comment (or at least, it would prompt a search, rather than just assuming that it worked). |
I'm afraid I have too many other things on my plate right now and won't be able to work on this. Sorry about that, because this is a feature than lots of people clearly want. |
Thanks for giving it a look! I understand that things come up. Also, this does seem to be the most requested feature right now, so it's high on my priority list, too. I'll feel free to work on an implementation when I get a chance. |
@drahnreb - please, let me know if this is what you'd expect. Thanks! values = {"time": ["20190902093000", "20190913093000", "20190921200000"]}
df = pandas.DataFrame(values, columns=["time"])
df["time"] = pandas.to_datetime(df["time"], format="%Y%m%d%H%M%S")
array = ak.layout.NumpyArray(df)
assert ak.to_list(array) == [
np.datetime64("2019-09-02T09:30:00"),
np.datetime64("2019-09-13T09:30:00"),
np.datetime64("2019-09-21T20:00:00"),
] |
@ianna - thanks for the huge PR! I could only follow it loosely and have apparently not commented a lot - sorry.
Yes, this looks good to me. The type casting to I pulled your latest version 2aa2ed5 and did a few tests with a I need to investigate and try to give you a reproducer. I will update you shortly. |
@drahnreb I just got this message, too late to stop the auto-merge. Is it incorrect in main? (I corrected a few Arrow-related things at the end, which might be relevant.) If it's still broken, fixing it can be a new PR. |
@jpivarski Yes, it's fixed. Thanks, great update! Not sure if this and my comment passed unnoticed. # numpy converts to `py_datetime` iff possible and same type...
>>> np_ms = np.asarray([np.datetime64('2019-09-02T09:30:00', 'ms')])
>>> ak_ms = ak.Array(np_ms)
>>> np_ms.tolist()
[datetime.datetime(2019, 9, 2, 9, 30)]
>>> ak_ms.tolist()
[numpy.datetime64('2019-09-02T09:30:00.000000000')]
# ...but `py_datetime` does not support `ns` scaled units
>>> np_ns = np.asarray([np.datetime64('2019-09-02T09:30:00', 'ns')])
>>> ak_ns = ak.Array(np_ns)
>>> np_ns.tolist()
[1567416600000000000]
>>> ak_ns.tolist()
[numpy.datetime64('2019-09-02T09:30:00.000000000')]
# ...but `option` typed numpy arrays are of dtype object
>>> np_ms_opt = np.asarray([np.datetime64('2019-09-02T09:30:00', 'ms'), None])
>>> ak_ms_opt = ak.Array(np_ms_opt)
>>> np_ms_opt.tolist()
[numpy.datetime64('2019-09-02T09:30:00.000'), None]
>>> ak_ms_opt.tolist()
[numpy.datetime64('2019-09-02T09:30:00.000000000'), None] I guess the reasoning for numpy is that a list should preferably hold python native datetime objects if possible/supported. |
Currently, a pandas DataFrame with a column of eg. datetime[ns] dtype is resulting in an obscure ValueError:
ValueError: cannot include dtype 'M' in a buffer
time datetime64[ns] dtype: object
The text was updated successfully, but these errors were encountered: