-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_datetime poor performance parsing string datetimes #1571
Comments
I ran your file on my box and I get a performance difference of about >2x. In [7]: %timeit using_to_datetime(test_data) In [8]: %timeit faking_tz(test_data) In [9]: %timeit concat_gmt_tz(test_data) I think it's because pandas is using dateutil internally while numpy uses its own parser that's faster: In [10]: from numpy.core._mx_datetime_parser import datetime_from_string as p2 In [11]: from dateutil.parser import parse as p1 In [12]: %timeit test_data.apply(p1) In [13]: %timeit test_data.apply(p2) I'll see whether we can convert pandas to use the faster date parsing code |
Sorry, I should have said it before, but I'm using the '1.8.0.dev-6a06466' version for numpy and '0.8.0' for pandas. It seems like I get much different performances due to recent improvements in numpy. The results I get are these: to_datetime(): 8.36494483948 I also run some simple tests using both numpy '1.8.0.dev-6a06466' and '1.6.2' to compare performance: In [2]: import numpy as np The results are:
~25 times faster in the newer version So parsing data with the current develpment version for numpy seems to be significantly faster than using to_datetime. Maybe it could be possible for to_datetime to make use of the new numpy improvements in the future, or maybe try to apply the same optimizations. It would be really nice to be able to use to_datetime with a performance similar to that offered by numpy. Thanks again and regards. |
Thanks for the feedback! |
It should be straightforward to optimize |
I was able to optimize the ISO8601 case and bring down the parsing time on 20000 strings from 1.87 seconds to 22.1 ms (85x improvement). Do you think this is adequate? I don't think things can get all that much faster than this.
|
Wow, definitely it is a huge improvement. It's even faster than doing .astype("datetime64"). These are the results I get now in terms of performance: to_datetime(): 0.0160160064697 to_datetime(): 0.0160160064697 However, I've noticed the result values are different now (actually, the asserts in my sample code fail). It seems like it is making a transformation of the datetimes according to the local timezone so, for instance, "2012-01-01 00:00:00" becomes "2011-12-31 23:00:00" in my timezone (CET). This is coherent with the results yielded by .astype("datetime64") and np.array([...], dtype="datetime64").
For my application I would like to have a timezone-agnostic parsing utility just like the older to_datetime, but maybe it makes more sense that to_datetime behaves like it does now, I don't really know. This is a discussion I'm not fit to get in ;). Please, let me know what you think about this. |
I am able to reproduce the issue. I'll try to figure out a fix |
I was able to fix this, so strings are parsed as naive times now |
Awesome work! After this fix it even runs about 3x faster on top of your first 85x improvement. That really makes a difference in large datasets. My sample code outputs these estimates now: to_datetime(): 0.00525259971619 to_datetime(): 0.00525259971619 Thanks. |
Hi,
I want to convert to datetime64 a Series that contains datetimes as strings. The format is '%Y-%m-%d %H:%M:%S' ('2012-07-06 10:05:58', for instance).
Casting the strings array into a datetime64 array in numpy (or using Series.astype("datetime64")) is fast, but it transforms the datetimes according to the local timezone, which is not the behavior I want in this case. Pandas to_datetime function does the parsing right, but it is much slower.
However, it is also possible to do the parsing right and fast with numpy by appending the "+0000" timezone suffix to every string before parsing/casting to datetime64. So I wonder, is there any reason why to_datetime() runs much slower than this approach?
Thanks and regards.
Some sample code to illustrate the issue:
The text was updated successfully, but these errors were encountered: