Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_datetime poor performance parsing string datetimes #1571

Closed
manuteleco opened this issue Jul 6, 2012 · 9 comments
Closed

to_datetime poor performance parsing string datetimes #1571

manuteleco opened this issue Jul 6, 2012 · 9 comments
Labels
Datetime Datetime data dtype
Milestone

Comments

@manuteleco
Copy link

Hi,

I want to convert to datetime64 a Series that contains datetimes as strings. The format is '%Y-%m-%d %H:%M:%S' ('2012-07-06 10:05:58', for instance).

Casting the strings array into a datetime64 array in numpy (or using Series.astype("datetime64")) is fast, but it transforms the datetimes according to the local timezone, which is not the behavior I want in this case. Pandas to_datetime function does the parsing right, but it is much slower.

However, it is also possible to do the parsing right and fast with numpy by appending the "+0000" timezone suffix to every string before parsing/casting to datetime64. So I wonder, is there any reason why to_datetime() runs much slower than this approach?

Thanks and regards.

Some sample code to illustrate the issue:

import os
import time
import numpy as np
from datetime import datetime, timedelta
from pandas import DatetimeIndex, Series, to_datetime
from pandas.tseries.offsets import Minute

DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"

# Decorator to execute code in a faked GMT timezone
def GMT_Timezone(func):
    def wrapper(*args, **kwargs):
        # Set timezone to GMT
        previous_tz = time.tzname[0]
        os.environ['TZ'] = 'GMT'
        time.tzset()
        # Execute function
        result = func(*args, **kwargs)
        # Set timezone to previous state
        os.environ['TZ'] = previous_tz
        time.tzset()
        return result
    return wrapper

def generate_test_data():
    start_date = datetime(2012, 1, 1)
    end_date = datetime(2012, 1, 31)
    minutes = DatetimeIndex(start=start_date, end=end_date, freq=Minute())
    str_minutes = Series(minutes.map(lambda d: d.strftime(DATETIME_FORMAT)))
    return str_minutes

def simple_casting(data):
    return data.astype("datetime64")

def shifting_afterwards(data):
    casted = data.astype("datetime64")
    shifted = DatetimeIndex(casted) - timedelta(seconds=time.altzone)
    return Series(shifted)

def concat_gmt_tz(data):
    data = np.add(data, "+0000")
    return data.astype("datetime64")

def using_to_datetime(data):
    return to_datetime(data)

@GMT_Timezone
def faking_tz(data):
    return data.astype("datetime64")

if __name__ == '__main__':
    test_data = generate_test_data()

    # Some aproaches to parse the datetime string
    using_to_dt = using_to_datetime(test_data)
    simple = simple_casting(test_data)
    shifted = shifting_afterwards(test_data)
    faked_tz = faking_tz(test_data)
    concat_tz = concat_gmt_tz(test_data)

    # Simple casting does not work. During the process, numpy converts the date
    # into GMT according to the local timezone.
    assert not (simple == using_to_dt).all(), "Fails if run from GMT timezone."
    # Shifting after casting does not work either, the problem being that
    # we should shift using a different timezone depending on the date we are
    # parsing (summer vs winter) -> more complex + error-prone + slower
    assert not (shifted == using_to_dt).all()
    # Faking the timezone to be GMT avoids datetime transformations, so it seems
    # to work, although since we are messing with environmental variables
    # it gets risky (multithreading, other side-effects?)
    assert (faked_tz == using_to_dt).all()
    assert (concat_tz == using_to_dt).all()

    # Now to performance
    from timeit import Timer
    ITERATIONS = 5
    t = Timer("using_to_datetime(test_data)",
              "from __main__ import generate_test_data, using_to_datetime;" + \
              "test_data = generate_test_data()")
    t_using_to_dt = t.timeit(ITERATIONS) / ITERATIONS

    t = Timer("faking_tz(test_data)",
              "from __main__ import generate_test_data, faking_tz;" + \
              "test_data = generate_test_data()")
    t_faking_tz = t.timeit(ITERATIONS) / ITERATIONS

    t = Timer("concat_gmt_tz(test_data)",
              "from __main__ import generate_test_data, concat_gmt_tz;" + \
              "test_data = generate_test_data()")
    t_concat_tz = t.timeit(ITERATIONS) / ITERATIONS

    print "to_datetime():", t_using_to_dt
    print "faking tz:", t_faking_tz
    # to_datetime() ~90 times slower than .astype("datetime64")
    print "Ratio:", t_using_to_dt / t_faking_tz

    print

    print "to_datetime():", t_using_to_dt
    print "concat tz:", t_concat_tz
    # to_datetime() ~270 times slower than concat tz + .astype("datetime64")
    print "Ratio:", t_using_to_dt / t_concat_tz
@changhiskhan
Copy link
Contributor

I ran your file on my box and I get a performance difference of about >2x.

In [7]: %timeit using_to_datetime(test_data)
1 loops, best of 3: 4 s per loop

In [8]: %timeit faking_tz(test_data)
1 loops, best of 3: 1.44 s per loop

In [9]: %timeit concat_gmt_tz(test_data)
1 loops, best of 3: 1.74 s per loop

I think it's because pandas is using dateutil internally while numpy uses its own parser that's faster:

In [10]: from numpy.core._mx_datetime_parser import datetime_from_string as p2

In [11]: from dateutil.parser import parse as p1

In [12]: %timeit test_data.apply(p1)
1 loops, best of 3: 3.75 s per loop

In [13]: %timeit test_data.apply(p2)
1 loops, best of 3: 1.22 s per loop

I'll see whether we can convert pandas to use the faster date parsing code

@manuteleco
Copy link
Author

Sorry, I should have said it before, but I'm using the '1.8.0.dev-6a06466' version for numpy and '0.8.0' for pandas. It seems like I get much different performances due to recent improvements in numpy.

The results I get are these:
to_datetime(): 8.36494483948
faking tz: 0.0932590007782
Ratio: 89.6958445799

to_datetime(): 8.36494483948
concat tz: 0.0312120437622
Ratio: 268.003752116

I also run some simple tests using both numpy '1.8.0.dev-6a06466' and '1.6.2' to compare performance:

In [2]: import numpy as np
In [3]: from datetime import datetime
In [4]: DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
In [5]: str_datetime = datetime.now().strftime(DATETIME_FORMAT)
In [7]: %timeit np.datetime64(str_datetime)

The results are:

  • '1.8.0.dev-6a06466':
    100000 loops, best of 3: 3.83 us per loop
  • '1.6.2':
    10000 loops, best of 3: 95.9 us per loop

~25 times faster in the newer version

So parsing data with the current develpment version for numpy seems to be significantly faster than using to_datetime. Maybe it could be possible for to_datetime to make use of the new numpy improvements in the future, or maybe try to apply the same optimizations. It would be really nice to be able to use to_datetime with a performance similar to that offered by numpy.

Thanks again and regards.

@changhiskhan
Copy link
Contributor

Thanks for the feedback!
I'll keep the issue open until we improve the performance on to_datetime.

@wesm
Copy link
Member

wesm commented Jul 8, 2012

It should be straightforward to optimize to_datetime for ISO8601 format (what you're describing). Currently it is very general and handles a lot more date formats than NumPy does. Will let you know when one of us gets a chance to work on it.

@wesm
Copy link
Member

wesm commented Jul 11, 2012

I was able to optimize the ISO8601 case and bring down the parsing time on 20000 strings from 1.87 seconds to 22.1 ms (85x improvement). Do you think this is adequate? I don't think things can get all that much faster than this.

In [4]: to_datetime(strings)
Out[4]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, ..., 2002-04-13 07:00:00]
Length: 20000, Freq: None, Timezone: None

In [5]: timeit to_datetime(strings)
1 loops, best of 3: 1.87 s per loop

In [6]: exit
18:54 ~/code/pandas  ((3824af1...))$ git checkout master
Previous HEAD position was 3824af1... BUG: override ndarray.tolist in Index for MultiIndex compat, close #1576
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 1 commit.

...

18:55 ~/code/pandas  (master)$ ipy
Python 2.7.2 |EPD 7.1-2 (64-bit)| (default, Jul  3 2011, 15:17:51) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

Welcome to pylab, a matplotlib-based Python environment [backend: Qt4Agg].
For more information, type 'help(pylab)'.

In [1]: paste
rng = date_range('1/1/2000', periods=20000, freq='h')
strings = [x.strftime('%Y-%m-%d %H:%M:%S') for x in rng]
## -- End pasted text --

In [2]: timeit to_datetime(strings)
10 loops, best of 3: 22.1 ms per loop

In [3]: 1870 / 22.1
Out[3]: 84.61538461538461

@manuteleco
Copy link
Author

Wow, definitely it is a huge improvement. It's even faster than doing .astype("datetime64"). These are the results I get now in terms of performance:

to_datetime(): 0.0160160064697
faking tz: 0.0268998146057
Ratio: 0.595394678532

to_datetime(): 0.0160160064697
concat tz: 0.00913701057434
Ratio: 1.75287161369

However, I've noticed the result values are different now (actually, the asserts in my sample code fail). It seems like it is making a transformation of the datetimes according to the local timezone so, for instance, "2012-01-01 00:00:00" becomes "2011-12-31 23:00:00" in my timezone (CET). This is coherent with the results yielded by .astype("datetime64") and np.array([...], dtype="datetime64").

  • Before:
    to_datetime(["2012-01-01 00:00:00"]) -> "2012-01-01 00:00:00"
    Series(["2012-01-01 00:00:00"]).astype("datetime64") -> "2011-12-31 23:00:00"
  • Now:
    to_datetime(["2012-01-01 00:00:00"]) -> "2011-12-31 23:00:00"
    Series(["2012-01-01 00:00:00"]).astype("datetime64") -> "2011-12-31 23:00:00"

For my application I would like to have a timezone-agnostic parsing utility just like the older to_datetime, but maybe it makes more sense that to_datetime behaves like it does now, I don't really know. This is a discussion I'm not fit to get in ;).

Please, let me know what you think about this.
Thanks for the amazing work you are all doing and sorry for the trouble ;)

@wesm
Copy link
Member

wesm commented Jul 13, 2012

I am able to reproduce the issue. I'll try to figure out a fix

@wesm wesm closed this as completed in 5a25499 Jul 13, 2012
@wesm
Copy link
Member

wesm commented Jul 13, 2012

I was able to fix this, so strings are parsed as naive times now

@manuteleco
Copy link
Author

Awesome work! After this fix it even runs about 3x faster on top of your first 85x improvement. That really makes a difference in large datasets.

My sample code outputs these estimates now:

to_datetime(): 0.00525259971619
faking tz: 0.0250147819519
Ratio: 0.209979832176

to_datetime(): 0.00525259971619
concat tz: 0.00868258476257
Ratio: 0.604958069494

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

3 participants