to_datetime poor performance parsing string datetimes #1571

manuteleco · 2012-07-06T08:22:42Z

Hi,

I want to convert to datetime64 a Series that contains datetimes as strings. The format is '%Y-%m-%d %H:%M:%S' ('2012-07-06 10:05:58', for instance).

Casting the strings array into a datetime64 array in numpy (or using Series.astype("datetime64")) is fast, but it transforms the datetimes according to the local timezone, which is not the behavior I want in this case. Pandas to_datetime function does the parsing right, but it is much slower.

However, it is also possible to do the parsing right and fast with numpy by appending the "+0000" timezone suffix to every string before parsing/casting to datetime64. So I wonder, is there any reason why to_datetime() runs much slower than this approach?

Thanks and regards.

Some sample code to illustrate the issue:

import os
import time
import numpy as np
from datetime import datetime, timedelta
from pandas import DatetimeIndex, Series, to_datetime
from pandas.tseries.offsets import Minute

DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"

# Decorator to execute code in a faked GMT timezone
def GMT_Timezone(func):
    def wrapper(*args, **kwargs):
        # Set timezone to GMT
        previous_tz = time.tzname[0]
        os.environ['TZ'] = 'GMT'
        time.tzset()
        # Execute function
        result = func(*args, **kwargs)
        # Set timezone to previous state
        os.environ['TZ'] = previous_tz
        time.tzset()
        return result
    return wrapper

def generate_test_data():
    start_date = datetime(2012, 1, 1)
    end_date = datetime(2012, 1, 31)
    minutes = DatetimeIndex(start=start_date, end=end_date, freq=Minute())
    str_minutes = Series(minutes.map(lambda d: d.strftime(DATETIME_FORMAT)))
    return str_minutes

def simple_casting(data):
    return data.astype("datetime64")

def shifting_afterwards(data):
    casted = data.astype("datetime64")
    shifted = DatetimeIndex(casted) - timedelta(seconds=time.altzone)
    return Series(shifted)

def concat_gmt_tz(data):
    data = np.add(data, "+0000")
    return data.astype("datetime64")

def using_to_datetime(data):
    return to_datetime(data)

@GMT_Timezone
def faking_tz(data):
    return data.astype("datetime64")

if __name__ == '__main__':
    test_data = generate_test_data()

    # Some aproaches to parse the datetime string
    using_to_dt = using_to_datetime(test_data)
    simple = simple_casting(test_data)
    shifted = shifting_afterwards(test_data)
    faked_tz = faking_tz(test_data)
    concat_tz = concat_gmt_tz(test_data)

    # Simple casting does not work. During the process, numpy converts the date
    # into GMT according to the local timezone.
    assert not (simple == using_to_dt).all(), "Fails if run from GMT timezone."
    # Shifting after casting does not work either, the problem being that
    # we should shift using a different timezone depending on the date we are
    # parsing (summer vs winter) -> more complex + error-prone + slower
    assert not (shifted == using_to_dt).all()
    # Faking the timezone to be GMT avoids datetime transformations, so it seems
    # to work, although since we are messing with environmental variables
    # it gets risky (multithreading, other side-effects?)
    assert (faked_tz == using_to_dt).all()
    assert (concat_tz == using_to_dt).all()

    # Now to performance
    from timeit import Timer
    ITERATIONS = 5
    t = Timer("using_to_datetime(test_data)",
              "from __main__ import generate_test_data, using_to_datetime;" + \
              "test_data = generate_test_data()")
    t_using_to_dt = t.timeit(ITERATIONS) / ITERATIONS

    t = Timer("faking_tz(test_data)",
              "from __main__ import generate_test_data, faking_tz;" + \
              "test_data = generate_test_data()")
    t_faking_tz = t.timeit(ITERATIONS) / ITERATIONS

    t = Timer("concat_gmt_tz(test_data)",
              "from __main__ import generate_test_data, concat_gmt_tz;" + \
              "test_data = generate_test_data()")
    t_concat_tz = t.timeit(ITERATIONS) / ITERATIONS

    print "to_datetime():", t_using_to_dt
    print "faking tz:", t_faking_tz
    # to_datetime() ~90 times slower than .astype("datetime64")
    print "Ratio:", t_using_to_dt / t_faking_tz

    print

    print "to_datetime():", t_using_to_dt
    print "concat tz:", t_concat_tz
    # to_datetime() ~270 times slower than concat tz + .astype("datetime64")
    print "Ratio:", t_using_to_dt / t_concat_tz

changhiskhan · 2012-07-06T20:13:54Z

I ran your file on my box and I get a performance difference of about >2x.

In [7]: %timeit using_to_datetime(test_data)
1 loops, best of 3: 4 s per loop

In [8]: %timeit faking_tz(test_data)
1 loops, best of 3: 1.44 s per loop

In [9]: %timeit concat_gmt_tz(test_data)
1 loops, best of 3: 1.74 s per loop

I think it's because pandas is using dateutil internally while numpy uses its own parser that's faster:

In [10]: from numpy.core._mx_datetime_parser import datetime_from_string as p2

In [11]: from dateutil.parser import parse as p1

In [12]: %timeit test_data.apply(p1)
1 loops, best of 3: 3.75 s per loop

In [13]: %timeit test_data.apply(p2)
1 loops, best of 3: 1.22 s per loop

I'll see whether we can convert pandas to use the faster date parsing code

manuteleco · 2012-07-07T12:29:02Z

Sorry, I should have said it before, but I'm using the '1.8.0.dev-6a06466' version for numpy and '0.8.0' for pandas. It seems like I get much different performances due to recent improvements in numpy.

The results I get are these:
to_datetime(): 8.36494483948
faking tz: 0.0932590007782
Ratio: 89.6958445799

to_datetime(): 8.36494483948
concat tz: 0.0312120437622
Ratio: 268.003752116

I also run some simple tests using both numpy '1.8.0.dev-6a06466' and '1.6.2' to compare performance:

In [2]: import numpy as np
In [3]: from datetime import datetime
In [4]: DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
In [5]: str_datetime = datetime.now().strftime(DATETIME_FORMAT)
In [7]: %timeit np.datetime64(str_datetime)

The results are:

'1.8.0.dev-6a06466':
100000 loops, best of 3: 3.83 us per loop
'1.6.2':
10000 loops, best of 3: 95.9 us per loop

~25 times faster in the newer version

So parsing data with the current develpment version for numpy seems to be significantly faster than using to_datetime. Maybe it could be possible for to_datetime to make use of the new numpy improvements in the future, or maybe try to apply the same optimizations. It would be really nice to be able to use to_datetime with a performance similar to that offered by numpy.

Thanks again and regards.

changhiskhan · 2012-07-07T13:24:44Z

Thanks for the feedback!
I'll keep the issue open until we improve the performance on to_datetime.

wesm · 2012-07-08T16:58:05Z

It should be straightforward to optimize to_datetime for ISO8601 format (what you're describing). Currently it is very general and handles a lot more date formats than NumPy does. Will let you know when one of us gets a chance to work on it.

wesm · 2012-07-11T22:56:39Z

I was able to optimize the ISO8601 case and bring down the parsing time on 20000 strings from 1.87 seconds to 22.1 ms (85x improvement). Do you think this is adequate? I don't think things can get all that much faster than this.

In [4]: to_datetime(strings)
Out[4]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, ..., 2002-04-13 07:00:00]
Length: 20000, Freq: None, Timezone: None

In [5]: timeit to_datetime(strings)
1 loops, best of 3: 1.87 s per loop

In [6]: exit
18:54 ~/code/pandas  ((3824af1...))$ git checkout master
Previous HEAD position was 3824af1... BUG: override ndarray.tolist in Index for MultiIndex compat, close #1576
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 1 commit.

...

18:55 ~/code/pandas  (master)$ ipy
Python 2.7.2 |EPD 7.1-2 (64-bit)| (default, Jul  3 2011, 15:17:51) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

Welcome to pylab, a matplotlib-based Python environment [backend: Qt4Agg].
For more information, type 'help(pylab)'.

In [1]: paste
rng = date_range('1/1/2000', periods=20000, freq='h')
strings = [x.strftime('%Y-%m-%d %H:%M:%S') for x in rng]
## -- End pasted text --

In [2]: timeit to_datetime(strings)
10 loops, best of 3: 22.1 ms per loop

In [3]: 1870 / 22.1
Out[3]: 84.61538461538461

manuteleco · 2012-07-12T08:13:05Z

Wow, definitely it is a huge improvement. It's even faster than doing .astype("datetime64"). These are the results I get now in terms of performance:

to_datetime(): 0.0160160064697
faking tz: 0.0268998146057
Ratio: 0.595394678532

to_datetime(): 0.0160160064697
concat tz: 0.00913701057434
Ratio: 1.75287161369

However, I've noticed the result values are different now (actually, the asserts in my sample code fail). It seems like it is making a transformation of the datetimes according to the local timezone so, for instance, "2012-01-01 00:00:00" becomes "2011-12-31 23:00:00" in my timezone (CET). This is coherent with the results yielded by .astype("datetime64") and np.array([...], dtype="datetime64").

Before:
to_datetime(["2012-01-01 00:00:00"]) -> "2012-01-01 00:00:00"
Series(["2012-01-01 00:00:00"]).astype("datetime64") -> "2011-12-31 23:00:00"
Now:
to_datetime(["2012-01-01 00:00:00"]) -> "2011-12-31 23:00:00"
Series(["2012-01-01 00:00:00"]).astype("datetime64") -> "2011-12-31 23:00:00"

For my application I would like to have a timezone-agnostic parsing utility just like the older to_datetime, but maybe it makes more sense that to_datetime behaves like it does now, I don't really know. This is a discussion I'm not fit to get in ;).

Please, let me know what you think about this.
Thanks for the amazing work you are all doing and sorry for the trouble ;)

wesm · 2012-07-13T02:36:48Z

I am able to reproduce the issue. I'll try to figure out a fix

wesm · 2012-07-13T02:47:58Z

I was able to fix this, so strings are parsed as naive times now

manuteleco · 2012-07-13T11:01:05Z

Awesome work! After this fix it even runs about 3x faster on top of your first 85x improvement. That really makes a difference in large datasets.

My sample code outputs these estimates now:

to_datetime(): 0.00525259971619
faking tz: 0.0250147819519
Ratio: 0.209979832176

to_datetime(): 0.00525259971619
concat tz: 0.00868258476257
Ratio: 0.604958069494

Thanks.

wesm added a commit that referenced this issue Jul 11, 2012

ENH: hugely improve performance of to_datetime on ISO8601 data #1571

b061260

wesm closed this as completed in 5a25499 Jul 13, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_datetime poor performance parsing string datetimes #1571

to_datetime poor performance parsing string datetimes #1571

manuteleco commented Jul 6, 2012

changhiskhan commented Jul 6, 2012

manuteleco commented Jul 7, 2012

changhiskhan commented Jul 7, 2012

wesm commented Jul 8, 2012

wesm commented Jul 11, 2012

manuteleco commented Jul 12, 2012

wesm commented Jul 13, 2012

wesm commented Jul 13, 2012

manuteleco commented Jul 13, 2012

to_datetime poor performance parsing string datetimes #1571

to_datetime poor performance parsing string datetimes #1571

Comments

manuteleco commented Jul 6, 2012

changhiskhan commented Jul 6, 2012

manuteleco commented Jul 7, 2012

changhiskhan commented Jul 7, 2012

wesm commented Jul 8, 2012

wesm commented Jul 11, 2012

manuteleco commented Jul 12, 2012

wesm commented Jul 13, 2012

wesm commented Jul 13, 2012

manuteleco commented Jul 13, 2012