Skip to content

PERF: strftime is slow #44764

Open
Open
@auderson

Description

@auderson
Contributor

  • I have checked that this issue has not already been reported.

    I have confirmed this issue exists on the latest version of pandas.

    I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I found pd.DatatimeIndex.strftime is pretty slow when data is large.

In the following I made a simple benchmark. method_b first stores 'year', 'month', 'day', 'hour', 'minute', 'second', then convert them to string with f-formatter. Although it's written in python, the time spent is significantly lower.

import time
import numpy as np
import pandas as pd
from joblib import Parallel, delayed

def timer(f):
    def inner(*args, **kwargs):
        s = time.time()
        result = f(*args, **kwargs)
        e = time.time()
        return e - s
    return inner


@timer
def method_a(index):
    return index.strftime("%Y-%m-%d %H:%M:%S")

@timer
def method_b(index):
    attrs = ('year', 'month', 'day', 'hour', 'minute', 'second')
    parts = [getattr(index, at) for at in attrs]
    b = []
    for year, month, day, hour, minute, second in zip(*parts):
        b.append(f'{year}-{month:02}-{day:02} {hour:02}:{minute:02}:{second:02}')
    b = pd.Index(b)
    return b

index = pd.date_range('2000', '2020', freq='1min')

@delayed
def profile(p):
    n = int(10 ** p)
    time_a = method_a(index[:n])
    time_b = method_b(index[:n])
    return n, time_a, time_b

records = Parallel(10, verbose=10)(profile(p) for p in np.arange(1, 7.1, 0.1))
pd.DataFrame(records, columns=['n', 'time_a', 'time_b']).set_index('n').plot(figsize=(10, 8))

image

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-63-generic
Version : #71-Ubuntu SMP Tue Jul 13 15:59:12 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.20.0
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: 0.9.0
bs4 : None
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.54.1

Prior Performance

No response

Activity

added
Needs TriageIssue that has not been reviewed by a pandas team member
PerformanceMemory or execution speed performance
on Dec 5, 2021
auderson

auderson commented on Dec 12, 2021

@auderson
ContributorAuthor

After reading the source code, I probably know where the bottleneck comes from. Internally strftime is called on every single element, meaning that the format string is repeatedly evaluated. If that can be done for just once before the loop, the performance can be much better.
I also notice format_array_from_datetime has a shortcut for format=None:

elif basic_format:
dt64_to_dtstruct(val, &dts)
res = (f'{dts.year}-{dts.month:02d}-{dts.day:02d} '
f'{dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}')
if show_ns:
ns = dts.ps // 1000
res += f'.{ns + dts.us * 1000:09d}'
elif show_us:
res += f'.{dts.us:06d}'
elif show_ms:
res += f'.{dts.us // 1000:03d}'
result[i] = res

Using this feature, method_c became the fastest:

@timer
def method_c(index):
    return index.strftime(None)

image

I suggest to change the following line to

basic_format = format is None or format ==  "%Y-%m-%d %H:%M:%S"  and tz is None 

basic_format = format is None and tz is None

added
DatetimeDatetime data dtype
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Dec 27, 2021
smarie

smarie commented on Feb 22, 2022

@smarie
Contributor

I confirm that we identified the same performance issue on our side, with custom formats such as "%Y-%m-%dT%H:%M:%SZ".

It would be great to improve this in a future version ! Would you like us to propose a PR ? If so, some guidance would be appreciated.

jreback

jreback commented on Feb 22, 2022

@jreback
Contributor

PRs are how things are fixed

core can provide review

smarie

smarie commented on Feb 22, 2022

@smarie
Contributor

I opened a draft PR. It seems to me that we could have some kind of format string processor run beforehand, in order to transform all strftime patterns i.e. %Y-%m-%d %H:%M:%S into '{dts.year}-{dts.month:02d}-{dts.day:02d} {dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}'.

I'll have a try in the upcoming days

smarie

smarie commented on Mar 12, 2022

@smarie
Contributor

@auderson , just being curious: did you try running your benchmark on windows ? Indeed it seems from my first benchmark results that it is even slower (blue curve) : #46116 (comment)

auderson

auderson commented on Mar 13, 2022

@auderson
ContributorAuthor

@smarie I ran this on a Linux Jupyter notebook.

auderson

auderson commented on Mar 13, 2022

@auderson
ContributorAuthor

This is my result on windows 10, a bit faster than yours #46116 (comment) but still way slower than Linux

Figure_1

EDIT

@smarie
I installed WSL on my desktop (CPU 3600X with 32GB 3000MHZ dual channel RAM) and ran again:

image

Looks like windows strftime is slower than Linux!

smarie

smarie commented on Mar 13, 2022

@smarie
Contributor

Thanks @auderson for this confirmation !

7 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtypePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @jreback@smarie@mroeschke@auderson

      Issue actions

        PERF: strftime is slow · Issue #44764 · pandas-dev/pandas