Description
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
I found pd.DatatimeIndex.strftime is pretty slow when data is large.
In the following I made a simple benchmark. method_b
first stores 'year', 'month', 'day', 'hour', 'minute', 'second', then convert them to string with f-formatter. Although it's written in python, the time spent is significantly lower.
import time
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
def timer(f):
def inner(*args, **kwargs):
s = time.time()
result = f(*args, **kwargs)
e = time.time()
return e - s
return inner
@timer
def method_a(index):
return index.strftime("%Y-%m-%d %H:%M:%S")
@timer
def method_b(index):
attrs = ('year', 'month', 'day', 'hour', 'minute', 'second')
parts = [getattr(index, at) for at in attrs]
b = []
for year, month, day, hour, minute, second in zip(*parts):
b.append(f'{year}-{month:02}-{day:02} {hour:02}:{minute:02}:{second:02}')
b = pd.Index(b)
return b
index = pd.date_range('2000', '2020', freq='1min')
@delayed
def profile(p):
n = int(10 ** p)
time_a = method_a(index[:n])
time_b = method_b(index[:n])
return n, time_a, time_b
records = Parallel(10, verbose=10)(profile(p) for p in np.arange(1, 7.1, 0.1))
pd.DataFrame(records, columns=['n', 'time_a', 'time_b']).set_index('n').plot(figsize=(10, 8))
Installed Versions
INSTALLED VERSIONS
commit : 945c9ed
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-63-generic
Version : #71-Ubuntu SMP Tue Jul 13 15:59:12 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.4
numpy : 1.20.0
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: 0.9.0
bs4 : None
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.54.1
Prior Performance
No response
Activity
auderson commentedon Dec 12, 2021
After reading the source code, I probably know where the bottleneck comes from. Internally
strftime
is called on every single element, meaning that the format string is repeatedly evaluated. If that can be done for just once before the loop, the performance can be much better.I also notice
format_array_from_datetime
has a shortcut forformat=None
:pandas/pandas/_libs/tslib.pyx
Lines 152 to 166 in 193ca73
Using this feature, method_c became the fastest:
I suggest to change the following line to
pandas/pandas/_libs/tslib.pyx
Line 134 in 193ca73
smarie commentedon Feb 22, 2022
I confirm that we identified the same performance issue on our side, with custom formats such as
"%Y-%m-%dT%H:%M:%SZ"
.It would be great to improve this in a future version ! Would you like us to propose a PR ? If so, some guidance would be appreciated.
jreback commentedon Feb 22, 2022
PRs are how things are fixed
core can provide review
smarie commentedon Feb 22, 2022
I opened a draft PR. It seems to me that we could have some kind of format string processor run beforehand, in order to transform all strftime patterns i.e.
%Y-%m-%d %H:%M:%S
into'{dts.year}-{dts.month:02d}-{dts.day:02d} {dts.hour:02d}:{dts.min:02d}:{dts.sec:02d}'
.I'll have a try in the upcoming days
smarie commentedon Mar 12, 2022
@auderson , just being curious: did you try running your benchmark on windows ? Indeed it seems from my first benchmark results that it is even slower (blue curve) : #46116 (comment)
auderson commentedon Mar 13, 2022
@smarie I ran this on a Linux Jupyter notebook.
auderson commentedon Mar 13, 2022
This is my result on windows 10, a bit faster than yours #46116 (comment) but still way slower than Linux
EDIT
@smarie
I installed WSL on my desktop (CPU 3600X with 32GB 3000MHZ dual channel RAM) and ran again:
Looks like windows strftime is slower than Linux!
smarie commentedon Mar 13, 2022
Thanks @auderson for this confirmation !
Performance improvement in :class:`BusinessHour`, ``repr`` is now 4 t…
7 remaining items