Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Optimize membership check in column filtering for better performance #61045

Open
3 tasks done
allrob23 opened this issue Mar 4, 2025 · 0 comments · May be fixed by #61046
Open
3 tasks done

PERF: Optimize membership check in column filtering for better performance #61045

allrob23 opened this issue Mar 4, 2025 · 0 comments · May be fixed by #61046
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance

Comments

@allrob23
Copy link

allrob23 commented Mar 4, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Description

Currently, the columns variable is a list of hashable elements returned by _filter_usecols. In the dictionary comprehension at pandas/pandas/io/parsers/c_parser_wrapper.py#L262:

col_dict = {k: v for k, v in col_dict.items() if k in columns}

Proposed Improvement

Convert columns to a set before performing the membership check, reducing lookup time to O(1):

columns_set = set(columns)  # Convert once
col_dict = {k: v for k, v in col_dict.items() if k in columns_set}

This avoids repeated list traversal and improves performance when filtering columns.

Expected Benefits

  • Faster execution when columns contains many elements.
  • Improved efficiency in scenarios with frequent membership checks.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.10.8
python-bits : 64
OS : Linux
OS-release : 6.5.0-1025-azure
Version : #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 1.26.4
pytz : 2025.1
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 8.33.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : 2024.11.0
fsspec : 2025.2.0
html5lib : 1.1
hypothesis : 6.127.5
gcsfs : 2025.2.0
jinja2 : 3.1.5
lxml.etree : 5.3.1
matplotlib : 3.10.1
numba : 0.61.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.1
pyreadstat : 1.2.8
pytest : 8.3.5
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2025.2.0
scipy : 1.15.2
sqlalchemy : 2.0.38
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.2
zstandard : 0.23.0
tzdata : 2025.1
qtpy : None
pyqt5 : None

Prior Performance

No response

@allrob23 allrob23 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 4, 2025
@allrob23 allrob23 linked a pull request Mar 4, 2025 that will close this issue
5 tasks
@rhshadrach rhshadrach added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants