Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Automate reading of data in chunks #61110

Open
1 of 3 tasks
acampove opened this issue Mar 12, 2025 · 1 comment
Open
1 of 3 tasks

ENH: Automate reading of data in chunks #61110

acampove opened this issue Mar 12, 2025 · 1 comment
Labels
Enhancement Needs Info Clarification about behavior needed to assess issue

Comments

@acampove
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I have a file with 20Gb of data that I need to process. When I use a pandas dataframe, the full 20Gb need to be loaded. That will make the computer slow or even crash. Can this process be made more efficient by automatically (very very important that the user does not have to do anything here) loads a chunk, processes it, writes it, loads the second chunk, etc.

This is stuff is possible, it is done by ROOT for instance.

Feature Description

This would just work with the normal dataframes, there could be an option like

pd.chunk_size = 100

which would process 100Mb at a time. So that no more than 100 Mb would be in memory.

Alternative Solutions

Alternatively we can

import ROOT

rdf = ROOT.RDataFrame('tree', 'path_to_file.root')

Additional Context

No response

@acampove acampove added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 12, 2025
@snitish
Copy link
Member

snitish commented Mar 24, 2025

@acampove what is the format of your data? If it is CSV, you can use the chunksize argument of read_csv. See https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking

@snitish snitish added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

2 participants