Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading large files in chunks #45

Open
dirkschneemann opened this issue Dec 2, 2016 · 1 comment
Open

Add support for reading large files in chunks #45

dirkschneemann opened this issue Dec 2, 2016 · 1 comment

Comments

@dirkschneemann
Copy link

Similar to the chunksize parameter of pandas.read_csv()...

Is this planned or even already possible somehow? Since paratext will most likely be used for reading large CSV files (pandas is usually already fast enough for small ones) which might not fit in memory, this would be very useful in my opinion.

@deads
Copy link
Contributor

deads commented Feb 17, 2017

Thank you for your feature request. I completely agree that this will be very useful and doable. It would require changes at the C++-level.

It would involve first running the ParaText::Generic::Chunker as before, but instead finding num_chunks*num_threads chunks. Then, spawing num_threads threads over the first set of chunks, returning the results back to C++, and then proceeding on the next batch of num_threads chunks, until this process is repeated num_chunks times. We will add this feature to the roadmap.

tdenniston added a commit to tdenniston/paratext that referenced this issue Dec 7, 2017
This allows for reading through larger-than-memory CSV files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants