Add support for reading large files in chunks #45

dirkschneemann · 2016-12-02T16:29:45Z

Similar to the chunksize parameter of pandas.read_csv()...

Is this planned or even already possible somehow? Since paratext will most likely be used for reading large CSV files (pandas is usually already fast enough for small ones) which might not fit in memory, this would be very useful in my opinion.

deads · 2017-02-17T02:04:20Z

Thank you for your feature request. I completely agree that this will be very useful and doable. It would require changes at the C++-level.

It would involve first running the ParaText::Generic::Chunker as before, but instead finding num_chunks*num_threads chunks. Then, spawing num_threads threads over the first set of chunks, returning the results back to C++, and then proceeding on the next batch of num_threads chunks, until this process is repeated num_chunks times. We will add this feature to the roadmap.

This allows for reading through larger-than-memory CSV files.

vspinu mentioned this issue Dec 8, 2016

C++ rowbase stream processing? #46

Open

tdenniston added a commit to tdenniston/paratext that referenced this issue Dec 7, 2017

Implement file chunk-based reading (wiseio#45).

6bfa2d6

This allows for reading through larger-than-memory CSV files.

tdenniston mentioned this issue Dec 7, 2017

Implement file chunk-based reading (#45). #68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading large files in chunks #45

Add support for reading large files in chunks #45

dirkschneemann commented Dec 2, 2016

deads commented Feb 17, 2017

Add support for reading large files in chunks #45

Add support for reading large files in chunks #45

Comments

dirkschneemann commented Dec 2, 2016

deads commented Feb 17, 2017