Question: What is the best way to load large data set? #275

zhmiao · 2017-01-24T22:35:00Z

Hi, I have some problems with loading large data sets. In the census example, it uses pandas to read in .h5 files or dask for castra. Since castra is no longer maintained, I used pandas. But it is real slow. I end up using df = dd.from_pandas() to make another dask data frame. I tried dd.from_hdf(), the loading process is real fast, but it's only because the data was still on the disk, not in the memory. When plotting for this data frame, it could be really slow, especially for interactive plots. every time I zoomed, it re-searched the disk. Is there a better way to load large data sets? When the data set is too large for the memory, what can I do to make the plotting faster?

The text was updated successfully, but these errors were encountered:

jbednar · 2017-01-24T23:35:33Z

Various file formats are discussed and benchmarked in #129. Personally, I recommend fastparquet, at least if you are using Python 3.

The osm.ipynb example in datashader shows how to set up dask to work out of core, when the data is too large for memory. It shouldn't be very difficult to get good performance, but you will probably have to study the options provided by fastparquet and dask (partition sizes, caching options, etc.) and experiment with them.

zhmiao · 2017-01-25T17:36:06Z

Thank you very much

jbednar closed this as completed Jan 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: What is the best way to load large data set? #275

Question: What is the best way to load large data set? #275

zhmiao commented Jan 24, 2017

jbednar commented Jan 24, 2017

zhmiao commented Jan 25, 2017

Question: What is the best way to load large data set? #275

Question: What is the best way to load large data set? #275

Comments

zhmiao commented Jan 24, 2017

jbednar commented Jan 24, 2017

zhmiao commented Jan 25, 2017