Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: What is the best way to load large data set? #275

Closed
zhmiao opened this issue Jan 24, 2017 · 2 comments
Closed

Question: What is the best way to load large data set? #275

zhmiao opened this issue Jan 24, 2017 · 2 comments

Comments

@zhmiao
Copy link

zhmiao commented Jan 24, 2017

Hi, I have some problems with loading large data sets. In the census example, it uses pandas to read in .h5 files or dask for castra. Since castra is no longer maintained, I used pandas. But it is real slow. I end up using df = dd.from_pandas() to make another dask data frame. I tried dd.from_hdf(), the loading process is real fast, but it's only because the data was still on the disk, not in the memory. When plotting for this data frame, it could be really slow, especially for interactive plots. every time I zoomed, it re-searched the disk. Is there a better way to load large data sets? When the data set is too large for the memory, what can I do to make the plotting faster?

@jbednar
Copy link
Member

jbednar commented Jan 24, 2017

Various file formats are discussed and benchmarked in #129. Personally, I recommend fastparquet, at least if you are using Python 3.

The osm.ipynb example in datashader shows how to set up dask to work out of core, when the data is too large for memory. It shouldn't be very difficult to get good performance, but you will probably have to study the options provided by fastparquet and dask (partition sizes, caching options, etc.) and experiment with them.

@jbednar jbednar closed this as completed Jan 24, 2017
@zhmiao
Copy link
Author

zhmiao commented Jan 25, 2017

Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants