New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General performance optimizations #313

Closed
gbrener opened this Issue Apr 19, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@gbrener
Copy link
Contributor

gbrener commented Apr 19, 2017

Tasks

For the past couple weeks I've been investigating datashader's performance and how we can improve upon it. I'm now documenting my remaining tasks, in case I get pulled away on a different project. Below is a list of tasks/issues that I'm currently addressing:

  • Extend the filetimes.py and filetimes.yml benchmarking environment to find the optimal file format for datashader/dask (Issue #129)
  • Benchmark numba compared to handwritten ufuncs in vaex (Issue #310)
  • Gather perf information about dask locking behavior (Issue #314)
  • Investigate why Cachey leads to better runtime performance for repeat datashader aggregations
  • Document memory usage findings (Issue #305)
  • Investigate how datashader's performance changes with data types (doubles vs floats, etc) (Issue #305)
  • Verify that repeat aggregations no longer depend on file format (Issue #129)
  • Investigate distributed scheduler vs threaded scheduler for single-machine use case (#331, #332, #334)
  • Identify issues hindering distributed scheduler from performing more effectively - credit goes to @martindurant ( #332, #336, #337 )

Performance takeaways

Below are some performance-related takeaways that fell out of my experiments and optimizations with datashader and dask:

General

  • Use the latest version of numba (>=0.33). This includes bugfixes providing ~3-5x speedups for many cases (numba/numba#2345, numba/numba#2349, numba/numba#2350)

  • When interacting with data on the filesystem, store it in the Apache Parquet format when possible. Snappy compression should be used when writing out parq files, and the data should rely on categorical dtypes (when possible) before writing the parq files, as parquet supports categoricals in its binary format (#129)

  • Use the categorical dtype for columns with data that takes on a limited, fixed number of possible values. Categorical columns use a more memory-efficient data representation and are optimized for common operations such as sorting and finding uniques. Example of how to convert a column to the categorical dtype:

    df[colname] = df[colname].astype('category')
    
  • There is promise with enhancing datashader's performance even further by using single-precision floats (np.float32) instead of double-precision floats (np.float64). In past experiments this cut down the time to load data off of disk (assuming the data was written out in single-precision float) as well as datashader's aggregation times. Care should be taken using this approach, as using single-precision (in any software application, not just datashader) leads to different numerical results than double-precision (#305)

  • When using pandas dataframes, there will be a speedup if you cache the cvs.x_range and cvs.y_range variables, and pass them back into the Canvas() constructor during future instantiations. As of #344 , dask dataframes automatically memoize the x_range and y_range calculations; this works for dask because dask's dataframes are immutable, unlike pandas (#129)

Single machine

  • A rule-of-thumb for the number of partitions to use while converting pandas dataframes into dask dataframes is multiprocessing.cpu_count(). This allows dask to use one thread per core for parallelizing computations (#129)

  • When the entire dataset fits into memory at once, persist the dataframe as a Dask dataframe prior to passing it into datashader (#129). One example of how to do this:

    from dask import dataframe as dd
    import multiprocessing
    dask_df = dd.from_pandas(df, npartitions=multiprocessing.cpu_count()).persist()
    ...
    cvs = datashader.Canvas(...)
    agg = cvs.points(dask_df, ...)
    
  • When the entire dataset doesn't fit into memory at once, use the distributed scheduler (#331) without persisting (there is an outstanding issue #332 that illustrates the problem with the distributed scheduler + persist). Example:

    from dask import distributed
    import multiprocessing
    cluster = distributed.LocalCluster(n_workers=multiprocessing.cpu_count(), threads_per_worker=1)
    dask_client = distributed.Client(cluster)
    dask_df = dd.from_pandas(df, npartitions=multiprocessing.cpu_count()) # Note no "persist"
    ...
    cvs = datashader.Canvas(...)
    agg = cvs.points(dask_df, ...)
    

Multiple machines

  • Use the distributed scheduler to farm computations out to remote machines. client.persist(dask_df) may help in certain cases, but be sure to include distributed.wait() to block until the data is read into RAM on each worker (#332)

@gbrener gbrener self-assigned this Apr 19, 2017

@gbrener

This comment has been minimized.

Copy link
Contributor Author

gbrener commented Apr 20, 2017

Benchmarking with numba exposed a bug affecting performance under multithreaded workloads. Once it is fixed, there should be a significant performance increase to datashader (at least 3x in many cases): numba/numba#2345

@gbrener

This comment has been minimized.

Copy link
Contributor Author

gbrener commented Apr 24, 2017

More performance improvements to come once this numba PR is merged: numba/numba#2349

@gbrener

This comment has been minimized.

Copy link
Contributor Author

gbrener commented Apr 25, 2017

There's one outstanding numba issue that, when resolved, should unlock additional performance for datashader (particularly for sum and mean): numba/numba#2350 . Thanks again for the outstanding work so far, @skram.

@gbrener

This comment has been minimized.

Copy link
Contributor Author

gbrener commented Apr 25, 2017

Preliminary memory usage measurements and calculations were added to Issue #305.

gbrener added a commit that referenced this issue Apr 25, 2017

Add latest numba build, dbg info for memory usage
The recent numba development build includes a few performance fixes, so we want to be using it for benchmarking. Also, add some debug information related to memory usage, per issue #313
@jbednar

This comment has been minimized.

Copy link
Collaborator

jbednar commented Apr 26, 2017

os.cpu_count()

That function didn't appear until Python 3.4, so we should probably recommend multiprocessing.cpu_count() instead, which is on 2.7 as well. On my machine that function returns 8, same as os.cpu_count(); not sure if 4 or 8 is the better answer for this particular machine (i.e., using hyperthreading or not)...

@jbcrail jbcrail added this to the 0.5.0 milestone Apr 26, 2017

@jbednar jbednar referenced this issue Apr 26, 2017

Closed

Punch list for 0.5.0 #325

13 of 14 tasks complete
@gbrener

This comment has been minimized.

Copy link
Contributor Author

gbrener commented May 1, 2017

Thanks @jbednar , I've updated the filetimes.py code and the suggestions at the top of the thread. Based on my experiments I've seen an improvement with 8 instead of 4, which indicates that hyperthreading seems to help for our benchmarks.

@jbednar

This comment has been minimized.

Copy link
Collaborator

jbednar commented May 1, 2017

Ok, great; that keeps it simple!

@jbednar

This comment has been minimized.

Copy link
Collaborator

jbednar commented May 8, 2017

Thanks for all the great work, @gbrener! Reflecting these recommendations into our documentation is now on our to-do list.

@jbednar jbednar closed this May 8, 2017

@jbednar jbednar removed the ready label May 8, 2017

jbcrail added a commit to jbcrail/datashader that referenced this issue May 11, 2017

Add performance documentation
This is based on @gbrener's writeup on pyviz#129 and pyviz#313.

Related to pyviz#325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment