PTDS Benchmarks #517

quasiben · 2021-02-08T21:53:54Z

With cupy/cupy#4322 merged in we can begin profiling/benchmarking performance of more workers per GPU.

Setup

Build latest cupy python -m pip install .
CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1 dask-cuda-worker --nthreads=2
Use at least 4 GPUs
test nthreads: 1, 2, 4, 8, 32`

Test Operations

sum
mean
svd
Matrix Multiply
Array Slicing
transpose + sum: (x + x.T).sum()

Note @pentschev previously tested these operations on a single GPU:

Example of CuPy with Dask

import cupy
import dask.array as da

rs = da.random.RandomState(RandomState=cupy.random.RandomState) 
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))
x.sum().compute()

cc @charlesbluca

The text was updated successfully, but these errors were encountered:

jakirkham · 2021-02-08T22:29:51Z

We've also used adding a matrix with it's transpose as a communication benchmark. Could be reasonable here to show the communication overhead goes to 0 while still computing work in parallel

quasiben · 2021-02-08T22:37:01Z

That's a good idea @jakirkham . I updated the test operations

charlesbluca · 2021-02-09T00:06:35Z

Thanks for writing this up @quasiben!

Some questions:

Would this benchmark be better suited as a reproducible script, a one-off notebook with some plots to reflect the results, or both? Seems like it might be leaning towards notebook since the number of GPUs is flexible.
Could pybench be applicable in any way towards putting this together, or does it serve as more of an example in implementing the tests?

quasiben · 2021-02-09T03:23:43Z

Would this benchmark be better suited as a reproducible script, a one-off notebook with some plots to reflect the results, or both? Seems like it might be leaning towards notebook since the number of GPUs is flexible.

Generally, yes! Some of these operations are already built in here:
https://github.com/rapidsai/dask-cuda/blob/branch-0.18/dask_cuda/benchmarks/local_cupy.py

Could pybench be applicable in any way towards putting this together, or does it serve as more of an example in implementing the tests?

Probably. Extending pybench seems like a reasonable way to automate running many of these operations. Much of the tests in the dask-cuda benchmark suite have nice output for testing a set of parameters be we don't have additional tooling for varying those parameters automatically. This would be very beneficial especially if plotting was also automated!

I also think doing a bit of a dive on some of these operations with a manual setup would be useful. We could use tools like pynvtx and confirm kernels are launching on multiple threads.

charlesbluca · 2021-02-09T19:24:21Z

After touching base with @pentschev, I think it may be easier to extend local_cupy.py with the additional test operations we want since it already takes in GPU count and nthreads as input - although I'm not too sure how to go about parametrizing the tests with pytest-benchmark.

For plotting pybench already has some tools for that, although @pentschev mentioned the process of generating the plots in his example notebook took some toil, so not sure how easy it would be to automate for wider use cases. I think a good start for now would be outputting benchmarking results in JSON that could be parsed by pybench's plotting utils, which could either be done:

Somewhere in the implementation of local_cupy.py
By pytest-benchmark using the --benchmark-json=PATH option

charlesbluca · 2021-02-11T19:11:22Z

Made a fork with some additional operations for testing and an argument to output JSON of the results:

https://github.com/charlesbluca/dask-cuda/blob/ptds-benchmark/dask_cuda/benchmarks/local_cupy.py

Some more thought may need to go into device sync and warm up as in pybench, but for now I'll try this out and see if there are noticeable results.

pentschev · 2021-02-11T23:26:43Z

Looks great, when you're ready, be sure to open a PR! 😄

charlesbluca · 2021-02-12T01:24:18Z

Done! #524

From benchmarks using 4 GPUs and 1, 2, 4, 8 threads, there wasn't been a remarkable difference in performance; a plot of these results:

https://gist.github.com/charlesbluca/009248ef106e027cfd106610656aa1f9

I'm going to scale back the number of GPUs and check out what the Nsys profile and Dask performance reports look like.

jakirkham · 2021-02-12T01:30:04Z

Interesting. It might be worth comparing notes with Peter to make sure things are configured correctly.

Couple questions. Were you able to see multiple streams in use ( cupy/cupy#4322 (comment) )? Also are using the changes from PR ( cupy/cupy#4322 ) (note this is not in a CuPy release yet)?

pentschev · 2021-02-12T11:49:44Z

IIRC, I was only able to see significant difference on transpose sum, and besides the CuPy PTDS support, I also needed rapidsai/rmm#633 and UCX. Could you try those out as well @charlesbluca ?

charlesbluca · 2021-02-12T15:42:43Z

Sure! I'll run the benchmarks again and see how it looks.

@charlesbluca

This PR adds the following operations to the local CuPy array benchmark: - sum - mean - array slicing This also adds an additional special argument, `--benchmark-json`, which takes an optional path to dump the results of the benchmark in JSON format. This would allow us to generate plots using the output, as discussed in #517. Some thoughts: - Should there be an additional argument to specify the array slicing interval (which is currently fixed at 3)? - Could the JSON output be cleaned up? Currently, a (truncated) sample output file looks like: ```json { "operation": "transpose_sum", "size": 10000, "second_size": 1000, "chunk_size": 2500, "compute_size": [ 10000, 10000 ], "compute_chunk_size": [ 2500, 2500 ], "ignore_size": "1.05 MB", "protocol": "tcp", "devs": "0,1,2,3", "threads_per_worker": 1, "times": [ { "wall_clock": 1.4910394318867475, "npartitions": 16 } ], "bandwidths": { "(00,01)": { "25%": "136.34 MB/s", "50%": "156.67 MB/s", "75%": "163.32 MB/s", "total_nbytes": "150.00 MB" } } } ``` Authors: - Charles Blackmon-Luca (@charlesbluca) Approvers: - Mads R. B. Kristensen (@madsbk) - Peter Andreas Entschev (@pentschev) URL: #524

charlesbluca · 2021-02-12T22:04:23Z

It turns out the initial runs I did on TCP didn't have CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1 set - I ran both TCP and UCX tests again after fixing that and the results are significantly different:

https://gist.github.com/charlesbluca/009248ef106e027cfd106610656aa1f9

In particular, it looks like mean, sum, and transpose sum have some noticeable speed up. Some observations:

It looks like most speed up tends to plateau at 2-4 threads, either running at the same speed when bumping up to 8 threads per worker or slower.
Mean on UCX runs slower at 8 threads than 4, while it generally stays the same on TCP for this jump
The same can be observed in reverse when looking at sum.
For some operations, TCP is faster than UCX (sum using 4 threads); not sure if this is an unexpected result.

jakirkham · 2021-02-12T22:38:33Z

It might be worth creating larger data and/or using smaller chunks to see if this continues to hold up.

pentschev · 2021-02-12T22:43:09Z

I forgot to mention, for UCX you should pass also --ucx-net-devices auto when running on a DGX-1. This will ensure each GPU utilizes the correct InfiniBand interface, which dramatically changes results for the right type of transfers. Could you also try running with that argument @charlesbluca ?

charlesbluca · 2021-02-12T23:47:45Z

Sure! Currently running them now - I'll update the notebook once it finishes.

I'll also play around with array/chunk size, any insight on how the size of the second array in SVD should scale?

Also are there any good examples of comparative bar plots that have more than one independent variable? I'm trying to think about the best way to show the impact of changing array size and thread count in one plot.

charlesbluca · 2021-02-13T00:06:13Z

Updated the notebook; looks like the main difference with auto is that mean generally scales better, and slice runs faster regardless of thread count.

pentschev · 2021-02-15T15:12:41Z

@charlesbluca as we discussed on our standup meeting last Friday, we also need to ensure UCX synchronizes on PTDS-only, rather than the default stream. Sorry for forgetting about this earlier, but could you try running the UCX benchmarks again with https://github.com/pentschev/distributed/tree/ptds-support ?

charlesbluca · 2021-02-16T14:27:38Z

Sure! Do I have to specify any options in the Dask configuration?

pentschev · 2021-02-16T14:51:26Z

No, right now it will rely on your environment variables: https://github.com/pentschev/distributed/blob/c1cf54c6fc584efbc423ad0b49608817deb08d88/distributed/comm/ucx.py#L46-L49

We may change that in the future though.

charlesbluca · 2021-02-16T16:21:32Z

Thanks for the clarification! Right now, I am trying out some different parameters for array/chunk size; currently, I'm thinking for a holistic test of any single operation, we should try:

Array sizes (may need some smaller sizes to run for array multiplication):

20,000
40,000
80,000

Chunk sizes:

1/8 of array size (64 chunks)
1/4 of array size (16 chunks)
1/2 of array size (4 chunks)

Second size (SVD only)

10% of array size
20% of array size
40% of array size

We could also play around with GPU count to see if that has discernible impact.

charlesbluca · 2021-02-16T18:33:21Z

I ended up using this testing suite for one operation (sum) and got some pretty strange results:

https://gist.githack.com/charlesbluca/232596479f8af7fb827a1aa34e3c0bc1/raw/5bf6ebc3c2a0b7eeebfc438dd35fc33f3294c8ed/ptds_plots_sum.html

In particular, we can see that the results aren't consistent with what was observed before with array/chunk size 40,000/10,000 - not sure if this is related to running all of these tests in short succession. For reference, here are the Dask profiles for this operation with 1 and 2 threads per worker (2 threads should be faster but was not):

charlesbluca · 2021-02-16T23:13:36Z

It looks like the results from before had something to do with generating the Dask performance report - removing that argument gave me results similar to the earliest benchmarks:

https://gist.githack.com/charlesbluca/232596479f8af7fb827a1aa34e3c0bc1/raw/de7c7fd2918ac05e8dc4ca085fce7702cc8c7021/ptds_plots_sum_no_report.html

Playing around with array/chunk size, some interesting observations:

The overall array size seems to have an impact on what thread count is optimal - this can be seen when comparing the 20000/5000, 40000/10000, and 80000/20000 runs.
The smallest chunk size (64 chunks) had the best scaling performance, seeing improvements up to 16-32 threads per worker in all cases.
The largest chunk size (4 chunks) saw no improvement from increased thread count.

I'm interested in if the smaller chunk size / larger thread count would continue scaling, although it might be impractical to have more than 32 threads per worker.

charlesbluca · 2021-02-17T17:32:50Z

I think the different results when generating performance reports could have something to do with device synchronization not being done:

dask-cuda/dask_cuda/benchmarks/local_cupy.py

Lines 86 to 98 in d78c668

    
           # Execute the operations to benchmark 
        
           if args.profile is not None: 
        
               async with performance_report(filename=args.profile): 
        
                   t1 = clock() 
        
                   await client.compute(func(*func_args)) 
        
                   took = clock() - t1 
        
           else: 
        
               t1 = clock() 
        
               res = client.compute(func(*func_args)) 
        
               await client.gather(res) 
        
               if args.type == "gpu": 
        
                   await client.run(lambda xp: xp.cuda.Device().synchronize(), xp) 
        
               took = clock() - t1

Is there any reason device sync isn't being done when we opt for a performance report?

jakirkham · 2021-02-17T17:55:57Z

I don't think so. Sounds like a bug (thanks for finding that btw 😄). Would you like to submit a PR?

pentschev · 2021-02-17T18:56:24Z

As John mentioned, this is indeed a bug. Could you submit a PR for that when you have a chance?

pentschev · 2021-03-09T18:07:33Z

We can do both. Since this is using CuPy and it has its own memory pool, it should work fine either way.

We actually replace CuPy's memory pool for these benchmarks:

dask-cuda/dask_cuda/benchmarks/utils.py

Line 251 in e72d776

cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)

Is this what we want for the benchmarks?

By default using RMM's default is a good balance, so I'd say yes. You can specify --rmm-pool-size if you want a different size when running.

EDIT: If you're asking about these benchmarks in particular, you can specify 90-95% of the GPU's total memory to prevent any further allocations.

charlesbluca · 2021-03-09T18:23:48Z

What I meant to say is that in the benchmark code itself, it looks like --rmm-pool-size is not actually being used when supplied - I think that if that were the case, args.rmm_pool_size would be getting supplied to client.run(setup_memory_pool, ...).

jakirkham · 2021-03-09T18:40:19Z

Sounds like a bug. Would you like to do a PR to fix that?

charlesbluca · 2021-03-09T19:03:56Z

Sure! Is it okay if I roll this into a larger PR with the additional operations + NVTX traces?

pentschev · 2021-03-09T19:26:07Z

Sure, I don't think there's much rush for this right. And thanks for catching that @charlesbluca !

charlesbluca · 2021-03-09T19:42:28Z

No problem! Opened up PR #548.

charlesbluca · 2021-03-15T18:39:53Z

Ran some benchmarks with column sum and gather; in general it looks like performance is better with a larger chunk size, though increased threads don't seem to have a consistent impact:

For sum, there is a spike in performance at the largest chunk size

This is surprising to me, as it seems like from the Nsys profiles there aren't too many kernel calls happening here, mostly UCX communication:

jakirkham · 2021-03-15T21:51:59Z

~~It might be worth trying this benchmark without a mask to see how it compares~~

charlesbluca · 2021-03-15T21:56:46Z

Sure, what would the code look like running this without a mask?

jakirkham · 2021-03-15T22:20:46Z

Sorry scratch that

What are the chunk sizes here? How many chunks are we using?

charlesbluca · 2021-03-15T22:24:28Z

Chunk sizes are 31250, 62500, 125000, and 250000, so 16, 8, 4, and 2 chunks on the sliced array and 32, 16, 8, 4 chunks on the index array, respectively.

charlesbluca · 2021-03-16T14:01:11Z

Thanks for #553 @jakirkham!

Column masking behaves similarly to column sum:

Main difference in the profiles (125000 vs 250000) being that there's a lot more space between the kernel calls and UCX communication:

Also looks like there's some PtoP copies happening only for the larger chunk size? Highlighted where that's happening.

pentschev · 2021-03-16T14:23:43Z

Also looks like there's some PtoP copies happening only for the larger chunk size? Highlighted where that's happening.

As I wrote in #536 (comment), this is probably due to UCX_RNDV_THRESH.

github-actions · 2021-04-15T15:03:40Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

quasiben · 2021-08-02T14:24:26Z

Still active

github-actions · 2021-11-23T20:03:17Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-12-23T21:02:49Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

pentschev assigned charlesbluca Feb 10, 2021

pentschev added 2 - In Progress Currently a work in progress proposal A code change suggestion to be considered or discussed labels Feb 10, 2021

charlesbluca mentioned this issue Feb 12, 2021

Add operations/arguments to local CuPy array benchmark #524

Merged

jakirkham mentioned this issue Mar 19, 2021

Support for CUDA streams #96

Open

jakirkham mentioned this issue Apr 2, 2021

Multiple ThreadPoolExecutors dask/distributed#4655

Closed

pentschev mentioned this issue Apr 15, 2021

create multiple dask workers per gpu #571

Closed

github-actions bot added the inactive-30d label Apr 15, 2021

quasiben mentioned this issue Aug 17, 2021

[FEA] Multiprocessing support (to avoid CUDA OOM due to memory spikes) rapidsai/cudf#9042

Closed

pentschev mentioned this issue Aug 18, 2021

Error: "'ProxifyHostFile' object has no attribute 'fast'" upon JIT unspill #704

Closed

github-actions bot added inactive-90d and removed inactive-30d labels Nov 23, 2021

github-actions bot added the inactive-30d label Dec 23, 2021

caryr35 added this to dask-cuda Dec 8, 2022

pentschev mentioned this issue Oct 23, 2023

Dask LocalCudaCluster compute error when threads_per_worker not equal to 1 #1262

Closed

PTDS Benchmarks #517

PTDS Benchmarks #517

Comments

quasiben commented Feb 8, 2021 • edited Loading

Setup

Test Operations

Example of CuPy with Dask

jakirkham commented Feb 8, 2021

quasiben commented Feb 8, 2021

charlesbluca commented Feb 9, 2021

quasiben commented Feb 9, 2021

charlesbluca commented Feb 9, 2021

charlesbluca commented Feb 11, 2021

pentschev commented Feb 11, 2021

charlesbluca commented Feb 12, 2021

jakirkham commented Feb 12, 2021

pentschev commented Feb 12, 2021

charlesbluca commented Feb 12, 2021

charlesbluca commented Feb 12, 2021 • edited Loading

jakirkham commented Feb 12, 2021

pentschev commented Feb 12, 2021

charlesbluca commented Feb 12, 2021

charlesbluca commented Feb 13, 2021

pentschev commented Feb 15, 2021

charlesbluca commented Feb 16, 2021

pentschev commented Feb 16, 2021

charlesbluca commented Feb 16, 2021

charlesbluca commented Feb 16, 2021

charlesbluca commented Feb 16, 2021

charlesbluca commented Feb 17, 2021

jakirkham commented Feb 17, 2021

pentschev commented Feb 17, 2021

pentschev commented Mar 9, 2021 • edited Loading

charlesbluca commented Mar 9, 2021

jakirkham commented Mar 9, 2021

charlesbluca commented Mar 9, 2021

pentschev commented Mar 9, 2021

charlesbluca commented Mar 9, 2021

charlesbluca commented Mar 15, 2021

jakirkham commented Mar 15, 2021 • edited Loading

charlesbluca commented Mar 15, 2021

jakirkham commented Mar 15, 2021

charlesbluca commented Mar 15, 2021

charlesbluca commented Mar 16, 2021 • edited Loading

pentschev commented Mar 16, 2021

github-actions bot commented Apr 15, 2021

quasiben commented Aug 2, 2021

github-actions bot commented Nov 23, 2021

github-actions bot commented Dec 23, 2021

quasiben commented Feb 8, 2021 •

edited

Loading

charlesbluca commented Feb 12, 2021 •

edited

Loading

pentschev commented Mar 9, 2021 •

edited

Loading

jakirkham commented Mar 15, 2021 •

edited

Loading

charlesbluca commented Mar 16, 2021 •

edited

Loading