Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTDS Benchmarks #517

Open
quasiben opened this issue Feb 8, 2021 · 56 comments
Open

PTDS Benchmarks #517

quasiben opened this issue Feb 8, 2021 · 56 comments
Assignees
Labels
2 - In Progress Currently a work in progress inactive-30d inactive-90d proposal A code change suggestion to be considered or discussed

Comments

@quasiben
Copy link
Member

quasiben commented Feb 8, 2021

With cupy/cupy#4322 merged in we can begin profiling/benchmarking performance of more workers per GPU.

Setup

  • Build latest cupy python -m pip install .
  • CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1 dask-cuda-worker --nthreads=2
  • Use at least 4 GPUs
  • test nthreads: 1, 2, 4, 8, 32`

Test Operations

  • sum
  • mean
  • svd
  • Matrix Multiply
  • Array Slicing
  • transpose + sum: (x + x.T).sum()

Note @pentschev previously tested these operations on a single GPU:

Example of CuPy with Dask

import cupy
import dask.array as da

rs = da.random.RandomState(RandomState=cupy.random.RandomState) 
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))
x.sum().compute()

cc @charlesbluca

@jakirkham
Copy link
Member

We've also used adding a matrix with it's transpose as a communication benchmark. Could be reasonable here to show the communication overhead goes to 0 while still computing work in parallel

@quasiben
Copy link
Member Author

quasiben commented Feb 8, 2021

That's a good idea @jakirkham . I updated the test operations

@charlesbluca
Copy link
Member

Thanks for writing this up @quasiben!

Some questions:

  • Would this benchmark be better suited as a reproducible script, a one-off notebook with some plots to reflect the results, or both? Seems like it might be leaning towards notebook since the number of GPUs is flexible.
  • Could pybench be applicable in any way towards putting this together, or does it serve as more of an example in implementing the tests?

@quasiben
Copy link
Member Author

quasiben commented Feb 9, 2021

Would this benchmark be better suited as a reproducible script, a one-off notebook with some plots to reflect the results, or both? Seems like it might be leaning towards notebook since the number of GPUs is flexible.

Generally, yes! Some of these operations are already built in here:
https://github.com/rapidsai/dask-cuda/blob/branch-0.18/dask_cuda/benchmarks/local_cupy.py

Could pybench be applicable in any way towards putting this together, or does it serve as more of an example in implementing the tests?

Probably. Extending pybench seems like a reasonable way to automate running many of these operations. Much of the tests in the dask-cuda benchmark suite have nice output for testing a set of parameters be we don't have additional tooling for varying those parameters automatically. This would be very beneficial especially if plotting was also automated!

I also think doing a bit of a dive on some of these operations with a manual setup would be useful. We could use tools like pynvtx and confirm kernels are launching on multiple threads.

@charlesbluca
Copy link
Member

After touching base with @pentschev, I think it may be easier to extend local_cupy.py with the additional test operations we want since it already takes in GPU count and nthreads as input - although I'm not too sure how to go about parametrizing the tests with pytest-benchmark.

For plotting pybench already has some tools for that, although @pentschev mentioned the process of generating the plots in his example notebook took some toil, so not sure how easy it would be to automate for wider use cases. I think a good start for now would be outputting benchmarking results in JSON that could be parsed by pybench's plotting utils, which could either be done:

  • Somewhere in the implementation of local_cupy.py
  • By pytest-benchmark using the --benchmark-json=PATH option

@pentschev pentschev added 2 - In Progress Currently a work in progress proposal A code change suggestion to be considered or discussed labels Feb 10, 2021
@charlesbluca
Copy link
Member

Made a fork with some additional operations for testing and an argument to output JSON of the results:

Some more thought may need to go into device sync and warm up as in pybench, but for now I'll try this out and see if there are noticeable results.

@pentschev
Copy link
Member

Looks great, when you're ready, be sure to open a PR! 😄

@charlesbluca
Copy link
Member

Done! #524

From benchmarks using 4 GPUs and 1, 2, 4, 8 threads, there wasn't been a remarkable difference in performance; a plot of these results:

I'm going to scale back the number of GPUs and check out what the Nsys profile and Dask performance reports look like.

@jakirkham
Copy link
Member

Interesting. It might be worth comparing notes with Peter to make sure things are configured correctly.

Couple questions. Were you able to see multiple streams in use ( cupy/cupy#4322 (comment) )? Also are using the changes from PR ( cupy/cupy#4322 ) (note this is not in a CuPy release yet)?

@pentschev
Copy link
Member

IIRC, I was only able to see significant difference on transpose sum, and besides the CuPy PTDS support, I also needed rapidsai/rmm#633 and UCX. Could you try those out as well @charlesbluca ?

@charlesbluca
Copy link
Member

Sure! I'll run the benchmarks again and see how it looks.

rapids-bot bot pushed a commit that referenced this issue Feb 12, 2021
This PR adds the following operations to the local CuPy array benchmark:

- sum
- mean
- array slicing

This also adds an additional special argument, `--benchmark-json`, which takes an optional path to dump the results of the benchmark in JSON format. This would allow us to generate plots using the output, as discussed in #517.

Some thoughts:

- Should there be an additional argument to specify the array slicing interval (which is currently fixed at 3)?
- Could the JSON output be cleaned up? Currently, a (truncated) sample output file looks like:

```json
{
  "operation": "transpose_sum",
  "size": 10000,
  "second_size": 1000,
  "chunk_size": 2500,
  "compute_size": [
    10000,
    10000
  ],
  "compute_chunk_size": [
    2500,
    2500
  ],
  "ignore_size": "1.05 MB",
  "protocol": "tcp",
  "devs": "0,1,2,3",
  "threads_per_worker": 1,
  "times": [
    {
      "wall_clock": 1.4910394318867475,
      "npartitions": 16
    }
  ],
  "bandwidths": {
    "(00,01)": {
      "25%": "136.34 MB/s",
      "50%": "156.67 MB/s",
      "75%": "163.32 MB/s",
      "total_nbytes": "150.00 MB"
    }
  }
}
```

Authors:
  - Charles Blackmon-Luca (@charlesbluca)

Approvers:
  - Mads R. B. Kristensen (@madsbk)
  - Peter Andreas Entschev (@pentschev)

URL: #524
@charlesbluca
Copy link
Member

charlesbluca commented Feb 12, 2021

It turns out the initial runs I did on TCP didn't have CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1 set - I ran both TCP and UCX tests again after fixing that and the results are significantly different:

In particular, it looks like mean, sum, and transpose sum have some noticeable speed up. Some observations:

  • It looks like most speed up tends to plateau at 2-4 threads, either running at the same speed when bumping up to 8 threads per worker or slower.
  • Mean on UCX runs slower at 8 threads than 4, while it generally stays the same on TCP for this jump
  • The same can be observed in reverse when looking at sum.
  • For some operations, TCP is faster than UCX (sum using 4 threads); not sure if this is an unexpected result.

@jakirkham
Copy link
Member

It might be worth creating larger data and/or using smaller chunks to see if this continues to hold up.

@pentschev
Copy link
Member

I forgot to mention, for UCX you should pass also --ucx-net-devices auto when running on a DGX-1. This will ensure each GPU utilizes the correct InfiniBand interface, which dramatically changes results for the right type of transfers. Could you also try running with that argument @charlesbluca ?

@charlesbluca
Copy link
Member

Sure! Currently running them now - I'll update the notebook once it finishes.

I'll also play around with array/chunk size, any insight on how the size of the second array in SVD should scale?

Also are there any good examples of comparative bar plots that have more than one independent variable? I'm trying to think about the best way to show the impact of changing array size and thread count in one plot.

@charlesbluca
Copy link
Member

Updated the notebook; looks like the main difference with auto is that mean generally scales better, and slice runs faster regardless of thread count.

@pentschev
Copy link
Member

@charlesbluca as we discussed on our standup meeting last Friday, we also need to ensure UCX synchronizes on PTDS-only, rather than the default stream. Sorry for forgetting about this earlier, but could you try running the UCX benchmarks again with https://github.com/pentschev/distributed/tree/ptds-support ?

@charlesbluca
Copy link
Member

Sure! Do I have to specify any options in the Dask configuration?

@pentschev
Copy link
Member

No, right now it will rely on your environment variables: https://github.com/pentschev/distributed/blob/c1cf54c6fc584efbc423ad0b49608817deb08d88/distributed/comm/ucx.py#L46-L49

We may change that in the future though.

@charlesbluca
Copy link
Member

Thanks for the clarification! Right now, I am trying out some different parameters for array/chunk size; currently, I'm thinking for a holistic test of any single operation, we should try:

Array sizes (may need some smaller sizes to run for array multiplication):

  • 20,000
  • 40,000
  • 80,000

Chunk sizes:

  • 1/8 of array size (64 chunks)
  • 1/4 of array size (16 chunks)
  • 1/2 of array size (4 chunks)

Second size (SVD only)

  • 10% of array size
  • 20% of array size
  • 40% of array size

We could also play around with GPU count to see if that has discernible impact.

@charlesbluca
Copy link
Member

I ended up using this testing suite for one operation (sum) and got some pretty strange results:

In particular, we can see that the results aren't consistent with what was observed before with array/chunk size 40,000/10,000 - not sure if this is related to running all of these tests in short succession. For reference, here are the Dask profiles for this operation with 1 and 2 threads per worker (2 threads should be faster but was not):

@charlesbluca
Copy link
Member

It looks like the results from before had something to do with generating the Dask performance report - removing that argument gave me results similar to the earliest benchmarks:

Playing around with array/chunk size, some interesting observations:

  • The overall array size seems to have an impact on what thread count is optimal - this can be seen when comparing the 20000/5000, 40000/10000, and 80000/20000 runs.
  • The smallest chunk size (64 chunks) had the best scaling performance, seeing improvements up to 16-32 threads per worker in all cases.
  • The largest chunk size (4 chunks) saw no improvement from increased thread count.

I'm interested in if the smaller chunk size / larger thread count would continue scaling, although it might be impractical to have more than 32 threads per worker.

@charlesbluca
Copy link
Member

I think the different results when generating performance reports could have something to do with device synchronization not being done:

# Execute the operations to benchmark
if args.profile is not None:
async with performance_report(filename=args.profile):
t1 = clock()
await client.compute(func(*func_args))
took = clock() - t1
else:
t1 = clock()
res = client.compute(func(*func_args))
await client.gather(res)
if args.type == "gpu":
await client.run(lambda xp: xp.cuda.Device().synchronize(), xp)
took = clock() - t1

Is there any reason device sync isn't being done when we opt for a performance report?

@jakirkham
Copy link
Member

I don't think so. Sounds like a bug (thanks for finding that btw 😄). Would you like to submit a PR?

@pentschev
Copy link
Member

As John mentioned, this is indeed a bug. Could you submit a PR for that when you have a chance?

@pentschev
Copy link
Member

pentschev commented Mar 9, 2021

We can do both. Since this is using CuPy and it has its own memory pool, it should work fine either way.

We actually replace CuPy's memory pool for these benchmarks:

cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)

Is this what we want for the benchmarks?

By default using RMM's default is a good balance, so I'd say yes. You can specify --rmm-pool-size if you want a different size when running.

EDIT: If you're asking about these benchmarks in particular, you can specify 90-95% of the GPU's total memory to prevent any further allocations.

@charlesbluca
Copy link
Member

What I meant to say is that in the benchmark code itself, it looks like --rmm-pool-size is not actually being used when supplied - I think that if that were the case, args.rmm_pool_size would be getting supplied to client.run(setup_memory_pool, ...).

@jakirkham
Copy link
Member

Sounds like a bug. Would you like to do a PR to fix that?

@charlesbluca
Copy link
Member

Sure! Is it okay if I roll this into a larger PR with the additional operations + NVTX traces?

@pentschev
Copy link
Member

Sure, I don't think there's much rush for this right. And thanks for catching that @charlesbluca !

@charlesbluca
Copy link
Member

No problem! Opened up PR #548.

@charlesbluca
Copy link
Member

Ran some benchmarks with column sum and gather; in general it looks like performance is better with a larger chunk size, though increased threads don't seem to have a consistent impact:

image
image

For sum, there is a spike in performance at the largest chunk size

image

This is surprising to me, as it seems like from the Nsys profiles there aren't too many kernel calls happening here, mostly UCX communication:

image

@jakirkham
Copy link
Member

jakirkham commented Mar 15, 2021

It might be worth trying this benchmark without a mask to see how it compares

@charlesbluca
Copy link
Member

Sure, what would the code look like running this without a mask?

@jakirkham
Copy link
Member

Sorry scratch that

What are the chunk sizes here? How many chunks are we using?

@charlesbluca
Copy link
Member

Chunk sizes are 31250, 62500, 125000, and 250000, so 16, 8, 4, and 2 chunks on the sliced array and 32, 16, 8, 4 chunks on the index array, respectively.

@charlesbluca
Copy link
Member

charlesbluca commented Mar 16, 2021

Thanks for #553 @jakirkham!

Column masking behaves similarly to column sum:

image
image

Main difference in the profiles (125000 vs 250000) being that there's a lot more space between the kernel calls and UCX communication:

image
image

Also looks like there's some PtoP copies happening only for the larger chunk size? Highlighted where that's happening.

@pentschev
Copy link
Member

Also looks like there's some PtoP copies happening only for the larger chunk size? Highlighted where that's happening.

As I wrote in #536 (comment), this is probably due to UCX_RNDV_THRESH.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@quasiben
Copy link
Member Author

quasiben commented Aug 2, 2021

Still active

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress inactive-30d inactive-90d proposal A code change suggestion to be considered or discussed
Projects
Status: No status
Development

No branches or pull requests

7 participants