Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Use of cudaGetDeviceProperties in minimax prim (and others?) leads to bottlenecks #927

Closed
JohnZed opened this issue Aug 5, 2019 · 5 comments

Comments

@JohnZed
Copy link
Contributor

commented Aug 5, 2019

We commonly use cudaGetDeviceProperties to figure out max shared mem and a few other properties. Calling this function from a prim used in tight inner loops leads to slowdowns and appears to lead to cross-device contention in a multi-GPU context. (Just eliminating this call speeds up multi-GPU RF tree building by up to 20x in some cases.)

Ideally we'd just cache this info once at startup. Particularly in the OPG paradigm, a process' GPU properties should never change. Maybe we should add a new singleton-style accessor for device properties that stashes the properties for all GPUs in a static and just returns them as needed?

@JohnZed

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

Adding @teju85 who probably has better ideas than me on how to make this nice and general.

@JohnZed

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

Maybe a terrible, overcomplicating idea... but what about stashing this info (properties of current GPU) in the cumlHandle?

@cjnolet

This comment has been minimized.

Copy link
Collaborator

commented Aug 5, 2019

I agree with the and others, if this is impacting performance by 20x or more, we should probably be providing a utility / helper for this and using the utility everywhere.

Maybe providing the getter in the cumlHandle that uses a thread-local to cache the results when called for the first time? (since the cumlHandle is not advertised to be thread-safe).

@teju85

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

20x is HUGE!

I agree that caching this info inside cumlHandle is a good place to start. And any prim that needs these device information should explicitly ask for it in its API. Let me start a PR proposing these changes soon.

teju85 added a commit to teju85/cuml that referenced this issue Aug 8, 2019
updated minmax usage inside RF code to use the cached cudaDeviceProp
Issue rapidsai#927 should be addressed properly after this commit.
@teju85

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

@JohnZed @cjnolet PR #934 has been created for this. Can you folks review it?

@cjnolet cjnolet closed this Aug 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.