Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[BUG] Use of cudaGetDeviceProperties in minimax prim (and others?) leads to bottlenecks #927
We commonly use cudaGetDeviceProperties to figure out max shared mem and a few other properties. Calling this function from a prim used in tight inner loops leads to slowdowns and appears to lead to cross-device contention in a multi-GPU context. (Just eliminating this call speeds up multi-GPU RF tree building by up to 20x in some cases.)
Ideally we'd just cache this info once at startup. Particularly in the OPG paradigm, a process' GPU properties should never change. Maybe we should add a new singleton-style accessor for device properties that stashes the properties for all GPUs in a static and just returns them as needed?
I agree with the
Maybe providing the getter in the cumlHandle that uses a thread-local to cache the results when called for the first time? (since the cumlHandle is not advertised to be thread-safe).