New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve collectives fingerprinting #97469
Comments
Regular user, first time contributor, would love to try and tackle this issue! Any contextual info would be greatly appreciated! |
Hey! I would love to work on this issue. |
Hey, it will be for me first contributing. I would like to try and help for solving this issue. |
Hello, I would also like to contribute to this issue. So far I have managed to access the cpython API to retrieve the #if PY_VERSION_HEX >= 0x03080000
#define Py_BUILD_CORE
#include <internal/pycore_interp.h> // PyInterpreterState
#include <internal/pycore_pystate.h> // _PyInterpreterState_GET
#endif
struct gc_stats {
Py_ssize_t collections;
Py_ssize_t collected;
Py_ssize_t uncollectable;
};
struct gc_info {
struct gc_stats stats[NUM_GENERATIONS];
};
static struct gc_info * get_gc_state() {
PyInterpreterState *interp = _PyInterpreterState_GET();
GCState* state = &interp->gc;
struct gc_info *info;
struct gc_stats stats[NUM_GENERATIONS];
struct gc_generation_stats gc_stats[NUM_GENERATIONS];
for(int i = 0; i < NUM_GENERATIONS; i++) {
gc_stats[i] = state->generation_stats[i];
}
return info;
} to the
When I move ProcessGroupWrapper into
To my best understanding I cannot just convert the Py_ssize_t into a "normal" ssize_t, so I am wondering on how to proceed with this issue. Any help would be greatly appreciated @kumpera |
@kumpera I think this requires additional changes as currently ProcessGroupWrapper is part of the pure C++ Implementation, where no Python Headers are available, I鈥檓 not sure how to proceed. |
馃殌 The feature, motivation and pitch
When using
TORCH_DISTRIBUTED_DEBUG=DETAIL
we collect collectives fingerprints and those are quite helpful when troubleshooting issues like stragglers.One recurring problem in distributed jobs are stragglers and, in special, those triggered by python GC activity. We should extend CollectiveFingerPrint to include two pieces of information: python gc counts (for all 3 gens) and some monotonic clock source.
Those would enable us us to detect such issues as part of
TORCH_DISTRIBUTED_DEBUG=DETAIL
.One complication of this idea is that we currently compare fingerprints in a bitwise fashion, which won't work since some of this information is just advisory.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: