Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve collectives fingerprinting #97469

Open
kumpera opened this issue Mar 23, 2023 · 5 comments
Open

Improve collectives fingerprinting #97469

kumpera opened this issue Mar 23, 2023 · 5 comments
Labels
good first issue module: c10d Issues/PRs related to collective communications and process groups triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@kumpera
Copy link
Contributor

kumpera commented Mar 23, 2023

馃殌 The feature, motivation and pitch

When using TORCH_DISTRIBUTED_DEBUG=DETAIL we collect collectives fingerprints and those are quite helpful when troubleshooting issues like stragglers.

One recurring problem in distributed jobs are stragglers and, in special, those triggered by python GC activity. We should extend CollectiveFingerPrint to include two pieces of information: python gc counts (for all 3 gens) and some monotonic clock source.

Those would enable us us to detect such issues as part of TORCH_DISTRIBUTED_DEBUG=DETAIL.

One complication of this idea is that we currently compare fingerprints in a bitwise fashion, which won't work since some of this information is just advisory.

Alternatives

No response

Additional context

No response

@kumpera kumpera added good first issue module: c10d Issues/PRs related to collective communications and process groups labels Mar 23, 2023
@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 24, 2023
@juan-park
Copy link

Regular user, first time contributor, would love to try and tackle this issue! Any contextual info would be greatly appreciated!

@SriRangaTarun
Copy link

Hey! I would love to work on this issue.

@mechatronicsengineer95
Copy link

Hey, it will be for me first contributing. I would like to try and help for solving this issue.

@fleonce
Copy link
Contributor

fleonce commented Sep 3, 2023

Hello, I would also like to contribute to this issue. So far I have managed to access the cpython API to retrieve the
desired information from the garbage collection, however when adding the following file python_gc_hook.c

#if PY_VERSION_HEX >= 0x03080000
#define Py_BUILD_CORE
#include <internal/pycore_interp.h>                     // PyInterpreterState
#include <internal/pycore_pystate.h>            // _PyInterpreterState_GET
#endif

struct gc_stats {
    Py_ssize_t collections;
    Py_ssize_t collected;
    Py_ssize_t uncollectable;
};

struct gc_info {
    struct gc_stats stats[NUM_GENERATIONS];
};

static struct gc_info * get_gc_state() {
    PyInterpreterState *interp = _PyInterpreterState_GET();
    GCState* state = &interp->gc;

    struct gc_info *info;
    struct gc_stats stats[NUM_GENERATIONS];
    struct gc_generation_stats gc_stats[NUM_GENERATIONS];

    for(int i = 0; i < NUM_GENERATIONS; i++) {
        gc_stats[i] = state->generation_stats[i];
    }
    return info;
}

to the build_variables.bzl under the libtorch_python_distributed_core_sources key, I cannot include said file in ProcessGroupWrapper.cpp, because ProcessGroupWrapper then needs also to include "Python.h" to use the type "Py_ssize_t" (gc_generation_stats), however ProcessGroupWrapper is contained in libtorch_distributed_base_sources in build_variables.bzl, and as a consequence I receive the following error when compiling ProcessGroupWrapper.cpp:

torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory
   12 | #include <Python.h>

When I move ProcessGroupWrapper into libtorch_python_distributed_core_sources however, linking libtorch_cpu will no longer work correctly:

/usr/bin/ld: /home/moritz/Dokumente/pytorch/build/lib/libtorch_cpu.so: undefined reference to `c10d::ProcessGroup::ProcessGroup(int, int)'
/usr/bin/ld: /home/moritz/Dokumente/pytorch/build/lib/libtorch_cpu.so: undefined reference to `typeinfo for c10d::ProcessGroup'
/usr/bin/ld: /home/moritz/Dokumente/pytorch/build/lib/libtorch_cpu.so: undefined reference to `c10d::ProcessGroup::getBackend(c10::DeviceType)'
/usr/bin/ld: /home/moritz/Dokumente/pytorch/build/lib/libtorch_cpu.so: undefined reference to `c10d::ProcessGroup::~ProcessGroup()'
/usr/bin/ld: /home/moritz/Dokumente/pytorch/build/lib/libtorch_cpu.so: undefined reference to `c10d::ProcessGroup::ProcessGroup(int, int)'

To my best understanding I cannot just convert the Py_ssize_t into a "normal" ssize_t, so I am wondering on how to proceed with this issue. Any help would be greatly appreciated @kumpera

@fleonce
Copy link
Contributor

fleonce commented Sep 26, 2023

@kumpera I think this requires additional changes as currently ProcessGroupWrapper is part of the pure C++ Implementation, where no Python Headers are available, I鈥檓 not sure how to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue module: c10d Issues/PRs related to collective communications and process groups triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants