-
Notifications
You must be signed in to change notification settings - Fork 212
Description
When adding Dynolog to the images used for training jobs, we started to see job failures at random throughout the cluster. Eventually, we tracked this down to segmentation faults in Kineto. After analyzing the resulting core dumps with Pystack, it was clear that these segmentation faults were coming from a race condition in the Kineto process registration workflow.
Most of the core dumps had some variation of
(C) File "???", line 0, in libkineto::IpcFabricConfigClient::getLibkinetoOndemandConfig[abi:cxx11](int) (libtorch_cpu.so)
or
libkineto::DaemonConfigLoader::readBaseConfig[abi:cxx11]()
as the last torch/kineto/Dynolog related parts of the stack, or occurred with the following error log:
what(): Bad file descriptor
Aborted (core dumped)
As this seems to be a race condition, reproducing it is non-trivial, requiring a container to be run many times. I managed to reproduce this in the following way:
Using the following Dockerfile:
FROM nvcr.io/nvidia/pytorch:25.01-py3
COPY requirement_files/requirements.txt /root/requirements.txt
# Copy the script that was running when the crash occurred
COPY *.py /root/
WORKDIR /root/
# Create Python virtual environment and install requirements
RUN python -m venv .venv && \
. .venv/bin/activate && \
pip install -q -r requirements.txt
ENTRYPOINT ["/bin/bash"]
where requirements.txt is:
torch
numpy
and the script is simply:
import torch
def main(seed):
a = torch.randn(3, 3).cuda()
print(a)
if __name__ == "__main__":
main()
Building and running the container:
docker build -t debug_core_dump_simple -f Dockerfile .
docker run --gpus all -it debug_core_dump_simple
wget https://github.com/facebookincubator/dynolog/releases/download/v0.5.0/dynolog_0.5.0-0-amd64.deb
dpkg -i dynolog_0.5.0-0-amd64.deb
dynolog --enable-ipc-monitor &
export KINETO_USE_DAEMON=1
source .venv/bin/activate
for i in {1..100}; do echo "Iteration $i"; python simple_script.py; sleep 1; done
I ran this through three iterations of 100 ( 300 total runs of the script), totaling 46 core files generated, or ~15% reproduction.
To analyze I used pystack in the following way:
(for dump in ~/core*; do pystack core "$dump" --native-all --exhaustive; done) >> dump_output.txt
Happy to provide any further details as this was difficult to track down in the first place.