Skip to content

Race condition in Kineto config reading causing segmentation fault when using Dynolog #1113

@1linkovdim

Description

@1linkovdim

When adding Dynolog to the images used for training jobs, we started to see job failures at random throughout the cluster. Eventually, we tracked this down to segmentation faults in Kineto. After analyzing the resulting core dumps with Pystack, it was clear that these segmentation faults were coming from a race condition in the Kineto process registration workflow.

Most of the core dumps had some variation of

(C) File "???", line 0, in libkineto::IpcFabricConfigClient::getLibkinetoOndemandConfig[abi:cxx11](int) (libtorch_cpu.so)

or

libkineto::DaemonConfigLoader::readBaseConfig[abi:cxx11]()

as the last torch/kineto/Dynolog related parts of the stack, or occurred with the following error log:

  what():  Bad file descriptor
Aborted (core dumped)

As this seems to be a race condition, reproducing it is non-trivial, requiring a container to be run many times. I managed to reproduce this in the following way:

Using the following Dockerfile:

FROM nvcr.io/nvidia/pytorch:25.01-py3

COPY requirement_files/requirements.txt /root/requirements.txt


# Copy the script that was running when the crash occurred
COPY *.py /root/


WORKDIR /root/


# Create Python virtual environment and install requirements
RUN python -m venv .venv && \
   . .venv/bin/activate && \
   pip install -q -r requirements.txt


ENTRYPOINT ["/bin/bash"]

where requirements.txt is:

torch
numpy

and the script is simply:


import torch


def main(seed):
    a = torch.randn(3, 3).cuda()
    print(a)

if __name__ == "__main__":
    main()

Building and running the container:

docker build -t debug_core_dump_simple -f Dockerfile .
docker run --gpus all -it debug_core_dump_simple
wget https://github.com/facebookincubator/dynolog/releases/download/v0.5.0/dynolog_0.5.0-0-amd64.deb
dpkg -i dynolog_0.5.0-0-amd64.deb
dynolog --enable-ipc-monitor &

export KINETO_USE_DAEMON=1
source .venv/bin/activate

for i in {1..100}; do     echo "Iteration $i";     python simple_script.py;     sleep 1; done

I ran this through three iterations of 100 ( 300 total runs of the script), totaling 46 core files generated, or ~15% reproduction.

To analyze I used pystack in the following way:

(for dump in ~/core*; do   pystack core "$dump" --native-all --exhaustive; done) >> dump_output.txt

Happy to provide any further details as this was difficult to track down in the first place.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions