Race condition in Kineto config reading causing segmentation fault when using Dynolog

When adding Dynolog to the images used for training jobs, we started to see job failures at random throughout the cluster. Eventually, we tracked this down to segmentation faults in Kineto. After analyzing the resulting core dumps with [Pystack](https://github.com/bloomberg/pystack), it was clear that these segmentation faults were coming from a race condition in the Kineto process registration workflow.

Most of the core dumps had some variation of 

```
(C) File "???", line 0, in libkineto::IpcFabricConfigClient::getLibkinetoOndemandConfig[abi:cxx11](int) (libtorch_cpu.so)
```
or 

```
libkineto::DaemonConfigLoader::readBaseConfig[abi:cxx11]()
```

as the last torch/kineto/Dynolog related parts of the stack, or occurred with the following error log: 
```terminate called after throwing an instance of 'std::runtime_error'
  what():  Bad file descriptor
Aborted (core dumped)
```

As this seems to be a race condition, reproducing it is non-trivial, requiring a container to be run many times. I managed to reproduce this in the following way:

Using the following Dockerfile:

```
FROM nvcr.io/nvidia/pytorch:25.01-py3

COPY requirement_files/requirements.txt /root/requirements.txt


# Copy the script that was running when the crash occurred
COPY *.py /root/


WORKDIR /root/


# Create Python virtual environment and install requirements
RUN python -m venv .venv && \
   . .venv/bin/activate && \
   pip install -q -r requirements.txt


ENTRYPOINT ["/bin/bash"]
```


where requirements.txt is:

```
torch
numpy
```

and the script is simply:

```

import torch


def main(seed):
    a = torch.randn(3, 3).cuda()
    print(a)

if __name__ == "__main__":
    main()
```

Building and running the container:

```
docker build -t debug_core_dump_simple -f Dockerfile .
docker run --gpus all -it debug_core_dump_simple
wget https://github.com/facebookincubator/dynolog/releases/download/v0.5.0/dynolog_0.5.0-0-amd64.deb
dpkg -i dynolog_0.5.0-0-amd64.deb
dynolog --enable-ipc-monitor &

export KINETO_USE_DAEMON=1
source .venv/bin/activate

for i in {1..100}; do     echo "Iteration $i";     python simple_script.py;     sleep 1; done
```

I ran this through three iterations of 100 ( 300 total runs of the script), totaling 46 core files generated, or ~15% reproduction.

To analyze I used pystack in the following way:

```
(for dump in ~/core*; do   pystack core "$dump" --native-all --exhaustive; done) >> dump_output.txt
```

Happy to provide any further details as this was difficult to track down in the first place.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race condition in Kineto config reading causing segmentation fault when using Dynolog #1113

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition in Kineto config reading causing segmentation fault when using Dynolog #1113

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions