Skip to content

fix: make fnllm utils fork-safe for Celery workers #1975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

droideronline
Copy link

Detect process forks and recreate thread resources to prevent 'RuntimeError: threads can only be started once' in worker processes.

  • Add PID tracking for fork detection
  • Safe cleanup of inherited resources
  • Fresh event loop creation per process

Problem

GraphRAG fails in Celery worker processes with RuntimeError: threads can only be started once when executing queries that use embedding models. This occurs because the current run_coroutine_sync() implementation is not fork-safe.

Root Cause

When Celery creates worker processes by forking:

  1. Child processes inherit parent's global thread objects (_thr, _loop, _pid)
  2. Inherited thread objects exist but the actual threads are not running (threads don't survive forks)
  3. Code attempts to call _thr.start() on a dead thread object
  4. Python raises "threads can only be started once" because threads have single-use lifecycle

Solution

This PR implements fork detection and safe resource recreation:

  • Fork Detection: Track process ID to detect when code runs in a forked child process
  • Safe Cleanup: Use contextlib.suppress() to safely cleanup inherited resources
  • Fresh Resources: Create new event loop and thread specific to the child process
  • Process Isolation: Each forked process gets its own async infrastructure

Changes Made

Modified Files:

  • graphrag/language_model/providers/fnllm/utils.py

Related Issues

#1974

Proposed Changes

  1. Added imports: os and contextlib for PID tracking and safe cleanup
  2. Enhanced run_coroutine_sync(): Added fork detection logic
  3. Process tracking: Compare current PID with stored PID to detect forks
  4. Resource cleanup: Safely stop inherited event loops before creating new ones
  5. Thread recreation: Create fresh thread and event loop for each process

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Detect process forks and recreate thread resources to prevent
'RuntimeError: threads can only be started once' in worker processes.

- Add PID tracking for fork detection
- Safe cleanup of inherited resources
- Fresh event loop creation per process
@droideronline droideronline requested a review from a team as a code owner June 11, 2025 19:17
@droideronline
Copy link
Author

@droideronline please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

@droideronline
Copy link
Author

@natoverse - Could you please review this PR when you have some time. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant