Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] [Flaky test] distributed/test_shm_broadcast.py is flaky #5848

Closed
cadedaniel opened this issue Jun 25, 2024 · 3 comments · Fixed by #5801
Closed

[CI] [Flaky test] distributed/test_shm_broadcast.py is flaky #5848

cadedaniel opened this issue Jun 25, 2024 · 3 comments · Fixed by #5801
Labels

Comments

@cadedaniel
Copy link
Collaborator

Anything you want to discuss about vllm.

Distributed comm ops test failed with below stacktrace. Buildkite

[2024-06-25T12:58:33Z] distributed/test_shm_broadcast.py:72:
--
  | [2024-06-25T12:58:33Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2024-06-25T12:58:33Z]
  | [2024-06-25T12:58:33Z] fn = <function worker_fn_wrapper.<locals>.wrapped_fn at 0x7f8cc92afa30>
  | [2024-06-25T12:58:33Z] world_size = 4
  | [2024-06-25T12:58:33Z]
  | [2024-06-25T12:58:33Z]     def distributed_run(fn, world_size):
  | [2024-06-25T12:58:33Z]         number_of_processes = world_size
  | [2024-06-25T12:58:33Z]         processes = []
  | [2024-06-25T12:58:33Z]         for i in range(number_of_processes):
  | [2024-06-25T12:58:33Z]             env = {}
  | [2024-06-25T12:58:33Z]             env['RANK'] = str(i)
  | [2024-06-25T12:58:33Z]             env['LOCAL_RANK'] = str(i)
  | [2024-06-25T12:58:33Z]             env['WORLD_SIZE'] = str(number_of_processes)
  | [2024-06-25T12:58:33Z]             env['LOCAL_WORLD_SIZE'] = str(number_of_processes)
  | [2024-06-25T12:58:33Z]             env['MASTER_ADDR'] = 'localhost'
  | [2024-06-25T12:58:33Z]             env['MASTER_PORT'] = '12345'
  | [2024-06-25T12:58:33Z]             p = multiprocessing.Process(target=fn, args=(env, ))
  | [2024-06-25T12:58:33Z]             processes.append(p)
  | [2024-06-25T12:58:33Z]             p.start()
  | [2024-06-25T12:58:33Z]
  | [2024-06-25T12:58:33Z]         for p in processes:
  | [2024-06-25T12:58:33Z]             p.join()
  | [2024-06-25T12:58:33Z]
  | [2024-06-25T12:58:33Z]         for p in processes:
  | [2024-06-25T12:58:33Z] >           assert p.exitcode == 0
  | [2024-06-25T12:58:33Z] E           AssertionError: assert 1 == 0
  | [2024-06-25T12:58:33Z] E            +  where 1 = <Process name='Process-1' pid=15885 parent=7 stopped exitcode=1>.exitcode
@cadedaniel
Copy link
Collaborator Author

FYI @youkaichao

@youkaichao
Copy link
Member

@cadedaniel should be fixed in #5801

@cadedaniel
Copy link
Collaborator Author

awesome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants