Skip to content

集群模式部署,如果重启supervisor,必须重启所有woker吗? #2402

@paradin

Description

@paradin

System Info / 系統信息

docker

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
    pip install / 通过 pip install 安装
    installation from source / 从源码安装

Version info / 版本信息

0.15.2

The command used to start Xinference / 用以启动 xinference 的命令

分布式场景,正常启动 supervisor 和 worker
supervisor启动指定了supervisor-port
在worker上启动一个模型,如:bge-m3

Reproduction / 复现过程

重启supervisor,前端无法查看正在运行的模型 bge-m3;模型服务不可用;

Expected behavior / 期待表现

  1. supervisor重启后,已经运行的模型正常
  2. 模型服务正常

Activity

added this to the v0.15 milestone on Oct 8, 2024
pkunight

pkunight commented on Oct 10, 2024

@pkunight

我也发现了这个问题, 必须先启动supervisor, 后启动worker, 而且此后连接不能中断. 否则即使supervisor成功重启了, worker依然会持续报错连不上supervisor的ip地址.

paradin

paradin commented on Oct 10, 2024

@paradin
Author

我也发现了这个问题, 必须先启动supervisor, 后启动worker, 而且此后连接不能中断. 否则即使supervisor成功重启了, worker依然会持续报错连不上supervisor的ip地址.

启动supervisor时指定supervisor-port的话,重启supervisor后是能够让worker连上的(因为supervisor端口固定了)
但是问题是现在supervisor是有状态的,重启后woker虽然能report_status,但是却没有report running models status

如果supervisor能实现无状态(比如通过redis共享),还能解决目前supervisor单点问题

ak47947

ak47947 commented on Oct 12, 2024

@ak47947

我也发现这个问题了,如果这个问题不解决,是没法真正集群使用的

github-actions

github-actions commented on Oct 19, 2024

@github-actions

This issue is stale because it has been open for 7 days with no activity.

github-actions

github-actions commented on Oct 24, 2024

@github-actions

This issue was closed because it has been inactive for 5 days since being marked as stale.

github-actions

github-actions commented on Dec 19, 2024

@github-actions

This issue is stale because it has been open for 7 days with no activity.

github-actions

github-actions commented on Dec 25, 2024

@github-actions

This issue was closed because it has been inactive for 5 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @paradin@qinxuye@ak47947@pkunight@XprobeBot

      Issue actions

        集群模式部署,如果重启supervisor,必须重启所有woker吗? · Issue #2402 · xorbitsai/inference