Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker无法连接上master #988

Closed
Double-bear opened this issue Feb 5, 2024 · 9 comments
Closed

Worker无法连接上master #988

Double-bear opened this issue Feb 5, 2024 · 9 comments
Milestone

Comments

@Double-bear
Copy link

在启动master之后woker连接报错:
master.sh

MASTER_IP=$(ifconfig | grep -o 'inet [0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+' | grep -v '127.0.0.1' | head -n 1 | awk '{print $2}')
xinference-supervisor -H "$MASTER_IP" --log-level=debug

worker.sh (MASTER_IP与master的ip一致)

MASTER_IP=$1
xinference-worker -e http://"$MASTER_IP":9997 --log-level=debug

报错如下:

2024-02-05 11:08:35,466 xinference.core.worker 121 INFO     Starting metrics export server at 0.0.0.0:None

2024-02-05 11:08:35,467 xinference.core.worker 121 INFO     Checking metrics export server...

2024-02-05 11:08:37,098 xinference.core.worker 121 INFO     Metrics server is started at: http://0.0.0.0:41273

Traceback (most recent call last):

  File "/usr/local/bin/xinference-worker", line 8, in <module>

    sys.exit(worker())

  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__

    return self.main(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main

    rv = self.invoke(ctx)

  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke

    return ctx.invoke(self.callback, **ctx.params)

  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke

    return __callback(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/cmdline.py", line 349, in worker

    main(

  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 94, in main

    loop.run_until_complete(task)

  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete

    return future.result()

  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 65, in _start_worker

    await start_worker_components(

  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 43, in start_worker_components

    await xo.create_actor(

  File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 78, in create_actor

    return await ctx.create_actor(actor_cls, *args, uid=uid, address=address, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 143, in create_actor

    return self._process_result_message(result)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message

    raise message.as_instanceof_cause()

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 596, in create_actor

    await self._run_coro(message.message_id, actor.__post_create__())

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 368, in _run_coro

    return await coro

  File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 179, in __post_create__

    await self._supervisor_ref.add_worker(self.address)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 227, in send

    return self._process_result_message(result)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message

    raise message.as_instanceof_cause()

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 657, in send

    result = await self._run_coro(message.message_id, coro)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 368, in _run_coro

    return await coro

  File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in __on_receive__

    return await super().__on_receive__(message)  # type: ignore

  File "xoscar/core.pyx", line 558, in __on_receive__

    raise ex

  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__

    async with self._lock:

  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__

    with debug_async_timeout('actor_lock_timeout',

  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__

    result = await result

  File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 44, in wrapped

    ret = await func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/xinference/core/supervisor.py", line 917, in add_worker

    worker_ref = await xo.actor_ref(address=worker_address, uid=WorkerActor.uid())

  File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 125, in actor_ref

    return await ctx.actor_ref(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 196, in actor_ref

    future = await self._call(actor_ref.address, message, wait=False)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 77, in _call

    return await self._caller.call(

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 180, in call

    client = await self.get_client(router, dest_address)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 68, in get_client

    client = await router.get_client(dest_address, from_who=self)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/router.py", line 143, in get_client

    client = await self._create_client(client_type, address, **kw)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/router.py", line 157, in _create_client

    return await client_type.connect(address, local_address=local_address, **kw)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/communication/socket.py", line 255, in connect

    (reader, writer) = await asyncio.open_connection(host=host, port=port, **kwargs)

  File "/usr/lib/python3.10/asyncio/streams.py", line 48, in open_connection

    transport, _ = await loop.create_connection(

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1076, in create_connection

    raise exceptions[0]

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1060, in create_connection

    sock = await self._connect_sock(

  File "/usr/lib/python3.10/asyncio/base_events.py", line 969, in _connect_sock

    await self.sock_connect(sock, address)

  File "/usr/lib/python3.10/asyncio/selector_events.py", line 501, in sock_connect

    return await fut

  File "/usr/lib/python3.10/asyncio/selector_events.py", line 541, in _sock_connect_cb

    raise OSError(err, f'Connect call failed {address}')

ConnectionRefusedError: [address=10.100.108.220:25589, pid=269] [Errno 111] Connect call failed ('0.0.0.0', 49736)

xinference版本是拉取最新的源码然后pip install
请问这个问题应该如何解决?

@XprobeBot XprobeBot added this to the v0.8.5 milestone Feb 5, 2024
@codingl2k1
Copy link
Contributor

看上去是连了0.0.0.0,你可以用具体的机器ip试试,例如:xinference-worker -e http://10.100.108.220:9997 --log-level=debug

@Double-bear
Copy link
Author

xinference-worker -e http://10.100.108.220:9997 --log-level=debug

worker的那个ip我是通过传参数传进去的,后面我在sh里写死了,运行以后还是报同样的错误:
worker.sh

xinference-worker -e http://10.100.108.220:9997/ --log-level=debug

@codingl2k1
Copy link
Contributor

10.100.108.220

确定ip没错吧?

@Double-bear
Copy link
Author

10.100.108.220

确定ip没错吧?

这个是server端当前的日志,应该没错吧

2024-02-05 11:02:50,956 xinference.core.supervisor 269 INFO     Xinference supervisor 10.100.108.220:25589 started

2024-02-05 11:02:55,956 xinference.core.supervisor 269 DEBUG    Enter get_status, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>,), kwargs: {}

2024-02-05 11:02:55,956 xinference.core.supervisor 269 DEBUG    Leave get_status, elapsed time: 0 s

2024-02-05 11:02:57,448 xinference.api.restful_api 140 INFO     Starting Xinference at endpoint: http://10.100.108.220:9997

/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py:476: UserWarning: 

            Xinference ui is not built at expected directory: /usr/local/lib/python3.10/dist-packages/xinference/web/ui/build/

            To resolve this warning, navigate to /usr/local/lib/python3.10/dist-packages/xinference/web/ui/

            And build the Xinference ui by running "npm run build"

            

  warnings.warn(

2024-02-05 11:08:37,100 xinference.core.supervisor 269 DEBUG    Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>, '0.0.0.0:49736'), kwargs: {}

2024-02-05 11:40:37,037 xinference.core.supervisor 269 DEBUG    Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>, '0.0.0.0:50276'), kwargs: {}

2024-02-05 11:49:15,908 xinference.core.supervisor 269 DEBUG    Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>, '0.0.0.0:48767'), kwargs: {}

2024-02-05 13:54:49,201 xinference.core.supervisor 269 DEBUG    Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>, '0.0.0.0:30828'), kwargs: {}

@codingl2k1
Copy link
Contributor

看着supervisor日志是有Enter add_worker的,worker的报错还是跟最开始一样吗?

@Double-bear
Copy link
Author

看着supervisor日志是有Enter add_worker的,worker的报错还是跟最开始一样吗?

是的,还是一样

2024-02-05 13:54:47,482 xinference.core.worker 121 INFO     Starting metrics export server at 0.0.0.0:None

2024-02-05 13:54:47,483 xinference.core.worker 121 INFO     Checking metrics export server...

2024-02-05 13:54:49,170 xinference.core.worker 121 INFO     Metrics server is started at: http://0.0.0.0:41831

Traceback (most recent call last):

  File "/usr/local/bin/xinference-worker", line 8, in <module>

    sys.exit(worker())

  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__

    return self.main(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main

    rv = self.invoke(ctx)

  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke

    return ctx.invoke(self.callback, **ctx.params)

  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke

    return __callback(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/cmdline.py", line 349, in worker

    main(

  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 94, in main

    loop.run_until_complete(task)

  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete

    return future.result()

  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 65, in _start_worker

    await start_worker_components(

  File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 43, in start_worker_components

    await xo.create_actor(

  File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 78, in create_actor

    return await ctx.create_actor(actor_cls, *args, uid=uid, address=address, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 143, in create_actor

    return self._process_result_message(result)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message

    raise message.as_instanceof_cause()

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 596, in create_actor

    await self._run_coro(message.message_id, actor.__post_create__())

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 368, in _run_coro

    return await coro

  File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 179, in __post_create__

    await self._supervisor_ref.add_worker(self.address)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 227, in send

    return self._process_result_message(result)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message

    raise message.as_instanceof_cause()

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 657, in send

    result = await self._run_coro(message.message_id, coro)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 368, in _run_coro

    return await coro

  File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in __on_receive__

    return await super().__on_receive__(message)  # type: ignore

  File "xoscar/core.pyx", line 558, in __on_receive__

    raise ex

  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__

    async with self._lock:

  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__

    with debug_async_timeout('actor_lock_timeout',

  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__

    result = await result

  File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 44, in wrapped

    ret = await func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/xinference/core/supervisor.py", line 917, in add_worker

    worker_ref = await xo.actor_ref(address=worker_address, uid=WorkerActor.uid())

  File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 125, in actor_ref

    return await ctx.actor_ref(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 196, in actor_ref

    future = await self._call(actor_ref.address, message, wait=False)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 77, in _call

    return await self._caller.call(

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 180, in call

    client = await self.get_client(router, dest_address)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 68, in get_client

    client = await router.get_client(dest_address, from_who=self)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/router.py", line 143, in get_client

    client = await self._create_client(client_type, address, **kw)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/router.py", line 157, in _create_client

    return await client_type.connect(address, local_address=local_address, **kw)

  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/communication/socket.py", line 255, in connect

    (reader, writer) = await asyncio.open_connection(host=host, port=port, **kwargs)

  File "/usr/lib/python3.10/asyncio/streams.py", line 48, in open_connection

    transport, _ = await loop.create_connection(

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1076, in create_connection

    raise exceptions[0]

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1060, in create_connection

    sock = await self._connect_sock(

  File "/usr/lib/python3.10/asyncio/base_events.py", line 969, in _connect_sock

    await self.sock_connect(sock, address)

  File "/usr/lib/python3.10/asyncio/selector_events.py", line 501, in sock_connect

    return await fut

  File "/usr/lib/python3.10/asyncio/selector_events.py", line 541, in _sock_connect_cb

    raise OSError(err, f'Connect call failed {address}')

ConnectionRefusedError: [address=10.100.108.220:25589, pid=269] [Errno 111] Connect call failed ('0.0.0.0', 30828)

@aresnow1
Copy link
Contributor

aresnow1 commented Feb 5, 2024

分布式下,worker -H 指定当前 worker 的 ip

@Double-bear
Copy link
Author

分布式下,worker -H 指定当前 worker 的 ip

成功了,谢谢!

@Double-bear
Copy link
Author

分布式下,worker -H 指定当前 worker 的 ip

还想请教一个问题,我一台机子有八张卡,我用四张卡启了一个qwen 72b的模型,但是在launch的时候oom了,我单卡的显存是80G,肯定是够的,请问在启动的时候还需要设置什么吗?下面是我sh的命令:

pip install xinference
MASTER_IP=$(ifconfig | grep -o 'inet [0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+' | grep -v '127.0.0.1' | head -n 1 | awk '{print $2}')
xinference-local -H "$MASTER_IP" --port 9997 --log-level=debug

@XprobeBot XprobeBot modified the milestones: v0.8.5, v0.9.0 Feb 6, 2024
@XprobeBot XprobeBot modified the milestones: v0.9.0, v0.9.1 Feb 22, 2024
@XprobeBot XprobeBot modified the milestones: v0.9.1, v0.9.2, v0.9.3 Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants