ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100 #19

zzr93 · 2021-11-26T09:02:24Z

This issue is an extension of #18

What happened:
Applying volcano-device-plugin on a server using 8*V100 GPU, but get volcano.sh/gpu-memory:0 when describe nodes:

Same situation did not occur when using T4 or P4.
Tracing kubelet logs, found following error message:

seems like sync message is too large.

What caused this bug:
volcano-device-plugin mock GPUs into a device list(every device in this list is considered as a 1MB memory block), so that different workloads can share one GPU through kubernetes device plugin mechanism. When large memory GPU such as V100 is implemented, the size of device list exceeds the bound, and ListAndWatch failed as a result.

Solutions:
The key is to minimize the size of the device list, so we can consider each device as a 10MB memory block and reform the whole bookkeeping process according to this assumption. This accuracy is enough for almost all production environments.

Thor-wl · 2021-11-26T09:11:39Z

Thanks for your report and debug. The debug is meaningful and we will fix it as soon as possible.

Thor-wl · 2021-11-30T10:11:26Z

Request more voice about how much should be considered as a block(default is 1M) which is suitable for all specified GPU cards.

zzr93 · 2021-12-01T07:27:57Z

100MB per block may work fine. Inference services usually cost hundreds to thousands MB memory(train services usually cost much more than this scale), so we actually do not care memory fragments which are less than 100MB.

Thor-wl · 2021-12-02T01:11:33Z

100MB per block may work fine. Inference services usually cost hundreds to thousands MB memory(train services usually cost much more than this scale), so we actually do not care memory fragments which are less than 100MB.

IC. I'll take this issue to the weekly meeting for discussion. Are you glad to share your ideas in the meeting?

zzr93 · 2021-12-02T07:05:12Z

100MB per block may work fine. Inference services usually cost hundreds to thousands MB memory(train services usually cost much more than this scale), so we actually do not care memory fragments which are less than 100MB.

IC. I'll take this issue to the weekly meeting for discussion. Are you glad to share your ideas in the meeting?

My pleasure, I will be in.

Thor-wl · 2021-12-03T06:18:26Z

See you 15:00.

jasonliu747 · 2021-12-03T06:20:12Z

See you 15:00.

Awww that's sweet.🥺

lakerhu999 · 2022-01-17T01:03:07Z

Is this issue resolved at present？

Thor-wl · 2022-01-18T09:13:53Z

Is this issue resolved at present？

Not yet. We are considering for a graceful way to make the fix without modifing the gRPC directly.

lakerhu999 · 2022-02-15T03:04:31Z

Any update for this issue?

Thor-wl · 2022-02-16T01:06:40Z

Any update for this issue?

Not yet now. I'm sorry for developing another feature recently. Will fix that ASAP.

lakerhu999 · 2022-03-01T08:52:37Z

It's still a bug in our product as same as this issue, if fixed, please close this issue.

Thor-wl · 2022-03-01T11:58:04Z

It's still a bug in our product as same as this issue, if fixed, please close this issue.

OK, it's still on the way. I'll close the issue after the bug is fixed.

pauky · 2022-04-11T10:30:30Z

How is this going?

shinytang6 · 2022-05-03T09:55:59Z

#22 may resolve this issue

XueleiQiao · 2022-07-05T02:12:44Z

我们这个最新的镜像公网上有发布吗？@shinytang6

Thor-wl added bug Something isn't working bug/import-soon and removed bug Something isn't working labels Nov 26, 2021

Thor-wl self-assigned this Nov 26, 2021

Thor-wl assigned merryzhou, Jeffwan, tizhou86, xiaoanyunfei, hudson741, eggiter, yahaa, shinytang6 and jasonliu747 Dec 2, 2021

Thor-wl unassigned merryzhou, Jeffwan, tizhou86, xiaoanyunfei, hudson741 and eggiter Feb 16, 2022

Thor-wl unassigned yahaa, jasonliu747 and shinytang6 Feb 16, 2022

peiniliu mentioned this issue Aug 15, 2022

support config gpu memory factor #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100 #19

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100 #19

zzr93 commented Nov 26, 2021

Thor-wl commented Nov 26, 2021

Thor-wl commented Nov 30, 2021

zzr93 commented Dec 1, 2021

Thor-wl commented Dec 2, 2021

zzr93 commented Dec 2, 2021

Thor-wl commented Dec 3, 2021

jasonliu747 commented Dec 3, 2021 •

edited

Loading

lakerhu999 commented Jan 17, 2022

Thor-wl commented Jan 18, 2022

lakerhu999 commented Feb 15, 2022

Thor-wl commented Feb 16, 2022

lakerhu999 commented Mar 1, 2022

Thor-wl commented Mar 1, 2022

pauky commented Apr 11, 2022

shinytang6 commented May 3, 2022

XueleiQiao commented Jul 5, 2022

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100 #19

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100 #19

Comments

zzr93 commented Nov 26, 2021

Thor-wl commented Nov 26, 2021

Thor-wl commented Nov 30, 2021

zzr93 commented Dec 1, 2021

Thor-wl commented Dec 2, 2021

zzr93 commented Dec 2, 2021

Thor-wl commented Dec 3, 2021

jasonliu747 commented Dec 3, 2021 • edited Loading

lakerhu999 commented Jan 17, 2022

Thor-wl commented Jan 18, 2022

lakerhu999 commented Feb 15, 2022

Thor-wl commented Feb 16, 2022

lakerhu999 commented Mar 1, 2022

Thor-wl commented Mar 1, 2022

pauky commented Apr 11, 2022

shinytang6 commented May 3, 2022

XueleiQiao commented Jul 5, 2022

jasonliu747 commented Dec 3, 2021 •

edited

Loading