Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU Numbers Support #15

Merged
merged 1 commit into from
Jul 18, 2022
Merged

Add GPU Numbers Support #15

merged 1 commit into from
Jul 18, 2022

Conversation

peiniliu
Copy link
Contributor

@peiniliu peiniliu commented Aug 23, 2021

@Thor-wl
Copy link
Contributor

Thor-wl commented Oct 15, 2021

Is there any UT and E2E for this feature?

@peiniliu
Copy link
Contributor Author

Not really, I think.
This is a volcano device plugin, the functionality should be tested through volcano. NVIDIA GPUs numbers can now be defined via container level resource requirements using the 'volcano.sh/gpu-numbers'.

@Thor-wl
Copy link
Contributor

Thor-wl commented Oct 16, 2021

OK. Thanks, Peini!

Copy link
Member

@shinytang6 shinytang6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @peiniliu, btw there is another important point to think about: whether to support gpu number & gpu share at the same time or according to the configuration.

e.g. if a user applies for one single gpu, theoretically the available memory of the gpu will decrease, but it cannot be perceived at the device plugin level right now.

@william-wang
Copy link
Member

Does anyone have gpu env to have a testing against #12? We can not merge the pr without the testing on GPU env.

@Thor-wl Thor-wl added the help wanted Extra attention is needed label Jul 11, 2022
@peiniliu
Copy link
Contributor Author

Hi,
@shinytang6 As discussed, we plan to do the GPU number and GPU share separately, we support configuration on the device plugin to specify GPUStrategy whether be number or share. The design doc now can be reviewed.
@william-wang I bought a GKE cluster to have a test. I ll test it in the following weeks. I ll let you know if all things are done.
Best,
Peini

@peiniliu
Copy link
Contributor Author

Hi all,
gpu share and gpu number allocation features are tested through GKE with a single GPU node. Preliminary results can be found with in this doc.
https://docs.google.com/document/d/1Q8obEBpYkIgFGQvvjnwTEBvTvWtXzMfpQSIdjYCWXeo/edit?usp=sharing
Pls let me know if there are some misunderstandings or any other comments. Also it will be better to fix this ASAP given the GKE nodes need to be paid :)
Best,
Peini

@william-wang
Copy link
Member

@peiniliu Thank you so much for the testing. We'll review the change as soon as possible:)

Makefile Outdated Show resolved Hide resolved
pkg/apis/config.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Thor-wl Thor-wl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally OK for me now. Please rebase all the commits into one and add --signoff flag when execute git commit.

main.go Outdated Show resolved Hide resolved
main.go Show resolved Hide resolved
pkg/apis/config.go Show resolved Hide resolved
pkg/plugin/nvidia/server.go Show resolved Hide resolved
pkg/plugin/nvidia/utils.go Outdated Show resolved Hide resolved
pkg/plugin/nvidia/server.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Thor-wl Thor-wl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure the commits you submit are ready.

@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 12, 2022

/hold

@Thor-wl Thor-wl removed the help wanted Extra attention is needed label Jul 12, 2022
@peiniliu
Copy link
Contributor Author

peiniliu commented Jul 12, 2022 via email

@Thor-wl Thor-wl added the hold the PR cannot be merged label Jul 12, 2022
doc/config.md Outdated Show resolved Hide resolved
doc/config.md Outdated Show resolved Hide resolved
doc/config.md Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
pkg/apis/config.go Show resolved Hide resolved
pkg/apis/config.go Show resolved Hide resolved
pkg/plugin/nvidia/utils.go Outdated Show resolved Hide resolved
pkg/plugin/nvidia/utils.go Outdated Show resolved Hide resolved
pkg/plugin/nvidia/server.go Outdated Show resolved Hide resolved
pkg/plugin/nvidia/server.go Outdated Show resolved Hide resolved
volcano-device-plugin-GKE.yml Outdated Show resolved Hide resolved
@shinytang6
Copy link
Member

In addition to the above comments, generally the implementation LGTM

@william-wang
Copy link
Member

@peiniliu Overall, Look good to me now. Please combine 21 commits to one commit.

@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 15, 2022

@peiniliu Overall, Look good to me now. Please combine 21 commits to one commit.

+1

volcano-device-plugin.yml Show resolved Hide resolved
volcano-device-plugin-GKE.yml Outdated Show resolved Hide resolved
Signed-off-by: peiniliu <peini.liu@gmail.com>
@peiniliu
Copy link
Contributor Author

Thanks all,
I think it is ready now.

Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Member

@shinytang6 shinytang6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort!
/lgtm
/approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hold the PR cannot be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants