Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible update to GPU feature #535

Open
marcodelapierre opened this issue Apr 1, 2022 · 14 comments
Open

Possible update to GPU feature #535

marcodelapierre opened this issue Apr 1, 2022 · 14 comments

Comments

@marcodelapierre
Copy link
Contributor

This thought came out of the issue on MPI #527, so thanks @georgiastuart for the inspiration!

Current interface of the GPU feature:

  • SHPC settings
  gpu: # amd, nvidia or null
  • container recipe
  gpu: # true or false

I have realised that the current interface does not specify, for a given recipe, whether the corresponding package/container was built for Nvidia or AMD cards, which is known beforehand.
As a consequence, this is limiting in the (probably unlikely?) scenario where a centre has both Nvidia and AMD cards.

Updated interface/configuration, for consideration:

  • SHPC settings
  gpu:  # null, amd, nvidia or a list with many, eg [nvidia, amd]  # future-proof for intel gpus and so on
  • container recipe
  gpu:  # amd, nvidia or false/null
  • New implementation, looking at the recipe entry:
  • if gpu == false, do nothing
  • if gpu == amd, add --rocm flag if global setting contains amd, ....ignore if latter is null(?)
  • if gpu == nvidia, add --nv flag if global setting contains nvidia, ....ignore if latter is null(?)

Small implication: update documents, and update the few preexisting recipes which have "gpu: true" (all Nvidia, apart from "tensorflow/tensorflow", for which it is to be checked).

What do you think @vsoch?

@vsoch
Copy link
Member

vsoch commented Apr 1, 2022

I like it! And I wonder if we need to make it easier to parse these feature groups - e.g recent changes to default-_version have a similar kind of logic - check the value and act differently depending on the case, and we would soon have the same for mpi (and maybe others in the future). I can give this a shot at implementation, although I want to work on update first (probably this weekend) since I think the binoc runs and generating incorrect listings!

@vsoch
Copy link
Member

vsoch commented Apr 2, 2022

Haven't gotten to try this out yet - worked on the update functionality today! Not sure I made progress there, but this is next on my TODO to play around with.

@vsoch
Copy link
Member

vsoch commented Apr 8, 2022

okay this is next in my queue @marcodelapierre ! I haven't forgotten!

@marcodelapierre
Copy link
Contributor Author

no rush.
my goodness, this week I am completely work flooded, too

@vsoch
Copy link
Member

vsoch commented Apr 8, 2022

I'm not terribly flooded (yet, knock on wood!), but I like working on one shpc new feature at a time! So Linux terms let's just say my brain works fairly serially, or in HPC terms I'm single threaded, within a single project. 🧵 😆

@marcodelapierre
Copy link
Contributor Author

Ahah always a great metaphor, love it! 😄
Despite my job, my brain is proudly single threaded, too lol, that's as much as it can do!

[In other SHPC issues, hopefully I will get to comment on the environments/views, it is a very powerful concept, and I do have a scenario to share with you and the other contributors]

@vsoch
Copy link
Member

vsoch commented Apr 8, 2022

@marcodelapierre one quick question! So this approach:

  • if gpu == false, do nothing
  • if gpu == amd, add --rocm flag if global setting contains amd, ....ignore if latter is null(?)
  • if gpu == nvidia, add --nv flag if global setting contains nvidia, ....ignore if latter is null(?)

Assumes that a container can only be built for one GPU type. E.g., tensorflow/tensorflow could be matched to nvidia, but not amd. Is that correct? And would we not run into issues with different tags being intended for different gpus? This does feel like something that should still be general in the container recipe to not hard code a bias (e.g.,true/false) but then on a particular install it should be up to the admin to decide the customizations. Our previous approach assumed a center is using one gpu type, and currently the admin would need one "one off" to install the same container name with a different gpu. Is that the action that is annoying / can be improved upon? Some more thinking:

  1. shpc supports pointing to a different settings.yml (config) on any command. Would it not work to just have a set of configs to point to for different kinds of installs?
  2. we could make it easier to "one off" for a particular config setting - I think right now the -c flag does this but I haven't tested and maybe it's not shared enough.
  3. This does feel like it' s a bit related to an environment or view. E.g., once I create an environment or view, I could say "I want all the containers here to install with amd" I'm starting to think that might be the best moving forward.

So TLDR: I think we want to make this easy and support it, but we want to ensure that we don't hard code a preference into a container.yaml that might be different / change with tags, and I think we should find the right way to scope this (e.g., scoped in a view I think would make sense!)

@marcodelapierre
Copy link
Contributor Author

Great points, sorry @vsoch I have been swamped these days, trying to catch up!

To be honest, I would tend to consider as very unlikely the case where a single container image tag contains builds for multiple GPU vendors (happy to be proven wrong...).
If not for other reasons, because it would imply maintaining both CUDA and ROCm in the same image..which seems an unnecessary nightmare to me. So I think most, if not all, Dockerfile writers would avoid this situation.
I am saying this because it seems like, on the SHPC side, supporting a single tag with 2 vendors would add a lot of complexity, so I first stopped for a sec, to think whether it is a likely usage scenario.

But on the other hand...

One scenario which I agree we need to definitely support is the one where different tags of the same image are built for different vendors. To this end....
....remember we were talking of tag specific customisations in the recipe? this would do, isn't it?

Here is the issue on this feature: #536

So, bottom line, I agree with you that we need to improve this aspect of this functionality, starting from the case where multiple tags of the same image support distinct vendors.

What do you think?

@marcodelapierre
Copy link
Contributor Author

Thinking more about environments in this context, and your point on AMD+Nvidia containers .... why not?!
In the end, once there are envs/views that manage the thing, it would be about enabling having a list of values for the container.yaml GPU feature; then the environment setting will allow to pick the desired one.

I am not really adding much here, just paraphrasing your thoughts, which I can say I support!

@vsoch
Copy link
Member

vsoch commented Apr 22, 2022

Just to loop back here to discussion - when you review #545 think of it in context of some of these questions. E.g., if we can find a way to customize a specific module install (still maintaining symbolic links or something else?) I think we could handle specifics like this.

@marcodelapierre
Copy link
Contributor Author

See my comment on #545,
where I suggest the following:

  1. tackle clusters with single GPU vendor with current approach, and small update (see below)
  2. tackle clusters with multiple vendors using Views

If we restrict the scope of the current issue to single GPU vendor, then I would just suggest to change the functionality inside the container yaml, from

gpu: # True or False

to

gpu:  # amd, nvidia or false/null

On the ground that typically a container is only built for one vendor.
However, if you think it's better to leave the flexibility, then no update is needed for the current functionality.

@vsoch
Copy link
Member

vsoch commented Apr 28, 2022

This is next in the queue after views! I did start working on it actually but paused with views in case it's a subset of that (which right now it looks like it will be in addition to them).

@vsoch
Copy link
Member

vsoch commented Jul 11, 2022

@marcodelapierre now that we have views could there be a way to allow this additional customization through them?

@marcodelapierre
Copy link
Contributor Author

Hi @vsoch,

I think we could provide the functionality in two ways:

  1. via update shpc/container settings: see my first message in this issue. This allows to specify which GPU vendor a container has been built for, AND also to handle multiple vendors within the settings, in a compact way. Now that SHPC has the overrides functionalities, it would be good to make sure that container features are also compatible with overrides, e.g. to handle the case where different tags of the same container repo support different GPU vendors
  2. via SHPC views

My personal preference is the first, as it still seems to be simple and flexible at the same time. However, we've also learnt that it is good to provide multiple ways to achieve the same setup, as different people/centres will have different preferences. SHPC views seem great in providing this additional flexibility in setups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants