Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: handle cpu & single-gpu without using multiprocessing & dist. data parallel #290

Merged
merged 8 commits into from
Jan 14, 2024

Conversation

Anmol6
Copy link
Contributor

@Anmol6 Anmol6 commented Jan 11, 2024

No description provided.

@Anmol6 Anmol6 changed the title Handle rank 1 case without mp Indexing: handle cpu & single-gpu without using multiprocessing & dist. data parallel Jan 11, 2024
@Anmol6 Anmol6 marked this pull request as ready for review January 11, 2024 11:04
@bclavie
Copy link
Collaborator

bclavie commented Jan 12, 2024

I think after the latest batch of fixes it looks good now! Thank you so much for doing this!

@@ -30,6 +30,8 @@ class RunSettings:
total_visible_gpus = torch.cuda.device_count()
gpus: int = DefaultVal(total_visible_gpus)

use_rank1_fork: bool = DefaultVal(False)
Copy link
Collaborator

@bclavie bclavie Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick note (well, two):

  • Should this not be the opposite logic? i.e. defaults to True (and the associated logic changes in the code)? I feel like it'd make more sense with the naming
  • Do we want this to be a config item rather than a flag passed to the function? I think either works fine and maybe having it as part of config is more fitting for how the rest of the code is structured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using fork in the sense of a code fork but will revert to fork as in multiprocessing! thanks for explaining the terminology!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, all looking good to me now! Just needs Omar's final approval 👍

@fblissjr
Copy link

fblissjr commented Jan 12, 2024

This fixed a problem I was having downstream in RAGatouille, always running in distributed mode on wsl2 with a single gpu. Thank you for the PR!

bclavie/RAGatouille#30 (comment)

edit: looks like the trainer is still always forcing distributed torch. the collection indexer fixed indexing, though. Definitely the right direction though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants