-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing: handle cpu & single-gpu without using multiprocessing & dist. data parallel #290
Conversation
I think after the latest batch of fixes it looks good now! Thank you so much for doing this! |
colbert/infra/config/settings.py
Outdated
@@ -30,6 +30,8 @@ class RunSettings: | |||
total_visible_gpus = torch.cuda.device_count() | |||
gpus: int = DefaultVal(total_visible_gpus) | |||
|
|||
use_rank1_fork: bool = DefaultVal(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick note (well, two):
- Should this not be the opposite logic? i.e. defaults to
True
(and the associated logic changes in the code)? I feel like it'd make more sense with the naming - Do we want this to be a
config
item rather than a flag passed to the function? I think either works fine and maybe having it as part of config is more fitting for how the rest of the code is structured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using fork in the sense of a code fork but will revert to fork as in multiprocessing! thanks for explaining the terminology!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, all looking good to me now! Just needs Omar's final approval 👍
This fixed a problem I was having downstream in RAGatouille, always running in distributed mode on wsl2 with a single gpu. Thank you for the PR! bclavie/RAGatouille#30 (comment) edit: looks like the trainer is still always forcing distributed torch. the collection indexer fixed indexing, though. Definitely the right direction though. |
No description provided.