-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP at eval stage + DDP metrics #2518
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting eval/validation with DDP is an important step, thanks for tackling it @Adel-Moumen . Tests for DDP and making metrics work across processes are crucial. However I think there may be some unnecessary complexity here. For example, I'm not convinced the DistributedState
singleton adds anything on top of what torch.distributed
already does.
Some options I can think of to address complexity:
- add
accelerate
as a dependency and use their version of these functions, so we don't have to maintain it. - add
torch-metrics
as a dependency, https://github.com/Lightning-AI/torchmetrics, they have most of the metrics with distributed support already, again we don't have to maintain it. - Force a relatively simple and consistent format for the metrics -- then we don't need much complexity to sync other than
torch.distributed
I know we're trying to avoid adding dependencies but these ones are relatively small and well worth it.
Hi @pplantinga, indeed. It does not in the current stage. As I said in my replies, this PoC aimed at showing one way of handling DDP informations in the future. Indeed, depending on your backend, you may need to follow different steps to achieve the same thing. Having a convenient dataclass would have helped to have a single interface and behind the hood all the logic required by those backends. With respect to your comments, I decided to remove this class as it can be fully replaced by
Regarding the complexity, I have some hard time to understand where the complexity boils down. I am slightly biased since I am the one who spent the most time understanding how to synchronize metrics. The only complexity is due to the different backends that we can encounter in speechbrain (e.g. gloo vs nccl) which comes with different utilities functions but I don't think you can simplify unless we drop gloo support :) (which i strongly disagree).
Again, I am still biased since I did the implementation but I really dont think we need a third package to handle the metrics. As I demonstrated, synchronizing can be summarized in < 250 lines of code (excluding docstring/tests). If you do install accelerate for metrics, you'll get thousands of lines of code that are useless mostly because they do support TPUs, Megatron, FSDP etc etc etc. I like the idea of having a SpeechBrain Trainer that does not rely on many toolkits and that can be easily hackable. |
Please jump in the discussion (one more time please its an important topic :p) @TParcollet @asumagic :) (note: i made some changes based on your comments [thanks again guys (@TParcollet @asumagic @pplantinga) !] and feel free to get back to me if you think that now I should move on to more tests/reporting more results etc). :) |
@pplantinga I do not like the idea of adding accelerate and torch-metrics as dependencies. I do not trust the stability of both dependencies and I hate depending on other things when it's for something easy, like here. However, I agree that the Singleton class seems to be to much IMHO. I also agree on the homogeneity of the metrics, but maybe this could be another PR? |
The small variations in CER and WER might be related to the fact that when the dataset is not divisible by the number of processes, a few samples are duplicated : pytorch/pytorch#25162 |
Indeed! Good point. |
What does this PR do?
Goal: Add support for distributed metrics inside SpeechBrain.
Why: We are currently running the training set on a distributed system, which significantly reduces training time of SpeechBrain's. However, due to a lack of supported features for distributed metrics, we are performing the evaluation stages (validation & testing) on a single GPU, despite potentially having access to many more. This practice renders SpeechBrain noncompetitive compared to other toolkits because the evaluation inference time is notably slow. Utilizing all available GPUs during testing could significantly reduce the testing time, potentially dividing it by N minus a small constant linked to the communication overhead, where N is the number of GPUs. For example, training a conformer transducer model typically requires 4x GPUs, while we are using only 1 GPU during evaluation stages, which can prolong the process to hours. Leveraging all 4x GPUs can offer a considerable speedup. This pull request (PR) is also part of our efforts to make SpeechBrain an attractive toolkit for "large-scale" training which will need to run evaluation on a tons of data.
How: To achieve this, I introduced various functions to gather tensors/objects across processes. Please refer to the file
distributed_metrics.py
for a better understanding of the available functions.Issues: Since this PR can significantly enhance the inference speed of SpeechBrain, it's crucial to introduce it with caution. I've written extensive unit tests covering 100% of the metrics in SpeechBrain, all of which are functional in Distributed Data Parallel (DDP) mode. Additionally, I manually executed some Automatic Speech Recognition (ASR) recipes and verified the accuracy of the results. Based on my tests, we observe consistent results, indicating that this PR should not compromise our outcomes. However, I've only manually covered ASR tests, and I believe it's prudent to dedicate some time to investigating Text-to-Speech (TTS) models as well.
Results
Note: we are seeing some very smalls varation sometimes like 1.91 CER vs 1.92. I believe the issue is due to how CUDA works. Since you are now using multiple GPUs, they might not use the same CUDA kernel for doing operations.
DDP No Sync Metrics
w2v2 + CTC + librispeech + 2 GPUs (no syncs) + test BS = 1
Whisper Transformer + librispeech + 2 GPUs (no sync) + test BS = 1
DDP Sync Metrics
w2v2 + CTC + librispeech + 2 GPUs + test BS = 1
Whisper Transformer + librispeech + 2 GPUs (no sync) + test BS = 1
To Do
Before submitting
PR review
Reviewer checklist