DynamicBatchSampler #1670
Replies: 11 comments
-
cc @anautsch maybe as well |
Beta Was this translation helpful? Give feedback.
-
Hi @bofenghuang yeah, that warping idea is on me. It's resolving time resolution in the latent statistical space rather than needing to define it explicitly for every audio collection that one fancies at a time. I write the following bulk just to make sure we are on the same page - from your message I assume you know it intuitively already, just so others can follow when reading this when archived. Please take a look at our tutorial for context. The prequel to your question:
Now, how to get these buckets with their fancy properties? About your question. Here, the goal was to get an initial handle on:
For this latent space, your question is about:
The answer might be depressingly simple: to get the PR & tutorial out for later discussions like this one. My gut feeling is that a distribution fit should play out better than some arbitrarily assumed distribution. Would you be up to dive into tests on this topic? It w/could make sense also to move away then from the log-normal assumption and use a general fit - in the end, what matters here is that quantiles become linear in their handling through warping distributions. Another issue might also rest in: for the rest of the dynamic batching, why not have simply three or five bucket types and sort the rest in from long to short audios? |
Beta Was this translation helpful? Give feedback.
-
Hi @anautsch, thank you very much for your explanations ! And yeah, my question was exactly why using the arbitrary lognorm but not one with the parameters fitted on dataset. When checking the bucketing log in
Based on this super tutorial, I have added some other bucketing methods and ended up with following results on MiniLibriSpeech:
I have also measured the time for creating the buckets to make sure we don't add too much time overhead (see
Same config used as in tuto batch_size = 32
max_batch_len = 17 * 32
num_buckets_list = [5, 10, 15, 20, 60] System Info
The code is here. I can make a PR if this approach is approved and look forward to further discussions :) Cheers |
Beta Was this translation helpful? Give feedback.
-
@bofenghuang neat! It's encouraging to see your enthusiasm :D @popcornell worked a lot on the final tutorial! As you demonstrated, the num_buckets helps DynamicBatching to do something but it is far from what one would think should happen (10% of the created buckets are actually used, the rest remains unfilled). Therefore, I'd be curious if it makes sense to have any distributional assumption at all - or to treat this entirely on the categorical level (have 4, 5, 6, ... bucket types to be filled whatever they are - the limit is VRAM). What I try to say:
That's also what your findings on kmeans support: fitting distributions can help but in the end, we get a dataset and that's our entire population to take care about during operation. (New dataset, new task - everything back to start.) Yet, what does the batch creation of DynamicBatching imply to the overall training? How to test that for one and many datasets - how do we have the guarantees that we need to have? If the number of total batches created is small and padding is small - how much random permutation of the files in these batches is then possible? Or would there only be one solution in how to draw a batch after kmeans? What's your take: is there benefit in giving many choices here on which DynamicBatching approach to use? Are there "relevant" ones?
DynamicBatching has a few use cases which need better testing if they are fulfilled:
@TParcollet @popcornell please add to that list what I forgot - there might be more requirements to DynamicBatching ^^" The dimensions to investigate a better DynamicBatching need to also work on:
What we observe on small datasets might not hold on large datasets. How about discussing next week a strategy for testing and developing DynamicBatching further? |
Beta Was this translation helpful? Give feedback.
-
@anautsch nice! Look forward to our discussion :) |
Beta Was this translation helpful? Give feedback.
-
Hi @anautsch , As discussed last week, I came back with the tests on more datasets :
I used Here are the results. MiniLibriSpeech
LibriSpeech
CommonVoice IT
CommonVoice FR
Cheers |
Beta Was this translation helpful? Give feedback.
-
Thank you @bofenghuang ; quite some basis to find the next step! As per slack: Random Sampling & After sorting are w/o the dyn batching warping I'll rewrite your resutls - let's see if this formatting helps. I'll add: %init time = sampler initialization time / total time
I think about them as: The init time is extra to the total time. After looking at the script, they are two separate timings. The total time refers to all batches within an epoch. To put this number in relation, one might think of distributed data parallel (DDP) & shards, where batches are composed once (init time effort), and then each batch is randomly sampled among the available batches (sets of batches can be delegated as shards to specific multi-gpu nodes). The results below are for 1 single epoch. Rnd = Random Sampling MiniLibriSpeech
Log-normal and beta appear alike; log-normal will scale better with more epochs. K-means has a great init penalty but promises reduced padding (yet, 70+ batches are a lot compared with some 40 batches). Distribution fitting helps a lot compared to Org dynamic batching; but: could this one be better on its own? LibriSpeech
Beta suggest slight gains to log-normal, and k-means sits in the same boat. There is no clear advantage among the distribution fits of one over the other; it'd be some nitty gritty argument. CommonVoice IT
K-means > Log-norm > beta. Especially, for DDP & shareds. None of the algorithms tackled the padding issue to a point of contempt satisfaction! CommonVoice FR
The init penalty behaviour is odd. None of the distribution fits is really, really good at getting rid off padding. While the run-time is somewhat similar, k-means has the fewest batches. It looks like all of them have their justifications and it should be to the user to decide which to take. For this, a tool to assist batch composition could be helpful! Yet, as for a standard configuration, I do not know. As we cannot control the users & their datasets, the best option might be the beta one. @bofenghuang what's your take? @TParcollet @mravanelli any ideas on the impact of number of batches to the training outcome? (does it matter/not - somewhat, how?) |
Beta Was this translation helpful? Give feedback.
-
@anautsch thanks for the rewriting. I didn't realize the dynamic sampler with beta distribution has the less initialization time (except on LibriSpeech). Just when fitting the beta distribution on datasets, I keep get this warning. I don't know if this stops you to use it as the default option.
For me they all have the similar behaviors in terms of number of batches and ratio of padding, better than the original fixed one. DynamicBatchSampler with kmeans is my personal flavor cause it always has the best batching results, and the once bucketing time overhead in the beginning is insignificant compared to hours even days of training (the "total time" reported is just one iteration of dataloader without real training, so I/O basically). But maybe it shouldn't be the default option, to not add more dependency, especially when it doesn't have an obvious gain compared to lognorm and beta. I put all the code here. |
Beta Was this translation helpful? Give feedback.
-
Dependency is a valid remark. Gosh, I needed to re-read this line a bit
"total_time" it's the total time necessary to add up the number padding in ratio to total audio per batch for 1 epoch! Then, the actual criterion is the number of batches - everything else just puts some nuances to the discussion. Got it, sorry I was slow here... Just to summarise per dataset, the approach with fewest num_batches (this indicates only fastest epochs, it does not guarantee it):
Interesting that fewest padding is not always fewest batches! Yeah, k-means is all over the place. Would you be up for creating a PR with the following?
I sense I might be missing sth. - what should the next iteration of dynamic batching include from your view? |
Beta Was this translation helpful? Give feedback.
-
Yeah it was the same "total time" used in the tutorial.
I think this is related to datasets, #buckets and how the batches are generated from buckets. In the current implementation, a batch is generated when its size reaches the predefined speechbrain/speechbrain/dataio/sampler.py Lines 594 to 599 in a8d5532 speechbrain/speechbrain/dataio/sampler.py Lines 487 to 489 in a8d5532 In my version, I defined And yeah! I would love to make a PR. But I'm not sure if we should make Now we come back to the question we have discussed. How could we quantify the randomness ? How will this approach impact the training performance ? |
Beta Was this translation helpful? Give feedback.
-
@Adel-Moumen I feel like this is a very very important discussion that could lead to major improvement of dynamic batching / training speed wise. But we need someone interested in doing this. @bofenghuang do you feel like you could turn this into a new PR and explaining what needs to be discussed? |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm trying to understand the implementation of
DynamicBatchSampler
.I would like to know, in
_get_boundaries_through_warping
, why would you use the lognorm ofs=1
to get the quantiles and linearly scale up tomax_batch_length
. What about usinglognorm.fit
?cc: @popcornell
Thanks in advance !
Beta Was this translation helpful? Give feedback.
All reactions