DynamicBatchSampler #1670

bofenghuang · 2022-05-17T07:42:37Z

bofenghuang
May 17, 2022

Hi,

I'm trying to understand the implementation of DynamicBatchSampler.
I would like to know, in _get_boundaries_through_warping, why would you use the lognorm of s=1 to get the quantiles and linearly scale up to max_batch_length. What about using lognorm.fit ?

cc: @popcornell

Thanks in advance !

TParcollet · 2022-05-17T07:44:41Z

TParcollet
May 17, 2022
Maintainer

cc @anautsch maybe as well

0 replies

anautsch · 2022-05-18T09:20:43Z

anautsch
May 18, 2022
Collaborator

Hi @bofenghuang yeah, that warping idea is on me. It's resolving time resolution in the latent statistical space rather than needing to define it explicitly for every audio collection that one fancies at a time. I write the following bulk just to make sure we are on the same page - from your message I assume you know it intuitively already, just so others can follow when reading this when archived.

Please take a look at our tutorial for context. The prequel to your question:

long audios are a burden, random batches run easily oom for limited VRAM
reducing batch size to 1 takes for ever in training, it wastes time and power consumption (lots of padding also)
so, how do we create batches that need the least padding possible while maximising training efficiency?
the go-to approach was: let's create buckets that have properties and we treat these properties then - words - so - one bucket takes only one long audio, others some medium audios, and others all the small ones - to reduce padding, tiny audios are put in here & there to reduce padding -> note: VRAM determines what's the max length of a bucket with the longest audio

Now, how to get these buckets with their fancy properties?
Without the latent space/warping approach, one would need to define: 0 to 0.2; 0.2 to 0.6; 0.6 to 1.2; ... or to script some exponential growth there. Rationale for exponential growth: most datasets are log-normal distributed when it comes to audio duration. So, the buckets are projected to treatment in a linear space. Therefore, the log-normal distribution is used.

About your question.
Which log-normal distribution to use is perhaps somewhat arbitrary. ;-)

Here, the goal was to get an initial handle on:

max_batch_length which represents the VRAM limit
num_quantiles which represents the targeted resolution in the latent space

For this latent space, your question is about:

why use standard parameters for the assumed distribution?
why not using a good fit of the actual duration distribution?

The answer might be depressingly simple: to get the PR & tutorial out for later discussions like this one.

My gut feeling is that a distribution fit should play out better than some arbitrarily assumed distribution. Would you be up to dive into tests on this topic? It w/could make sense also to move away then from the log-normal assumption and use a general fit - in the end, what matters here is that quantiles become linear in their handling through warping distributions. Another issue might also rest in: for the rest of the dynamic batching, why not have simply three or five bucket types and sort the rest in from long to short audios?

0 replies

bofenghuang · 2022-05-19T14:50:08Z

bofenghuang
May 19, 2022
Author

Hi @anautsch, thank you very much for your explanations ! And yeah, my question was exactly why using the arbitrary lognorm but not one with the parameters fitted on dataset.

When checking the bucketing log in _generate_batches func, I found that the reason that current DynamicBatchSampler works very well with num_buckets=60, as mentioned in tuto (With Dynamic Batching: % True samples 0.9206810914442322, % of padding 0.07931890855576784, Total time 11.002478837966919), is that a large #buckets is used, but only 6 buckets are filled in fact. I think this has justified the need for a more fit distribution.

INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 0 with boundary 0.0-7.6 and batch_size 71: Num Examples 199.0, Num Full Batches 1.000, Pad Factor 120.711.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 1 with boundary 7.6-10.2 and batch_size 53: Num Examples 114.0, Num Full Batches 1.000, Pad Factor 28.153.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 2 with boundary 10.2-12.3 and batch_size 44: Num Examples 149.0, Num Full Batches 3.000, Pad Factor 18.298.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 3 with boundary 12.3-14.2 and batch_size 38: Num Examples 377.0, Num Full Batches 9.000, Pad Factor 14.158.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 4 with boundary 14.2-16.0 and batch_size 34: Num Examples 594.0, Num Full Batches 16.000, Pad Factor 11.702.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 5 with boundary 16.0-17.7 and batch_size 30: Num Examples 86.0, Num Full Batches 2.000, Pad Factor 7.803.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 6 with boundary 17.7-19.3 and batch_size 28: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 7 with boundary 19.3-21.0 and batch_size 25: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 8 with boundary 21.0-22.6 and batch_size 24: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 9 with boundary 22.6-24.2 and batch_size 22: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 10 with boundary 24.2-25.8 and batch_size 21: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 11 with boundary 25.8-27.4 and batch_size 19: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 12 with boundary 27.4-29.0 and batch_size 18: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 13 with boundary 29.0-30.7 and batch_size 17: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 14 with boundary 30.7-32.4 and batch_size 16: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 15 with boundary 32.4-34.1 and batch_size 15: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 16 with boundary 34.1-35.8 and batch_size 15: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 17 with boundary 35.8-37.5 and batch_size 14: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 18 with boundary 37.5-39.4 and batch_size 13: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 19 with boundary 39.4-41.2 and batch_size 13: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 20 with boundary 41.2-43.1 and batch_size 12: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 21 with boundary 43.1-45.0 and batch_size 12: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 22 with boundary 45.0-47.0 and batch_size 11: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 23 with boundary 47.0-49.1 and batch_size 11: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 24 with boundary 49.1-51.2 and batch_size 10: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 25 with boundary 51.2-53.4 and batch_size 10: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 26 with boundary 53.4-55.7 and batch_size 9: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 27 with boundary 55.7-58.1 and batch_size 9: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 28 with boundary 58.1-60.5 and batch_size 8: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 29 with boundary 60.5-63.0 and batch_size 8: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 30 with boundary 63.0-65.7 and batch_size 8: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 31 with boundary 65.7-68.4 and batch_size 7: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 32 with boundary 68.4-71.3 and batch_size 7: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 33 with boundary 71.3-74.3 and batch_size 7: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 34 with boundary 74.3-77.5 and batch_size 7: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 35 with boundary 77.5-80.8 and batch_size 6: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 36 with boundary 80.8-84.3 and batch_size 6: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 37 with boundary 84.3-88.0 and batch_size 6: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 38 with boundary 88.0-91.9 and batch_size 5: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 39 with boundary 91.9-96.1 and batch_size 5: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 40 with boundary 96.1-100.5 and batch_size 5: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 41 with boundary 100.5-105.2 and batch_size 5: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 42 with boundary 105.2-110.3 and batch_size 4: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 43 with boundary 110.3-115.7 and batch_size 4: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 44 with boundary 115.7-121.6 and batch_size 4: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 45 with boundary 121.6-128.0 and batch_size 4: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 46 with boundary 128.0-134.9 and batch_size 4: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 47 with boundary 134.9-142.6 and batch_size 3: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 48 with boundary 142.6-151.1 and batch_size 3: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 49 with boundary 151.1-160.5 and batch_size 3: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 50 with boundary 160.5-171.2 and batch_size 3: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 51 with boundary 171.2-183.3 and batch_size 2: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 52 with boundary 183.3-197.4 and batch_size 2: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 53 with boundary 197.4-214.0 and batch_size 2: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 54 with boundary 214.0-234.0 and batch_size 2: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 55 with boundary 234.0-258.8 and batch_size 2: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 56 with boundary 258.8-291.2 and batch_size 1: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 57 with boundary 291.2-336.0 and batch_size 1: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 58 with boundary 336.0-405.7 and batch_size 1: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.
INFO:speechbrain.dataio.sampler:DynamicBatchSampler: Bucket 59 with boundary 405.7-544.0 and batch_size 1: Num Examples 0.0, Num Full Batches 0.000, Pad Factor 0.000.

Based on this super tutorial, I have added some other bucketing methods and ended up with following results on MiniLibriSpeech:

Random Sampling
After sorting (minimum padding but no randomness)
Org Dynamic Batching (the one in speechbrain current version)
Mdf Dynamic Batching w/ fitted lognorm (bucket boundaries set up with lognormal distribution fitted on dataset)
Mdf Dynamic Batching w/ fitted beta (bucket boundaries set up with beta distribution fitted on dataset, mentioned in tuto)
Mdf Dynamic Batching w/ Kmeans (examples aggregated into buckets by Kmeans)

I have also measured the time for creating the buckets to make sure we don't add too much time overhead (see Sampler initialization time).

Random Sampling: % True samples 0.7691900805460599, % of padding 0.2308099194539402, Total time 8.152731657028198
After sorting: % True samples 0.9877999198583547, % of padding 0.012200080141645281, Total time 7.920250654220581

num_buckets: 5
Org Dynamic Batching: % True samples 0.8037098958878093, % of padding 0.19629010411219072, Total time 8.982778787612915, Sampler initialization time 0.00830388069152832
Mdf Dynamic Batching w/ fitted lognorm: % True samples 0.9105685331121831, % of padding 0.08943146688781692, Total time 8.017155408859253, Sampler initialization time 0.2598123550415039
Mdf Dynamic Batching w/ fitted beta: % True samples 0.9061257346980318, % of padding 0.09387426530196818, Total time 8.40556263923645, Sampler initialization time 0.12327051162719727
Mdf Dynamic Batching w/ Kmeans: % True samples 0.9172593243667561, % of padding 0.08274067563324397, Total time 8.077270984649658, Sampler initialization time 0.37424445152282715

num_buckets: 10
Org Dynamic Batching: % True samples 0.7820164173717347, % of padding 0.2179835826282653, Total time 8.377742528915405, Sampler initialization time 0.00844120979309082
Mdf Dynamic Batching w/ fitted lognorm: % True samples 0.9358722055686697, % of padding 0.06412779443133029, Total time 8.322029829025269, Sampler initialization time 0.2533447742462158
Mdf Dynamic Batching w/ fitted beta: % True samples 0.9443160562979621, % of padding 0.055683943702037865, Total time 8.235085248947144, Sampler initialization time 0.11742091178894043
Mdf Dynamic Batching w/ Kmeans: % True samples 0.9554278736419016, % of padding 0.044572126358098395, Total time 8.240896701812744, Sampler initialization time 0.12387490272521973

num_buckets: 15
Org Dynamic Batching: % True samples 0.7756018507479978, % of padding 0.22439814925200216, Total time 7.931443691253662, Sampler initialization time 0.007909774780273438
Mdf Dynamic Batching w/ fitted lognorm: % True samples 0.9518678718022273, % of padding 0.048132128197772656, Total time 8.096233129501343, Sampler initialization time 0.27107810974121094
Mdf Dynamic Batching w/ fitted beta: % True samples 0.9599471840190559, % of padding 0.04005281598094407, Total time 8.055798530578613, Sampler initialization time 0.1018073558807373
Mdf Dynamic Batching w/ Kmeans: % True samples 0.9694473523120741, % of padding 0.03055264768792584, Total time 8.201721429824829, Sampler initialization time 0.19640326499938965

num_buckets: 20
Org Dynamic Batching: % True samples 0.7710137381778893, % of padding 0.22898626182211065, Total time 7.569898366928101, Sampler initialization time 0.00859689712524414
Mdf Dynamic Batching w/ fitted lognorm: % True samples 0.9607128662922578, % of padding 0.03928713370774222, Total time 8.13912057876587, Sampler initialization time 0.31614184379577637
Mdf Dynamic Batching w/ fitted beta: % True samples 0.9690223254725432, % of padding 0.030977674527456776, Total time 8.036609172821045, Sampler initialization time 0.10276174545288086
Mdf Dynamic Batching w/ Kmeans: % True samples 0.9769193297979476, % of padding 0.023080670202052424, Total time 7.95473051071167, Sampler initialization time 0.20025897026062012

num_buckets: 60
Org Dynamic Batching: % True samples 0.9203420430290024, % of padding 0.07965795697099763, Total time 8.252700090408325, Sampler initialization time 0.008800029754638672
Mdf Dynamic Batching w/ fitted lognorm: % True samples 0.9846392039967643, % of padding 0.015360796003235686, Total time 7.790568828582764, Sampler initialization time 0.2609591484069824
Mdf Dynamic Batching w/ fitted beta: % True samples 0.9887281684322106, % of padding 0.011271831567789437, Total time 8.079726696014404, Sampler initialization time 0.10100769996643066
Mdf Dynamic Batching w/ Kmeans: % True samples 0.9925841676005036, % of padding 0.0074158323994964305, Total time 8.379063844680786, Sampler initialization time 0.3739185333251953

Same config used as in tuto

batch_size = 32
max_batch_len = 17 * 32
num_buckets_list = [5, 10, 15, 20, 60]

System Info

- SpeechBrain version: 0.5.11
- PyTorch version: 1.11.0+cu113
- Scipy version: 1.8.0
- Python version: 3.8.13
- Platform: Linux version 5.4.0-105-generic
- CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
- #logical cpus: 8

The code is here.

I can make a PR if this approach is approved and look forward to further discussions :)

Cheers

0 replies

anautsch · 2022-05-20T08:50:51Z

anautsch
May 20, 2022
Collaborator

@bofenghuang neat! It's encouraging to see your enthusiasm :D
Any choice is arbitrary, having some facts selection to it, doesn't make choices systematic ;)

@popcornell worked a lot on the final tutorial!

As you demonstrated, the num_buckets helps DynamicBatching to do something but it is far from what one would think should happen (10% of the created buckets are actually used, the rest remains unfilled). Therefore, I'd be curious if it makes sense to have any distributional assumption at all - or to treat this entirely on the categorical level (have 4, 5, 6, ... bucket types to be filled whatever they are - the limit is VRAM). What I try to say:

we need to question what is the underlying problem actually
which parameters (among those we know/not know) are critical to solve it?
the actual problem might be much more simple but statistical understanding of its inner workings might help to find a more adequate solution

That's also what your findings on kmeans support: fitting distributions can help but in the end, we get a dataset and that's our entire population to take care about during operation. (New dataset, new task - everything back to start.) Yet, what does the batch creation of DynamicBatching imply to the overall training? How to test that for one and many datasets - how do we have the guarantees that we need to have?

If the number of total batches created is small and padding is small - how much random permutation of the files in these batches is then possible? Or would there only be one solution in how to draw a batch after kmeans?

What's your take: is there benefit in giving many choices here on which DynamicBatching approach to use? Are there "relevant" ones?
// The goal is to provide useful tools which make a complex issue to be handled intuitively.

% True samples & % of padding say the same thing - % of padding could be more useful to develop intuition here but it's not what one thinks of first (so both were in the tutorial). About the Sampler initialization time - it's relevant to decompose the Total time to understand what's going on internally.

DynamicBatching has a few use cases which need better testing if they are fulfilled:

low VRAM: create batches that are not running oom
low VRAM: don't be wasteful - reduce padding
low VRAM: see as much data as possible; don't remove too much bc of low VRAM
low VRAM: different epochs should feature different batch samplings = random permutations of files
low VRAM: while long audios might fill up an entire bucket, two half-long audios of that might also run oom - how to instantiate buckets properly for all audios to be processed
high VRAM: less padding means fewer batches means faster training

@TParcollet @popcornell please add to that list what I forgot - there might be more requirements to DynamicBatching ^^"

The dimensions to investigate a better DynamicBatching need to also work on:

mini LibriSpeech
full LibriSpeech
CommonVoice (some language sets there are at a few MB; some GB; lots GB)
...

What we observe on small datasets might not hold on large datasets. How about discussing next week a strategy for testing and developing DynamicBatching further?

0 replies

bofenghuang · 2022-05-20T09:06:41Z

bofenghuang
May 20, 2022
Author

@anautsch nice! Look forward to our discussion :)

0 replies

bofenghuang · 2022-06-03T08:35:13Z

bofenghuang
Jun 3, 2022
Author

Hi @anautsch ,

As discussed last week, I came back with the tests on more datasets :

MiniLibriSpeech
LibriSpeech
CommonVoice IT
CommonVoice FR

I used batch_size=32 for the random sampling and the sequential sampling after sorting. max_batch_len for DynamicBatchSampler is set to batch_size * max_dur to make sure they have the same memory peak.

Here are the results.

MiniLibriSpeech

batch_size: 32, max_dur: 17.275, max_batch_len is set to 552.8

Random Sampling: ratio of true samples 0.7701, ratio of padding 0.2299, #batches 48, total time 1.3554s
After sorting: ratio of true samples 0.9878, ratio of padding 0.0122, #batches 48, total time 1.2735s

num_buckets: 5
Org Dynamic Batching: ratio of true samples 0.8053, ratio of padding 0.1947, #batches 254, total time 1.5920s, sampler initialization time 0.0088s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.9032, ratio of padding 0.0968, #batches 41, total time 1.3638s, sampler initialization time 0.1339s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.9066, ratio of padding 0.0934, #batches 42, total time 1.2339s, sampler initialization time 0.0553s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.9150, ratio of padding 0.0850, #batches 41, total time 1.5800s, sampler initialization time 0.5647s

num_buckets: 10
Org Dynamic Batching: ratio of true samples 0.7819, ratio of padding 0.2181, #batches 109, total time 1.7833s, sampler initialization time 0.0105s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.9350, ratio of padding 0.0650, #batches 45, total time 1.5748s, sampler initialization time 0.1315s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.9444, ratio of padding 0.0556, #batches 43, total time 1.8142s, sampler initialization time 0.0577s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.9546, ratio of padding 0.0454, #batches 41, total time 1.5580s, sampler initialization time 0.5645s

num_buckets: 20
Org Dynamic Batching: ratio of true samples 0.7710, ratio of padding 0.2290, #batches 55, total time 1.6245s, sampler initialization time 0.0097s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.9615, ratio of padding 0.0385, #batches 45, total time 1.5537s, sampler initialization time 0.1275s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.9694, ratio of padding 0.0306, #batches 46, total time 1.4692s, sampler initialization time 0.0616s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.9757, ratio of padding 0.0243, #batches 45, total time 1.4958s, sampler initialization time 0.5201s

num_buckets: 60
Org Dynamic Batching: ratio of true samples 0.9174, ratio of padding 0.0826, #batches 40, total time 1.6088s, sampler initialization time 0.0123s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.9847, ratio of padding 0.0153, #batches 70, total time 1.5208s, sampler initialization time 0.1340s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.9887, ratio of padding 0.0113, #batches 72, total time 1.7073s, sampler initialization time 0.0538s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.9926, ratio of padding 0.0074, #batches 76, total time 1.6233s, sampler initialization time 0.5996s

LibriSpeech

batch_size: 32, max_dur: 29.735, max_batch_len is set to 951.52

Random Sampling: ratio of true samples 0.7481, ratio of padding 0.2519, #batches 8789, total time 63.2609s
After sorting: ratio of true samples 0.9999, ratio of padding 0.0001, #batches 8789, total time 50.7157s

num_buckets: 5
Org Dynamic Batching: ratio of true samples 0.7896, ratio of padding 0.2104, #batches 46874, total time 105.6899s, sampler initialization time 1.4504s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8932, ratio of padding 0.1068, #batches 4736, total time 62.6816s, sampler initialization time 7.9327s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8472, ratio of padding 0.1528, #batches 4913, total time 61.9353s, sampler initialization time 6.6490s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.9038, ratio of padding 0.0962, #batches 5502, total time 56.4311s, sampler initialization time 6.0409s

num_buckets: 10
Org Dynamic Batching: ratio of true samples 0.7644, ratio of padding 0.2356, #batches 20089, total time 72.5885s, sampler initialization time 1.4320s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.9278, ratio of padding 0.0722, #batches 3999, total time 56.7527s, sampler initialization time 5.6482s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8930, ratio of padding 0.1070, #batches 4134, total time 58.4157s, sampler initialization time 6.6240s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.9472, ratio of padding 0.0528, #batches 4427, total time 56.5835s, sampler initialization time 5.4183s

num_buckets: 20
Org Dynamic Batching: ratio of true samples 0.7504, ratio of padding 0.2496, #batches 10045, total time 66.0226s, sampler initialization time 1.3766s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.9564, ratio of padding 0.0436, #batches 3849, total time 56.3148s, sampler initialization time 5.7829s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.9379, ratio of padding 0.0621, #batches 3923, total time 56.3607s, sampler initialization time 6.4758s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.9723, ratio of padding 0.0277, #batches 3802, total time 54.1639s, sampler initialization time 5.7752s

num_buckets: 60
Org Dynamic Batching: ratio of true samples 0.8083, ratio of padding 0.1917, #batches 4727, total time 61.7811s, sampler initialization time 1.3827s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.9816, ratio of padding 0.0184, #batches 3766, total time 52.2051s, sampler initialization time 5.7537s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.9771, ratio of padding 0.0229, #batches 3786, total time 54.7776s, sampler initialization time 6.5235s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.9907, ratio of padding 0.0093, #batches 3730, total time 52.6833s, sampler initialization time 8.8615s

CommonVoice IT

batch_size: 32, max_dur: 11.556, max_batch_len is set to 369.792

Random Sampling: ratio of true samples 0.6445, ratio of padding 0.3555, #batches 4416, total time 40.7047s
After sorting: ratio of true samples 0.9182, ratio of padding 0.0818, #batches 4416, total time 31.9217s

num_buckets: 5
Org Dynamic Batching: ratio of true samples 0.7557, ratio of padding 0.2443, #batches 23548, total time 58.7428s, sampler initialization time 0.7653s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8179, ratio of padding 0.1821, #batches 2529, total time 34.2622s, sampler initialization time 2.3579s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8182, ratio of padding 0.1818, #batches 2511, total time 34.3602s, sampler initialization time 1.1880s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.8295, ratio of padding 0.1705, #batches 2386, total time 34.0743s, sampler initialization time 4.8116s

num_buckets: 10
Org Dynamic Batching: ratio of true samples 0.6943, ratio of padding 0.3057, #batches 10092, total time 43.5794s, sampler initialization time 0.6968s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8492, ratio of padding 0.1508, #batches 2290, total time 32.5350s, sampler initialization time 1.7544s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8505, ratio of padding 0.1495, #batches 2275, total time 33.9850s, sampler initialization time 1.1942s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.8614, ratio of padding 0.1386, #batches 2219, total time 33.2582s, sampler initialization time 5.6309s

num_buckets: 20
Org Dynamic Batching: ratio of true samples 0.6521, ratio of padding 0.3479, #batches 5046, total time 41.2652s, sampler initialization time 0.6814s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8582, ratio of padding 0.1418, #batches 2175, total time 33.8558s, sampler initialization time 1.7794s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8594, ratio of padding 0.1406, #batches 2173, total time 32.2418s, sampler initialization time 1.1861s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.8693, ratio of padding 0.1307, #batches 2137, total time 32.9984s, sampler initialization time 4.7487s

num_buckets: 60
Org Dynamic Batching: ratio of true samples 0.8030, ratio of padding 0.1970, #batches 2454, total time 35.4841s, sampler initialization time 0.6731s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8447, ratio of padding 0.1553, #batches 2120, total time 34.5552s, sampler initialization time 1.7633s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8430, ratio of padding 0.1570, #batches 2120, total time 34.0768s, sampler initialization time 1.2023s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.8514, ratio of padding 0.1486, #batches 2111, total time 33.5079s, sampler initialization time 5.2714s

CommonVoice FR

batch_size: 32, max_dur: 17.856, max_batch_len is set to 571.392

Random Sampling: ratio of true samples 0.6317, ratio of padding 0.3683, #batches 13872, total time 130.7427s
After sorting: ratio of true samples 0.9485, ratio of padding 0.0515, #batches 13872, total time 99.5431s

num_buckets: 5
Org Dynamic Batching: ratio of true samples 0.7534, ratio of padding 0.2466, #batches 73984, total time 186.3494s, sampler initialization time 2.3191s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8219, ratio of padding 0.1781, #batches 5996, total time 112.5874s, sampler initialization time 11.8102s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8224, ratio of padding 0.1776, #batches 5958, total time 114.5180s, sampler initialization time 2.7761s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.8326, ratio of padding 0.1674, #batches 5089, total time 114.8335s, sampler initialization time 8.4498s

num_buckets: 10
Org Dynamic Batching: ratio of true samples 0.6855, ratio of padding 0.3145, #batches 31708, total time 135.9934s, sampler initialization time 2.4196s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8652, ratio of padding 0.1348, #batches 5018, total time 111.4365s, sampler initialization time 7.8416s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8654, ratio of padding 0.1346, #batches 4992, total time 115.1908s, sampler initialization time 2.6985s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.8753, ratio of padding 0.1247, #batches 4437, total time 107.5632s, sampler initialization time 7.0993s

num_buckets: 20
Org Dynamic Batching: ratio of true samples 0.6399, ratio of padding 0.3601, #batches 15854, total time 134.5930s, sampler initialization time 2.4251s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8862, ratio of padding 0.1138, #batches 4532, total time 107.1991s, sampler initialization time 8.0944s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8864, ratio of padding 0.1136, #batches 4531, total time 109.3951s, sampler initialization time 2.7305s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.8945, ratio of padding 0.1055, #batches 4235, total time 110.1019s, sampler initialization time 7.9038s

num_buckets: 60
Org Dynamic Batching: ratio of true samples 0.6558, ratio of padding 0.3442, #batches 6345, total time 131.4008s, sampler initialization time 2.1779s
Mdf Dynamic Batching w/ fit lognorm: ratio of true samples 0.8957, ratio of padding 0.1043, #batches 4233, total time 108.8947s, sampler initialization time 8.0933s
Mdf Dynamic Batching w/ fit beta: ratio of true samples 0.8962, ratio of padding 0.1038, #batches 4243, total time 111.4680s, sampler initialization time 2.7762s
Mdf Dynamic Batching w/ kmeans: ratio of true samples 0.8995, ratio of padding 0.1005, #batches 4133, total time 108.9824s, sampler initialization time 13.6261s

Cheers

0 replies

anautsch · 2022-06-03T14:30:16Z

anautsch
Jun 3, 2022
Collaborator

Thank you @bofenghuang ; quite some basis to find the next step!

As per slack: Random Sampling & After sorting are w/o the dyn batching warping

I'll rewrite your resutls - let's see if this formatting helps. I'll add: %init time = sampler initialization time / total time

A = num_buckets | B = ratio of padding | C = #batches | D = total time | E = %init time

I think about them as:
resolution depth | avg energy waste | total energy spent | demanded impatience | one-time overhead

The init time is extra to the total time. After looking at the script, they are two separate timings. The total time refers to all batches within an epoch. To put this number in relation, one might think of distributed data parallel (DDP) & shards, where batches are composed once (init time effort), and then each batch is randomly sampled among the available batches (sets of batches can be delegated as shards to specific multi-gpu nodes).

The results below are for 1 single epoch.

Rnd = Random Sampling
Org = Org Dynamic Batching
log = Mdf Dynamic Batching w/ fit lognorm
bta = Mdf Dynamic Batching w/ fit beta
kms = Mdf Dynamic Batching w/ kmeans
srt = After sorting

MiniLibriSpeech

(max. 17.275s -> max. length: 552.8)
    | A  |   B   | C   |   D   |   E
Rnd | 32 | 23.0% |  48 | 1.35s | N/A
Org | 5  | 19.5% | 254 | 1.59s |  0.6%
    | 10 | 21.8% | 109 | 1.79s |  0.6%
    | 20 | 22.9% |  55 | 1.62s |  0.6%
    | 60 |  8.3% |  40 | 1.61s |  0.8%
log | 5  |  9.7% |  41 | 1.36s |  9.8%
    | 10 |  6.5% |  45 | 1.57s |  8.4%
    | 20 |  3.9% |  45 | 1.56s |  8.2%
    | 60 |  1.5% |  70 | 1.52s |  8.8%
bta | 5  |  9.3% |  42 | 1.23s |  4.5%
    | 10 |  5.6% |  43 | 1.81s |  3.2%
    | 20 |  3.1% |  46 | 1.47s |  4.2%
    | 60 |  1.1% |  72 | 1.71s |  3.2%
kms | 5  |  8.5% |  41 | 1.58s | 35.7%
    | 10 |  4.5% |  41 | 1.56s | 36.2%
    | 20 |  2.4% |  45 | 1.50s | 34.8%
    | 60 |  0.7% |  76 | 1.62s | 36.9%
srt | 32 |  1.2% |  48 | 1.35s | N/A

Log-normal and beta appear alike; log-normal will scale better with more epochs. K-means has a great init penalty but promises reduced padding (yet, 70+ batches are a lot compared with some 40 batches). Distribution fitting helps a lot compared to Org dynamic batching; but: could this one be better on its own?

LibriSpeech

(max. 29.735s -> max. length: 951.52)
Rnd | 32 | 25.2% |  8789 |  63.3s | N/A
Org | 5  | 21.0% | 46874 | 105.7s |  1.4%
    | 10 | 23.6% | 20089 |  72.6s |  2.0%
    | 20 | 25.0% | 10045 |  66.0s |  2.1%
    | 60 | 19.2% |  4727 |  61.8s |  2.2%
log | 5  | 10.7% |  4736 |  61.9s | 12.7%
    | 10 |  7.2% |  3999 |  56.8s | 10.0%
    | 20 |  4.4% |  3849 |  56.3s | 10.3%
    | 60 |  1.8% |  3766 |  52.2s | 11.0%
bta | 5  | 15.3% |  4913 |  61.9s | 10.7%
    | 10 | 10.7% |  4134 |  58.4s | 11.3%
    | 20 |  6.2% |  3923 |  56.4s | 11.5%
    | 60 |  2.3% |  3786 |  54.8s | 11.9%
kms | 5  |  9.6% |  5502 |  56.4s | 10.7%
    | 10 |  5.3% |  4427 |  56.6s |  9.6%
    | 20 |  2.8% |  3802 |  54.2s | 10.7%
    | 60 |  0.9% |  3730 |  52.7s | 16.8%
srt | 32 | 0.01% |  8789 |  50.7s | N/A

Beta suggest slight gains to log-normal, and k-means sits in the same boat. There is no clear advantage among the distribution fits of one over the other; it'd be some nitty gritty argument.

CommonVoice IT

(max. 11.556s -> max. length: 369.792)
Rnd | 32 | 35.6% |  4416 | 40.7s | N/A
Org | 5  | 24.4% | 23548 | 58.7s |  1.3%
    | 10 | 30.6% | 10092 | 43.6s |  1.6%
    | 20 | 34.8% |  5046 | 41.3s |  1.7%
    | 60 | 19.7% |  2454 | 35.5s |  1.9%
log | 5  | 18.2% |  2529 | 34.3s |  6.9%
    | 10 | 15.1% |  2290 | 32.5s |  5.4%
    | 20 | 14.2% |  2175 | 33.9s |  5.3%
    | 60 | 15.5% |  2120 | 34.6s |  5.1%
bta | 5  | 18.2% |  2511 | 34.4s |  3.5%
    | 10 | 15.0% |  2275 | 34.0s |  3.5%
    | 20 | 14.1% |  2173 | 32.2s |  3.7%
    | 60 | 15.7% |  2129 | 34.1s |  3.5%
kms | 5  | 17.1% |  2386 | 34.1s | 14.1%
    | 10 | 13.9% |  2219 | 33.3s | 16.9%
    | 20 | 13.1% |  2137 | 33.0s | 14.4%
    | 60 | 14.9% |  2111 | 33.5s | 15.7%
srt | 32 |  8.2% |  4416 | 31.9s | N/A

K-means > Log-norm > beta. Especially, for DDP & shareds.

None of the algorithms tackled the padding issue to a point of contempt satisfaction!

CommonVoice FR

(max. 17.856s -> max. length: 571.392)
Rnd | 32 | 36.8% | 13872 | 130.7s | N/A
Org | 5  | 24.7% | 73984 | 186.3s |  1.2%
    | 10 | 31.5% | 31708 | 136.0s |  1.8%
    | 20 | 36.0% | 15854 | 134.6s |  1.8%
    | 60 | 34.4% |  6345 | 131.4s |  1.7%
log | 5  | 17.8% |  5996 | 112.6s | 10.5%
    | 10 | 13.5% |  5018 | 111.4s |  7.0%
    | 20 | 11.4% |  4532 | 107.2s |  7.6%
    | 60 | 10.4% |  4233 | 108.9s |  7.4%
bta | 5  | 17.8% |  5958 | 114.5s |  2.4%
    | 10 | 13.5% |  4992 | 115.2s |  2.3%
    | 20 | 11.4% |  4531 | 109.4s |  2.5%
    | 60 | 10.4% |  4243 | 111.5s |  2.5%
kms | 5  | 16.7% |  5089 | 114.8s |  7.4%
    | 10 | 12.5% |  4437 | 107.6s |  6.6%
    | 20 | 10.6% |  4235 | 110.1s |  7.2%
    | 60 | 10.1% |  4133 | 109.0s | 12.5%
srt | 32 |  5.2% | 13872 |  99.5s | N/A

The init penalty behaviour is odd. None of the distribution fits is really, really good at getting rid off padding. While the run-time is somewhat similar, k-means has the fewest batches.

It looks like all of them have their justifications and it should be to the user to decide which to take. For this, a tool to assist batch composition could be helpful!

Yet, as for a standard configuration, I do not know. As we cannot control the users & their datasets, the best option might be the beta one. @bofenghuang what's your take?

@TParcollet @mravanelli any ideas on the impact of number of batches to the training outcome? (does it matter/not - somewhat, how?)

0 replies

bofenghuang · 2022-06-03T15:26:00Z

bofenghuang
Jun 3, 2022
Author

@anautsch thanks for the rewriting.

I didn't realize the dynamic sampler with beta distribution has the less initialization time (except on LibriSpeech). Just when fitting the beta distribution on datasets, I keep get this warning. I don't know if this stops you to use it as the default option.

/home/bhuang/anaconda3/envs/sb/lib/python3.8/site-packages/scipy/stats/_continuous_distns.py:647: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)

For me they all have the similar behaviors in terms of number of batches and ratio of padding, better than the original fixed one. DynamicBatchSampler with kmeans is my personal flavor cause it always has the best batching results, and the once bucketing time overhead in the beginning is insignificant compared to hours even days of training (the "total time" reported is just one iteration of dataloader without real training, so I/O basically). But maybe it shouldn't be the default option, to not add more dependency, especially when it doesn't have an obvious gain compared to lognorm and beta.

I put all the code here.

0 replies

anautsch · 2022-06-04T09:21:47Z

anautsch
Jun 4, 2022
Collaborator

Dependency is a valid remark.
It's good btw you handle that import inline!

Gosh, I needed to re-read this line a bit

the "total time" reported is just one iteration of dataloader without real training, so I/O basically

"total_time" it's the total time necessary to add up the number padding in ratio to total audio per batch for 1 epoch!
Ok, then my above analysis doesn't really hold xD

Then, the actual criterion is the number of batches - everything else just puts some nuances to the discussion. Got it, sorry I was slow here...

Just to summarise per dataset, the approach with fewest num_batches (this indicates only fastest epochs, it does not guarantee it):

MiniLibriSpeech | Org | 60 |  8.3% |   40 |   1.6s |  0.8%
                  log | 5  |  9.7% |   41 |   1.4s |  9.8%
                  kms | 5  |  8.5% |   41 |   1.6s | 35.7%
                      | 10 |  4.5% |   41 |   1.6s | 36.2%
LibriSpeech     | kms | 60 |  0.9% | 3730 |  52.7s | 16.8%
                  log | 60 |  1.8% | 3766 |  52.2s | 11.0%
                  bta | 60 |  2.3% | 3786 |  54.8s | 11.9%
                  kms | 20 |  2.8% | 3802 |  54.2s | 10.7%
CommonVoice IT  | kms | 60 | 14.9% | 2111 |  33.5s | 15.7%
                  log | 60 | 15.5% | 2120 |  34.6s |  5.1%
                  bta | 60 | 15.7% | 2129 |  34.1s |  3.5%
                  kms | 20 | 13.1% | 2137 |  33.0s | 14.4%
CommonVoice FR  | kms | 60 | 10.1% | 4133 | 109.0s | 12.5%
                  log | 60 | 10.4% | 4233 | 108.9s |  7.4%
                  kms | 20 | 10.6% | 4235 | 110.1s |  7.2%
                  bta | 60 | 10.4% | 4243 | 111.5s |  2.5%

Interesting that fewest padding is not always fewest batches!

Yeah, k-means is all over the place.

Would you be up for creating a PR with the following?

log-norm with num_buckets=60 as default, since it does not need additional dependencies
leaving out beta, since it might throw errors and is not the striking candidate
making k-means available as per your implementation, num_buckets=60
creating a tool to run your pre-tests with a table output -> based on that, one could make a choice on which dynamic batching method to choose with which parameters etc.

I sense I might be missing sth. - what should the next iteration of dynamic batching include from your view?
Would it make sense to invest more time into another conceptual approach or is this as good as it might get?

0 replies

bofenghuang · 2022-06-04T16:15:11Z

bofenghuang
Jun 4, 2022
Author

Yeah it was the same "total time" used in the tutorial.

Interesting that fewest padding is not always fewest batches!

I think this is related to datasets, #buckets and how the batches are generated from buckets.

In the current implementation, a batch is generated when its size reaches the predefined _max_batch_ex, or _bucket_lens[i] which is max_batch_length divided by upper boundary of the bucket.

speechbrain/speechbrain/dataio/sampler.py

Lines 594 to 599 in a8d5532

    
           if ( 
        
               len(bucket_batches[bucket_id]) >= self._bucket_lens[bucket_id] 
        
               or len(bucket_batches[bucket_id]) >= self._max_batch_ex 
        
           ): 
        
               self._batches.append(bucket_batches[bucket_id]) 
        
               bucket_batches[bucket_id] = []

speechbrain/speechbrain/dataio/sampler.py

Lines 487 to 489 in a8d5532

    
           self._bucket_lens = [ 
        
               max(1, int(max_batch_length / self._bucket_boundaries[i])) 
        
               for i in range(len(self._bucket_boundaries))

In my version, I defined _bucket_lens[i] as max_batch_length divided by length of the longest example in the bucket.

And yeah! I would love to make a PR. But I'm not sure if we should make num_buckets=60 as default. Isn't it too much? As said in the docstring, num_buckets represents a trade-off between speed and randomization.

Now we come back to the question we have discussed. How could we quantify the randomness ? How will this approach impact the training performance ?

0 replies

TParcollet · 2023-11-24T10:32:15Z

TParcollet
Nov 24, 2023
Maintainer

@Adel-Moumen I feel like this is a very very important discussion that could lead to major improvement of dynamic batching / training speed wise. But we need someone interested in doing this. @bofenghuang do you feel like you could turn this into a new PR and explaining what needs to be discussed?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DynamicBatchSampler #1670

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

DynamicBatchSampler #1670

bofenghuang May 17, 2022

Replies: 11 comments

TParcollet May 17, 2022 Maintainer

anautsch May 18, 2022 Collaborator

bofenghuang May 19, 2022 Author

anautsch May 20, 2022 Collaborator

bofenghuang May 20, 2022 Author

bofenghuang Jun 3, 2022 Author

anautsch Jun 3, 2022 Collaborator

bofenghuang Jun 3, 2022 Author

anautsch Jun 4, 2022 Collaborator

bofenghuang Jun 4, 2022 Author

TParcollet Nov 24, 2023 Maintainer

bofenghuang
May 17, 2022

TParcollet
May 17, 2022
Maintainer

anautsch
May 18, 2022
Collaborator

bofenghuang
May 19, 2022
Author

anautsch
May 20, 2022
Collaborator

bofenghuang
May 20, 2022
Author

bofenghuang
Jun 3, 2022
Author

anautsch
Jun 3, 2022
Collaborator

bofenghuang
Jun 3, 2022
Author

anautsch
Jun 4, 2022
Collaborator

bofenghuang
Jun 4, 2022
Author

TParcollet
Nov 24, 2023
Maintainer