Increase beam search speed and reduce memory usage #1957

yuyan2do · 2020-04-02T23:06:39Z

🚀 Feature Request

Increase generation speed and reduce memory usage by making model aware beam search. This is a continue speed improvement follow up by #1851.

Motivation

In beam search, encoder state are duplicate N times and saved in GPU memory during incremental decode. Avoid these duplication can reduce memory usage for encoder state to 1/N, and enable increase batch size up to Nx to get much better speed. (N is batch size)

Measure by BART model on CNNDM summarization dataset, this change reduces memory usage to 1/4, and enable use 4x batch size, which get 2x speed up. See below table for detail.

Pitch

I created PR #1958 with below changes.

[Major change] In encoder decoder attention, reduce saved states from [batch_size*beam_size] to [batch_size], and change attention logic to compatible with original states and reduced states.
Do not duplicate encode state, contribute 10% speed.
In beam search code, only generate necessary ngram, contribute 20% speed.

Additional context

(beam=4, lenpen=2.0, max_len_b=140, min_len=55, max_source_positions=1024)

To benchmark the speed, run "CUDA_VISIBLE_DEVICES=0 python generation_speed_test.py".
cnndm_128.txt
generation_speed_test.py

yuyan2do · 2020-04-07T18:48:50Z

Add profile result for one transformer layer.

Before change, batch size 32, cost 2.8 ms per layer
The first part of GPU computation is very sparse.

After change, batch size 128, cost 3.3 ms per layer
GPU computation is more dense compare to use small batch size.

yuyan2do · 2020-04-09T00:25:23Z

@myleott Could you please help review this PR? It reduce memory size used for encoder-decoder attention to 1/Beam_size. (Original size is Layer_number * [Batch_size*Beam_size, Input_length, Hidden_dim]) And enable increasing batch size to get quicker generation speed.

stale · 2021-07-21T18:04:44Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale · 2022-04-18T03:20:32Z

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

Summary: Implements beamable encoder-decoder cross attention. This removes the need to duplicate the encoder states beam_size # of times during inference. Which gives both a big memory improvement enabling larger batch sizes while on GPU and also compute efficiency by greatly reducing time spent in reorder_encoder_out. This is inspired from work in [fastseq](https://arxiv.org/abs/2106.04718) which has more in-depth analysis. There was an old [PR](#1958) for fairseq as well to implement this feature but was not merged and eventually closed. I revive+refactor that PR and also add support for dynamically changing the beam_size while calling `hub_interface.generate()` ## Benchmarking **CPU Performance** (On-demand devserver) batch size: 1 | beam size: 4 50.4s/it -> 22.3s/it | **2.25X Speedup** batch size: 2 | beam size: 4 53.1s/it -> 25.8s/it | **2.06X Speedup** batch size: 1 | beam size: 8 65.8s/it -> 23.8s/it | **2.76X Speedup** **GPU Performance** Reported in detail [here](#1957) Currently this optimization is only enabled for our custom BART model used in the workplace summarization demo to unblock landing this fast. This should be up-streamed to TransformerModel after syncing with fairseq folk. Reviewed By: xwhan Differential Revision: D35722467 fbshipit-source-id: a420f73ff5b9ec0cdf40c59464b6ed1794114906

Summary: Implements beamable encoder-decoder cross attention. This removes the need to duplicate the encoder states beam_size # of times during inference. Which gives both a big memory improvement enabling larger batch sizes while on GPU and also compute efficiency by greatly reducing time spent in reorder_encoder_out. This is inspired from work in [fastseq](https://arxiv.org/abs/2106.04718) which has more in-depth analysis. There was an old [PR](facebookresearch#1958) for fairseq as well to implement this feature but was not merged and eventually closed. I revive+refactor that PR and also add support for dynamically changing the beam_size while calling `hub_interface.generate()` ## Benchmarking **CPU Performance** (On-demand devserver) batch size: 1 | beam size: 4 50.4s/it -> 22.3s/it | **2.25X Speedup** batch size: 2 | beam size: 4 53.1s/it -> 25.8s/it | **2.06X Speedup** batch size: 1 | beam size: 8 65.8s/it -> 23.8s/it | **2.76X Speedup** **GPU Performance** Reported in detail [here](facebookresearch#1957) Currently this optimization is only enabled for our custom BART model used in the workplace summarization demo to unblock landing this fast. This should be up-streamed to TransformerModel after syncing with fairseq folk. Reviewed By: xwhan Differential Revision: D35722467 fbshipit-source-id: a420f73ff5b9ec0cdf40c59464b6ed1794114906

yuyan2do added enhancement help wanted needs triage labels Apr 2, 2020

yuyan2do mentioned this issue Apr 2, 2020

Make transformer model beam search aware #1958

Closed

4 tasks

lematt1991 removed the needs triage label Apr 23, 2020

yuyan2do mentioned this issue Dec 18, 2020

RuntimeError: CUDA error: no kernel image is available for execution on the device microsoft/fastseq#70

Closed

stale bot added the stale label Jul 21, 2021

stale bot closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase beam search speed and reduce memory usage #1957

Increase beam search speed and reduce memory usage #1957

yuyan2do commented Apr 2, 2020 •

edited

yuyan2do commented Apr 7, 2020

yuyan2do commented Apr 9, 2020

stale bot commented Jul 21, 2021

stale bot commented Apr 18, 2022

Increase beam search speed and reduce memory usage #1957

Increase beam search speed and reduce memory usage #1957

Comments

yuyan2do commented Apr 2, 2020 • edited

🚀 Feature Request

Motivation

Pitch

Additional context

yuyan2do commented Apr 7, 2020

yuyan2do commented Apr 9, 2020

stale bot commented Jul 21, 2021

stale bot commented Apr 18, 2022

yuyan2do commented Apr 2, 2020 •

edited