Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase beam search speed and reduce memory usage #1957

Closed
yuyan2do opened this issue Apr 2, 2020 · 4 comments
Closed

Increase beam search speed and reduce memory usage #1957

yuyan2do opened this issue Apr 2, 2020 · 4 comments

Comments

@yuyan2do
Copy link
Contributor

yuyan2do commented Apr 2, 2020

馃殌 Feature Request

Increase generation speed and reduce memory usage by making model aware beam search. This is a continue speed improvement follow up by #1851.

Motivation

In beam search, encoder state are duplicate N times and saved in GPU memory during incremental decode. Avoid these duplication can reduce memory usage for encoder state to 1/N, and enable increase batch size up to Nx to get much better speed. (N is batch size)

Measure by BART model on CNNDM summarization dataset, this change reduces memory usage to 1/4, and enable use 4x batch size, which get 2x speed up. See below table for detail.

Pitch

I created PR #1958 with below changes.

  1. [Major change] In encoder decoder attention, reduce saved states from [batch_size*beam_size] to [batch_size], and change attention logic to compatible with original states and reduced states.
  2. Do not duplicate encode state, contribute 10% speed.
  3. In beam search code, only generate necessary ngram, contribute 20% speed.

Additional context

image
(beam=4, lenpen=2.0, max_len_b=140, min_len=55, max_source_positions=1024)

To benchmark the speed, run "CUDA_VISIBLE_DEVICES=0 python generation_speed_test.py".
cnndm_128.txt
generation_speed_test.py

@yuyan2do
Copy link
Contributor Author

yuyan2do commented Apr 7, 2020

Add profile result for one transformer layer.

Before change, batch size 32, cost 2.8 ms per layer
The first part of GPU computation is very sparse.
image

After change, batch size 128, cost 3.3 ms per layer
GPU computation is more dense compare to use small batch size.
image

@yuyan2do
Copy link
Contributor Author

yuyan2do commented Apr 9, 2020

@myleott Could you please help review this PR? It reduce memory size used for encoder-decoder attention to 1/Beam_size. (Original size is Layer_number * [Batch_size*Beam_size, Input_length, Hidden_dim]) And enable increasing batch size to get quicker generation speed.

@stale
Copy link

stale bot commented Jul 21, 2021

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

@stale stale bot added the stale label Jul 21, 2021
@stale
Copy link

stale bot commented Apr 18, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

@stale stale bot closed this as completed Apr 18, 2022
facebook-github-bot pushed a commit that referenced this issue Apr 20, 2022
Summary:
Implements beamable encoder-decoder cross attention. This removes the need to duplicate the encoder states beam_size # of times during inference. Which gives both a big memory improvement enabling larger batch sizes while on GPU and also compute efficiency by greatly reducing time spent in reorder_encoder_out.

This is inspired from work in [fastseq](https://arxiv.org/abs/2106.04718) which has more in-depth analysis.

There was an old [PR](#1958) for fairseq as well to implement this feature but was not merged and eventually closed. I revive+refactor that PR and also add support for dynamically changing the beam_size while calling `hub_interface.generate()`

## Benchmarking

**CPU Performance** (On-demand devserver)
batch size: 1 | beam size: 4
50.4s/it -> 22.3s/it | **2.25X Speedup**

batch size: 2 | beam size: 4
53.1s/it -> 25.8s/it | **2.06X Speedup**

batch size: 1 | beam size: 8
65.8s/it -> 23.8s/it | **2.76X Speedup**

**GPU Performance**

Reported in detail [here](#1957)

Currently this optimization is only enabled for our custom BART model used in the workplace summarization demo to unblock landing this fast.
This should be up-streamed to TransformerModel after syncing with fairseq folk.

Reviewed By: xwhan

Differential Revision: D35722467

fbshipit-source-id: a420f73ff5b9ec0cdf40c59464b6ed1794114906
lzzk pushed a commit to lzzk/fairseq that referenced this issue Jul 24, 2022
Summary:
Implements beamable encoder-decoder cross attention. This removes the need to duplicate the encoder states beam_size # of times during inference. Which gives both a big memory improvement enabling larger batch sizes while on GPU and also compute efficiency by greatly reducing time spent in reorder_encoder_out.

This is inspired from work in [fastseq](https://arxiv.org/abs/2106.04718) which has more in-depth analysis.

There was an old [PR](facebookresearch#1958) for fairseq as well to implement this feature but was not merged and eventually closed. I revive+refactor that PR and also add support for dynamically changing the beam_size while calling `hub_interface.generate()`

## Benchmarking

**CPU Performance** (On-demand devserver)
batch size: 1 | beam size: 4
50.4s/it -> 22.3s/it | **2.25X Speedup**

batch size: 2 | beam size: 4
53.1s/it -> 25.8s/it | **2.06X Speedup**

batch size: 1 | beam size: 8
65.8s/it -> 23.8s/it | **2.76X Speedup**

**GPU Performance**

Reported in detail [here](facebookresearch#1957)

Currently this optimization is only enabled for our custom BART model used in the workplace summarization demo to unblock landing this fast.
This should be up-streamed to TransformerModel after syncing with fairseq folk.

Reviewed By: xwhan

Differential Revision: D35722467

fbshipit-source-id: a420f73ff5b9ec0cdf40c59464b6ed1794114906
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants