Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Encoder-decoder model support and T5 Model support #3117

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

js8544
Copy link
Contributor

@js8544 js8544 commented Feb 29, 2024

Add support for encoder-decoder models and T5 as an example. There are mainly two differences between enc-dec and decoder-only models.

  1. The KVCache blocks of prompt and generated tokens are seperated.
  2. The generated tokens always start with <decoder_start_token_id>, and this token is added during the prompt phase.

T5 has a custom bias in its attention, so I also added a custom pias argument to the cuda kernel.

I can run the t5-small model successfully, but the outputs of t5-large and larger models become NaN at some point. I am still digging into this issue. Also, t5-small is only 20% faster than transformers on my machine. So it's likely that there a lot of room for performance improvement.

@robertgshaw2-neuralmagic
Copy link
Collaborator

Thanks @js8544, will review

@afeldman-nm
Copy link
Contributor

Thanks @js8544 ! Taking a look

@seanxcwang
Copy link

The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large).

@afeldman-nm
Copy link
Contributor

@js8544 please review this PR against your feature branch

js8544#1

it adds a t5 encoder/decoder example file, and also finishes merging upstream main into your PR.

@esmeetu esmeetu added the new model Requests to new models label Mar 2, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Mar 5, 2024

@afeldman-nm left a few specific comments. My overarching reaction is as follows:

The implementation does not modify any of the input_metadata or kv_cache allocation logic by putting the cross attention kvs and self-attention kvs in the same blocktable. This makes the attention and prepare_decode functions much more complicated because we have to keep around the block tables in the model + constantly pad things

So my question is:

  • Should we continue down this path or should we adjust and keep 2 separate block tables for the cross and self attention separately? Then make KV cache allocations to each of these explicitly and pass around the 2 different block tables explicitly

@js8544
Copy link
Contributor Author

js8544 commented Mar 5, 2024

I don't mind having two separate block tables. I chose to have them in one because it would make minimal change to existing components. In fact it only adds a couple of ifs in ModelRunner and all other changes are limited to the enc-dec implementation. IIUC having them separate would also require changes to Scheduler, Sequence, InputMetadata.

Also cc @zhuohan123 for ideas and comments.

@robertgshaw2-neuralmagic
Copy link
Collaborator

Note: I was wrong about this breaking decoders that was a silly comment

@js8544 Its a good point. Perhaps we can make the indexing into the block tables more transparent about what is going on to make the code easier to follow instead. We will think more about it

@js8544
Copy link
Contributor Author

js8544 commented Mar 5, 2024

I agree. The current indexing scheme (by computing paddings each time) is painful and hard to read.

@ywang96 ywang96 mentioned this pull request Mar 11, 2024
@js8544
Copy link
Contributor Author

js8544 commented Mar 12, 2024

The issue of NaN may arise due to numerical overflow when using FP16 for computation in these models(T5-Large).

Yeah I noticed that transformers's original implementation of T5 also suffers from this. Using BF16 or FP32 should work.

@js8544
Copy link
Contributor Author

js8544 commented Mar 12, 2024

@zhuohan123 @WoosukKwon Would you guys mind taking a look at this PR? T5 seems to be working now.

@afeldman-nm
Copy link
Contributor

FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them.

This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon

@VarnithChordia
Copy link

Hi,

Can this feature be extended to T5-3b/11b and flanT5-xl/xxl models as well. We observe some errors with both these cases.

@Abineshik
Copy link

FYI I think this PR has some conflicts with recent changes to the main branch. I am looking at resolving them.

This PR was previously passing all of the tests so I am hoping once the recent conflicts are resolved then we will be ready to merge @zhuohan123 @WoosukKwon

Hello, Any update on fixing the conflict and merge?

@afeldman-nm
Copy link
Contributor

Hello @Abineshik yes things are moving apace. Thanks for checking in.

I determined it is probably for the best for encoder decoder models to have separate blocktables for self- and cross-attention KVs. As opposed to packing all KVs into a single blocktable, which had been the existing approach. Having two blocktables makes it easier to work with the vLLM Attention wrapper when implementing Encoder self-attention and Decoder cross-attention. The indexing arithmetic becomes cleaner.

I am almost finished making this change, then I will update the PR.

@rohithkrn
Copy link

@afeldman-nm how is the change you are working on going?

@afeldman-nm
Copy link
Contributor

@afeldman-nm how is the change you are working on going?

Work is still ongoing but hope to finish soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants