Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Model]: LLaVA-NeXT-Video support #5124

Open
AmazDeng opened this issue May 30, 2024 · 4 comments
Open

[New Model]: LLaVA-NeXT-Video support #5124

AmazDeng opened this issue May 30, 2024 · 4 comments
Labels
new model Requests to new models

Comments

@AmazDeng
Copy link

The model to consider.

The llava-next-video project has already been released, and the test results are quite good. Are there any plans to support this project?
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Video.md
Currently, Hugging Face does not support this model.

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

No response

@AmazDeng AmazDeng added the new model Requests to new models label May 30, 2024
@ywang96
Copy link
Collaborator

ywang96 commented Jul 5, 2024

Hi there @AmazDeng! It looks like this model is already supported on transformers. However, multi-image per prompt (which is essentially how video prompting is done) is currently not supported in vLLM, but this is definitely one of the top priorities on our roadmap!

@AmazDeng
Copy link
Author

AmazDeng commented Jul 5, 2024

transformers

Yes, the latest version of Transformers now supports the llava-next-video model. However, the inference speed is very slow. I hope you can support this model soon.
Additionally, I have another question. Why does the VLLM framework not support the direct input of inputs_emb so far? If you know, could you please explain the reason?

@ywang96
Copy link
Collaborator

ywang96 commented Jul 5, 2024

Why does the VLLM framework not support the direct input of inputs_emb so far? If you know, could you please explain the reason?

I do think that's something we should support (and there's indeed an issue for this #416). This will be another API change so we need to make sure everything's compatible.

At least as a first step, we do plan to support image embeddings as input (instead of PIL.Image) for vision language models. This will be our part of Q3 roadmap.

@TKONIY
Copy link

TKONIY commented Jul 19, 2024

Hi there @AmazDeng! It looks like this model is already supported on transformers. However, multi-image per prompt (which is essentially how video prompting is done) is currently not supported in vLLM, but this is definitely one of the top priorities on our roadmap!

I am trying to implement the Llava-Next-Video support. #6571

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

No branches or pull requests

3 participants