[New Model]: LLaVA-NeXT-Video support #5124

AmazDeng · 2024-05-30T03:22:17Z

The model to consider.

The llava-next-video project has already been released, and the test results are quite good. Are there any plans to support this project?
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Video.md
Currently, Hugging Face does not support this model.

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

No response

The text was updated successfully, but these errors were encountered:

ywang96 · 2024-07-05T07:39:54Z

Hi there @AmazDeng! It looks like this model is already supported on transformers. However, multi-image per prompt (which is essentially how video prompting is done) is currently not supported in vLLM, but this is definitely one of the top priorities on our roadmap!

AmazDeng · 2024-07-05T07:54:25Z

transformers

Yes, the latest version of Transformers now supports the llava-next-video model. However, the inference speed is very slow. I hope you can support this model soon.
Additionally, I have another question. Why does the VLLM framework not support the direct input of inputs_emb so far? If you know, could you please explain the reason?

ywang96 · 2024-07-05T08:13:05Z

Why does the VLLM framework not support the direct input of inputs_emb so far? If you know, could you please explain the reason?

I do think that's something we should support (and there's indeed an issue for this #416). This will be another API change so we need to make sure everything's compatible.

At least as a first step, we do plan to support image embeddings as input (instead of PIL.Image) for vision language models. This will be our part of Q3 roadmap.

TKONIY · 2024-07-19T10:20:33Z

Hi there @AmazDeng! It looks like this model is already supported on transformers. However, multi-image per prompt (which is essentially how video prompting is done) is currently not supported in vLLM, but this is definitely one of the top priorities on our roadmap!

I am trying to implement the Llava-Next-Video support. #6571

AmazDeng added the new model Requests to new models label May 30, 2024

DarkLight1337 mentioned this issue May 31, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model]: LLaVA-NeXT-Video support #5124

[New Model]: LLaVA-NeXT-Video support #5124

AmazDeng commented May 30, 2024

ywang96 commented Jul 5, 2024 •

edited

Loading

AmazDeng commented Jul 5, 2024

ywang96 commented Jul 5, 2024

TKONIY commented Jul 19, 2024

[New Model]: LLaVA-NeXT-Video support #5124

[New Model]: LLaVA-NeXT-Video support #5124

Comments

AmazDeng commented May 30, 2024

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

ywang96 commented Jul 5, 2024 • edited Loading

AmazDeng commented Jul 5, 2024

ywang96 commented Jul 5, 2024

TKONIY commented Jul 19, 2024

ywang96 commented Jul 5, 2024 •

edited

Loading