Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add vision language model support. #3042

Merged
merged 32 commits into from
Mar 25, 2024

Conversation

xwjiang2010
Copy link
Contributor

@xwjiang2010 xwjiang2010 commented Feb 26, 2024

Vision Language Support

This PR adds vision language support to vLLM.
Mainly API changes. The core logic of vLLM is kept untouched.

The design goal is to enable all vision language models although the POC is done using Llava-7b.

Usage

The usage looks like this:

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type="pixel_values",
    image_token_id=32000,
    image_input_shape="1,3,336,336",
    image_feature_size=576,
)

prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load("xxx")  # This should be generated offline or by another online component. See tests/images/ for samples

from vllm.sequence import MultiModalData
llm.generate(prompt, multi_modal_data=MultiModalData(type=MultiModalData.Type.IMAGE, data=images))

Feature list

  • Allow vLLM entrypoint to take image as input.
  • Expand SequenceGroup and SequenceGroupMetadata’s API to take image.
  • Expand vLLM engine to take image.
  • Expand the contract between engine and worker to include image (only for prompting phase).
  • Add Llava model.
  • Support hosting vision tower inside or leave it outside of vLLM. This allows for maximum flexibility and configurability when balancing scalability and latency. This is configured through VisionLanguageConfig.
  • Works with other vLLM features: TP > 1, preemption, cuda graph etc.

Reviewability

The PR should work end to end. I have tested it locally through test_llava.py, which is a correctness test I added that compares transformers’ result and vLLM’s result.
Depending on vLLM team’s preference, we can either use this PR, in which case I need some more work to fix CI failures. Or I can break it down into smaller PRs to facilitate review.

Future work

  • Benchmarking of both FTL (first token latency) and ITL (inter token latency).
  • Avoid inefficiency involved using transformers native implementation of vision tower.
  • Scalability studies of models of varying sizes

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
@xwjiang2010 xwjiang2010 changed the title [Do not review] [Feature] Support Llava. [Feature] Support Llava. Feb 26, 2024
@xwjiang2010 xwjiang2010 changed the title [Feature] Support Llava. [Feature] Add vision language model support. Feb 26, 2024
# encoding.
# Each request should have at least `image_feature_size` tokens.
if self.vision_language_config:
max_num_seqs = min(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently don't understand this part, can you make the comments above clearer? (and also move them into the "if" condition) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased. ptal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a warning here since this will be "overriding" user configurations.

@esmeetu esmeetu added the new model Requests to new models label Mar 2, 2024
@zhuohan123 zhuohan123 self-assigned this Mar 3, 2024
@zhuohan123
Copy link
Collaborator

@pcmoritz Please let me know when you finish the first pass on this PR and when can I start reviewing!

@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator

robertgshaw2-neuralmagic commented Mar 5, 2024

I think this PR looks very promising!

I think it would be a good idea to implement the vision tower using the vLLM primitives, so that it can:

  • use tensor parallelism (I think right now the model is replicated on all layers)
  • use the inference only kernels

Additionally, the other note I had is that it is somewhat hard to follow what the datatype of the image inputs should be since they are passed around as raw torch tensors. It might be nice to make a datatype (even if they are just aliases of torch tensors, that make it more explicit that it is either pixel values or embedding values. This would make the code more readable since this was confusing to me at first

Note, we are working on encoder-decoder (to enable whisper). We will use a similar structure for the whisper multimodality as you have here for llava.

@Pernekhan
Copy link
Contributor

It looks good overall.

I have a few suggestions:

  1. I believe we don't need VisionLanguageConfig. The fields could be derived from the HF config.

    • image_feature_size could be derived by (image_size * image_size) / (patch_size * patch_size) (i.e. how many patch_size tiles are needed to cover it. For Llava1.5 it's (336 * 336) / (14 * 14) = 576.
    • image_token_id could be used from the HF config image_token_index
    • image_input_shape also makes it very restrictive, as Llava 1.6 has multiple image sizes.
    • image_input_type could always be pixel_values for the simplicity
  2. Consider making the argument image_request more generic to allow other multi-modality in the future. I propose instead of image_request we have a datatype with type and data fields in it. So that in the future other multimodal functionalities don't need to introduce a new arguments each time.

@junior-zsy
Copy link

@xwjiang2010 I executed this code

 llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type="pixel_values",
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336),
    image_feature_size=576,
)

prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load("xxx")  # This should be generated offline or by another online component. See tests/images/ for samples

llm.generate(prompt, images=images) 

and encountered an error message: return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (7056x336 and 1024x4096)

@xwjiang2010
Copy link
Contributor Author

xwjiang2010 commented Mar 8, 2024

I think it would be a good idea to implement the vision tower using the vLLM primitives, so that it can:

  • use tensor parallelism (I think right now the model is replicated on all layers)
  • use the inference only kernels

@zhuohan123 @pcmoritz
I remember you prefer using hf's implementation for vision related stuff last time we chatted. Could you clarify vLLM's position on this again? I am happy either way and it's more about complexity and maintainability of the repo. So you guys' call.

Additionally, the other note I had is that it is somewhat hard to follow what the datatype of the image inputs should be since they are passed around as raw torch tensors. It might be nice to make a datatype (even if they are just aliases of torch tensors, that make it more explicit that it is either pixel values or embedding values. This would make the code more readable since this was confusing to me at first

@robertgshaw2-neuralmagic I think this is a great feedback and also echoed by @Pernekhan. We should do that!

@pcmoritz
Copy link
Collaborator

pcmoritz commented Mar 8, 2024

Yes, we want to first get a very simple implementation of the vision tower in before we do something more advanced. We can implement the vision model with vLLM primitives as a follow up later if it is worth the complexity (but we should do benchmarks first before we do that to ensure it will be worth the additional complexity). If there are contributions towards this effort, that would certainly speed things up (either implementations or benchmarking).

@zhuohan123
Copy link
Collaborator

Yes, we want to first get a very simple implementation of the vision tower in before we do something more advanced. We can implement the vision model with vLLM primitives as a follow up later if it is worth the complexity (but we should do benchmarks first before we do that to ensure it will be worth the additional complexity). If there are contributions towards this effort, that would certainly speed things up (either implementations or benchmarking).

+1, let's merge a simple version where we don't maintain the vision model code by ourselves first. We can optimize the performance later.

@xwjiang2010
Copy link
Contributor Author

@Pernekhan
Thanks for the comment.

I believe we don't need VisionLanguageConfig. The fields could be derived from the HF config.

  • image_feature_size could be derived by (image_size * image_size) / (patch_size * patch_size) (i.e. how many patch_size tiles are needed to cover it. For Llava1.5 it's (336 * 336) / (14 * 14) = 576.
  • image_token_id could be used from the HF config image_token_index
  • image_input_shape also makes it very restrictive, as Llava 1.6 has multiple image sizes.
  • image_input_type could always be pixel_values for the simplicity

Since this is API discussion, I think we should align asap.
While I agree with you that some information can be inferred from hf_config, I still think we should be explicit. The reasons are:

  • How hf_config maps to these configs is model dependent. Managing the mapping for various models should not be vLLM's responsibility.
  • image_input_shape: what we really need is the maximum input shape in the worst case. This is used for dry run to determine how many GPU blocks cache manager can have.
  • pixel_values is indeed the default. However, as model gets larger, it's likely that one may want to have the encoder run somewhere else. To my knowledge, some people request to have the flexibility of feeding in features or even joint embeddings directly.

@xwjiang2010
Copy link
Contributor Author

@junior-zsy It's likely the images arg not being quite right. Can you compare with the pytest fixture used by test_llava.py?

@alexv-cerebras
Copy link

I guess it should be like this:

from vllm import LLM
from vllm.config import VisionLanguageConfig

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
    image_feature_size=576,
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336)
)
prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load(xxx)  # This should be generated offline or by another online component. See tests/images/ for samples

output = llm.generate(prompt, image_request=images)

@xwjiang2010
Copy link
Contributor Author

I guess it should be like this:

from vllm import LLM
from vllm.config import VisionLanguageConfig

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
    image_feature_size=576,
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336)
)
prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load(xxx)  # This should be generated offline or by another online component. See tests/images/ for samples

output = llm.generate(prompt, image_request=images)

Yes, you are exactly right. Did the snippet work for you?

@alexv-cerebras
Copy link

I guess it should be like this:

from vllm import LLM
from vllm.config import VisionLanguageConfig

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
    image_feature_size=576,
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336)
)
prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load(xxx)  # This should be generated offline or by another online component. See tests/images/ for samples

output = llm.generate(prompt, image_request=images)

Yes, you are exactly right. Did the snippet work for you?

Yes, it worked, thank you!

from vllm.transformers_utils.tokenizer import get_tokenizer

_TEST_DIR = os.path.dirname(__file__)
_TEST_PROMPTS = [os.path.join(_TEST_DIR, "prompts", "example.txt")]
_LONG_PROMPTS = [os.path.join(_TEST_DIR, "prompts", "summary.txt")]

_PIXEL_VALUES_FILES = [
"images/stop_sign_pixel_values.pt", "images/cherry_blossom_pixel_values.pt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we generate these programmatically from the .jpg files and not check them in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment makes me think I need to document what line of code output do pixel_values and image_features correspond to.
As to a programmatically way of generating these, while pixel_values is easy to do, image_features is not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments under llava.py.

tests/conftest.py Outdated Show resolved Hide resolved
hf_vision_config = config.vision_config
self.vision_language_config = vision_language_config

assert self.vision_language_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't fail if the type signature above is correct :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not rely on type hinting. I added some useful user facing information that will show up when someone does

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
)

instead of

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336),
    image_feature_size=576,
)

@ywang96
Copy link
Member

ywang96 commented Apr 1, 2024

@xwjiang2010 is there a plan to update this for llava 1.6? 1.6 is vastly better than 1.5 in terms of accuracy.

I tried using sglang from llava repo, and hit tons of problems, hoping vLLM team can make it work for fast concurrent inference!

I will be working on a PR for Llava 1.6 - ideally by the end of this week

@pseudotensor
Copy link

@ywang96 Amazing!

@xwjiang2010
Copy link
Contributor Author

xwjiang2010 commented Apr 1, 2024

@ywang96 I am doing a bit of POC on Llava 1.6. There should be no major blocker than dynamically figuring the number of <image> placeholders. There are other nits and bits on performance implications which I hope can be solved in a similar fashion as disaggregated prefill.
I am happy to chat offline and totally agree that 1.6 support is essential for unlocking a ton of exciting applications!

@lightmatmul
Copy link

thank you for your great work !
I was wondering what would be the difference between images features and pixel values ? at least performance wise ?

@Iven2132
Copy link

Iven2132 commented Apr 3, 2024

Hi @xwjiang2010 Can I use my fine-tuned llava on VLLM? I'm first downloading my fine-tuned model from HF than in the LLM class I'm doing model="llava-hf/llava-1.5-7b-hf" am I doing things correctly? Can you please help me?

import os
import requests
from modal import Image, Secret, Stub, enter, exit, gpu, method
import subprocess

MODEL_DIR = "/model"
BASE_MODEL = "myfine-tuned-model-huggingface-repo"

def download_model_and_image():
    from huggingface_hub import snapshot_download
    from transformers.utils import move_cache

    snapshot_download(
        BASE_MODEL,
        local_dir=MODEL_DIR,
        token=os.environ["HF_TOKEN"],
        ignore_patterns=["*.pt", "*.gguf"],
    )
    move_cache()

image = (
    Image.from_registry(
        "nvidia/cuda:12.1.1-devel-ubuntu22.04", add_python="3.10"
    )
    .pip_install(
        "vllm==0.4.0",
        "huggingface_hub==0.19.4",
        "hf-transfer==0.1.4",
        "torch==2.1.2",
        "aws-shell",
        "requests" 
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_function(
        download_model_and_image,
        secrets=[Secret.from_name("huggingface-secret")],
        timeout=60 * 20,
    )
)

stub = Stub(f"example-vllm", image=image)

GPU_CONFIG = gpu.A100(memory=80, count=1)

@stub.cls(gpu=GPU_CONFIG, secrets=[Secret.from_name("huggingface-secret")])
class Model:
    @enter()
    def load(self):
        from vllm import LLM

        if GPU_CONFIG.count > 1:
            import ray

            ray.shutdown()
            ray.init(num_gpus=GPU_CONFIG.count)

        self.llm = LLM(
            model="llava-hf/llava-1.5-7b-hf",
            image_input_type="image_features",
            image_token_id=32000,
            image_input_shape="1,576,1024",
            image_feature_size=576,
        )

        s3_bucket_path = "s3://air-example-data-2/vllm_opensource_llava/"
        local_directory = "images"
        os.makedirs(local_directory, exist_ok=True)
        subprocess.check_call([
            "aws",
            "s3",
            "sync",
            s3_bucket_path,
            local_directory,
            "--no-sign-request",
        ])


    @method()
    def generate(self, user_questions):
        import torch
        from vllm import LLM
        from vllm.sequence import MultiModalData

        prompt = "<image>" * 576 + (
            "\nUSER: What is the content of this image?\nASSISTANT:")

        images = torch.load("images/stop_sign_image_features.pt")

        outputs = self.llm.generate(prompt,
                                    multi_modal_data=MultiModalData(
                                        type=MultiModalData.Type.IMAGE, data=images))
        for o in outputs:
            generated_text = o.outputs[0].text
            print(generated_text)

    @exit()
    def stop_engine(self):
        if GPU_CONFIG.count > 1:
            import ray
            ray.shutdown()

@stub.local_entrypoint()
def main():
    model = Model()
    questions = [
        "Implement a Python function to compute the Fibonacci numbers.",
    ]
    model.generate.remote(questions)

@chricro
Copy link

chricro commented Apr 9, 2024

@ywang96 thank you for the work you're doing, I can't wait to see the results!

@alsichcan
Copy link

@xwjiang2010
First of all, thank you for your amazing work. Your work have paved a way for my research!

I am working on developing an OpenAI-compatible server for LLaVa #3873 and have encountered a couple of points where I seek your guidance and wish to offer some suggestions.

  1. Output from LLaVa Example Code:
    While executing llava_example.py, I observed that the output appears to be truncated:
    The image features several elements set in a city environment. There is a stop sign
    Could you confirm if this output is as designed, or are there additional configurations needed to ensure complete and detailed responses?

  2. Image Preparation Guidelines:
    The instructions for preparing images are somewhat vague, mentioning only that this task should be handled by an external component in llava_example.py. However, by delving into the comments within llava.py, one can uncover further details on preparing image inputs:

PIXEL_VALUES: 
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L353
IMAGE_FEATURES:
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L430
before going through the multi modal projector.

To enhance user convenience, I propose the integration of a feature that automates the conversion of raw image files into image features, eliminating the need for users to manually prepare .pt files for utilizing LLaVa with vLLM. This enhancement would align the process more closely with the OpenAI API's format, which accepts images via URL or base64-encoded local files in formats such as PNG, JPEG, WEBP, and GIF.

Thank you for considering these points. I am eager to hear your thoughts and look forward to continuing to leverage the impressive capabilities of your work.

Best regards,

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 10, 2024

  1. Output from LLaVa Example Code:
    While executing llava_example.py, I observed that the output appears to be truncated:
    The image features several elements set in a city environment. There is a stop sign
    Could you confirm if this output is as designed, or are there additional configurations needed to ensure complete and detailed responses?

You have to set the max_tokens parameter to a higher value to avoid truncated output.

  1. Image Preparation Guidelines:
    The instructions for preparing images are somewhat vague, mentioning only that this task should be handled by an external component in llava_example.py. However, by delving into the comments within llava.py, one can uncover further details on preparing image inputs:
PIXEL_VALUES: 
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L353
IMAGE_FEATURES:
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L430
before going through the multi modal projector.

I would like to add to your point. The current example script requires the use of S3 which is not convenient to set up. While developing support for OpenAI image input API, I personally passed URLs to online images for testing. Perhaps the example should be modified later so that S3 is no longer required.

@ywang96
Copy link
Member

ywang96 commented Apr 10, 2024

@xwjiang2010 First of all, thank you for your amazing work. Your work have paved a way for my research!

I am working on developing an OpenAI-compatible server for LLaVa #3873 and have encountered a couple of points where I seek your guidance and wish to offer some suggestions.

  1. Output from LLaVa Example Code:
    While executing llava_example.py, I observed that the output appears to be truncated:
    The image features several elements set in a city environment. There is a stop sign
    Could you confirm if this output is as designed, or are there additional configurations needed to ensure complete and detailed responses?
  2. Image Preparation Guidelines:
    The instructions for preparing images are somewhat vague, mentioning only that this task should be handled by an external component in llava_example.py. However, by delving into the comments within llava.py, one can uncover further details on preparing image inputs:
PIXEL_VALUES: 
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L353
IMAGE_FEATURES:
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L430
before going through the multi modal projector.

To enhance user convenience, I propose the integration of a feature that automates the conversion of raw image files into image features, eliminating the need for users to manually prepare .pt files for utilizing LLaVa with vLLM. This enhancement would align the process more closely with the OpenAI API's format, which accepts images via URL or base64-encoded local files in formats such as PNG, JPEG, WEBP, and GIF.

Thank you for considering these points. I am eager to hear your thoughts and look forward to continuing to leverage the impressive capabilities of your work.

Best regards,

@alsichcan I personally agree with your point. That's why I've been taking time to think about the best way to put such helper module in vllm and integrate it with the current vision language model framework, and this can also be the module to bridge the engine and the API server if we eventually build image API into it as well.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 18, 2024

@WoosukKwon I think you should close #1286 and #1751 as well, since they have been resolved by this PR.

@ywang96
Copy link
Member

ywang96 commented Apr 18, 2024

@DarkLight1337 @alsichcan FYI - While working on adding support for Llava-Next, I realize the current design for vision models is too specific to Llava1.5 and probably not generalizable to support other multi-modal models, along with things that are missing to support end-to-end inference with the API server that has been addressed in #3978.

I'm working on a RFC to share some thoughts for refactoring and will send out tomorrow.

@Iven2132
Copy link

Iven2132 commented Apr 21, 2024

Hey @ywang96 Can I use my fine-tuned PEFT LLava model with vllm? I'm writing a notebook for Brev that I want to share with the world, but I'm stuck in this problem, Can you please help me out? here is the fine-tuned model on HuggingFace marksuccsmfewercoc/llava-1.5-7b-hf-ft-mix-vsft

@clj55
Copy link

clj55 commented Jun 7, 2024

I tried using the llava_example.py from [https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html] but encountering ModuleNotFoundError: No module named 'vllm.multimodal'

I pip installed vllm version 0.4.3

Anyone know what's the issue?

@DarkLight1337
Copy link
Member

I tried using the llava_example.py from [https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html] but encountering ModuleNotFoundError: No module named 'vllm.multimodal'

I pip installed vllm version 0.4.3

Anyone know what's the issue?

You are using the docs for latest version, not v0.4.3. The API has changed since then.

@SaltFish11
Copy link

SaltFish11 commented Jul 3, 2024

This is a very interesting job, but I have two questions here. I hope the author can answer:

  1. Note that the vit module uses modules from the CLIP model in the huggingface repository. Instead of VLLM based attention, does llava support TP>1?
  2. Does llava currently support the prefix_caching function?

@ywang96
Copy link
Member

ywang96 commented Aug 4, 2024

This is a very interesting job, but I have two questions here. I hope the author can answer:

  1. Note that the vit module uses modules from the CLIP model in the huggingface repository. Instead of VLLM based attention, does llava support TP>1?
  2. Does llava currently support the prefix_caching function?
  1. We do support TP of VLMs, except the ViTs in the model are currently not TP'ed but replicated on each GPU. The reason is that we didn't see performance benefit from TPing these ViTs, but this is definitely something we want to work more closely on in the future.
  2. It should still support the text part but we currently don't have a good way to cache the image embeddings, so will go through the ViT regardless.

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

Successfully merging this pull request may close these issues.