Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple image embeds in one prompt? #12

Open
Quasimondo opened this issue Jan 24, 2024 · 4 comments
Open

Multiple image embeds in one prompt? #12

Quasimondo opened this issue Jan 24, 2024 · 4 comments

Comments

@Quasimondo
Copy link

By slightly augmenting the code I was trying to embed two images into the prompt in the hope that the model would be able to make comparisons between them, but so far it looks like it always just sees the last embed. I am wondering if this approach is feasible at all and what would be required to make this work?

This is my change in sample.py:

parser = argparse.ArgumentParser()
parser.add_argument("--image", type=str, required=True)
parser.add_argument("--image2", type=str, required=True)
parser.add_argument("--prompt", type=str, required=False)
args = parser.parse_args()

image = Image.open(args.image)
image_embeds = vision_encoder(image)
print("image_embeds",image_embeds.size())

image2 = Image.open(args.image2)
image_embeds2 = vision_encoder(image2)
print("image_embeds",image_embeds2.size())

image_embeds = torch.cat((image_embeds,image_embeds2),0)
print("image_embeds",image_embeds.size())

And this is my change in text_model.py

def input_embeds(self, prompt, image_embeds):
        embeds = []

        def _add_toks(toks):
            embeds.append(self.text_emb(toks))

        def _tokenize(txt):
            return self.tokenizer(
                txt, return_tensors="pt", add_special_tokens=False
            ).input_ids.to(self.model.device)

        # Add BOS token
        _add_toks(
            torch.tensor([[self.tokenizer.bos_token_id]], device=self.model.device)
        )

        if "<image>" not in prompt:
            embeds.append(self.text_emb(_tokenize(prompt)))
        else:
            assert prompt.count("<image>") == 1
            before, after = prompt.split("<image>")
            
            if image_embeds.size(0)==1:
                embeds.append(self.text_emb(_tokenize(f"{before}<image>")))
                embeds.append(image_embeds.to(self.model.device))
                embeds.append(self.text_emb(_tokenize(f"</image>{after}")))
            else:
                if len(before)>0:    
                    embeds.append(self.text_emb(_tokenize(f"{before}")))
                for i in range(image_embeds.size(0)):
                    embeds.append(self.text_emb(_tokenize(f"Image #{i+1}: <image>")))
                    embeds.append(image_embeds[i].unsqueeze(0).to(self.model.device))
                    embeds.append(self.text_emb(_tokenize(f"</image>")))
                if len(after)>0:
                    embeds.append(self.text_emb(_tokenize(f"{after}")))

        return torch.cat(embeds, dim=1)
@Quasimondo
Copy link
Author

Ah I just found this issue on the LLaVA repository and it looks like this would require a different training approach:
haotian-liu/LLaVA#197

@vikhyat
Copy link
Owner

vikhyat commented Jan 24, 2024

Right, the current version doesn't have the ability to understand multiple images. I expect to train a version that addresses this in the near term future. Can I ask what types of comparisons you're interested in?

@Quasimondo
Copy link
Author

Ideally it would be capable of any kind of comparison, but for a start it would already be nice if it was able to point out things like "object A is only present in image 1", "image 1 is a photo, whilst image 2 is a comic but both are portraits"

@sujitvasanth
Copy link

sujitvasanth commented Jan 29, 2024

It took a bit of prompt engineering and image stitching but I got moondream1 comparing rudimentary images
my prompts:

  • here are 2 webcam images, upper and lower; is the dog in upper, lower, both or none? What colour is the dog?
  • here's 2 webcams, whats held up in upper vs lower picture?

Image 1 Image 2

what I learned:
images next to each other don't work - there is left right confusion between the images
horizontally stacked images work when the photos have a border and an area of white to split them.

I think some fine tuning on answering questions with multiple photos in 1 image would help
but then again Phi1.5 cant add up to 10 so really need a better LLM backbone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants