Multiple image embeds in one prompt? #12

Quasimondo · 2024-01-24T08:57:33Z

By slightly augmenting the code I was trying to embed two images into the prompt in the hope that the model would be able to make comparisons between them, but so far it looks like it always just sees the last embed. I am wondering if this approach is feasible at all and what would be required to make this work?

This is my change in sample.py:

parser = argparse.ArgumentParser()
parser.add_argument("--image", type=str, required=True)
parser.add_argument("--image2", type=str, required=True)
parser.add_argument("--prompt", type=str, required=False)
args = parser.parse_args()

image = Image.open(args.image)
image_embeds = vision_encoder(image)
print("image_embeds",image_embeds.size())

image2 = Image.open(args.image2)
image_embeds2 = vision_encoder(image2)
print("image_embeds",image_embeds2.size())

image_embeds = torch.cat((image_embeds,image_embeds2),0)
print("image_embeds",image_embeds.size())

And this is my change in text_model.py

def input_embeds(self, prompt, image_embeds):
        embeds = []

        def _add_toks(toks):
            embeds.append(self.text_emb(toks))

        def _tokenize(txt):
            return self.tokenizer(
                txt, return_tensors="pt", add_special_tokens=False
            ).input_ids.to(self.model.device)

        # Add BOS token
        _add_toks(
            torch.tensor([[self.tokenizer.bos_token_id]], device=self.model.device)
        )

        if "<image>" not in prompt:
            embeds.append(self.text_emb(_tokenize(prompt)))
        else:
            assert prompt.count("<image>") == 1
            before, after = prompt.split("<image>")
            
            if image_embeds.size(0)==1:
                embeds.append(self.text_emb(_tokenize(f"{before}<image>")))
                embeds.append(image_embeds.to(self.model.device))
                embeds.append(self.text_emb(_tokenize(f"</image>{after}")))
            else:
                if len(before)>0:    
                    embeds.append(self.text_emb(_tokenize(f"{before}")))
                for i in range(image_embeds.size(0)):
                    embeds.append(self.text_emb(_tokenize(f"Image #{i+1}: <image>")))
                    embeds.append(image_embeds[i].unsqueeze(0).to(self.model.device))
                    embeds.append(self.text_emb(_tokenize(f"</image>")))
                if len(after)>0:
                    embeds.append(self.text_emb(_tokenize(f"{after}")))

        return torch.cat(embeds, dim=1)

The text was updated successfully, but these errors were encountered:

Quasimondo · 2024-01-24T09:14:03Z

Ah I just found this issue on the LLaVA repository and it looks like this would require a different training approach:
haotian-liu/LLaVA#197

vikhyat · 2024-01-24T22:16:24Z

Right, the current version doesn't have the ability to understand multiple images. I expect to train a version that addresses this in the near term future. Can I ask what types of comparisons you're interested in?

Quasimondo · 2024-01-25T07:00:48Z

Ideally it would be capable of any kind of comparison, but for a start it would already be nice if it was able to point out things like "object A is only present in image 1", "image 1 is a photo, whilst image 2 is a comic but both are portraits"

sujitvasanth · 2024-01-29T01:21:04Z

It took a bit of prompt engineering and image stitching but I got moondream1 comparing rudimentary images
my prompts:

here are 2 webcam images, upper and lower; is the dog in upper, lower, both or none? What colour is the dog?
here's 2 webcams, whats held up in upper vs lower picture?

what I learned:
images next to each other don't work - there is left right confusion between the images
horizontally stacked images work when the photos have a border and an area of white to split them.

I think some fine tuning on answering questions with multiple photos in 1 image would help
but then again Phi1.5 cant add up to 10 so really need a better LLM backbone

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple image embeds in one prompt? #12

Multiple image embeds in one prompt? #12

Quasimondo commented Jan 24, 2024

Quasimondo commented Jan 24, 2024

vikhyat commented Jan 24, 2024

Quasimondo commented Jan 25, 2024

sujitvasanth commented Jan 29, 2024 •

edited

Multiple image embeds in one prompt? #12

Multiple image embeds in one prompt? #12

Comments

Quasimondo commented Jan 24, 2024

Quasimondo commented Jan 24, 2024

vikhyat commented Jan 24, 2024

Quasimondo commented Jan 25, 2024

sujitvasanth commented Jan 29, 2024 • edited

sujitvasanth commented Jan 29, 2024 •

edited