# Image tasks with VLM
IDEFICS tutorial is based on Flamingo, developed by DeepMind
Reference:
- [idefics tutorial](https://huggingface.co/docs/transformers/v4.41.2/en/tasks/idefics)

In [1]:
!pip install bitsandbytes>=0.39.0 transformers==4.41.2

In [2]:
_MODEL_ID = "HuggingFaceM4/idefics-9b"

In [10]:
import torch
from transformers import AutoProcessor, IdeficsForVisionText2Text, BitsAndBytesConfig

In [11]:
processor = AutoProcessor.from_pretrained(_MODEL_ID)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [12]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # The weights are saved in 4 bit
    bnb_4bit_compute_dtype=torch.float16,  # The calculations are done in 16 bit float, https://huggingface.co/docs/transformers/v4.41.2/en/tasks/idefics
)

In [13]:
model = IdeficsForVisionText2Text.from_pretrained(_MODEL_ID, device_map="auto", quantization_config=quantization_config)    

Loading checkpoint shards: 100%|██████████| 19/19 [00:46<00:00,  2.43s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


# Image captioning - Image2Text task
The model receives text and images.
To captionize the image, the model only needs a processed input image to generate the output.

In [37]:
image_path = [
    "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80"
]

In [43]:
processed_inputs = processor(image_path, return_tensors="pt").to("cuda")
# Note that the processed inputs have some token ids + pixel values that stores the image pixel values.
print(processed_inputs)
print(processed_inputs['input_ids'].shape)
print(processed_inputs['pixel_values'].shape)

{'input_ids': tensor([[    1, 32000, 32001, 32000]], device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1]], device='cuda:1'), 'pixel_values': tensor([[[[[-0.5660, -0.5806, -0.5660,  ..., -0.7850, -0.7850, -0.7850],
           [-0.5660, -0.5660, -0.5514,  ..., -0.7704, -0.7850, -0.7850],
           [-0.5514, -0.5368, -0.5514,  ..., -0.7704, -0.7704, -0.7850],
           ...,
           [-1.1791, -1.1207, -1.1499,  ..., -1.1499, -1.1645, -1.1207],
           [-1.1645, -1.1791, -1.2083,  ..., -0.8726, -0.9893, -1.0769],
           [-1.2375, -1.2375, -1.2375,  ..., -1.0477, -1.0331, -1.1353]],

          [[-0.2213, -0.2063, -0.1913,  ..., -0.7166, -0.7316, -0.7316],
           [-0.1463, -0.1313, -0.1163,  ..., -0.7166, -0.7166, -0.7316],
           [-0.0862, -0.0712, -0.0562,  ..., -0.7166, -0.7166, -0.7316],
           ...,
           [-0.8366, -0.7016, -0.7166,  ..., -1.1518, -1.1218, -1.0918],
           [-0.8366, -0.8366, -0.8366,  ..., -1.0167, -1.0467, -1.0617],
           [-0.95

In [39]:
# Get the token ids for the bad words
bad_words_id = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

In [40]:
# The model generates <image> or <fake_token_around_image> when there is no image fed into the model.
generated_output_ids = model.generate(**processed_inputs, max_new_tokens=20, bad_words_ids=bad_words_id)

In [41]:
decoded_text = processor.batch_decode(generated_output_ids, skip_special_tokens=True)[0]
print(decoded_text)

A puppy in a flower bed


# Prompted image captioning
Use text and image prompts as

In [60]:
prompt = [
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg",
    "This is an image of"
]

In [61]:
inputs = processor(prompt, return_tensors="pt").to("cuda")
print(inputs)
print(f"Decoded tokens = {processor.decode(inputs['input_ids'][0], skip_special_tokens=False)}")

{'input_ids': tensor([[    1, 32000, 32001, 32000,   910,   338,   385,  1967,   310]],
       device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1'), 'pixel_values': tensor([[[[[1.4632, 1.4632, 1.4632,  ..., 1.4048, 1.4048, 1.4048],
           [1.4632, 1.4632, 1.4632,  ..., 1.4048, 1.4048, 1.4048],
           [1.4486, 1.4486, 1.4486,  ..., 1.4194, 1.4194, 1.4194],
           ...,
           [1.3026, 1.3026, 1.3026,  ..., 1.4340, 1.4340, 1.4340],
           [1.2734, 1.2880, 1.2880,  ..., 1.4340, 1.4340, 1.4340],
           [1.2734, 1.2880, 1.2880,  ..., 1.4340, 1.4340, 1.4340]],

          [[1.5796, 1.5796, 1.5796,  ..., 1.5646, 1.5646, 1.5646],
           [1.5796, 1.5796, 1.5796,  ..., 1.5646, 1.5646, 1.5646],
           [1.5646, 1.5646, 1.5646,  ..., 1.5796, 1.5796, 1.5796],
           ...,
           [1.4446, 1.4446, 1.4446,  ..., 1.5496, 1.5496, 1.5496],
           [1.4295, 1.4446, 1.4446,  ..., 1.5496, 1.5496, 1.5496],
           [1.4145, 1.4295

In [62]:
generated_output = model.generate(**inputs, max_new_tokens=100, bad_words_ids=bad_words_id)
decoded_output = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
print(f"Generated output token ids = {generated_output}")
print(f"Decoded text = {decoded_output}")

Generated output token ids = tensor([[    1, 32000, 32001, 32000,   910,   338,   385,  1967,   310,   278,
          6666,   434,   310, 10895,  1017,   297,  1570,  3088,  4412, 29889,
             2]], device='cuda:1')
Decoded text = This is an image of the Statue of Liberty in New York City.


# Few shot prompting
Do in-context learning by generating results that mimic the format of the given example. Predicts the next token in the following examples.

In [86]:
few_shot_prompt = [
    "User:",
    "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
    "Describe this image.\n Assistant: An image of A puppy in a flower bed. Fun fact: the dog is 1 foot tall.\n",
    "User:",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg",
    "Describe this image.\n Assistant: ",
]   

In [87]:
inputs = processor(few_shot_prompt, return_tensors="pt").to("cuda")
print(inputs)
print(processor.decode(inputs['input_ids'][0], skip_special_tokens=False))
print(f"input_ids.shape = {inputs['input_ids'].shape}")  # [B, Numshots]
print(f"pixel_values.shape = {inputs['pixel_values'].shape}")  # [B, Numshots, C, H, W]
print(f"image_attention_mask.shape = {inputs['image_attention_mask'].shape}")  # [B, S, Numshots]

{'input_ids': tensor([[    1,  4911, 29901, 32000, 32001, 32000, 20355,   915,   445,  1967,
         29889,    13,  4007, 22137, 29901,   530,  1967,   310,   319,  2653,
         23717,   297,   263, 28149,  6592, 29889, 13811,  2114, 29901,   278,
         11203,   338, 29871, 29896,  3661, 15655, 29889,    13,  2659, 29901,
         32000, 32001, 32000, 20355,   915,   445,  1967, 29889,    13,  4007,
         22137, 29901]], device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]], device='cuda:1'), 'pixel_values': tensor([[[[[-0.5660, -0.5806, -0.5660,  ..., -0.7850, -0.7850, -0.7850],
           [-0.5660, -0.5660, -0.5514,  ..., -0.7704, -0.7850, -0.7850],
           [-0.5514, -0.5368, -0.5514,  ..., -0.7704, -0.7704, -0.7850],
           ...,
           [-1.1791, -1.1207, -1.1499,  ..., -1.1499, -1.1645, -1.1207],
   

In [88]:
generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_id)

In [89]:
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

User: Describe this image.
 Assistant: An image of A puppy in a flower bed. Fun fact: the dog is 1 foot tall.
User: Describe this image.
 Assistant: An image of The Statue of Liberty. Fun fact: the statue is 111 feet tall.
User:

Describe


# VQA
Open-ended questions based on an image. Slightly different than image captioning

In [112]:
prompt = [
    "Instruction: Provide an answer to the question. Use the image to answer.\n",
    "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
    "Question: Where are these people and what is the girl's name? Answer:",
]
inputs = processor(prompt, return_tensors="pt").to("cuda")
print(processor.tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=False))

<s> Instruction: Provide an answer to the question. Use the image to answer.
<fake_token_around_image><image><fake_token_around_image> Question: Where are these people and what is the girl's name? Answer:


In [118]:
generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_id, do_sample=False)
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what is the girl's name? Answer: They are in a park in New York City. The girl's name is Lily.



# Image classification

In [119]:
_CATEGORIES = [
    'animals','vegetables', 'city landscape', 'cars', 'office'
]

In [122]:
prompts = [
    f"Instruction: Classify the image into one of the following categories: {_CATEGORIES}.\n",
    "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
    "Category: ",
]

In [125]:
inputs = processor(prompts, return_tensors="pt").to("cuda")
print(inputs)
print(processor.decode(inputs['input_ids'][0], skip_special_tokens=False))

{'input_ids': tensor([[    1,  2799,  4080, 29901,  4134,  1598,   278,  1967,   964,   697,
           310,   278,  1494, 13997, 29901,  6024, 11576,  1338,   742,   525,
           345,   657,  1849,   742,   525, 12690, 24400,   742,   525, 29883,
          1503,   742,   525, 20205, 13359,    13, 32000, 32001, 32000, 17943,
         29901]], device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1'), 'pixel_values': tensor([[[[[ 0.4559,  0.3391,  0.2515,  ...,  0.2953,  0.2223, -0.5806],
           [ 0.4267,  0.2953,  0.2223,  ...,  0.3829,  0.1493, -0.7850],
           [ 0.3391,  0.1931,  0.0325,  ...,  0.3975,  0.0617, -0.9893],
           ...,
           [-0.5076, -0.4930, -0.5222,  ..., -0.8142, -0.7850, -0.7996],
           [-0.5368, -0.4930, -0.4784,  ..., -0.7996, -0.7704, -0.7558],
           [-0.4346, -0.3908, -0.4200,  ..., -0.7996, -0.7

In [136]:
bad_words_id = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
generated_ids = model.generate(**inputs, max_new_tokens=3, bad_words_ids=bad_words_id)
print(f"Model output = {processor.batch_decode(generated_ids, skip_special_tokens=True)[0]}")

Model output = Instruction: Classify the image into one of the following categories: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
 Category: Vegetables


# Image-guided text generation
Write a story about the image

In [144]:
prompt = [
    "Instruction: Use the image to write a story.\n",
    "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
    "Story: \n"
]

In [145]:
inputs = processor(prompt, return_tensors="pt").to("cuda")
print(processor.decode(inputs['input_ids'][0], skip_special_tokens=False))

<s> Instruction: Use the image to write a story.
<fake_token_around_image><image><fake_token_around_image> Story: 



In [150]:
generated_ids = model.generate(**inputs, max_new_tokens=200, bad_words_ids=bad_words_id, num_beams=2)

In [151]:
print(f"Geneated story = {processor.batch_decode(generated_ids, skip_special_tokens=True)[0]}")

Geneated story = Instruction: Use the image to write a story.
 Story: 
I was walking down the street when I saw a house with a red door.  I thought to myself, “I wonder what’s on the other side of that red door?”  So, I knocked on the door.  The door opened, and there stood a beautiful woman.  She was wearing a red dress, and she had red lipstick on her lips.  She said to me, “Come in, come in.  I’ve been waiting for you.”

I walked into the house, and there was a red couch.  I sat down on the couch, and the woman sat down next to me.  She said to me, “I’ve been waiting for you, too.”

I said to her, “What do you mean, you’ve been waiting for me?”

She said to me, “I’ve been waiting for you for a long time.”

I said to her,


# Chatting

In [152]:
_FINED_TUNED_ID = "HuggingFaceM4/idefics-9b-instruct"
model = IdeficsForVisionText2Text.from_pretrained(_MODEL_ID, device_map="auto", quantization_config=quantization_config)

Loading checkpoint shards: 100%|██████████| 19/19 [00:48<00:00,  2.56s/it]
You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


In [154]:
processor = AutoProcessor.from_pretrained(_FINED_TUNED_ID)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [160]:
prompts = [
    "User: What is in this image?",
    "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
    "<end_of_utterance>",  # This is a special token that tells the model that the user has finished speaking.
    "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",
    "\nUser:",
    "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
    "And who is that?<end_of_utterance>",
    "\nAssistant:",
]

In [162]:
inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to("cuda")   
print(f"pixel_values.shape = {inputs['pixel_values'].shape}")
print(processor.decode(inputs['input_ids'][0], skip_special_tokens=False))

pixel_values.shape = torch.Size([1, 2, 3, 224, 224])
<s> User: What is in this image?<fake_token_around_image><image><fake_token_around_image><end_of_utterance> 
Assistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance> 
User:<fake_token_around_image><image><fake_token_around_image> And who is that?<end_of_utterance> 
Assistant:


In [163]:
exit_token = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids