# Run Llama 3.2 Vision Models on Snowflake's ML Container runtime

Let's use Meta's Llama-3.2-11B-Vision-Instruct and Llama-3.2-90B-Vision-Instruct models with Hugging Face transformers!  

- Turn an image into text
- Turn an image of a table into a JSON representation.
- Understand an invoice document

These multimodal models are capable of visual understanding. They take both text and images as input, check out the model cards for more information [Llama 3.2 11B](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) and [Llama 3.2 90B](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) model cards.

These models do require GPUs. The 11B version requires greater than 30GB and the larger model will require a set of 8 A100s. For AWS customers, these are the using GPU_NV_M and GPU_NV_L compute pools respectively.

You will need to fill out a form for access to the Meta Models.

Meta Llama 3.2 is licensed under the LLAMA 3.2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring their compliance with the terms of this license and the Llama 3.2 Acceptable Use Policy.

Note: Meta does not grant rights for the multimodal models in the Llama 3.2 license to users domiciled in the European Union, or companies with principle place of business in the European Union. See the Llama 3.2 Acceptable Use Policy for more information.

### Standard Imports

In [None]:
# Import python packages
# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()


### Upgrade transformers
If you get an error here, make sure you have an external integration to install packages. You will also need the external integration for connecting to the Hugging Face hub to download the model.

In [None]:
!pip install "transformers>=4.45.0" --upgrade

### Import task specific libraries

In [None]:
import requests
import torch
from PIL import Image
from huggingface_hub import login
from transformers import MllamaForConditionalGeneration, AutoProcessor

### Set params and log-in into Huggingface hub

Add your [Hugging Face token](https://huggingface.co/settings/tokens) for downloading Llama. Please note, the [Llama vision models](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf) are gated. Meta requires you submit a form on Hugging Face for access to the model. You will need to do that before you can download the model.

In [None]:
hf_token = "hf_XXX"
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
#model_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"

Let's use transformers which makes it easy to use the new Llama models. Transformers has added support for [Llama vision models](https://huggingface.co/docs/transformers/en/model_doc/mllama)  through the `MllamaForConditionalGeneration`. Other transformer models will require using and AutoClass or a specific class for the model, see more in the [transformers docs](https://huggingface.co/docs/transformers/en/model_doc/auto)

The `device_map` option set to `auto` means for multi-GPU systems, the model will automatically use [Big Model Inference](https://huggingface.co/docs/accelerate/concept_guides/big_model_inference) from Hugging Face. This distributes the large vision language model across multiple GPUs!


In [None]:
login(hf_token)
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

### Image to Text -- Let's use this image and convert it to text.

![image](https://huggingface.co/spaces/rajistics/llamavision/resolve/main/llama.png)

Let's makes this fun by asking us to convert the text description into a haiku

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url = "https://huggingface.co/spaces/rajistics/llamavision/resolve/main/llama.png"

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]

image = Image.open(requests.get(url, stream=True).raw)

In [None]:
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=300)
print(processor.decode(output[0]))

#### Results

<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>  
<|image|>If I had to write a haiku for this one, it would be: <|eot_id|><|start_header_id|>assistant<|end_header_id|>  
**Haiku: Cool Llama in Shades**  
In sunglasses, llama's gaze  
Chillin' with a smile, so bright  
Winter's chill, no fear<|eot_id|>

### Table to Text/JSON

![image](https://huggingface.co/spaces/rajistics/llamavision/resolve/main/data-table-example1.png)

In [None]:
url = "https://huggingface.co/spaces/rajistics/llamavision/resolve/main/data-table-example1.png"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Parse the table into a JSON representation where the methods are keys and the datasets are subkeys. "}
    ]}
]

image = Image.open(requests.get(url, stream=True).raw)

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=300)
print(processor.decode(output[0]))

#### Results

<|begin_of_text|><|start_header_id|>user<|end_header_id|>  
<|image|>Parse the table into a JSON representation where the methods are keys and the datasets are subkeys. <|eot_id|><|start_header_id|>assistant<|end_header_id|>  
Sure, here is the table data in a JSON representation:  
```  
{  
  "Salt Concentration (%)": {  
    "0": {  
      "Trial #1": 77.23,  
      "Trial #2": 74.50,  
      "Trial #3": 64.88,  
      "Trial #4": 75.27,  
      "Trial #5": 54.66  
    },  
    "3": {  
      "Trial #1": 85.23,  
      "Trial #2": 92.82,  
      "Trial #3": 78.91,  
      "Trial #4": 60.71,  
      "Trial #5": 57.96  
    },  
    "6": {  
      "Trial #1": 88.39,  
      "Trial #2": 100.05,  
      "Trial #3": 73.66,  
      "Trial #4": 66.51,  
      "Trial #5": 64.54  
    },  
    "9": {  
      "Trial #1": 80.71,  
      "Trial #2": 100.05,  


### Document Understanding

![image](https://huggingface.co/spaces/huggingface-projects/llama-3.2-vision-11B/resolve/main/examples/invoice.png)

In [None]:
url = "https://huggingface.co/spaces/huggingface-projects/llama-3.2-vision-11B/resolve/main/examples/invoice.png"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "How long does it take from invoice date to due date? Be short and concise."}
    ]}
]

image = Image.open(requests.get(url, stream=True).raw)

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=300)
print(processor.decode(output[0]))

#### Results 

<|begin_of_text|><|start_header_id|>user<|end_header_id|>  
<|image|>How long does it take from invoice date to due date? Be short and concise.<|eot_id|><|start_header_id|>assistant<|end_header_id|>  
To calculate the time difference between the invoice date and the due date, we need to subtract the invoice date from the due date.  
**Invoice Date:** 11/02/2019  
**Due Date:** 26/02/2019  
**Calculation:**  
*   Difference in days = Due Date - Invoice Date  
*   Difference in days = 26/02/2019 - 11/02/2019  
*   Difference in days = 15 days  
The time difference between the invoice date and the due date is **15 days**.<|eot_id|>
