# Run LLAMA3.2-11B-Vision-Instruct on ML Container runtime using GPU_NV_M

Let's use the Llama-3.2-11B-Vision-Instruct and Llama-3.2-90B-Vision-Instruct models from Meta with Hugging Face transformers!  

- Turn an image into test
- Turn an image of a table into a JSON representation.

These multimodal models are capable of visual understanding. They take both text and images as input, check out the model cards for more information [Llama 3.2 11B](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) and [Llama 3.2 90B](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) model cards.

Meta Llama 3.2 is licensed under the LLAMA 3.2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring their compliance with the terms of this license and the Llama 3.2 Acceptable Use Policy.

Note: Meta does not grant rights for the multimodal models in the Llama 3.2 license to users domiciled in the European Union, or companies with principle place of business in the European Union. See the Llama 3.2 Acceptable Use Policy for more informati

### Standard Imports

In [None]:
# Import python packages
# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()


### Upgrade transformers
If you get an error here, make sure you have an external integration to install packages. You will also need the external integration for connecting to the Hugging Face hub to download the model.

In [None]:
!pip install "transformers>=4.45.0" --upgrade

### Import task specific libraries

In [None]:
import requests
import torch
from PIL import Image
from huggingface_hub import login
from transformers import MllamaForConditionalGeneration, AutoProcessor

### Set params and log-in into Huggingface hub

Add your [Hugging Face token](https://huggingface.co/settings/tokens) for downloading Llama. Please note, the [Llama vision models](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf) are gated. Meta requires you submit a form on Hugging Face for access to the model. You will need to do that before you can download the model.

In [None]:
hf_token = "hf_XXX"
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

Let's use transformers which makes it easy to use the new Llama models. The `device_map` option set to `auto` means for multi-GPU systems, the model will automatically use [Big Model Inference](https://huggingface.co/docs/accelerate/concept_guides/big_model_inference) from Hugging Face. 

In [None]:
login(hf_token)
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

### Image to Text -- Let's use this image and convert it to text.

![image](https://miro.medium.com/v2/resize:fit:800/format:webp/1*3BMNlDaKPOlijIX-1erbJw.jpeg)

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url = "https://miro.medium.com/v2/resize:fit:800/format:webp/1*3BMNlDaKPOlijIX-1erbJw.jpeg"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]

image = Image.open(requests.get(url, stream=True).raw)

In [None]:
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=300)
print(processor.decode(output[0]))

### Table to Text/JSON

![image](https://mrkremerscience.com/wp-content/uploads/2013/08/data-table-example1.png)

In [None]:
url = "https://mrkremerscience.com/wp-content/uploads/2013/08/data-table-example1.png"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Parse the table into a JSON representation where the methods are keys and the datasets are subkeys. "}
    ]}
]

image = Image.open(requests.get(url, stream=True).raw)

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=300)
print(processor.decode(output[0]))

### Document Understanding

![image](https://huggingface.co/spaces/huggingface-projects/llama-3.2-vision-11B/resolve/main/examples/invoice.png)

In [None]:
url = "https://huggingface.co/spaces/huggingface-projects/llama-3.2-vision-11B/resolve/main/examples/invoice.png"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "How long does it take from invoice date to due date? Be short and concise."}
    ]}
]

image = Image.open(requests.get(url, stream=True).raw)

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=300)
print(processor.decode(output[0]))