# Aria Inference Recipes

Here is an VLLM-version of the inference recipe, aiming to facilitate users with faster inference speed. 

## Section 3: Multi-page PDF Understanding (VLLM)

We here show the best recipes to understand a multi-page PDF (e.g. ArXiv papers, financial reports, slides, scanned books) with Aria model. We use the paper of [LongVideoBench](https://arxiv.org/pdf/2407.15754) (Jul 24', ^1) as an example, to show an end-to-end tutorial from a `.pdf` file to various types of responses. 

By default, we use split-image settings as images in PDFs are information-rich.

^1: As per knowledge cutoff of the model, this paper has never been seen during training.



### [General] Load Model and Processor

To maximize the actual length that Aria (VLLM version) can infer on a single 80GB GPU, we set the recommended parameter as follows:

- `max_model_len`: 38400
- `gpu_memory_utilization`: 0.84

This will allow a very long input with up to 64 high-resolution (980 resolution) or 256 mid-resolution (490 resolution) images to be fed as inputs of Aria with only one GPU, which will cover our long-context evaluation cases in Sections 3 and 4. Enjoy!

In [1]:
# load Aria model & tokenizer with vllm

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import requests
import torch
from PIL import Image

from transformers import AutoTokenizer
from vllm import LLM, ModelRegistry, SamplingParams
from vllm.model_executor.models import _MULTIMODAL_MODELS

from aria.vllm.aria import AriaForConditionalGeneration

ModelRegistry.register_model(
    "AriaForConditionalGeneration", AriaForConditionalGeneration
)
_MULTIMODAL_MODELS["AriaForConditionalGeneration"] = (
    "aria",
    "AriaForConditionalGeneration",
)

model_id_or_path = "rhymes-ai/Aria"

model = LLM(
        model=model_id_or_path,
        tokenizer=model_id_or_path,
        dtype="bfloat16",
        limit_mm_per_prompt={"image": 256},
        enforce_eager=True,
        trust_remote_code=True,
        max_model_len=38400,
        gpu_memory_utilization=0.84,
    )

tokenizer = AutoTokenizer.from_pretrained(
        model_id_or_path, trust_remote_code=True, use_fast=False
    )

  from .autonotebook import tqdm as notebook_tqdm
2024-10-04 20:54:40,866	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 10-04 20:54:41 config.py:1652] Downcasting torch.float32 to torch.bfloat16.
INFO 10-04 20:54:41 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/cpfs/29cd2992fe666f2a/user/zhoufan/yivl_open_source/models/uf_sft_0929_seqlen8k_from_sft0916_afterlong_iden_1600', speculative_config=None, tokenizer='/cpfs/29cd2992fe666f2a/user/zhoufan/yivl_open_source/models/uf_sft_0929_seqlen8k_from_sft0916_afterlong_iden_1600', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=38400, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observabili

Loading safetensors checkpoint shards:   0% Completed | 0/12 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/12 [00:02<00:27,  2.53s/it]
Loading safetensors checkpoint shards:  17% Completed | 2/12 [00:05<00:27,  2.74s/it]
Loading safetensors checkpoint shards:  25% Completed | 3/12 [00:08<00:24,  2.71s/it]
Loading safetensors checkpoint shards:  33% Completed | 4/12 [00:10<00:21,  2.74s/it]
Loading safetensors checkpoint shards:  42% Completed | 5/12 [00:13<00:18,  2.70s/it]
Loading safetensors checkpoint shards:  50% Completed | 6/12 [00:15<00:13,  2.32s/it]
Loading safetensors checkpoint shards:  58% Completed | 7/12 [00:18<00:12,  2.52s/it]
Loading safetensors checkpoint shards:  67% Completed | 8/12 [00:20<00:10,  2.56s/it]
Loading safetensors checkpoint shards:  75% Completed | 9/12 [00:23<00:07,  2.56s/it]
Loading safetensors checkpoint shards:  83% Completed | 10/12 [00:26<00:05,  2.66s/it]
Loading safetensors checkpoint shards:  92% Completed | 11/12

INFO 10-04 20:55:17 model_runner.py:1025] Loading model weights took 47.1793 GB
INFO 10-04 20:55:19 gpu_executor.py:122] # GPU blocks: 2650, # CPU blocks: 936


### Installing PyMUPDF & Defining PDF2Image Function

To convert a PDF into images, we use the PyMUPDF package. We install it as follows.

In [None]:
%%sh
pip uninstall PyMUPDF

In [2]:
import fitz  # PyMuPDF
from PIL import Image, ImageFile

def pdf_to_images(pdf_path):
    # Open the PDF file using PyMuPDF
    doc = fitz.open(pdf_path)
    
    # Store each page as a PIL image
    images = []
    
    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)
        
        # Convert page to a pixmap (image representation in PyMuPDF)
        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
        
        # Create a PIL image from the pixmap's byte data
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        images.append(img)
    
    doc.close()
    
    return images




In [3]:
images = pdf_to_images("visuals/longvideobench.pdf")[:9] #limit to 9 pages, removing appendix and references

In [4]:
from typing import List

def get_placeholders_for_multiple_pages(images: List):
    contents = []
    for i, _ in enumerate(images):
        contents.extend(
            [
                {"text": f"Page {i+1}: ", "type": "text"},
                {"text": None, "type": "image"},
                {"text": "\n", "type": "text"}
            ]
        )
    return contents

contents = get_placeholders_for_multiple_pages(images)

#### Pages Visualization

Let's visualize the paper as follows:

In [5]:
from PIL import Image

def create_image_gallery(images, columns=3, spacing=20, bg_color=(200, 200, 200)):
    """
    Combine multiple images into a single larger image in a grid format.
    
    Parameters:
        image_paths (list of str): List of file paths to the images to display.
        columns (int): Number of columns in the gallery.
        spacing (int): Space (in pixels) between the images in the gallery.
        bg_color (tuple): Background color of the gallery (R, G, B).
    
    Returns:
        PIL.Image: A single combined image.
    """
    # Open all images and get their sizes
    img_width, img_height = images[0].size  # Assuming all images are of the same size

    # Calculate rows needed for the gallery
    rows = (len(images) + columns - 1) // columns

    # Calculate the size of the final gallery image
    gallery_width = columns * img_width + (columns - 1) * spacing
    gallery_height = rows * img_height + (rows - 1) * spacing

    # Create a new image with the calculated size and background color
    gallery_image = Image.new('RGB', (gallery_width, gallery_height), bg_color)

    # Paste each image into the gallery
    for index, img in enumerate(images):
        row = index // columns
        col = index % columns

        x = col * (img_width + spacing)
        y = row * (img_height + spacing)

        gallery_image.paste(img, (x, y))

    return gallery_image

In [24]:
create_image_gallery(images).save("longvideobench_gallery.jpg")

### Task 1: Find and Narrate Figures in the Paper

The first task is to find and provide a description on all the figures in this paper, which is a non-replaceable ability an LMM has (compared with an OCR + LLM pipeline).

In [6]:

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "Please narrate what each Figure (in total 4 Figures) is about in this paper.", "type": "text"},
        ],
    }
]

text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    outputs = model.generate(
            {
                "prompt_token_ids": text,
                "multi_modal_data": {
                    "image": images,
                    "max_image_size": 980,  # [Optional] The max image patch size, default `980`
                    "split_image": True,  # [Optional] whether to split the images, default `False`
                },
            },
            sampling_params=SamplingParams(max_tokens=4096, top_k=1, stop=["<|im_end|>"])
        )
    generated_tokens = outputs[0].outputs[0].token_ids
    result = tokenizer.decode(generated_tokens)

print(result)

  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
Processed prompts: 100%|██████████| 1/1 [00:16<00:00, 16.27s/it, est. speed input: 715.35 toks/s, output: 17.64 toks/s]

1. **Figure 1**: This figure illustrates the concept of LONGVIDEOBENCH, a benchmark for long-context video-language understanding. It features a question about changes in a woman's backpack over time, with multiple frames showing different scenarios. The figure highlights the challenge of understanding long-term context in video data.

2. **Figure 2**: This figure provides examples of the 17 categories of referring reasoning questions included in LONGVIDEOBENCH. Each example shows a question related to specific video contexts, demonstrating the diversity and complexity of the questions designed to test the models' understanding of long-term video content.

3. **Figure 3**: This figure outlines the video and subtitle collection process for LONGVIDEOBENCH. It includes a flowchart showing the steps from downloading videos to annotating them, ensuring high-quality data for evaluating video-language models.

4. **Figure 4**: This figure presents the accuracy of various models (both propriet




### Task 2: Summarize the Paper


The second task is to summarize this paper. Ideally, we would like this summarization not only from the abstract / introduction / conclusion parts of it, but also includes many important points that are iterated through this paper. 

And Aria is able to provide a summarization like that. See the results below and try on more papers.

In [7]:

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "Please provide an in-detail summary of the paper.", "type": "text"},
        ],
    }
]

text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    outputs = model.generate(
            {
                "prompt_token_ids": text,
                "multi_modal_data": {
                    "image": images,
                    "max_image_size": 980,  # [Optional] The max image patch size, default `980`
                    "split_image": True,  # [Optional] whether to split the images, default `False`
                },
            },
            sampling_params=SamplingParams(max_tokens=4096, top_k=1, stop=["<|im_end|>"])
        )
    generated_tokens = outputs[0].outputs[0].token_ids
    result = tokenizer.decode(generated_tokens)


print(result)

  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
Processed prompts: 100%|██████████| 1/1 [00:38<00:00, 38.63s/it, est. speed input: 301.05 toks/s, output: 19.44 toks/s]

The paper titled "LONGVIDEOBENCH: A Benchmark for Long-context Interleaved Video-Language Understanding" by Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li introduces a comprehensive benchmark for evaluating Large Multimodal Models (LMMs) in understanding long-duration videos. The paper highlights the challenges in processing long-context inputs and presents LONGVIDEOBENCH, a novel benchmark designed to address these challenges.

### Key Points:

1. **Introduction to LONGVIDEOBENCH**:
   - The benchmark is introduced to measure the performance of LMMs on long-duration videos, which are videos up to an hour long.
   - It includes 3,763 videos with subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding.

2. **Referring Reasoning Task**:
   - The benchmark focuses on the referring reasoning task, where models need to interpret and reason about specific video contexts.
   - It includes 6,678 human-annotated multiple-choice questions 




### Task 3: Detailed Question-Answering

As the third task, we provide an example for Aria to ask some detail-related question that are in the middle of this paper.

In [8]:

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "According to the paper, what are the two major difficulties in understanding long videos? Reply me in Latex format.", "type": "text"},
        ],
    }
]

text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    outputs = model.generate(
            {
                "prompt_token_ids": text,
                "multi_modal_data": {
                    "image": images,
                    "max_image_size": 980,  # [Optional] The max image patch size, default `980`
                    "split_image": True,  # [Optional] whether to split the images, default `False`
                },
            },
            sampling_params=SamplingParams(max_tokens=4096, top_k=1, stop=["<|im_end|>"])
        )
    generated_tokens = outputs[0].outputs[0].token_ids
    result = tokenizer.decode(generated_tokens)

    
print(result)

  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
Processed prompts: 100%|██████████| 1/1 [00:10<00:00, 10.61s/it, est. speed input: 1097.33 toks/s, output: 17.25 toks/s]

The two major difficulties in understanding long videos, as outlined in the paper, are:

1. **Retrieving details from long videos**: Existing Large Multimodal Models (LMMs) often struggle to extract specific details from long sequences. To accurately assess tasks in LONGVIDEOBENCH, there is a need for models to focus on granular details such as objects, events, or attributes, rather than providing a summary or topic overview.

2. **Reasoning contextual relations in long videos**: Questions in LONGVIDEOBENCH require models to analyze the interconnections among diverse contents. This involves understanding the relationships among objects, events, or attributes within the video, which is significantly challenging for extensive inputs. The tasks demand models to derive the correct answer by examining the context and relations across multiple moments in the video.<|im_end|>



