# Aria Inference Recipes

## Section 3: Multi-page PDF Understanding

We here show the best recipes to understand a multi-page PDF (e.g. ArXiv papers, financial reports, slides, scanned books) with Aria model. We use the paper of [LongVideoBench](https://arxiv.org/pdf/2407.15754) (Jul 24', ^1) as an example, to show an end-to-end tutorial from a `.pdf` file to various types of responses. 

By default, we use split-image settings as images in PDFs are information-rich.

^1: As per knowledge cutoff of the model, this paper has never been seen during training.



### [General] Load Model and Processor

As the input size rapidly increases, we load the model and understand images with two 80GB GPUs (GPU 0, 1). If you find an OOM error, please try to let the model to see more GPUs. The `device_map="auto"` parameter will automatically shard model parameters to all visible GPUs.

In [1]:
# load Aria model & processor

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_id_or_path = "rhymes-ai/Aria"

model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)



  from .autonotebook import tqdm as notebook_tqdm
AriaMoELMForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Loading checkpoint shards: 100%|██████████| 12/12 [00:36<00:00,  3.03s/it]


### Installing PyMUPDF & Defining PDF2Image Function

To convert a PDF into images, we use the PyMUPDF package. We install it as follows.

In [None]:
%%sh
pip uninstall PyMUPDF

In [2]:
import fitz  # PyMuPDF
from PIL import Image, ImageFile

def pdf_to_images(pdf_path):
    # Open the PDF file using PyMuPDF
    doc = fitz.open(pdf_path)
    
    # Store each page as a PIL image
    images = []
    
    for page_num in range(doc.page_count):
        page = doc.load_page(page_num)
        
        # Convert page to a pixmap (image representation in PyMuPDF)
        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
        
        # Create a PIL image from the pixmap's byte data
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        images.append(img)
    
    doc.close()
    
    return images




In [3]:
images = pdf_to_images("visuals/longvideobench.pdf")[:9] #limit to 9 pages, removing appendix and references

In [4]:
from typing import List

def get_placeholders_for_multiple_pages(images: List):
    contents = []
    for i, _ in enumerate(images):
        contents.extend(
            [
                {"text": f"Page {i+1}: ", "type": "text"},
                {"text": None, "type": "image"},
                {"text": "\n", "type": "text"}
            ]
        )
    return contents

contents = get_placeholders_for_multiple_pages(images)

#### Pages Visualization

Let's visualize the paper as follows:

In [22]:
from PIL import Image

def create_image_gallery(images, columns=3, spacing=20, bg_color=(200, 200, 200)):
    """
    Combine multiple images into a single larger image in a grid format.
    
    Parameters:
        image_paths (list of str): List of file paths to the images to display.
        columns (int): Number of columns in the gallery.
        spacing (int): Space (in pixels) between the images in the gallery.
        bg_color (tuple): Background color of the gallery (R, G, B).
    
    Returns:
        PIL.Image: A single combined image.
    """
    # Open all images and get their sizes
    img_width, img_height = images[0].size  # Assuming all images are of the same size

    # Calculate rows needed for the gallery
    rows = (len(images) + columns - 1) // columns

    # Calculate the size of the final gallery image
    gallery_width = columns * img_width + (columns - 1) * spacing
    gallery_height = rows * img_height + (rows - 1) * spacing

    # Create a new image with the calculated size and background color
    gallery_image = Image.new('RGB', (gallery_width, gallery_height), bg_color)

    # Paste each image into the gallery
    for index, img in enumerate(images):
        row = index // columns
        col = index % columns

        x = col * (img_width + spacing)
        y = row * (img_height + spacing)

        gallery_image.paste(img, (x, y))

    return gallery_image

In [24]:
create_image_gallery(images).save("longvideobench_gallery.jpg")

### Task 1: Find and Narrate Figures in the Paper

The first task is to find and provide a description on all the figures in this paper, which is a non-replaceable ability an LMM has (compared with an OCR + LLM pipeline).

In [5]:

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "Please narrate what each Figure (in total 4 Figures) is about in this paper.", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=images, return_tensors="pt", split_image=True)
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=2048,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=False,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)

  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)




**Figure 1:** This figure illustrates the LONGVIDEOBENCH benchmark, which features referring questions that reference specific video contexts to answer questions about them. The left side (a) shows an example of a referring query where a woman with a red top and black backpack is described, and the reader is asked what changes occur to her backpack. The right side (b) shows a graph comparing the performance of different models (GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo, Gemini-1.5-Flash) on the benchmark as the number of frames increases.

**Figure 2:** This figure provides examples of the 17 categories of referring reasoning questions in the LONGVIDEOBENCH. It is divided into two levels: Perception (L1) and Relation (L2). Each category is illustrated with examples, such as identifying objects, events, or attributes in the video context.

**Figure 3:** This figure depicts the video and subtitle collection process for LONGVIDEOBENCH. It shows how videos are downloaded, transcribed, and anno

### Task 2: Summarize the Paper


The second task is to summarize this paper. Ideally, we would like this summarization not only from the abstract / introduction / conclusion parts of it, but also includes many important points that are iterated through this paper. 

And Aria is able to provide a summarization like that. See the results below and try on more papers.

In [6]:

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "Please provide an in-detail summary of the paper.", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=images, return_tensors="pt", split_image=True)
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=2048,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=False,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)

  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):


The paper titled "LONGVIDEOBENCH: A Benchmark for Long-context Interleaved Video-Language Understanding" introduces a comprehensive benchmark designed to evaluate the performance of Large Multimodal Models (LMMs) in understanding long-duration videos. The authors, Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li, highlight the limitations of existing benchmarks that primarily focus on short videos and do not adequately test the capabilities of LMMs to handle long-context multimodal inputs.

The paper is structured into several sections:

1. **Introduction**:
   - Discusses the growth in the processing capabilities of foundation models, which now handle longer contexts.
   - Emphasizes the need for benchmarks that can evaluate these models on long-duration videos.
   - Introduces LONGVIDEOBENCH as a solution to this gap.

2. **The Referring Reasoning Task**:
   - Identifies the primary challenges in multimodal long-context understanding.
   - Defines the referring reasoning task, which in

### Task 3: Detailed Question-Answering

As the third task, we provide an example for Aria to ask some detail-related question that are in the middle of this paper.

In [7]:

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "According to the paper, what are the two major difficulties in understanding long videos? Reply me in Latex format.", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=images, return_tensors="pt", split_image=True)
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=False,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)
    
print(result)

  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):



The two major difficulties in understanding long videos, as outlined in the paper, are:

\begin{enumerate}
    \item \textbf{Retrieving details from long videos:} Existing Large Multimodal Models (LMMs) often struggle to extract specific details from long sequences. To accurately assess tasks in LONGVIDEOBENCH, a focus on granular details such as objects, events, or attributes is necessary, rather than a summary or topic overview.
    \item \textbf{Reasoning contextual relations in long videos:} Beyond mere retrieval, it is significantly challenging for LMMs to reason about the relationships among different elements within a long video. Questions in LONGVIDEOBENCH are designed to compel LMMs to interpret the interconnections among diverse context clues spread across the video, necessitating a deep understanding of the temporal and contextual dynamics.
\end{enumerate}<|im_end|>
