## Instruction from CLI

This tutorial intends to test the performance of [OllmOCR](https://github.com/allenai/olmocr).

Their main website is [here](https://olmocr.allenai.org/blog).

Make sure the environment is good to install.

```bash
! sudo apt-get update
! sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
```

Installation

```bash
conda create -n olmocr python=3.11
conda activate olmocr

git clone https://github.com/allenai/olmocr.git
cd olmocr

pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```

Make sure you have a PDF document updated or multiple PDF documents uploaded.

Then back in the main directory, you can run

```bash
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
```

Results are then saved in a local temp directory and can be displayed

```bash
cat localworkspace/results/output_*.jsonl
```

In [None]:
from pathlib import Path
import json

def read_jsonl_file(file_path: str) -> list[dict]:
    """
    Reads a JSONL (JSON Lines) file and returns a list of dictionaries.

    Args:
        file_path (str): Path to the .jsonl file

    Returns:
        list[dict]: List of parsed JSON objects
    """
    data = []
    path = Path(file_path)

    with path.open("r", encoding="utf-8") as f:
        for line in f:
            if line.strip():  # skip empty lines
                data.append(json.loads(line))

    return data

# Example usage
file_path = "/content/localworkspace/results/output_4529a88ad87a0a3581876cf677a3e908b7a8ba9e.jsonl"
jsonl_data = read_jsonl_file(file_path)

# Print the first item for inspection
print(json.dumps(jsonl_data[0], indent=2))


{
  "id": "5cc91d9dfb17134a7743458ae19394a303e0d18c",
  "text": "Decision Process Sequence Diagram\n### Process Details\n\n| Process Type | Assign Value |\n|--------------|--------------|\n|              |              |\n## Decision Process Tree Components\n\n| Segmentation Tree |\n|-------------------|\n| [MY][BCD][Limit Assignment and Mgmt] DSR Cap DPT |\n\n[MY][BCD][Limit Assignment and Mgmt] DSR Cap DPT\n| Ref | Leaf Node ID | Leaf Node Name, Sc 700-750 | Outcome Cap |\n|-----|--------------|-----------------------------|-------------|\n| 121 | NTC, 4W, <4K, KV, Sc 700-750 | DSR Cap 35% |\n| 122 | NTC, 4W, <4K, KV, Sc 650-<700 | DSR Cap 35% |\n| 123 | NTC, 4W, <4K, KV, Sc 600-<650 | DSR Cap 35% |\n| 124 | NTC, 4W, <4K, KV, Sc 550-<600 | DSR Cap 30% |\n| 125 | NTC, 4W, <4K, KV, Sc 500-<550 | DSR Cap 30% |\n| 126 | NTC, 4W, <4K, KV, Sc 450-<500 | DSR Cap 25% |\n| 127 | NTC, 4W, <4K, KV, Sc 400-<450 | DSR Cap 20% |\n| 128 | NTC, 4W, <4K, KV, Sc 350-<400 | DSR Cap 20% |\n| 129 | NTC, 

In [None]:
print(jsonl_data[0]['text'])

Decision Process Sequence Diagram
### Process Details

| Process Type | Assign Value |
|--------------|--------------|
|              |              |
## Decision Process Tree Components

| Segmentation Tree |
|-------------------|
| [MY][BCD][Limit Assignment and Mgmt] DSR Cap DPT |

[MY][BCD][Limit Assignment and Mgmt] DSR Cap DPT
| Ref | Leaf Node ID | Leaf Node Name, Sc 700-750 | Outcome Cap |
|-----|--------------|-----------------------------|-------------|
| 121 | NTC, 4W, <4K, KV, Sc 700-750 | DSR Cap 35% |
| 122 | NTC, 4W, <4K, KV, Sc 650-<700 | DSR Cap 35% |
| 123 | NTC, 4W, <4K, KV, Sc 600-<650 | DSR Cap 35% |
| 124 | NTC, 4W, <4K, KV, Sc 550-<600 | DSR Cap 30% |
| 125 | NTC, 4W, <4K, KV, Sc 500-<550 | DSR Cap 30% |
| 126 | NTC, 4W, <4K, KV, Sc 450-<500 | DSR Cap 25% |
| 127 | NTC, 4W, <4K, KV, Sc 400-<450 | DSR Cap 20% |
| 128 | NTC, 4W, <4K, KV, Sc 350-<400 | DSR Cap 20% |
| 129 | NTC, 4W, <4K, KV, Sc <350 | DSR Cap 20% |
| 130 | NTC, 4W, <4K, KV, Sc <350 | DSR Cap 20% |
|

In [None]:
print(jsonl_data[0]['text'])

Decision Process Sequence Diagram
### Process Details

| Process Type | Assign Value |
|--------------|--------------|
|              |              |
## Decision Process Tree Components

| Segmentation Tree |
|-------------------|
| [MY][BCD][Limit Assignment and Mgmt] DSR Cap DPT |

[MY][BCD][Limit Assignment and Mgmt] DSR Cap DPT
| Ref | Leaf Node ID | Leaf Node Name, Sc 700-750 | Outcome Cap |
|-----|--------------|-----------------------------|-------------|
| 121 | NTC, 4W, <4K, KV, Sc 700-750 | DSR Cap 35% |
| 122 | NTC, 4W, <4K, KV, Sc 650-<700 | DSR Cap 35% |
| 123 | NTC, 4W, <4K, KV, Sc 600-<650 | DSR Cap 35% |
| 124 | NTC, 4W, <4K, KV, Sc 550-<600 | DSR Cap 30% |
| 125 | NTC, 4W, <4K, KV, Sc 500-<550 | DSR Cap 30% |
| 126 | NTC, 4W, <4K, KV, Sc 450-<500 | DSR Cap 25% |
| 127 | NTC, 4W, <4K, KV, Sc 400-<450 | DSR Cap 20% |
| 128 | NTC, 4W, <4K, KV, Sc 350-<400 | DSR Cap 20% |
| 129 | NTC, 4W, <4K, KV, Sc <350 | DSR Cap 20% |
| 130 | NTC, 4W, <4K, KV, Sc <350 | DSR Cap 20% |
|

## Instruction from Python

In [None]:
%%capture

! pip install olmocr

Collecting olmocr
  Downloading olmocr-0.1.60-py3-none-any.whl.metadata (26 kB)
Collecting cached-path (from olmocr)
  Downloading cached_path-1.7.1-py3-none-any.whl.metadata (19 kB)
Collecting pypdf>=5.2.0 (from olmocr)
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Collecting pypdfium2 (from olmocr)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting lingua-language-detector (from olmocr)
  Downloading lingua_language_detector-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
Collecting ftfy (from olmocr)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting markdown2 (from olmocr)
  Downloading markdown2-2.5.3-py3-none-any.whl.metadata (2.1 kB)
Collecting boto3 (from olmocr)
  Downloading boto3-1.37.21-py3-none-any.whl.metadata (6.7 kB)
Coll

In [None]:
import torch
import base64
import urllib.request

from io import BytesIO
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

from olmocr.data.renderpdf import render_pdf_to_base64png
from olmocr.prompts import build_finetuning_prompt
from olmocr.prompts.anchor import get_anchor_text

# Initialize the model
model = Qwen2VLForConditionalGeneration.from_pretrained("allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16).eval()
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Grab a sample PDF
# urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", "./paper.pdf")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Qwen2VLForConditionalGeneration(
  (visual): Qwen2VisionTransformerPretrainedModel(
    (patch_embed): PatchEmbed(
      (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
    )
    (rotary_pos_emb): VisionRotaryEmbedding()
    (blocks): ModuleList(
      (0-31): 32 x Qwen2VLVisionBlock(
        (norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        (norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        (attn): VisionSdpaAttention(
          (qkv): Linear(in_features=1280, out_features=3840, bias=True)
          (proj): Linear(in_features=1280, out_features=1280, bias=True)
        )
        (mlp): VisionMlp(
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (act): QuickGELUActivation()
          (fc2): Linear(in_features=5120, out_features=1280, bias=True)
        )
      )
    )
    (merger): PatchMerger(
      (ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
      (mlp): Seq

Function `render_pdf_to_base64png` requires:

```bash
!apt-get install -y poppler-utils
```

In [None]:
! apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.6).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


In [None]:
# Render page 1 to an image
image_base64 = render_pdf_to_base64png("/content/test.pdf", 1, target_longest_image_dim=1024)

# Build the prompt, using document metadata
anchor_text = get_anchor_text("/content/test.pdf", 1, pdf_engine="pdfreport", target_length=4000)
prompt = build_finetuning_prompt(anchor_text)

# Build the full prompt
messages = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
                ],
            }
        ]

# Apply the chat template and processor
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
main_image = Image.open(BytesIO(base64.b64decode(image_base64)))

inputs = processor(
    text=[text],
    images=[main_image],
    padding=True,
    return_tensors="pt",
)
inputs = {key: value.to(device) for (key, value) in inputs.items()}


# Generate the output
output = model.generate(
            **inputs,
            temperature=0.8,
            max_new_tokens=50,
            num_return_sequences=1,
            do_sample=True,
        )

# Decode the output
prompt_length = inputs["input_ids"].shape[1]
new_tokens = output[:, prompt_length:]
text_output = processor.tokenizer.batch_decode(
    new_tokens, skip_special_tokens=True
)

print(text_output)
# ['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\\nOpen Weights and Open Data\\nfor State-of-the']


['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"DOMAIN: GFIN\\nSOLUTION: DAX\\nUSER: stefani.diorani']
