<!-- <div style="background-image: linear-gradient(rgba(255, 255, 255, 0.5), rgba(255, 255, 255, 0.5)), url('assets/paper.png')"> -->

## Converting PDF documents to Markdown with Google Gemini

<img src="assets/is_all_you_need.png" width="400" style="display: inline" />
<img src="assets/lowry.png" width="300" style="display: inline" />


### The problem

Unlike Madkdown or HTML, the data in a PDF page does not always come in the natural order. 
It's more like a canvas, with a list of elements like:
- `("World", font, location, transforms, ...)`
- `("Hello", font, location, transforms, ...)`
- `([Image], location, transforms, ...)`

that are placed at the correct location during rendering. It's generally not possible to reliably extract text  
in the natural order, and it gets especially messy with formulas:

![Attention](assets/attention.png)
```
In practice, we compute the attention function on a set of queries simultaneously, packed together | {...}
into a matrix | {...}
 Q | {...}
. The keys and values are also packed together into matrices | {...}
 K | {...}
 and | {...}
 V | {...}
 . We compute | {...}
the matrix of outputs as: | {...}
Attention( | {...}
Q, K, V | {...}
 ) = softmax( | {...}
QK | {...}
T | {...}
√ | {...}
d | {...}
k | {...}
) | {...}
V | {...}
(1) | {...}
```

On top of this, many older PDFs are scanned paper documents, and contain poorly  
recognized text, no text at all, full-page images instead of text, or images of page fragments.

Especially for a scanned PDF, we need to use AI to correctly extract the text  
and image from the document.

### Procedure

- Extract text and image from the document
- 🤖 Ask Gemini if the images look good.
    - If the images look good, we don't need to do the extra extraction step.
    - If they don't look good:
        - 🤖 Use Gemini to find bounding boxes for all images in the page.
        - Extract images from the bounding boxes.
 - 🤖 Give the page image, extracted text, and extracted images to Gemini, and generate markdown with references to images.
- Combine the markdown for all pages into a single document.
- 🤖 Pass the full document to Gemini and ask it to brush up the formatting to make it consistent across pages.
    - This step requires multi-step generation and context caching.
- Save the resulting markdown and images.

### UI
There us a UI at the end of the notebok. Run the notebook (with `RUN_DEV_CELLS=False` for speed) and use it to convert your PDFs:

<img src="assets/ui.png" width="400" />

### API key
Set `GEMINI_API_KEY="your very secret key"` in the environment

### Findings

Durig this fun projects I found:
- Gemini works best with simple and concise tasks. Initially I tried to combine multiple tasks into one step, and the model was struggling.
- Gemini is able to find figures/plots/images in a PDF page, but it's not 100% reliable on its own.
- Gemini is excellent at converting between plain text and different text-based formats.

#### 🚩🥽 Recitation error
Gemini will fail with finish reason: `RECITATION` that triggers when the model is generating large chunks of  
material from some Google's database, regardless of license. This of couse makes using Gemini for format  
conversion challenging, as the generated text will often match large chunks of known material.

**GOOGLE: Pretty please, allow recitation if the recited material is also fully/largely present in the models input!**

Luckily, this mechanism can be fooled easily. I ask the model to insert
```
[<end of paragraph>]
```
after every paragraph, and cut them out later. This seems to be enough to avoid this error.
</div>

---

### There are two ways to run this notebook
#### `RUN_DEV_CELLS=True` - will test the functions as I did during development, this is rather slow. Enable it to follow the inner works.
#### `RUN_DEV_CELLS=False` - runs vert fast and you can use the UI at the end of the notebook to convert your PDFs into markdown!

In [1]:
RUN_DEV_CELLS=True

In [44]:
import io
import re

import fitz
from PIL import Image
import os
import shutil
from tqdm.auto import tqdm
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold
from google.generativeai.protos import Candidate
from google.generativeai import caching
import datetime

from typing import Any

from time import sleep
from dataclasses import dataclass, field
import random
import string
from IPython.display import Markdown
from pathlib import Path
import warnings

import json
from typing_extensions import TypedDict

from dotenv import load_dotenv
load_dotenv()

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

In [3]:
# I tried with Flash, but the results were significantly worse, especially for finding the bounding boxes.
# It might be possible to use Flash for some of the sub-tasks.

## You need to specify the model postfix (002) for the cache to work.
MODEL_NAME="gemini-1.5-pro-002"

# Gemini is a bit trigger-happy on the filters, and some research papers might get flagged.
SAFETY_SETTINGS = {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
}

# Gemini has another annoying feature - it detects when the genreatio is reciting cipyrighted content,
# which you'd expect would happen all the time when converting a PDF to markdown.

# Surprisingly, it only triggers infrequently and rather randomly, even with temperature=0,
# and does not trigger if I retry the exact same request.

# As a workaround, retry this many times if the request fails (likely with a RECITATION reason)
# I've seen the requests infrequently fail for other reasons, so we will retry for any errors.
MAX_RETRIES=5


In [4]:


@dataclass
class PDFImage:
    image: Image.Image
    name: str
    bbox: list[int] = field(default_factory=list)  # For images we extract later, keep the bbox.
    export: bool = False                           # Export images will be saved as files.

@dataclass
class PDFPage:
    page_num: int
    extracted_text: str;
    page_image: Image.Image
    extracted_images: list[PDFImage];

    # Used to keep track of intermediate steps for debugging.
    extracted_images_analyzed: Any = field(default_factory=dict);
    images_proposed: list[Any] = field(default_factory=list);

    markdown: str = "";

@dataclass
class PDFDocument:
    pages: list[PDFPage];
    markdown: str= "";

In [5]:
# For the demo in the description.
# pdf = fitz.open("AIAYN.pdf")

# texts = []
# for block in pdf[3].get_text("dict")["blocks"]:
#     if (block["type"] == 0):
#         for line in block["lines"]:
#             for span in line["spans"]:
#                 print(span["text"] + " | {...}", )

### Use PyMuPDF to extract text and images from a PDF. No AI yet. :)

In [6]:
def extract_pdf_content(pdf_path):
    pdf = fitz.open(pdf_path)

    pages: list[PDFPage] = []
    num_images = 0

    for page_num, pdf_page in enumerate(tqdm(pdf, desc="Extracting PDF content", unit="page")):
        extracted_images: list[PDFImage] = []

        # Get images
        image_list = pdf_page.get_images()
        for img in image_list:
            xref = img[0]
            base_image = pdf.extract_image(xref)

            extracted_images.append(PDFImage(
                image=Image.open(io.BytesIO(base_image["image"])),
                name=f"image_{num_images}.png"

            ))
            num_images += 1

        pages.append(PDFPage(
            page_image=Image.open(io.BytesIO(pdf_page.get_pixmap(dpi=300, annots=False).tobytes())),
            extracted_text=pdf_page.get_text(),
            extracted_images=extracted_images,
            page_num=page_num
        ))

    return PDFDocument(pages=pages)

### Let's try with a well-formatted PDF

In [None]:
if RUN_DEV_CELLS:
    aiayn = extract_pdf_content("AIAYN.pdf")

In [8]:
def show_pdf_page(pdf_page: PDFPage):
    display(pdf_page.page_image.resize((350, 500)))
    for image in pdf_page.extracted_images:
        img = image.image
        width, height = img.size
        aspect = width / height
        if width > height:
            new_width = 200
            new_height = int(200 / aspect)
        else:
            new_height = 200
            new_width = int(200 * aspect)
        display(img.resize((new_width, new_height)))
    print(pdf_page.extracted_text)

In [None]:
if RUN_DEV_CELLS:
    show_pdf_page(aiayn.pages[3])

### Not bad. We lost formatting, the formulas are messed up, and we don't know where to place the images in the page. We can clean up the text using Gemini, and use the extracted hi-res images as is, with some extra annotations.

### Let's try with a scanned document that has worse formatting.

In [None]:
if RUN_DEV_CELLS:
    lowry = extract_pdf_content("Lowry.pdf")
    show_pdf_page(lowry.pages[6])

### This is a mess. Not only is the text all over the place, the extracted images are just slices of the page.

### We will need to use Gemini to both re-OCR the text, and locate the images in the page.

---

In [11]:
## Handle retries, caching, and errors.

def gemini_generate(model,
                    messages,
                    generation_config=None,
                    safety_settings=SAFETY_SETTINGS,
                    retries=MAX_RETRIES):

    attempts = 0

    while attempts < retries:
        if (attempts > 0):
            warnings.warn(f"Retry {attempts} of maximum {retries}", stacklevel=1)

        response = model.generate_content(
            messages,
            generation_config=generation_config,
            safety_settings=safety_settings
        )

        if response.candidates[0].finish_reason == Candidate.FinishReason.STOP:
            return response.text

        if len(response.candidates) == 0:
            warnings.warn("No candidates received from the model.", stacklevel=1)

        if not response.candidates[0].finish_reason == Candidate.FinishReason.STOP:
            warnings.warn(f"Unexpected finish reason {response.candidates[0].finish_reason.name}", stacklevel=1)

        attempts += 1

    # If we reach here, we have failed all retries.
    if len(response.candidates) == 0:
        warnings.warn(response, stacklevel=1)
        raise Exception("No candidates received from the model.")

    if not response.candidates[0].finish_reason == Candidate.FinishReason.STOP:
        warnings.warn(response, stacklevel=1)
        raise Exception(f"Unexpected finish reason {response.candidates[0].finish_reason}")


In [12]:
good_enough_prompt="""
You are given a page from a PDF document, and images automatically extracted from it.

First, check if the original page actually has images (figures, plots, diagrams, etc) in it.
If no, return an empty list for the image ratings.
If yes, for all provided images, determine if the image is extracted perfectly.

A perfectly extracted image fully captures one or more figure, plot, or diagram in the page.

Poorly extracted images:
- Extracted image captures only part of a figure, plot, or diagram.
- Extracted image is an image of the full page with mostly text in it.

Be picky.
It's ok if all images are good or bad. Briefly explain your reasoning before making a decision.
"""

# I found that even for this simple task a bit of COT makes a day and night difference.
# The model will generate the values in alphabetical order of the keys, so we need to place reasoning before decision.
class ImageRating(TypedDict):
    a_name: str
    b_reason: str
    c_good: bool

class NeedManualExtraction(TypedDict):
    a_page_has_images: bool
    image_ratings: list[ImageRating]


def analyze_extracted_images(page: PDFPage):
    model = genai.GenerativeModel(model_name=MODEL_NAME)

    images = []
    for image in page.extracted_images:
        images.append(f"Extracted image '{image.name}':\n")
        images.append(image.image)

    result = gemini_generate(
        model,
        ["PDF Page:\n", page.page_image, "Extracted images:\n", *images, good_enough_prompt],
        generation_config=genai.GenerationConfig(
            max_output_tokens=2000,
            temperature=0,
            top_k=1,
            response_mime_type="application/json",
            response_schema=NeedManualExtraction,
        ),
        safety_settings=SAFETY_SETTINGS
    )

    page.extracted_images_analyzed = json.loads(result)


In [None]:
if RUN_DEV_CELLS:
    analyze_extracted_images(aiayn.pages[0])
    analyze_extracted_images(aiayn.pages[3])
    analyze_extracted_images(lowry.pages[0])
    analyze_extracted_images(lowry.pages[1])
    analyze_extracted_images(lowry.pages[6])

    print("AIAYN no images: " + json.dumps(aiayn.pages[0].extracted_images_analyzed, indent=2))
    print("AIAYN with good images: " + json.dumps(aiayn.pages[3].extracted_images_analyzed, indent=2))

    print("Lowry no images: " + json.dumps(lowry.pages[0].extracted_images_analyzed, indent=2))
    print("Lowry no images: " + json.dumps(lowry.pages[1].extracted_images_analyzed, indent=2))
    print("Lowry bad images: " + json.dumps(lowry.pages[6].extracted_images_analyzed, indent=2))

### Gemini correctly recognized if a pages has images in it, and for a page with images it correctly recognized if the images are good or not.
### Looks promising. I will assume that the page contains good images if the model gives 👍 for 80%+ of the images

In [18]:
def page_needs_image_extraction(page: PDFPage):
    analyze_extracted_images(page)

    if not page.extracted_images_analyzed.get("a_page_has_images", False):
        # Remove the wrongly extracted images from pages without images to avoid confusing the model later
        page.extracted_image = []
        return False

    good_images = [rating.get("c_good", False) for rating in page.extracted_images_analyzed.get("image_ratings", [])]

    # <= to account for the case where there are images, but none have been extracted.
    if good_images.count(True) <= 0.8 * len(good_images):
        # Images bad overall - remove everything.
        page.extracted_images = []
        return True
    return False


In [None]:
if RUN_DEV_CELLS:
    print(page_needs_image_extraction(aiayn.pages[3]))
    print(page_needs_image_extraction(lowry.pages[6]))

### Now, if the page needs AI-asisted image extraction...

In [23]:
# I think Gemini prefers this order or coordinates.
image_extraction_prompt = """
Annotate the images in the document with a meaningful filename (.png) and a bounding box.
Only provide one annotation for each image. Do not annotate tables.

JSON schema:
[
    {
        a_caption: "Image caption",
        bbox: [top, left, bottom, right]
    },
]
"""

class ImageAnnotation(TypedDict):
    a_caption: str
    bbox: list[int]


def page_find_bboxes(page: PDFPage):
    model = genai.GenerativeModel(model_name=MODEL_NAME)
    result = gemini_generate(
        model,
        [page.page_image, image_extraction_prompt],
        generation_config=genai.GenerationConfig(
            temperature=0,
            top_k=1,
            response_mime_type="application/json",
            response_schema=list[ImageAnnotation]
        ),
        safety_settings=SAFETY_SETTINGS
    )

    page.images_proposed = []

    # Sanity check the bounding boxes and unswap coordinates if they seem to have been swapped (happens infrequently)
    for image in json.loads(result):
        bbox = image.get("bbox")
        if len(bbox) == 4:
            page.images_proposed.append({
                "a_caption": image.get("a_caption", "unknown"),
                "bbox": [min(bbox[0], bbox[2]), min(bbox[1], bbox[3]), max(bbox[0], bbox[2]), max(bbox[1], bbox[3])]
            })

In [24]:
if RUN_DEV_CELLS:
    page_find_bboxes(lowry.pages[6])

In [25]:
from PIL import ImageDraw

def visualize_bboxes(page: PDFPage):
    img = page.page_image.copy()
    draw = ImageDraw.Draw(img)

    colors = ["red", "green", "blue", "yellow", "purple", "orange", "pink", "brown", "cyan", "magenta"]
    width, height = img.size

    for i, bbox in enumerate([image["bbox"] for image in page.images_proposed]):
        # The order of coordintes is different in PIL
        bbox_img = [
            bbox[1] * width / 1000,   # xmin
            bbox[0] * height / 1000,  # ymin
            bbox[3] * width / 1000,   # xmax
            bbox[2] * height / 1000,  # ymax
        ]
        draw.rectangle(bbox_img, outline=colors[i % len(colors)], width=5)

    return img

In [None]:
if RUN_DEV_CELLS:
    display(visualize_bboxes(lowry.pages[6]).resize((350, 500)))

### It's not perfect, but might be good enough.

> I tried adding more bboxes that are shifted/scaled around the predicted one, extracting the images and asking Gemini to score how well they are cropped, but for some reason Gemini had a very hard time figuring out which ones had the best crop.

I will scale up the bbox by 15% and hope it captures the whole image.
A dedicated fine-tuned object detection model would give better results, but let's stick to the challenge.

In [27]:
def scale_bbox(bbox: list[int], scale):
    assert(len(bbox) == 4)

    ymin, xmin, ymax, xmax = bbox

    width = xmax - xmin
    height = ymax - ymin

    center_x = (xmin + xmax) / 2
    center_y = (ymin + ymax) / 2

    scaled_width = width * scale
    scaled_height = height * scale

    scale_bbox = [
        center_y - scaled_height/2, center_x - scaled_width/2,
        center_y + scaled_height/2, center_x + scaled_width/2
    ]

    return [max(0, min(1000, int(x))) for x in scale_bbox]

def page_scale_bboxes(page: PDFPage, scale):
    for image in page.images_proposed:
        image["bbox"] = scale_bbox(image["bbox"], scale)

In [None]:
if RUN_DEV_CELLS:
    page_scale_bboxes(lowry.pages[6], 1.15)
    display(visualize_bboxes(lowry.pages[6]).resize((350, 500)))

### We get a bit more text around the image, but I think it's a better alternative to cropping the image.

Another solution would be to pass the images with rendered bboxes to Gemini and ask it to adjust the bboxes, but I did not try that.

In [29]:
# Adds the extracted images to the page.

# I will give a random name to extracted images to avoid biasing the model based on the image number.
# Store the names in a list to avoid duplicates.
random_names = []

def extract_images_from_page(page: PDFPage):

    # Drop the previouly extracted bad images.
    page.extracted_images = []

    for bbox in [image["bbox"] for image in page.images_proposed]:
        ymin, xmin, ymax, xmax = bbox
        width, height = page.page_image.size

        bbox_img = page.page_image.crop((
            xmin * width / 1000, ymin * height / 1000,
            xmax * width / 1000, ymax * height / 1000
        ))

        while True:
            random_name = ''.join(random.choices(string.ascii_lowercase + string.digits, k=5))
            if random_name not in random_names:
                random_names.append(random_name)
                break

        page.extracted_images.append(PDFImage(
            image=bbox_img,
            name=f"image_{random_name}.png",
            bbox=bbox,
        ))

In [30]:
if RUN_DEV_CELLS:
    extract_images_from_page(lowry.pages[6])

In [38]:
# Remove the anti-reciting hack.
def remove_recitation_hack(markdown: str):
    return markdown.replace("[end of paragraph]", "")

In [49]:

common_markdown_prompt = """

Use markdown heading levels for sections.

Insert 2 new lines to force a line bvreak.

For images:
<img src="filename" width="x">

For tables:
Use markdown tables, don not include images of tables.

For math formulas:
Use markdown formulas `$...$ ` for inline formulas, and block formulas \n$$\n...\n$$\n for block formulas.
Use \\tag{n} if you need to number a block formula.

For chemical formulas:
Use <sub> and <sup>

For figure captions:
Use markdown quotes (> Figure n: caption)

Escape symbols in text that would be wrongly interpreted as markdown (#, $, *, etc)

Insert [end of paragraph] after each paragraph.

Wrap the output in ```markdown.
"""

page_markdown_prompt = """
Convert the page to markdown. Convert all text as is, don't skip any parts, don't change wording.
Ignore purely decorative elements.

If a page contains figures, plots, diagrams, etc and its image for it is available, include it in markdown.
Place each image roughly where it appears in the page. Avoid breaking sentences to fit images.

""" + common_markdown_prompt


def page_to_markdown(page: PDFPage):
    model = genai.GenerativeModel(model_name=MODEL_NAME)
    image_messages = []
    for img in page.extracted_images:
        image_messages.append(f"Image variant {img.name}:\n")
        image_messages.append(img.image)

    markdown = gemini_generate(
        model,
        [
            page.page_image,
            f"Extracted page text:\n{page.extracted_text}",
            "Extracted images:",
            *image_messages,
            page_markdown_prompt,
            # Gemini often wraps the output in ```markdown or ```text. I prefill it for consistency.
            "```markdown\n", "model"
        ],
        generation_config=genai.GenerationConfig(
            temperature=0,
            top_k=1,
            response_mime_type="text/plain"
        ),
        safety_settings=SAFETY_SETTINGS
    )

    prefix = "```markdown\n"
    suffix = "```"

    if markdown.startswith(prefix):
        markdown = markdown[len(prefix):]
        if markdown.endswith(suffix):
            markdown = markdown[:-len(suffix)]

    page.markdown = remove_recitation_hack(markdown)



In [40]:
## Save the images so we can render them.
def save_markdown_images(page: PDFPage, directory: str="."):
    directory = Path(directory)
    # Save the images mentioned in the markdown
    os.makedirs(directory, exist_ok=True)
    for img in page.extracted_images:
        if (img.name in page.markdown):
            img.image.save(directory / img.name)

In [42]:
# We saved the images in an output dir, but the image references in Markdown don't have this
# prefix. To render markdown in the notebook, I will need to prepend the directory to the image names.

def patch_dir_base(input: string, image_names: list[str], prefix: Path):
    prefix = Path(prefix)

    replacements = { name : str(prefix/name) for name in image_names }
    pattern = '|'.join(map(re.escape, replacements.keys()))
    return re.sub(pattern, lambda m: replacements[m.group()], input)

In [None]:
# Let's try with the AIAYN page.
if RUN_DEV_CELLS:
    page_to_markdown(aiayn.pages[3])
    save_markdown_images(aiayn.pages[3], "aiayn_dev")
    print(aiayn.pages[3].markdown)
    display(Markdown(patch_dir_base(aiayn.pages[3].markdown,
                                    [image.name for image in aiayn.pages[3].extracted_images],
                                    "aiayn_dev")))

### Looks very good! Let's try with the Lowry page.

In [None]:
if RUN_DEV_CELLS:
    page_to_markdown(lowry.pages[6])
    save_markdown_images(lowry.pages[6], "lowry_dev")
    print(lowry.pages[6].markdown)
    display(Markdown(patch_dir_base(lowry.pages[6].markdown,
                                    [image.name for image in lowry.pages[6].extracted_images],
                                    "lowry_dev")))

### It looks decent too, let's tie it all together!

### Let's process multiple pages in paralle to make it faster.

Note: I'm not sure what's a reasonable number of parallel workers that does not get rate-limited with a free API key.

In [52]:
from concurrent.futures import ThreadPoolExecutor
from functools import partial

def process_single_page(page, master_pbar):
    pbar = tqdm(desc=f"Page {page.page_num}", total=3)
    pbar.set_postfix_str("Checking images...")
    need_extraction = page_needs_image_extraction(page)
    pbar.update(1)
    if (need_extraction):
        pbar.set_postfix_str("Generating BBoxes...")
        page_find_bboxes(page)
        page_scale_bboxes(page, 1.15)
        # bboxes = scale_bboxes(bboxes, 1.15)
#         if len(bboxes):
#             print(f"Found bboxes: {bboxes}")
        extract_images_from_page(page)
    pbar.update(1)

    pbar.set_postfix_str("Converting to markdown...")
    page_to_markdown(page)
    pbar.update(1)
    pbar.close()
    master_pbar.update(1)

def pdf_to_markdown(pdf: PDFDocument, max_workers=16):
    with tqdm(total=len(pdf.pages), desc="Processing pages", unit="page") as pbar:
        with ThreadPoolExecutor(max_workers=min(len(pdf.pages), max_workers)) as executor:
            process_fn = partial(process_single_page, master_pbar=pbar)
            list(executor.map(process_fn, pdf.pages))
    pdf.markdown = remove_recitation_hack("\n\n".join([page.markdown for page in pdf.pages]))


In [None]:
if RUN_DEV_CELLS:
    pdf_to_markdown(lowry)
    for page in lowry.pages:
        save_markdown_images(page, "lowry_dev")

    display(Markdown(patch_dir_base(lowry.markdown,
                                    [image.name for page in lowry.pages for image in page.extracted_images],
                                    "lowry_dev")))

    # display(Markdown(lowry.markdown))

### Now let's do a second pass.

Gemini did a decent job at converting individual pages to PDF, but it made a couple mistakes because for each pages it did not have access to othe pages.
I will pass the whole document through Gemini again, to improve consistency between pages, brush up formatting, and optionally will make links to bibliography work properly.

---

### If we process the entire document, and we need to use a few tricks:
- It's possible the whole document will not fit into the output token limit (8K), so we need to generate the output in steps.
- Cache the input (whole document + page images + extracted images) to save cost.

In [58]:
## Handle caching, multi-step generation, retries, and errors.

# For debugging purposes:

multistep_log = []

def gemini_generate_multistep(model,
                    messages,
                    pbar=None,             # If we want to update a progress bar.
                    max_total_chars=None,  # We might want to limit the length of generated text to avoid an infinite loop.
                    generation_config=None,
                    safety_settings=SAFETY_SETTINGS,
                    retries=MAX_RETRIES):

    attempts = 0
    tokens_generated = 0



    if pbar: pbar.set_postfix_str("Caching input...")

    # print(f"Caching input: {messages}")

    cache = caching.CachedContent.create(
        model=model.model_name,
        contents=messages,
        ttl=datetime.timedelta(hours=1) # We will delete the cache manually after we are done.
    )
    cache_model = genai.GenerativeModel.from_cached_content(cache)

    multistep_log.append({"msg": "Cachd input", "cache": cache, "cached_messages": messages})

    # Gemini requires at least one message besides cache. Luckily we always start with a markdown block.
    extra = "```markdown\n"

    try:
        while attempts < retries:
            if pbar: pbar.set_postfix_str("Generating content...")
            if (attempts > 0):
                print(f"Retry {attempts} of maximum {retries}")

            response = cache_model.generate_content(
                [{"role": "model", "parts": [{"text": extra}]}],
                generation_config=generation_config,
                safety_settings=safety_settings
            )

            if len(response.candidates) == 0:
                print("No candidates received from the model:")
                print(f"Input: {messages}")
                print(f"Extra: `{extra}`")
                print(response)
                attempts += 1
                continue


            if response.candidates[0].finish_reason.name in ["MAX_TOKENS", "STOP"]:

                multistep_log.append({
                    "msg": "Response received",
                    "extra": [{"role": "model", "parts": [{"text": extra}]}],
                    "text": response.text,
                    "reason": response.candidates[0].finish_reason.name,
                    "cache": cache,
                    })


                if response.candidates[0].finish_reason == Candidate.FinishReason.MAX_TOKENS:
                    if pbar:
                        pbar.update(len(response.text))
                    attempts = 0
                    extra += response.text
                    # print(response.usage_metadata)
                    tokens_generated += response.usage_metadata.candidates_token_count
                    if max_total_chars is not None and len(extra + response.text) > max_total_chars:
                        print("Max total characters reached.")
                        response_text = extra + response.text
                        if pbar:
                            pbar.set_postfix_str("Max total tokens reached.")
                            pbar.refresh()
                        return response_text, response.usage_metadata

                    continue

                if response.candidates[0].finish_reason == Candidate.FinishReason.STOP:
                    if pbar:
                        pbar.set_postfix_str("Finished.")
                        pbar.update(len(response.text))

                    multistep_log.append({
                        "msg": "Finished",
                        "extra": [{"role": "model", "parts": [{"text": extra}]}],
                        "text": response.text,
                        "reason": response.candidates[0].finish_reason.name,
                        "cache": cache,
                        })

                    return extra + response.text, response.usage_metadata


            # We can only reach here if the finish reason is not STOP or MAX_TOKENS.
            print(f"Unexpected finish reason {response.candidates[0].finish_reason.name}")
            attempts += 1

        if (attempts >= retries):
            print("Max retries reached")
            print(response)
            raise Exception("Max retries reached.")

    finally:
        if cache: cache.delete()

    return response

In [60]:
brush_it_up_prompt = """
Brush up formatting in this markdown document. Process the document in full.
Do not omit any parts of the document. Do not summarize any parts of the document.
Never change the wording or structure of the document, keep content where it is.
Remove decorative elements and page sepaeators.

If the text has special symbols that would be incorrectly interpreted as markdown formatting, escape them.

Keep any images in the document as they are.
""" + common_markdown_prompt


def brush_it_up(document: PDFDocument,  pbar):
    model = genai.GenerativeModel(model_name=MODEL_NAME)
    if not document.markdown:
        print("The document is missing markdown!")
        return

    prompt = brush_it_up_prompt

    messages = []
    # for page in document.pages:
    #     messages.append(f"Page: {page.page_num}\n")
    #     messages.append(page.page_image)

    messages.append(document.markdown)
    messages.append(prompt)


    markdown, usage = gemini_generate_multistep(
                model,
                messages,
                # We don't want to create an infinite loop, and double the input size give a good safety margin.
                max_total_chars=len(document.markdown) * 1.5,
                pbar=pbar,
                generation_config=genai.GenerationConfig(
                    # I set it to 1000 so we can update the progress bar more often.
                    # Don't want to deal with streaming for this purpose.
                    max_output_tokens=1000,
                    temperature=0,
                    top_k=1,
                    response_mime_type="text/plain")
            )

    pbar.set_postfix_str("Finished.")

    pbar.refresh()
    pbar.close()

    prefix = "```markdown\n"
    suffix = "```"

    if markdown.startswith(prefix):
        markdown = markdown[len(prefix):]
        if markdown.endswith(suffix):
            markdown = markdown[:-len(suffix)]


    return remove_recitation_hack(markdown), usage

In [None]:
if RUN_DEV_CELLS:
    md, usage = brush_it_up(lowry, tqdm(desc="Second pass", unit="character", total=len(lowry.markdown)))
    print(f"Usage on the last call:\n{usage}")
    display(Markdown(patch_dir_base(md,
                                    [image.name for page in lowry.pages for image in page.extracted_images],
                                    "lowry_dev")))

### This looks very nice!
### Gemini fixed up a few inconsistencies between pages and correctly escaped all markdown syntax in text
### Let's tie  it all together!

In [62]:
def convert_pdf_to_markdown(pdf_path: str, output_dir: str= "output", second_pass=True):
    output_dir = Path(output_dir)

    document = extract_pdf_content(pdf_path)
    pdf_to_markdown(document)
    os.makedirs(output_dir, exist_ok=True)
    for page in document.pages:
        save_markdown_images(page, output_dir)
    if second_pass:
        open(output_dir / "document-preliminary.md", "w").write(document.markdown)
        document.markdown_pass1 = document.markdown
        document.markdown, usage = brush_it_up(document,  tqdm(desc="Second pass", unit="character", total=len(document.markdown)))
        open(output_dir / "document-final.md", "w").write(document.markdown)
        print(f"Usage on the last call: {usage}")
    else:
        open(output_dir / "document.md", "w").write(document.markdown)
    return document

In [None]:
if RUN_DEV_CELLS:
    multistep_log = []
    lowry = convert_pdf_to_markdown("Lowry.pdf", "lowry")

### I had a look at the markdown in the "lowry" directory, and it looks good. I will now convert the AIAYN PDF.

### Now, the competition requires us to use at least 100k context. I will convert the Gemini 1.5 paper, which is 150 pages long.

In [None]:
# This take a really long time to run.

RUN_SUPER_LONG_CONTEXT=True

if RUN_DEV_CELLS and RUN_SUPER_LONG_CONTEXT:
    convert_pdf_to_markdown("gemini-1.5.pdf", "gemini1.5")

### Now the final touch - a simple UI

In [1]:
import ipywidgets as widgets
from IPython.display import display
import shutil
import tempfile
import base64

In [2]:
upload = widgets.FileUpload(accept='.pdf', multiple=False)
second_pass=widgets.Checkbox(description="Two passes (slow but better results)",
                             disabled=True,
                             indent=False,
                             value=True)
parallel=widgets.Checkbox(description="Paralel processing (might rate-limit free API keys)",
                             disabled=True,
                             indent=False,
                             value=True)
convert_btn = widgets.Button(description='Convert to Markdown', disabled=True)
status = widgets.HTML(value='')
download_btn = widgets.Button(description='Download Results', disabled=True)
download_link = widgets.HTML()
show_btn = widgets.Button(description='Display the document', disabled=True, value=False)

container = widgets.VBox()

temp_dir = Path(tempfile.mkdtemp(dir=".")).relative_to(Path.cwd())

input_path = None
prefix = None
filename = None

def on_upload_change(change):
    global input_path

    global prefix
    global filename
    if upload.value:
        filename = upload.value[0]["name"]
        file_content = upload.value[0]['content']

        # Save uploaded file
        input_path = temp_dir/filename
        with open(input_path, 'wb') as f:
            f.write(file_content)

        # Convert
        prefix = temp_dir/'output'
        os.makedirs(prefix, exist_ok=True)

        # Enable conversion when a file is uploaded.
        parallel.disabled = False
        second_pass.disabled = False
        convert_btn.disabled = False


zip_filename = None
document = None
def on_convert_click(b):
        global zip_filename
        global document

        convert_btn.disabled=True
        try:
            document = convert_pdf_to_markdown(input_path,
                                               prefix,
                                               second_pass.value,
                                               parallel.value)

            # Create zip
            zip_filename = f'{filename}-output.zip'
            zip_path = temp_dir/zip_filename
            shutil.make_archive(str(zip_path)[:-4], 'zip', prefix)

            # Enable download and show
            download_btn.disabled = False
            show_btn.disabled = False
            status.value = '<span style="color: green">Conversion complete!</span>'
        except Exception as e:
            download_btn.disabled = True
            status.value = f'<span style="color: red">Error: {str(e)}</span>'
        convert_btn.disabled=True


def on_download_click(b):
    with open(os.path.join(temp_dir, zip_filename), 'rb') as f:
        content = f.read()
        b64 = base64.b64encode(content).decode()
    download_link.value = f"""
    <a download="{zip_filename}"
       href="data:application/zip;base64,{b64}"
       target="_blank">Click to download</a>
    """

markdown_display_handle = None

def on_show_click(b):
    global markdown_display_handle
    display(Markdown(patch_dir_base(document, prefix)), display_id=True)

upload.observe(on_upload_change, names='value')
convert_btn.on_click(on_convert_click)
download_btn.on_click(on_download_click)
show_btn.on_click(on_show_click)



container.children = [upload, status, second_pass, parallel, convert_btn, download_btn, download_link, show_btn]

# Display UI
display(container)

NameError: name 'Path' is not defined

### Run the cell above to reset it

In [None]:
### Now, one requirement of the challenge was to use at least 100k token context.
### The PDFs I've been using for development/demonstration are a bit short for that.
### Let's convert the Gemini 1.5 paper. It's 150 pages and will take a long time to process.