# AIMS.au Inference on PDFs Demo

This notebook shows how to parse a PDF file and turn its content into preprocessed sentences
that a baseline model may process and convert into relevance predictions.

Here, we assume that no metadata, no images, and really nothing else but the text itself is needed
for a model to generate its predictions. This scenario only corresponds to baseline models used
as part of the proposed AIMS.au benchmark.

For a demo showing how to parse the statements contained in the full dataset,
see [this notebook](./data_parsing_demo.ipynb).

This notebook was last updated on 2024-07-31 for framework v0.5.2.

In [None]:
import fitz
import numpy as np
import pandas as pd
import torch

import qut01

In [None]:
# first step: provide a path to a local PDF file that you want to process
input_pdf_path = "/tmp/statement_to_analyze.pdf"

# we will open this PDF and extract its raw text using PyMuPDF (fitz), an open source library
with fitz.open(input_pdf_path) as pdf_reader:
    statement_text = []
    for page_idx in range(pdf_reader.page_count):
        page = pdf_reader.load_page(page_idx)
        statement_text.append(page.get_text("text"))
    statement_text = "\n".join(statement_text)
# note: of course, if the above PDF only contains embedded text, you won't get anything here
assert len(statement_text) != 0
print(f"STATEMENT TEXT:\n\n{statement_text}")

In [None]:
# next, we'll convert this raw text into a list of sentences
# (note: we create a fake data batch dict to use existing processing functions)
statement_processed_data = qut01.data.statement_utils.StatementProcessedData.create(
    statement_tensor_data={"fitz/text": statement_text},
    load_annotations=False,  # assume none exist
)
print(f"statement contains {len(statement_processed_data.sentences)} sentences")

In [ ]:
# the next step involves loading a pretrained model, and feeding it the sentences...
# (note: these steps will depend a LOT on the type of model you intend to use)

# @@@@@@@@@@@ TODO