# Why preprocess academic papers before handing them to an LLM?

Like most data projects 80%-90% of our success will come down to proper handling of the data. In this case, PDFs of journal articles from arXiv. In modern literature review workflows, the biggest bottleneck is not finding papers but making their PDF bound knowledge actually accessible to large language model tooling.

| Goal | Reason we **can’t** give raw PDFs to an LLM |
|------|---------------------------------------------|
| *Structure* – know what’s **title / abstract / sections / references** | PDF stores nothing about logical structure; it’s just coordinates on a page. |
| *Clean text* – remove headers, footers, page numbers | These artifacts confuse token‐based models and waste context length. |
| *Stable math and code* | Equations often arrive as images or scattered glyphs; code fragments lose indentation. |
| *Consistent token‑budget* | PDFs contain ligatures, hyphenated line‐breaks, multiple encodings; the same sentence can expand to many more tokens if not normalised. |
| *Rich metadata* | Author lists, publication year, DOI, citation graph – essential for literature reviews but hidden inside the PDF. |

There are many workflows to handle the task of translating PDFs to a large language model format. What follows is an example of a fairly robust general purpose workflow based on two technologies, GROBID and Nougat.

Instead of handing opaque, page oriented files straight to an LLM and wasting precious context on headers, line breaks, and garbled equations we first translate each PDF into a pair of machine friendly artefacts: a TEI (Text Encoding Initiative) XML document created by GROBID that captures the paper’s logical skeleton (title, authors, section hierarchy, references) and a Markdown transcription generated by Nougat that preserves readable text, formulas, and tables.

By fusing GROBID’s structure with Nougat’s clean content, we produce lightweight JSON records whose fields—metadata, sectioned body text, citation list—are ready for chunking, embedding, and sophisticated LLM querying, turning a static corpus into an interactive, insight ready knowledge base.

| Tool | Core task | Why it matters for LLM prep |
|------|-----------|----------------------------|
| **GROBID**<br>(GeneRation Of BIbliographic Data) | • Converts scholarly PDFs → TEI‑XML<br>• Extracts **title, authors, affiliations, abstract, section hierarchy, in‑text citations, reference list** | Gives us a *semantic* view of a paper – every logical part in its own XML tag. Perfect for:<br>• building citation graphs<br>• slicing the text into meaningful chunks (“Introduction”, “Methods”…)<br>• attaching metadata to embeddings or RAG pipelines |
| **Nougat**<br>(Facebook AI Research) | • Finetuned vision transformer that transcribes PDFs → **Markdown**<br>• Keeps **mathematical formulas, tables and figures** in readable form | Produces human‑friendly, LLM‑friendly plain text while *preserving* layout cues:<br>• equations as LaTeX blocks, not garbled symbols<br>• headings as `#`, lists as `-`<br>• tables in Markdown grid format |

Before we get into the code, lets discuss the file types that we will be using

| Format | What it really is | ✅ Strengths for researchers | ⚠️ Typical pain‑points |
|--------|------------------|----------------------------|------------------------|
| **PDF**<br>*Portable Document Format* | A **page‑oriented binary canvas**. Text, figures, & glyphs are stored as drawing commands. Optimized for what the *page looks* like. | • Camera‑ready, citable rendering<br>• Opens everywhere | • Lacks semantic structure—“paragraphs” are just XY‑coordinates<br>• Hard to recover math & tables without heavy layout analysis |
| **TEI XML**<br>(output of **GROBID**) | Standards‑based XML where every logical unit—`<title>`, `<biblStruct>` ect —is explicitly tagged per the **Text Encoding Initiative**. | • Rich scholarly metadata<br>• Interoperable with library/archival tools<br>• Coordinates (optional) for linking text ↔ page | • Verbose; not human‑friendly<br>• Needs chunking/cleaning before LLM ingestion |
| **Markdown**<br>(output of **Nougat**) | Plain‑text markup: headings with `#`, LaTeX math `$…$`, pipe tables, embedded figure refs. | • Easy to read & diff in Git<br>• Preserves equations and table layout far better than naïve text extraction | • Drops deep metadata (affiliations, reference IDs)<br>• Heading levels depend on model quality—may require post‑fixes |
| **JSON**<br>(your downstream corpus) | Your own schema, e.g. ```{"title":…,"authors":[…],"sections":[…],"references":[…]}``` | • Compact & stream‑friendly<br>• First‑class citizen in Python / JS & vector DBs<br>• Perfect for chunk‑embedding pipelines | • Must map TEI+Markdown consistently<br>• Versioning of the pipeline is critical to avoid schema drift |

In [2]:
import os, time, shutil #shutil will allow us a bit more power to read and write files
import xml.etree.ElementTree as ET
from google.colab import files
# Mount Google Drive (ensure this is run first)
from google.colab import drive
drive.mount('/content/drive', force_remount=True)



base_directory = "/content/drive/MyDrive/Colab_Notebooks/AI/"
PDF_DIR = base_directory+"arxiv_pdfs"   # input PDFs
OUT_XML = base_directory+"grobid_xml"   # GROBID TEI files
OUT_MD  = base_directory+"arxiv_markdowns2"  # Nougat markdowns
!mkdir -p "$OUT_XML" "$OUT_MD"

Mounted at /content/drive


In [None]:
# JDK for GROBID
!apt-get update -qq
!apt-get install -y openjdk-17-jdk git build-essential

# Verify Java
!java -version

# GROBID client (Python) & server
%cd /content
!git clone --depth 1 https://github.com/kermitt2/grobid.git
%cd grobid
!./gradlew clean install -x test  # ≈ 4–6 min


# ────────────────────────────────────────────────────────────
#   Launch the GROBID web service (runs on port 8070)
#     './gradlew run' keeps the JVM alive; we fork it to the background
# ────────────────────────────────────────────────────────────
import subprocess, time, requests, os, textwrap, sys, json, glob, shutil, pathlib, tqdm, random

# Start the service silently in the background
service = subprocess.Popen(["./gradlew", "run"], stdout=subprocess.DEVNULL)

# Wait for the /isalive ping to return 200
print("⏳  Starting GROBID …")
for _ in range(60):
    try:
        if requests.get("http://localhost:8070/api/isalive", timeout=1).status_code == 200:
            print("✅  GROBID is up on http://localhost:8070")
            break
    except requests.exceptions.ConnectionError:
        pass
    time.sleep(10)
else:
    raise RuntimeError("GROBID failed to start – check the build log above.")

# Install grobid
!pip install -q git+https://github.com/kermitt2/grobid-client-python.git
# Nougat – use the tiny wrapper from FB Research
!git clone https://github.com/facebookresearch/nougat.git
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
%cd nougat
!pip install .
import nougat
print(nougat.__version__)
# (Optional) utilities
!pip install tqdm rich python-docx

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
git is already the newest version (1:2.34.1-1ubuntu1.12).
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libxt-dev libxtst6 libxxf86dga1
  openjdk-17-jdk-headless openjdk-17-jre openjdk-17-jre-headless x11-utils
Suggested packages:
  libxt-doc openjdk-17-demo openjdk-17-source visualvm libnss-mdns
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei
  | fonts-wqy-zenhei fonts-indic mesa-utils
The following NEW packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libxt-dev libxtst6 libxxf

  check_for_updates()
  alb.ElasticTransform(


0.1.18
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2


# Using Grobid to generate TEI XML files,

An overview of the code before proceeding

| Step | What it does | Why it matters to **Grobid newcomers** |
|------|--------------|----------------------------------------|
| **Import client** | Loads the official Python wrapper for Grobid’s REST API. | You control Grobid from Python instead of cURL; easier in notebooks. |
| **Create folders** | Makes sure your input/output directories exist. | Grobid reads *only* from a folder you specify and writes TEI‑XML next to (not inside) the PDFs. Missing folders cause silent failures. |
| **Instantiate `GrobidClient`** | Points the client to the running Grobid server. | Grobid itself is a Java service (often started via Docker). The client just sends HTTP calls. |
| **Configure runtime** | Sets time‑outs, batch size, retry delay. | Prevents the notebook from hanging on huge PDFs and avoids hammering the server. |
| **`process()` call** | Sends every PDF in `PDF_DIR` to Grobid’s `processFulltextDocument` service, which parses → TEI‑XML. | `processFulltextDocument` is Grobid’s “hero” endpoint: it extracts the full article structure—metadata, section hierarchy, references, tables, formulas—into **TEI**, an XML dialect for scholarly texts. |
| **`consolidate_citations=True`** | Asks Grobid to cluster identical references and fill in missing data from CrossRef. | Produces cleaner, more complete bibliographies for downstream tasks (citation graphs, reference matching, etc.). |
| **End message** | Confirms where the new `.tei.xml` files live. | You’ll feed these TEI files into later preprocessing steps (e.g. reference parsing, section splitting, JSON export). |

In [None]:
#─────────────────────────────────────────────────────────────────────────────
# 1.  Import the Grobid Python client
#─────────────────────────────────────────────────────────────────────────────

from grobid_client.grobid_client import GrobidClient
print("✅  GrobidClient imported from", GrobidClient.__module__)
#─────────────────────────────────────────────────────────────────────────────
# 2.  Ensure the input/output folders exist
#─────────────────────────────────────────────────────────────────────────────
# Grobid reads PDF files from  PDF_DIR   and writes TEI‑XML files to  OUT_XML
# If the folders don’t exist yet, create them so the next call won’t crash.
os.makedirs(PDF_DIR,  exist_ok=True)
os.makedirs(OUT_XML, exist_ok=True)
#─────────────────────────────────────────────────────────────────────────────
# 3.  Instantiate a GrobidClient
#─────────────────────────────────────────────────────────────────────────────
# • grobid_server points to the REST service   (default Docker container runs
#o n http://localhost:8070).  Adjust if you deployed Grobid elsewhere.
client = GrobidClient(grobid_server="http://localhost:8070")
# This is the default server address
#─────────────────────────────────────────────────────────────────────────────
# 4.  Tweak runtime parameters
#─────────────────────────────────────────────────────────────────────────────
client.config["timeout"]     = 180   # seconds to wait for each PDF (big files)
client.config["batch_size"]  = 100   # how many PDFs to enqueue per request
client.config["sleep_time"]  =   8   # seconds to wait after a failed request


#─────────────────────────────────────────────────────────────────────────────
# 5.  Launch the actual extraction job
#─────────────────────────────────────────────────────────────────────────────

client.process(
    "processFulltextDocument",   # Grobid service: full‑text TEI extraction
    input_path=PDF_DIR,          # folder containing *.pdf files
    output=OUT_XML,              # where *.tei.xml files will be written
    n=2,                         # use 2 parallel worker threads (good for Colab)
    consolidate_citations=True,  # ask Grobid to merge duplicate references
    force=True,                  # overwrite any TEI that already exists
    verbose=True                 # print progress to the notebook
)


print(f"\nDone! TEI files written to: {OUT_XML}")




In [None]:
# Workflow in a nutshell:
#   1. Create a temporary scratch directory for Nougat artifacts.
#   2. Loop over PDFs → call `nougat <pdf> -o <tmp_subdir>`.
#   3. Collect the single `.mmd` file, move it to **OUT_MD**.
#   4. Delete the bulky, per‑PDF scratch folder to keep Drive tidy.
#─────────────────────────────────────────────────────────────────────────────

import subprocess      # lets us run shell commands programmatically

import subprocess # to call Nougat CLI from Python
temp_dir = '/content/temp_nougat_output' # Where Nougat will stash *intermediate* outputs (page PNGs, json, etc.)
#─────────────────────────────────────────────────────────────────────────────
# 1. Find every PDF in our input folder
#─────────────────────────────────────────────────────────────────────────────

pdf_files = [f for f in os.listdir(PDF_DIR) if f.lower().endswith('.pdf')]
#─────────────────────────────────────────────────────────────────────────────
# 2. Convert each PDF
#─────────────────────────────────────────────────────────────────────────────

for pdf_file in pdf_files:
    pdf_path = os.path.join(PDF_DIR, pdf_file)
    base_filename = os.path.splitext(pdf_file)[0]

    # Nougat outputs markdown into a folder; define this intermediate path
    intermediate_output_path = os.path.join(temp_dir, base_filename)

    print(f"\nProcessing {pdf_file}...")

    # Run Nougat
    result = subprocess.run(
        ["nougat", pdf_path, "-o", intermediate_output_path],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True
    )
#─────────────────────────────────────────────────────────────────────────────
# 3. If Nougat succeeded, move *.mmd to our final markdown folder
#─────────────────────────────────────────────────────────────────────────────


    if result.returncode == 0:
        # Nougat places .mmd file inside a directory it created
        nougat_mmd_file = os.path.join(intermediate_output_path, f"{base_filename}.mmd")
        final_markdown_path = os.path.join(OUT_MD, f"{base_filename}.mmd")

        # Move markdown file to desired directory
        if os.path.exists(nougat_mmd_file):
            shutil.move(nougat_mmd_file, final_markdown_path)
            print(f"✅ Markdown saved to {final_markdown_path}")
        else:
            print(f"❌ Nougat markdown file not found at {nougat_mmd_file}")

        # Remove the intermediate output directory
        shutil.rmtree(intermediate_output_path, ignore_errors=True)
    else:
        print(f"❌ Error converting {pdf_file}: {result.stderr}")

#─────────────────────────────────────────────────────────────────────────────
# 4.Global cleanup: remove the top‑level temp directory
#─────────────────────────────────────────────────────────────────────────────
    shutil.rmtree(temp_dir, ignore_errors=True)

| What you should know | Why it matters |
|----------------------|----------------|
| **Nougat vs. OCR** – Nougat is a deep‑learning model trained on arXiv articles; it understands LaTeX layout, math zones, tables, and multi‑column text. | You retain high‑fidelity equations & table structure that OCR or `pdftotext` would mangle – crucial for LLM reasoning or downstream parsing. |
| **Markdown output** (`*.mmd`) | Easier to tokenize, chunk, or embed than HTML/TEI while still preserving semantic structure (headings, lists, code blocks). |
| **Intermediate files** – Nougat writes PNGs / JSON per page. | They are only useful for visual inspection; we delete them to save your Google Drive quota. |
| **Return codes** – `returncode == 0` signals success. | Always check; malformed PDFs or memory limits can cause silent failures. |
| **Batching** – Running in a loop like above is safer in Colab than passing an entire folder (Nougat can be memory‑heavy). | Prevents the session from crashing on one bad PDF. |