## Mistral OCR

In [44]:
import os
from mistralai import Mistral
from dotenv import load_dotenv

# 1. Load your API key from .env or env variable
load_dotenv()
client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))

# 2. Upload the local PDF file
uploaded_pdf = client.files.upload(
    file={
        "file_name": "sample_doc.pdf",
        "content": open("sample_doc.pdf", "rb"),
    },
    purpose="ocr"
)

# 3. Retrieve a signed URL for that file
signed_url = client.files.get_signed_url(file_id=uploaded_pdf.id)

# 4. Use that URL in the OCR call
ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": signed_url.url,
    },
    include_image_base64=True
)

In [45]:
print(ocr_response.pages[2].markdown)

# Differential Diagnosis 

Lysosomal storage disorders. The main clinical features in alpha-mannosidosis - intellectual disability, ataxia, coarse face, and dysostosis multiplex - may overlap with other lysosomal storage disorders (e.g., mucopolysaccharidosis type I and II). However, the distinctive clinical features associated with other lysosomal storage disorders, the availability of biochemical testing in clinical laboratories, and an understanding of their natural history should help in distinguishing between them.

Table 2. Genes of Interest in the Differential Diagnosis of Alpha-Mannosidosis

| Gene(s) | Disorder | MOI | Key Clinical Features of Disorder |  |
| :--: | :--: | :--: | :--: | :--: |
|  |  |  | Overlapping w/alpha- <br> mannosidosis | Distinguishing from alpha- <br> mannosidosis |
| $A B C C 9$ <br> KCNJ8 | Cantú syndrome | AD | - Coarse facial features <br> - Thickened ribs | - Heart defects <br> - Hypertrichosis |
| ARSB <br> ARSK <br> GALNS <br> GLB1 <br> GNS |  |

## PyPDF

In [25]:
from pypdf import PdfReader

# Open the PDF file
reader = PdfReader("sample_doc.pdf")
num_pages = len(reader.pages)
print(f"Total pages: {num_pages}")

# Parse and store text for all pages
all_text = []
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    all_text.append(text)

# Print the output from page 3 (index 2)
page_number = 2  # Python uses 0-based indexing
if page_number < num_pages:
    print(f"Text from page 3:\n{all_text[page_number]}")
else:
    print("This PDF does not have a page 3.")

Ignoring wrong pointing object 61 0 (offset 0)


Total pages: 4
Text from page 3:
Differential Diagnosis
Lysosomal storage disorders. /T_he main clinical features in alpha-mannosidosis – intellectual disability, ataxia, 
coarse face, and dysostosis multiplex – may overlap with other lysosomal storage disorders (e.g., 
mucopolysaccharidosis type I and II). However, the distinctive clinical features associated with other lysosomal 
storage disorders, the availability of biochemical testing in clinical laboratories, and an understanding of their 
natural history should help in distinguishing between them.
Table 2. Genes of Interest in the Diﬀerential Diagnosis of Alpha-Mannosidosis
Gene(s) Disorder MOI
Key Clinical Features of Disorder
Overlapping w/alpha-
mannosidosis
Distinguishing from alpha-
mannosidosis
ABCC9 
KCNJ8 Cantú syndrome AD • Coarse facial features
• /T_hickened ribs
• Heart defects
• Hypertrichosis
ARSB 
ARSK 
GALNS 
GLB1 
GNS 
GUSB 
HGSNAT 
HYAL1 
IDS 
IDUA 
NAGLU 
SGSH 
Mucopolysaccharidoses (OMIM 
PS607014)
AR
XL/uni0

## PyPlumber

In [26]:
import pdfplumber

pdf_path = "sample_doc.pdf"

all_text = []
all_tables = []

with pdfplumber.open(pdf_path) as pdf:
    num_pages = len(pdf.pages)
    print(f"Total pages: {num_pages}")

    for i, page in enumerate(pdf.pages):
        # Extract all text on the page
        text = page.extract_text()
        all_text.append(text)

        # Extract all tables on the page (list of tables)
        tables = page.extract_tables()
        all_tables.append(tables)

# ---- Print output from page 3 (index 2) ----

page_number = 2  # 0-based indexing for page 3

if page_number < num_pages:
    print("------ TEXT ------")
    print(all_text[page_number])

    print("\n------ TABLES ------")
    tables = all_tables[page_number]
    if tables:
        for t_idx, table in enumerate(tables):
            print(f"\nTable {t_idx+1}:")
            for row in table:
                print(row)
    else:
        print("No tables found on this page.")
else:
    print("This PDF does not have a page 3.")

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


Total pages: 4


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


------ TEXT ------
Alpha-Mannosidosis 11
Differential Diagnosis
Lysosomal storage disorders. The main clinical features in alpha-mannosidosis – intellectual disability, ataxia,
coarse face, and dysostosis multiplex – may overlap with other lysosomal storage disorders (e.g.,
mucopolysaccharidosis type I and II). However, the distinctive clinical features associated with other lysosomal
storage disorders, the availability of biochemical testing in clinical laboratories, and an understanding of their
natural history should help in distinguishing between them.
Table 2. Genes of Interest in the Differential Diagnosis of Alpha-Mannosidosis
Key Clinical Features of Disorder
Gene(s) Disorder MOI
Overlapping w/alpha- Distinguishing from alpha-
mannosidosis mannosidosis
ABCC9 • Coarse facial features • Heart defects
Cantú syndrome AD
KCNJ8 • Thickened ribs • Hypertrichosis
ARSB
ARSK
GALNS
GLB1
GNS
• Coarse facial features
GUSB Mucopolysaccharidoses (OMIM AR • Short stature
• Dysostosis multiplex

## n8n parser

In [46]:
import requests
from pypdf import PdfReader, PdfWriter
import io

webhook_url = "https://congliu.app.n8n.cloud/webhook/76137e77-2a43-4b79-8945-3ec3b3341fa6"
webhook_url = "http://localhost:5678/webhook-test/d9c46738-586c-4b2c-b51a-96b34e72fe11"
pdf_path = "sample_doc.pdf"

# Extract page 3 (index 2) and write to a BytesIO buffer
reader = PdfReader(pdf_path)
writer = PdfWriter()

if len(reader.pages) > 2:
    writer.add_page(reader.pages[2])
    pdf_bytes = io.BytesIO()
    writer.write(pdf_bytes)
    pdf_bytes.seek(0)

    files = {'file': ('page3.pdf', pdf_bytes, 'application/pdf')}
    response = requests.post(webhook_url, files=files)

    print("Status code:", response.status_code)
    print("Response:")
    print(response.text)
else:
    print("This PDF does not have a page 3.")


Ignoring wrong pointing object 15 0 (offset 0)
Ignoring wrong pointing object 19 0 (offset 0)
Ignoring wrong pointing object 22 0 (offset 0)
Ignoring wrong pointing object 24 0 (offset 0)
Ignoring wrong pointing object 86 0 (offset 0)
Ignoring wrong pointing object 789 0 (offset 0)
Ignoring wrong pointing object 795 0 (offset 0)
Ignoring wrong pointing object 801 0 (offset 0)
Ignoring wrong pointing object 807 0 (offset 0)


Status code: 200
Response:
{"message":"Workflow was started"}


In [43]:
print(response.json().get('text'))

Differential Diagnosis
Lysosomal storage disorders. !e main clinical features in alpha-mannosidosis – intellectual disability, ataxia,
coarse face, and dysostosis multiplex – may overlap with other lysosomal storage disorders (e.g.,
mucopolysaccharidosis type I and II). However, the distinctive clinical features associated with other lysosomal
storage disorders, the availability of biochemical testing in clinical laboratories, and an understanding of their
natural history should help in distinguishing between them.
Table 2. Genes of Interest in the Differential Diagnosis of Alpha-Mannosidosis
Gene(s) Disorder MOI
Key Clinical Features of Disorder
Overlapping w/alpha-
mannosidosis
Distinguishing from alpha-
mannosidosis
ABCC9
KCNJ8 
Cantú syndrome AD 
• Coarse facial features
• !ickened ribs
• Heart defects
• Hypertrichosis
ARSB
ARSK
GALNS
GLB1
GNS
GUSB
HGSNAT
HYAL1
IDS
IDUA
NAGLU
SGSH
Mucopolysaccharidoses (OMIM
PS607014)
AR
XL 
1
• Coarse facial features
• Dysostosis multiplex
• ID
• 