# 01_extract_text.ipynb  
### PDF Text Extraction and Cleaning
This notebook extracts text from raw patent PDFs, cleans it, and prepares standardized `.txt` files for later chunking and embedding.

## Setup and Imports
Load required libraries and define project paths for raw PDFs and cleaned text output.

In [1]:
from pathlib import Path
import pdfminer.high_level
import re
from pypdf import PdfReader

PROJECT_ROOT = Path("..").resolve()
RAW_DIR = PROJECT_ROOT / "data" / "raw" / "patents"
TXT_DIR = PROJECT_ROOT / "data" / "processed" / "txt"

TXT_DIR.mkdir(parents=True, exist_ok=True)

RAW_DIR, TXT_DIR

(WindowsPath('C:/Users/sully/RAGPROJ/data/raw/patents'),
 WindowsPath('C:/Users/sully/RAGPROJ/data/processed/txt'))

## PDF to Text Extraction
The `pdf_to_text` function extracts text from a PDF using pdfminer, with PyPDF as a fallback when pdfminer fails. Some patents may not yield extractable text.


In [2]:
def pdf_to_text(pdf_path: Path) -> str:
    """
    Try to extract text using pdfminer first, then pypdf as fallback.
    """
    text = ""
    # First try pdfminer
    try:
        text = pdfminer.high_level.extract_text(str(pdf_path)) or ""
    except:
        text = ""

    # Fallback to PyPDF
    if not text.strip():
        try:
            reader = PdfReader(str(pdf_path))
            pages = [page.extract_text() or "" for page in reader.pages]
            text = "\n".join(pages)
        except:
            text = ""

    return text

## Text Cleaning
Removes page numbers, figure labels, line breaks, and extra spaces to produce clean text suitable for chunking.


In [3]:
def clean_text(text: str) -> str:
    # Remove standalone page numbers at the start of lines
    text = re.sub(r'^\s*\d+\s+', ' ', text, flags=re.MULTILINE)
    # Remove FIG/Fig lines
    text = re.sub(r'FIG\.?\s*\d+.*', ' ', text)
    text = re.sub(r'Fig\.?\s*\d+.*', ' ', text)
    # Fix hyphenated line breaks (in case any remain)
    text = text.replace("-\n", "")
    # Replace all newlines with spaces
    text = text.replace("\n", " ")
    # Collapse multiple spaces
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

## Process All PDFs
Extracts and cleans every PDF in the raw directory.  
Skips files that already have corresponding `.txt` outputs.


In [4]:
def process_all_pdfs(raw_dir: Path = RAW_DIR, out_dir: Path = TXT_DIR):
    pdf_files = sorted([p for p in raw_dir.iterdir() if p.suffix.lower() == ".pdf"])
    print(f"Found {len(pdf_files)} PDFs")

    for pdf in pdf_files:
        out_path = out_dir / (pdf.stem + ".txt")

        # üîí If you already created a txt by hand, don't touch it
        if out_path.exists():
            print(f"Skipping {pdf.name} (txt already exists)")
            continue

        raw = pdf_to_text(pdf)
        if not raw.strip():
            print(f"WARNING: no text extracted from {pdf.name}")
            continue

        clean = clean_text(raw)
        with open(out_path, "w", encoding="utf-8") as f:
            f.write(clean)

    print("Done!")

## Test Extraction
Run a quick test extraction to inspect the cleaned text and verify that extraction works correctly.


In [5]:
test_pdf = RAW_DIR / "US9037464.pdf"  # change to any name from the list above

raw = pdf_to_text(test_pdf)
clean = clean_text(raw)

print("Raw length:", len(raw))
print("Clean length:", len(clean))
print(clean[:1500])  # peek at the first 1500 chars

Raw length: 60769
Clean length: 56671
USOO9037464B1 United States Patent (12) Mikolov et al. (10) Patent No.: (45) Date of Patent: US 9,037.464 B1 May 19, 2015 (54) COMPUTING NUMERIC REPRESENTATIONS OF WORDS INA HIGH-DIMIENSIONAL SPACE 6,092,043 A * 7/2000 Squires et al. ................ TO4/251 8,566,102 B1 * 10/2013 Bangalore et al. . TO4/270.1 2013/0262467 A1* 10/2013 Zhang et al. .................. 707f737 ck (71) Applicant: Google Inc., Mountain View, CA (US) (72) Inventors: Tomas Mikolov, Jersey City, NJ (US); Kai Chen, San Bruno, CA (US); Gregory S. Corrado, San Francisco, CA Machine Learning Research, 3:1137-1155, 2003. (US); Jeffrey A. Dean, Palo Alto, CA (US) OTHER PUBLICATIONS Bengio and LeCun, ‚ÄúScaling learning algorithms towards AI. Large Scale Kernel Machines, MIT Press, 41 pages, 2007. Bengio et al., ‚ÄúA neural probabilistic language model.‚Äù Journal of Brants et al., "Large language models in machine translation." Pro ceedings of the Joint Conference on Empirical Me

## Run Full Extraction
Execute the PDF extraction and cleaning pipeline on all patents.


In [6]:
def process_all_pdfs(raw_dir: Path = RAW_DIR, out_dir: Path = TXT_DIR):
    pdf_files = sorted([p for p in raw_dir.iterdir() if p.suffix.lower() == ".pdf"])
    
    print(f"Found {len(pdf_files)} PDFs in {raw_dir}")
    for pdf in pdf_files:
        base_name = pdf.stem  # e.g., "US9037464"
        out_path = out_dir / f"{base_name}.txt"
        
        if out_path.exists():
            print(f"Skipping {pdf.name} (already processed)")
            continue
        
        print(f"Processing {pdf.name}...")
        raw = pdf_to_text(pdf)
        if not raw.strip():
            print(f"  WARNING: no text extracted from {pdf.name}")
            continue
        
        clean = clean_text(raw)
        with open(out_path, "w", encoding="utf-8") as f:
            f.write(clean)

    print("Done!")

## Run Full Extraction
Execute the PDF extraction and cleaning pipeline on all patents.

In [7]:
process_all_pdfs()

Found 33 PDFs in C:\Users\sully\RAGPROJ\data\raw\patents
Skipping US10452978.pdf (already processed)
Skipping US10740433.pdf (already processed)
Skipping US10902563.pdf (already processed)
Skipping US11003865.pdf (already processed)
Skipping US11023715.pdf (already processed)
Skipping US11238332.pdf (already processed)
Skipping US11295552.pdf (already processed)
Skipping US11328398.pdf (already processed)
Skipping US11562147.pdf (already processed)
Skipping US11636570.pdf (already processed)
Skipping US11749857.pdf (already processed)
Skipping US11900261.pdf (already processed)
Skipping US11921824.pdf (already processed)
Skipping US11961514.pdf (already processed)
Skipping US11989527.pdf (already processed)
Skipping US11991338.pdf (already processed)
Skipping US12148421.pdf (already processed)
Skipping US12182506.pdf (already processed)
Skipping US12217382.pdf (already processed)
Skipping US12271791B2.pdf (already processed)
Skipping US12282696B2.pdf (already processed)
Skipping US2021

## Normalize All Text Files
Re-cleans each `.txt` file to ensure consistent formatting, including manually pasted ones.


In [8]:
def normalize_all_text_files(txt_dir: Path = TXT_DIR):
    txt_files = sorted(txt_dir.glob("*.txt"))
    print(f"Normalizing {len(txt_files)} text files...")

    for txt_file in txt_files:
        raw = txt_file.read_text(encoding="utf-8", errors="ignore")
        cleaned = clean_text(raw)
        txt_file.write_text(cleaned, encoding="utf-8")
        print(f"‚úî Normalized: {txt_file.name}")

    print("\n‚ú® Done! All text files are now cleaned and uniform.")

## Run Normalization
Apply normalization to all cleaned text files.


In [9]:
normalize_all_text_files()

Normalizing 33 text files...
‚úî Normalized: US10452978.txt
‚úî Normalized: US10740433.txt
‚úî Normalized: US10902563.txt
‚úî Normalized: US11003865.txt
‚úî Normalized: US11023715.txt
‚úî Normalized: US11238332.txt
‚úî Normalized: US11295552.txt
‚úî Normalized: US11328398.txt
‚úî Normalized: US11562147.txt
‚úî Normalized: US11636570.txt
‚úî Normalized: US11749857.txt
‚úî Normalized: US11900261.txt
‚úî Normalized: US11921824.txt
‚úî Normalized: US11961514.txt
‚úî Normalized: US11989527.txt
‚úî Normalized: US11991338.txt
‚úî Normalized: US12148421.txt
‚úî Normalized: US12182506.txt
‚úî Normalized: US12217382.txt
‚úî Normalized: US12271791B2.txt
‚úî Normalized: US12282696B2.txt
‚úî Normalized: US20210183484A1.txt
‚úî Normalized: US20220101113A1.txt
‚úî Normalized: US20230252224A1.txt
‚úî Normalized: US20240185001A1.txt
‚úî Normalized: US20240256792A1.txt
‚úî Normalized: US20240346254A1.txt
‚úî Normalized: US8332207.txt
‚úî Normalized: US8812291.txt
‚úî Normalized: US9037464.txt
‚úî Normal