# üìö Agricultural PDF Text Extraction Pipeline

This notebook extracts text from agricultural PDFs and creates a unified dataset for LLM fine-tuning.

## Features:
- **Smart Detection**: Automatically detects digital vs scanned PDFs
- **Fast Extraction**: Uses PyMuPDF for digital PDFs (instant)
- **High-Quality OCR**: Uses Chandra for scanned PDFs
- **JSONL Output**: Best format for LLM training
- **Rich Metadata**: Includes source, page numbers, timestamps

## Output Format:
```json
{"id": "chapter02-p1", "doc": "chapter02", "page": 1, "text": "...", "source": "chapter02.pdf", "method": "digital", "timestamp": "..."}
```

## üîß Step 1: Install Dependencies

In [None]:
# Install required packages
!pip install pymupdf tqdm -q

# Uncomment below if you have scanned PDFs that need OCR
# !pip install chandra-ocr -q

## üìÅ Step 2: Configuration & Setup

In [None]:
from pathlib import Path
import json
from datetime import datetime
import fitz  # PyMuPDF
from tqdm.notebook import tqdm

# ==================== CONFIGURATION ====================
PDF_DIR = Path("pdfs")  # Input directory with PDFs
OUTPUT_DIR = Path("extracted_data")  # Output directory
DATASET_FILE = OUTPUT_DIR / "agricultural_dataset.jsonl"  # Final dataset

# Analysis settings
SAMPLE_PAGES = 5  # Pages to check for digital text
MIN_CHARS_PER_PAGE = 100  # Threshold for digital vs scanned
# =======================================================

# Create output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"üìÇ Input: {PDF_DIR.absolute()}")
print(f"üìÇ Output: {OUTPUT_DIR.absolute()}")
print(f"üìÑ Dataset: {DATASET_FILE.name}")

## üîç Step 3: Analyze PDFs

In [None]:
def analyze_pdf(pdf_path, sample_pages=SAMPLE_PAGES):
    """
    Analyze PDF to determine if it's digital (has extractable text) or scanned (needs OCR).
    
    Args:
        pdf_path: Path to PDF file
        sample_pages: Number of pages to check
    
    Returns:
        dict with analysis results
    """
    doc = fitz.open(pdf_path)
    total_chars = 0
    pages_to_check = min(sample_pages, len(doc))
    
    # Check first N pages for text content
    for i in range(pages_to_check):
        text = doc[i].get_text().strip()
        total_chars += len(text)
    
    avg_chars = total_chars / pages_to_check if pages_to_check > 0 else 0
    is_digital = avg_chars >= MIN_CHARS_PER_PAGE
    
    result = {
        "path": pdf_path,
        "name": pdf_path.name,
        "stem": pdf_path.stem,
        "size_mb": pdf_path.stat().st_size / (1024 * 1024),
        "total_pages": len(doc),
        "avg_chars_per_page": round(avg_chars, 1),
        "is_digital": is_digital,
        "extraction_method": "digital" if is_digital else "ocr"
    }
    
    doc.close()
    return result

# Find and analyze all PDFs
pdf_files = sorted(PDF_DIR.glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files\n")

if not pdf_files:
    print("‚ùå No PDF files found! Please check the PDF_DIR path.")
else:
    print("Analyzing PDFs...")
    print("=" * 80)
    
    pdf_analysis = []
    digital_count = 0
    scanned_count = 0
    
    for pdf in tqdm(pdf_files, desc="Analyzing"):
        info = analyze_pdf(pdf)
        pdf_analysis.append(info)
        
        if info["is_digital"]:
            digital_count += 1
        else:
            scanned_count += 1
    
    # Display results
    print("\nüìä Analysis Results:")
    print("=" * 80)
    for info in pdf_analysis:
        status = "üü¢ DIGITAL" if info["is_digital"] else "üî¥ SCANNED (needs OCR)"
        print(f"{info['name']:<50} | {info['total_pages']:3d} pages | {info['size_mb']:5.1f} MB | {status}")
    
    print("\n" + "=" * 80)
    print(f"üìà Summary: {digital_count} digital, {scanned_count} scanned (total: {len(pdf_files)})")
    
    if scanned_count > 0:
        print("\n‚ö†Ô∏è  WARNING: Some PDFs need OCR. This will be handled in Step 5.")

## ‚ö° Step 4: Extract Digital PDFs (Fast Method)

In [None]:
def extract_digital_pdf(pdf_info):
    """
    Extract text from digital PDFs using PyMuPDF (fast, no OCR needed).
    
    Args:
        pdf_info: Dictionary with PDF information from analyze_pdf()
    
    Returns:
        List of records (one per page)
    """
    doc = fitz.open(pdf_info['path'])
    records = []
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        text = page.get_text("text").strip()
        
        # Only include pages with actual content
        if text:
            record = {
                "id": f"{pdf_info['stem']}-p{page_num + 1}",
                "doc": pdf_info['stem'],
                "page": page_num + 1,
                "total_pages": pdf_info['total_pages'],
                "text": text,
                "source": pdf_info['name'],
                "method": "digital",
                "timestamp": datetime.utcnow().isoformat() + "Z"
            }
            records.append(record)
    
    doc.close()
    return records

# Extract all digital PDFs
digital_pdfs = [info for info in pdf_analysis if info["is_digital"]]
all_records = []

if digital_pdfs:
    print(f"Extracting {len(digital_pdfs)} digital PDFs...\n")
    
    for pdf_info in tqdm(digital_pdfs, desc="Extracting digital PDFs"):
        records = extract_digital_pdf(pdf_info)
        all_records.extend(records)
        print(f"  ‚úì {pdf_info['name']}: {len(records)} pages extracted")
    
    print(f"\n‚úÖ Extracted {len(all_records)} pages from digital PDFs")
else:
    print("No digital PDFs found. All PDFs require OCR.")

## ü§ñ Step 5: Extract Scanned PDFs (OCR Method)

**Note**: This step uses Chandra OCR for scanned PDFs. If you don't have scanned PDFs, you can skip this step.

In [None]:
import subprocess
import re

def extract_scanned_pdf_with_chandra(pdf_info):
    """
    Extract text from scanned PDFs using Chandra OCR.
    
    Args:
        pdf_info: Dictionary with PDF information
    
    Returns:
        List of records (one per page)
    """
    # Run Chandra OCR
    cmd = [
        "chandra",
        str(pdf_info['path']),
        str(OUTPUT_DIR),
        "--method", "hf",  # Use HuggingFace method (local, no server needed)
        "--no-images"  # Skip images to save space
    ]
    
    print(f"  Running OCR on {pdf_info['name']}...")
    subprocess.run(cmd, check=True, capture_output=True, text=True)
    
    # Find the generated markdown file
    md_file = OUTPUT_DIR / f"{pdf_info['stem']}.md"
    
    if not md_file.exists():
        print(f"  ‚ö†Ô∏è  Warning: Expected output file {md_file} not found")
        return []
    
    # Read and parse the markdown output
    with open(md_file, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Split by page markers (Chandra typically uses ---)
    if '\n---\n' in content:
        pages = content.split('\n---\n')
    else:
        pages = [content]
    
    # Create records
    records = []
    for page_num, page_text in enumerate(pages, start=1):
        page_text = page_text.strip()
        if page_text:
            record = {
                "id": f"{pdf_info['stem']}-p{page_num}",
                "doc": pdf_info['stem'],
                "page": page_num,
                "total_pages": len(pages),
                "text": page_text,
                "source": pdf_info['name'],
                "method": "ocr",
                "timestamp": datetime.utcnow().isoformat() + "Z"
            }
            records.append(record)
    
    return records

# Extract scanned PDFs
scanned_pdfs = [info for info in pdf_analysis if not info["is_digital"]]

if scanned_pdfs:
    print(f"\nProcessing {len(scanned_pdfs)} scanned PDFs with Chandra OCR...")
    print("‚ö†Ô∏è  This may take several minutes per document.\n")
    
    for pdf_info in scanned_pdfs:
        try:
            records = extract_scanned_pdf_with_chandra(pdf_info)
            all_records.extend(records)
            print(f"  ‚úì {pdf_info['name']}: {len(records)} pages extracted")
        except Exception as e:
            print(f"  ‚ùå Error processing {pdf_info['name']}: {e}")
    
    print(f"\n‚úÖ OCR extraction complete")
else:
    print("\n‚úÖ No scanned PDFs found. All PDFs were digital.")

## üíæ Step 6: Save Complete Dataset

In [None]:
# Save all records to JSONL
with open(DATASET_FILE, 'w', encoding='utf-8') as f:
    for record in all_records:
        f.write(json.dumps(record, ensure_ascii=False) + '\n')

# Calculate statistics
total_pages = len(all_records)
total_docs = len(set(r['doc'] for r in all_records))
total_chars = sum(len(r['text']) for r in all_records)
avg_chars_per_page = total_chars / total_pages if total_pages > 0 else 0

print("="* 80)
print("üìä DATASET STATISTICS")
print("="* 80)
print(f"Total documents:        {total_docs}")
print(f"Total pages:            {total_pages:,}")
print(f"Total characters:       {total_chars:,}")
print(f"Avg chars per page:     {avg_chars_per_page:,.0f}")
print(f"\nDataset saved to:       {DATASET_FILE}")
print(f"File size:              {DATASET_FILE.stat().st_size / (1024*1024):.2f} MB")
print("="* 80)

# Show extraction method breakdown
digital_count = sum(1 for r in all_records if r['method'] == 'digital')
ocr_count = sum(1 for r in all_records if r['method'] == 'ocr')
print(f"\nüìà Extraction Methods:")
print(f"  Digital (fast):       {digital_count} pages")
print(f"  OCR (Chandra):        {ocr_count} pages")

## üëÄ Step 7: Preview Dataset

In [None]:
# Load and preview first 3 records
with open(DATASET_FILE, 'r', encoding='utf-8') as f:
    samples = [json.loads(line) for i, line in enumerate(f) if i < 3]

print("üìñ DATASET PREVIEW (First 3 records)\n")

for i, sample in enumerate(samples, 1):
    print("=" * 80)
    print(f"Record {i}/{len(samples)}")
    print("=" * 80)
    print(f"ID:              {sample['id']}")
    print(f"Document:        {sample['doc']}")
    print(f"Page:            {sample['page']}/{sample['total_pages']}")
    print(f"Source:          {sample['source']}")
    print(f"Method:          {sample['method']}")
    print(f"Text length:     {len(sample['text'])} characters")
    print(f"\nText preview (first 500 chars):")
    print("-" * 80)
    preview = sample['text'][:500]
    print(preview + ("..." if len(sample['text']) > 500 else ""))
    print("\n")

## üìã Step 8: Document-Level Statistics

In [None]:
# Analyze each document
from collections import defaultdict

doc_stats = defaultdict(lambda: {'pages': 0, 'chars': 0})

for record in all_records:
    doc_stats[record['doc']]['pages'] += 1
    doc_stats[record['doc']]['chars'] += len(record['text'])
    doc_stats[record['doc']]['source'] = record['source']
    doc_stats[record['doc']]['method'] = record['method']

print("üìö DOCUMENT-LEVEL STATISTICS")
print("=" * 90)
print(f"{'Document':<45} | {'Pages':>6} | {'Chars':>10} | {'Method':>8}")
print("=" * 90)

for doc_name, stats in sorted(doc_stats.items()):
    print(f"{stats['source']:<45} | {stats['pages']:>6} | {stats['chars']:>10,} | {stats['method']:>8}")

print("=" * 90)

## ‚úÖ Next Steps

Your dataset is ready! Here's what you can do next:

### 1. Load the Dataset
```python
import json

# Load all records
with open('extracted_data/agricultural_dataset.jsonl', 'r', encoding='utf-8') as f:
    data = [json.loads(line) for line in f]

# Or load with HuggingFace datasets
from datasets import load_dataset
dataset = load_dataset('json', data_files='extracted_data/agricultural_dataset.jsonl')
```

### 2. Generate Q&A Pairs (Next Phase)
- Use GPT-4 or Claude to generate question-answer pairs
- Use the text as context for creating training examples
- Format for instruction fine-tuning

### 3. Create Embeddings for RAG
- Use OpenAI embeddings or open-source alternatives
- Store in vector database (Pinecone, Chroma, FAISS)
- Build retrieval-augmented generation system

### 4. Fine-tune an LLM
- Use for continued pre-training on agricultural domain
- Or create instruction dataset first, then fine-tune

---

**Dataset Location**: `extracted_data/agricultural_dataset.jsonl`  
**Format**: One JSON object per line (JSONL)  
**Ready for**: LLM training, Q&A generation, embeddings, RAG