# 🏛️ CURIA Scraper - Complete Web Scraping Pipeline

This notebook provides a comprehensive solution for scraping legal documents from the **CURIA (Court of Justice of the European Union)** website. It automates the entire process from navigating search results to downloading PDFs and extracting structured metadata.

## ✨ Key Features

- **🔍 Smart Document Discovery**: Automatically finds and processes CURIA document links
- **📄 PDF Generation**: Clicks "Start Printing" buttons to generate clean PDF downloads  
- **🌐 Multi-language Support**: Filters documents by preferred language (EN, FR, DE, etc.)
- **📊 Structured Data**: Extracts case numbers, titles, dates, and metadata
- **🔄 Pagination Handling**: Automatically processes multiple result pages
- **⚡ Error Recovery**: Graceful fallbacks for failed downloads

## 🚀 Quick Start Guide

### 1. **Prerequisites**
- Python 3.8+ with Jupyter support
- Internet connection for CURIA website access
- Sufficient disk space for PDF downloads

### 2. **Configuration** 
The scraper auto-generates a `config.toml` file with CURIA-optimized defaults:
- **Target Language**: English (EN) - modify `preferred_language` as needed
- **Search URL**: CURIA search results page - update `listing_url` with your specific search
- **Output Directory**: `./output` - all files saved here
- **Browser Mode**: Visible by default (`headless = false`) for debugging

### 3. **What It Does**
1. ✅ Navigates through CURIA search result pages
2. ✅ Identifies document links matching your language preference  
3. ✅ Clicks "Start Printing" buttons when available for clean PDFs
4. ✅ Falls back to HTML parsing if print buttons unavailable
5. ✅ Extracts structured metadata (case numbers, titles, dates)
6. ✅ Handles pagination automatically across multiple pages
7. ✅ Saves everything with meaningful filenames (`curia-doc-{id}.pdf`)

### 4. **Example CURIA URLs**
- **Search Results**: `https://curia.europa.eu/juris/recherche.jsf?language=en`
- **Document Page**: `https://curia.europa.eu/juris/document/document.jsf?docid=304744&doclang=EN`

## ⚠️ Important Notes

- **Windows Users**: Due to subprocess limitations, the scraper includes a standalone Python script fallback
- **Rate Limiting**: Built-in delays prevent overwhelming the CURIA servers
- **Legal Compliance**: Ensure your usage complies with CURIA's terms of service
- **Testing**: Set `headless = false` in config.toml to watch the browser during debugging

## 📁 Output Structure

```
output/
├── curia-doc-12345.pdf      # PDF documents (when print button works)
├── doc_1.json               # Structured metadata for each document
├── doc_2.json               # Contains URLs, case numbers, titles, etc.
└── ...
```

Ready to start? Run the cells below in order! 🚀

In [1]:
# This is required on first run to install Playwright browsers so they work async etc.
!playwright install

In [2]:
# Step 2: Install Required Python Packages
# This cell automatically detects and installs missing dependencies

import subprocess
import sys
from playwright.sync_api import sync_playwright
import importlib.util

def check_package(package_name):
    """Check if a package is installed"""
    spec = importlib.util.find_spec(package_name)
    return spec is not None

def check_playwright_browsers():
    """Check if playwright browsers are installed"""
    try:
        with sync_playwright() as p:
            # Try to get browser executable path - if it fails, browsers aren't installed
            p.chromium.executable_path
            return True
    except Exception:
        return False

# Define packages to check (added nest-asyncio for Windows/Jupyter compatibility)
packages_to_check = [
    ("playwright", "playwright"),
    ("bs4", "beautifulsoup4"),
    ("pydantic", "pydantic"),
    ("toml", "toml"),
    ("nest_asyncio", "nest-asyncio"),
    ("requests", "requests"),
    ("urllib3", "urllib3")
]

installed_packages = []
missing_packages = []

# Check each package
for import_name, pip_name in packages_to_check:
    if check_package(import_name):
        installed_packages.append(pip_name)
    else:
        missing_packages.append(pip_name)

# Check playwright browsers separately
playwright_browsers_installed = check_playwright_browsers() if check_package("playwright") else False

print(f"📦 Package Status: {len(installed_packages)} of {len(packages_to_check)} packages already installed")

# Install missing packages
if missing_packages:
    print(f"\n🔧 Installing missing packages: {', '.join(missing_packages)}")
    for package in missing_packages:
        print(f"Installing {package}...")
        subprocess.run([sys.executable, "-m", "pip", "install", package], check=True)
else:
    print("✅ All required packages already installed")

# Install playwright browsers if needed
if check_package("playwright") and not playwright_browsers_installed:
    print("\n🎭 Installing Playwright browsers...")
    subprocess.run([sys.executable, "-m", "playwright", "install"], check=True)
elif not check_package("playwright"):
    print("\n🎭 Installing Playwright browsers...")
    subprocess.run([sys.executable, "-m", "playwright", "install"], check=True)
else:
    print("✅ Playwright browsers already installed")

print("\n🎉 Installation complete! All dependencies are ready.")
print("💡 Tip: If you encounter Windows/Jupyter issues, use the standalone script generated at the end.")

📦 Package Status: 7 of 7 packages already installed
✅ All required packages already installed

🎭 Installing Playwright browsers...

🎉 Installation complete! All dependencies are ready.
💡 Tip: If you encounter Windows/Jupyter issues, use the standalone script generated at the end.

🎉 Installation complete! All dependencies are ready.
💡 Tip: If you encounter Windows/Jupyter issues, use the standalone script generated at the end.


# Step 3: Import Core Libraries
Import all required libraries for web scraping, data processing, and file management.

In [3]:
# Core library imports for the CURIA scraper
import asyncio
import json
from pathlib import Path
import toml
from pydantic import BaseModel
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright, Page
import logging

print("✅ Core libraries imported successfully")

✅ Core libraries imported successfully


# Step 4: Configuration Setup
Load or create the configuration file with CURIA-specific settings and validation models.

In [4]:
# Configuration models using Pydantic for validation
class GeneralSettings(BaseModel):
    output_dir: str
    checkpoint_file: str
    headless: bool
    throttle_delay_ms: int
    preferred_language: str = "EN"  # Language code for CURIA documents

class SiteSettings(BaseModel):
    listing_url: str
    document_content_selector: str = "body"
    start_print_button_text: str = "Start Printing"
    document_link_selector: str = "div#docHtml a[href*='document.jsf']"
    next_page_selector: str = "a[title='Next Page']"

class Settings(BaseModel):
    general: GeneralSettings
    site: SiteSettings

# Create or load configuration
config_path = Path("config.toml")

if not config_path.exists():
    print("📝 Creating default configuration file...")
    default_config = {
        "general": {
            "output_dir": "./output",
            "checkpoint_file": "./checkpoint.json",
            "headless": False,  # Visible browser for debugging
            "throttle_delay_ms": 2000,
            "preferred_language": "EN"
        },
        "site": {
            "listing_url": "https://curia.europa.eu/juris/recherche.jsf?language=en",
            "document_content_selector": "body",
            "start_print_button_text": "Start Printing",
            "document_link_selector": "div#docHtml a[href*='document.jsf']",
            "next_page_selector": "a[title='Next Page']"
        }
    }
    with open(config_path, "w") as f:
        toml.dump(default_config, f)
    print("✅ Created config.toml with CURIA defaults")

# Load and validate configuration
try:
    config_data = toml.load("config.toml")

    # Ensure all required fields exist with backward compatibility
    if "site" in config_data:
        site_config = config_data["site"]

        # Add missing fields with defaults
        defaults = {
            "document_link_selector": "div#docHtml a[href*='document.jsf']",
            "next_page_selector": "a[title='Next Page']",
            "document_content_selector": "body",
            "start_print_button_text": "Start Printing"
        }

        for key, default_value in defaults.items():
            if key not in site_config:
                site_config[key] = default_value
                print(f"➕ Added missing config field: {key}")

    if "general" in config_data:
        general_config = config_data["general"]
        if "preferred_language" not in general_config:
            general_config["preferred_language"] = "EN"
            print("➕ Added missing preferred_language field")

    # Save updated configuration
    with open(config_path, "w") as f:
        toml.dump(config_data, f)

    settings = Settings(**config_data)
    print("✅ Configuration loaded and validated successfully!")

except Exception as e:
    print(f"⚠️ Configuration error: {e}")
    print("🔄 Creating fresh configuration...")

    # Fallback to fresh config
    default_config = {
        "general": {
            "output_dir": "./output",
            "checkpoint_file": "./checkpoint.json",
            "headless": False,
            "throttle_delay_ms": 2000,
            "preferred_language": "EN"
        },
        "site": {
            "listing_url": "https://curia.europa.eu/juris/recherche.jsf?language=en",
            "document_content_selector": "body",
            "start_print_button_text": "Start Printing",
            "document_link_selector": "div#docHtml a[href*='document.jsf']",
            "next_page_selector": "a[title='Next Page']"
        }
    }
    with open(config_path, "w") as f:
        toml.dump(default_config, f)
    settings = Settings(**default_config)
    print("✅ Fresh configuration created!")

# Display current configuration
print(f"\n📋 CURIA Scraper Configuration:")
print(f"   🌐 Target Language: {settings.general.preferred_language}")
print(f"   🔗 Listing URL: {settings.site.listing_url}")
print(f"   👁️  Headless Mode: {settings.general.headless}")
print(f"   📁 Output Directory: {settings.general.output_dir}")
print(f"   🎯 Document Selector: {settings.site.document_link_selector}")
print(f"   ▶️  Next Page Selector: {settings.site.next_page_selector}")
print(f"\n💡 Edit config.toml to customize settings before running the scraper!")

✅ Configuration loaded and validated successfully!

📋 CURIA Scraper Configuration:
   🌐 Target Language: en
   🔗 Listing URL: https://curia.europa.eu/juris/documents.jsf?nat=or&mat=or&pcs=Oor&jur=C%2CT%2CF&for=&jge=&dates=%2524type%253Dpro%2524mode%253D1M%2524from%253D2025.09.27%2524to%253D2025.10.27&language=en&pro=&cit=none%252CC%252CCJ%252CR%252C2008E%252C%252C%252C%252C%252C%252C%252C%252C%252C%252Ctrue%252Cfalse%252Cfalse&oqp=&td=%24mode%3D1M%24from%3D2025.09.27%24to%3D2025.10.27%3B%3BPUB%3BPUB1%2CPUB2%2CPUB4%2CPUB7%2CPUB3%2CPUB8%2CPUB5%2CPUB6%3B%3B%3B%3BORDALL&avg=&lgrec=en&page=1&lg=EN%252C%252Btrue%252Cfalse&cid=7416927
   👁️  Headless Mode: True
   📁 Output Directory: ./output
   🎯 Document Selector: div#docHtml a[href*='document.jsf']
   ▶️  Next Page Selector: a[title='Next Page']

💡 Edit config.toml to customize settings before running the scraper!


# Step 5: Logging Configuration
Set up structured logging to track scraper progress and debug issues.

In [5]:
# Configure logging for the CURIA scraper
logger = logging.getLogger("curia_scraper")
handler = logging.StreamHandler()
formatter = logging.Formatter("[%(asctime)s] %(levelname)s %(message)s")
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

print("✅ Logging configured - scraper progress will be displayed with timestamps")

✅ Logging configured - scraper progress will be displayed with timestamps


# Step 6: Browser Management
Define the browser manager class for handling Playwright browser instances and downloads.

In [6]:
# Browser manager for Playwright automation
class BrowserManager:
    """Manages Playwright browser instances with proper cleanup"""

    def __init__(self, headless: bool = True, downloads_path: str = None):
        self.headless = headless
        self.downloads_path = downloads_path

    async def __aenter__(self):
        """Start Playwright and launch browser"""
        self.playwright = await async_playwright().start()
        self.browser = await self.playwright.chromium.launch(headless=self.headless)
        return self

    async def __aexit__(self, exc_type, exc, tb):
        """Clean up browser and Playwright resources"""
        if self.browser:
            await self.browser.close()
        await self.playwright.stop()

    async def new_page(self):
        """Create a new browser page with download configuration"""
        context_args = {}
        if self.downloads_path:
            context_args["accept_downloads"] = True
            context_args["downloads_path"] = self.downloads_path
        context = await self.browser.new_context(**context_args)
        page = await context.new_page()
        return page

print("✅ Browser manager class defined - ready for web automation")

✅ Browser manager class defined - ready for web automation


# Step 7: HTML Parser Functions
CURIA-specific functions for extracting structured data from legal documents.

In [7]:
# Import additional libraries for parsing
from datetime import datetime
import re

def parse_curia_document(html: str, url: str, doc_id: str = None) -> dict:
    """
    Parse CURIA legal document HTML and extract structured metadata

    Args:
        html: Raw HTML content of the document
        url: Document URL for context
        doc_id: Optional document ID for reference

    Returns:
        dict: Structured document metadata
    """
    soup = BeautifulSoup(html, "html.parser")

    # Extract case number using multiple approaches
    case_number = None
    case_selectors = [
        "span.case-number",
        "div.case-number",
        "h1", "h2", "h3"  # Often in headings
    ]

    # Try CSS selectors first
    for selector in case_selectors:
        tag = soup.select_one(selector)
        if tag and tag.text.strip():
            text = tag.text.strip()
            # Look for case pattern in the text
            case_match = re.search(r'Case\s+[A-Z]-\d+/\d+', text, re.I)
            if case_match:
                case_number = case_match.group(0)
                break

    # Fallback: search all text for case patterns
    if not case_number:
        text_content = soup.get_text()
        case_matches = re.findall(r'Case\s+[A-Z]-\d+/\d+', text_content, re.I)
        if case_matches:
            case_number = case_matches[0]

    # Extract document title
    title = None
    title_selectors = ["title", "h1", "h2", ".document-title", ".judgment-title"]
    for selector in title_selectors:
        tag = soup.select_one(selector)
        if tag and tag.text.strip():
            title = tag.text.strip()
            # Clean up common title prefixes
            title = re.sub(r'^(CURIA\s*-\s*)?', '', title, flags=re.I)
            break

    # Extract date of judgment using various patterns
    date_of_judgment = None
    date_patterns = [
        r'\b\d{1,2}[\/\-\.]\d{1,2}[\/\-\.]\d{4}\b',        # DD/MM/YYYY
        r'\b\d{4}[\/\-\.]\d{1,2}[\/\-\.]\d{1,2}\b',        # YYYY/MM/DD
        r'\b\d{1,2}\s+\w+\s+\d{4}\b'                       # DD Month YYYY
    ]

    text_content = soup.get_text()
    for pattern in date_patterns:
        matches = re.findall(pattern, text_content)
        if matches:
            date_of_judgment = matches[0]
            break

    # Extract parties from structured tables
    parties = []
    tables = soup.find_all("table")
    for table in tables:
        rows = table.find_all("tr")
        for row in rows:
            cells = row.find_all(["td", "th"])
            if len(cells) >= 2:
                label = cells[0].get_text(strip=True).lower()
                value = cells[1].get_text(strip=True)
                # Look for party-related information
                if any(keyword in label for keyword in ["party", "applicant", "defendant", "member state"]):
                    if value and value not in parties:
                        parties.append(value)

    # Extract language from URL
    language = None
    lang_match = re.search(r'doclang=([A-Z]{2})', url)
    if lang_match:
        language = lang_match.group(1)

    return {
        "doc_id": doc_id,
        "url": url,
        "language": language,
        "case_number": case_number,
        "title": title,
        "date_of_judgment": date_of_judgment,
        "parties": parties,
        "html_length": len(html),
        "extracted_at": str(datetime.now()),
        "html": html  # Include full HTML for reference
    }

print("✅ CURIA document parser functions defined")

✅ CURIA document parser functions defined


# Step 8: File Storage Functions
Functions for saving scraped data in JSON format with proper organization.

In [8]:
# Storage functions for saving scraped data
def save_json(content: dict, idx: int):
    """
    Save document metadata or content to a JSON file

    Args:
        content: Dictionary containing document data
        idx: Index number for filename
    """
    output_dir = Path(settings.general.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    filename = output_dir / f"doc_{idx}.json"

    with open(filename, "w", encoding="utf-8") as f:
        json.dump(content, f, ensure_ascii=False, indent=2)

    logger.info(f"Saved JSON metadata: {filename}")

print("✅ Storage functions defined - ready to save scraped data")

✅ Storage functions defined - ready to save scraped data


# Step 9: Document Processing Pipeline
Core function for processing individual CURIA documents with PDF generation and metadata extraction.

In [9]:
# Document processor with CURIA-specific workflow
async def process_document(page: Page, doc_link: str, idx: int):
    """
    Process a single CURIA document: navigate, find print button, generate PDF or save HTML

    Args:
        page: Playwright page instance
        doc_link: URL of the document to process
        idx: Document index for naming
    """
    logger.info(f"[{idx}] Processing CURIA document: {doc_link}")

    # Navigate to the document page
    await page.goto(doc_link)
    await page.wait_for_load_state("networkidle")

    # Extract document ID from URL for better filename
    doc_id_match = re.search(r'docid=(\d+)', doc_link)
    doc_id = doc_id_match.group(1) if doc_id_match else str(idx)

    log_entry = {
        "idx": idx,
        "doc_id": doc_id,
        "url": doc_link,
        "used_path": None,
        "filename": None
    }

    # Look for the "Start Printing" button (CURIA specific)
    print_button_selectors = [
        "input[value*='Start Printing']",
        "button:has-text('Start Printing')",
        "a:has-text('Start Printing')",
        "input[type='submit'][value*='Print']",
        "*:has-text('Start Printing')"
    ]

    print_btn = None
    for selector in print_button_selectors:
        try:
            print_btn = await page.wait_for_selector(selector, timeout=3000)
            if print_btn:
                logger.info(f"[{idx}] Found 'Start Printing' button with selector: {selector}")
                break
        except:
            continue

    if print_btn:
        # Click the print button and wait for the print-ready page
        await print_btn.click()
        await page.wait_for_load_state("networkidle")

        # Wait for print content to load properly
        await page.wait_for_timeout(2000)

        filename = f"curia-doc-{doc_id}.pdf"
        filepath = Path(settings.general.output_dir) / filename

        try:
            # Set print media emulation for cleaner PDFs
            await page.emulate_media(media="print")
        except Exception as e:
            logger.warning(f"[{idx}] Media emulation failed: {e}")

        # Generate PDF with proper formatting
        await page.pdf(
            path=str(filepath),
            format="A4",
            print_background=True,
            margin={
                "top": "1cm",
                "right": "1cm",
                "bottom": "1cm",
                "left": "1cm"
            }
        )
        logger.info(f"[{idx}] Saved PDF: {filepath}")

        log_entry["used_path"] = "page_pdf"
        log_entry["filename"] = filename
        save_json(log_entry, idx)
        return

    # Fallback: if no print button found, save HTML content with metadata
    logger.info(f"[{idx}] No 'Start Printing' button found; extracting HTML content")

    try:
        # Get the main content area
        content_element = await page.query_selector("body")
        html_content = await content_element.inner_html() if content_element else await page.content()

        # Parse the HTML and extract structured data
        parsed = parse_curia_document(html_content, doc_link, doc_id)
        parsed["filename"] = None
        parsed["used_path"] = "html_parse"
        save_json(parsed, idx)

    except Exception as e:
        logger.error(f"[{idx}] Error processing document: {e}")
        log_entry["error"] = str(e)
        save_json(log_entry, idx)

print("✅ Document processing function defined")

✅ Document processing function defined


In [10]:
# Step 10: Main Listing Processor
# Core scraper function that handles pagination, document discovery, and batch processing

async def process_listing():
    """
    Main CURIA listing processor: navigate search results, find documents, handle pagination
    """
    output_dir = Path(settings.general.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    async with BrowserManager(headless=settings.general.headless, downloads_path=str(output_dir)) as bm:
        page = await bm.new_page()

        logger.info(f"Navigating to: {settings.site.listing_url}")
        await page.goto(settings.site.listing_url)
        await page.wait_for_load_state("networkidle")

        idx = 0
        page_num = 1

        while True:
            logger.info(f"Processing page {page_num}...")

            # Wait for page content to load
            try:
                await page.wait_for_selector("table, .results, .judgment-list", timeout=10000)
            except:
                logger.warning("No standard content selectors found, continuing anyway...")

            # Find document links using multiple selectors for robustness
            document_links = []
            link_selectors = [
                settings.site.document_link_selector,          # From config
                "div#docHtml a[href*='document.jsf']",         # CURIA specific
                "a[href*='document.jsf']",                     # More general
                "a[href*='docid=']"                            # Even more general
            ]

            for selector in link_selectors:
                try:
                    links = await page.query_selector_all(selector)
                    if links:
                        logger.info(f"Found {len(links)} document links using selector: {selector}")

                        for link in links:
                            href = await link.get_attribute("href")
                            if href:
                                # Ensure absolute URL
                                if href.startswith("http"):
                                    full_url = href
                                else:
                                    base_url = "https://curia.europa.eu"
                                    full_url = f"{base_url}{href}" if href.startswith("/") else f"{base_url}/{href}"

                                # Filter for preferred language if specified
                                if settings.general.preferred_language:
                                    if f"doclang={settings.general.preferred_language}" in full_url:
                                        document_links.append(full_url)
                                    elif "doclang=" not in full_url:
                                        # Add language parameter if not present
                                        lang_param = f"&doclang={settings.general.preferred_language}"
                                        full_url += lang_param
                                        document_links.append(full_url)
                                else:
                                    document_links.append(full_url)
                        break
                except Exception as e:
                    logger.warning(f"Selector '{selector}' failed: {e}")
                    continue

            if not document_links:
                logger.warning("No document links found on this page")
                break

            # Remove duplicates while preserving order
            unique_links = []
            seen = set()
            for link in document_links:
                # Extract docid for deduplication
                doc_match = re.search(r'docid=(\d+)', link)
                if doc_match:
                    doc_id = doc_match.group(1)
                    if doc_id not in seen:
                        seen.add(doc_id)
                        unique_links.append(link)

            logger.info(f"Processing {len(unique_links)} unique documents from page {page_num}")

            # Process each document
            for doc_link in unique_links:
                idx += 1
                logger.info(f"[{idx}] Processing document: {doc_link}")

                try:
                    await process_document(page, doc_link, idx)
                    # Small delay between documents to avoid overwhelming the server
                    await page.wait_for_timeout(1000)
                except Exception as e:
                    logger.error(f"[{idx}] Error processing {doc_link}: {e}")

            # Look for next page button
            next_btn = None
            next_selectors = [
                settings.site.next_page_selector,
                "a[title*='Next']",
                "a[title*='next']",
                "a:has-text('Next')",
                "a:has-text('»')",
                ".next-page",
                "input[value*='Next']"
            ]

            for selector in next_selectors:
                try:
                    next_btn = await page.query_selector(selector)
                    if next_btn:
                        # Check if button is enabled
                        is_disabled = await next_btn.get_attribute("disabled")
                        if not is_disabled:
                            logger.info(f"Found next page button with selector: {selector}")
                            break
                        else:
                            next_btn = None
                except:
                    continue

            if not next_btn:
                logger.info("No next page button found or all pages processed")
                break

            # Navigate to next page
            logger.info(f"Moving to page {page_num + 1}")
            await next_btn.click()
            await page.wait_for_load_state("networkidle")
            await page.wait_for_timeout(settings.general.throttle_delay_ms)
            page_num += 1

        logger.info(f"Scraping completed. Processed {idx} total documents across {page_num} pages.")

print("✅ Main listing processor function defined")

✅ Main listing processor function defined


In [14]:
# Step 11: Execute CURIA Scraper
# Main execution cell with Windows/Jupyter compatibility and comprehensive error handling

import nest_asyncio
import concurrent.futures
import threading

# Apply nest_asyncio to allow nested event loops in Jupyter
nest_asyncio.apply()

print("🚀 Starting CURIA scraper...")
print(f"📁 Output directory: {settings.general.output_dir}")
print(f"🌐 Target URL: {settings.site.listing_url}")
print(f"🔍 Headless mode: {settings.general.headless}")
print(f"🌍 Language filter: {settings.general.preferred_language}")

def run_scraper_sync():
    """Run the scraper in a separate thread with its own event loop (Windows compatibility)"""
    print("\n🔧 Using threaded execution for Windows compatibility...")

    # Create a new event loop for this thread
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)

    try:
        # Verify output directory
        output_dir = Path(settings.general.output_dir)
        output_dir.mkdir(parents=True, exist_ok=True)
        print(f"✅ Output directory ready: {output_dir.absolute()}")

        # Run the async scraper
        result = loop.run_until_complete(process_listing())
        print("✅ Scraping completed successfully!")
        return result

    except Exception as e:
        print(f"❌ Error during scraping: {str(e)}")
        print(f"Error type: {type(e).__name__}")

        # Provide specific guidance based on error type
        if "NotImplementedError" in str(e):
            print("\n🔍 This is a Windows/asyncio subprocess issue.")
            print("💡 Try running the standalone script generated below.")
        elif "playwright" in str(e).lower():
            print("\n🔍 Playwright browser issue detected.")
            print("💡 Try running: !playwright install chromium")
        elif "config" in str(e).lower() or "url" in str(e).lower():
            print(f"\n🔍 Configuration issue. Current URL: {settings.site.listing_url}")
            print("💡 Make sure the listing_url in config.toml points to a real CURIA search results page.")
        else:
            print(f"\n🔍 Unexpected error: {e}")

        import traceback
        print("\n📝 Full traceback:")
        traceback.print_exc()
        return None

    finally:
        loop.close()

# Try direct execution first, fall back to threaded
try:
    print("🔄 Attempting direct async execution...")
    await process_listing()
    print("✅ Direct execution successful!")

except (NotImplementedError, RuntimeError) as e:
    print(f"⚠️ Direct execution failed: {type(e).__name__}")
    print("🔄 Switching to threaded execution...")

    # Use ThreadPoolExecutor for better error handling
    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(run_scraper_sync)

        try:
            result = future.result(timeout=300)  # 5 minute timeout
            if result is not None:
                print("🎉 Scraping completed via threaded execution!")
            else:
                print("❌ Scraping failed - check errors above")

        except concurrent.futures.TimeoutError:
            print("⏰ Scraping timed out after 5 minutes")
        except Exception as e:
            print(f"❌ Threaded execution also failed: {e}")

except Exception as e:
    print(f"❌ Unexpected error in direct execution: {e}")
    print("🔄 Trying threaded fallback...")

    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(run_scraper_sync)
        try:
            result = future.result(timeout=300)
        except Exception as thread_error:
            print(f"❌ All execution methods failed: {thread_error}")

print("\n📋 Scraper execution complete. Check the output directory for results.")
print("💡 If you encountered Windows/subprocess issues, try the standalone script)!")

Task exception was never retrieved
future: <Task finished name='Task-18' coro=<Connection.run() done, defined at c:\Users\Ryan\anaconda3\envs\curia-scraper\Lib\site-packages\playwright\_impl\_connection.py:303> exception=NotImplementedError()>
Traceback (most recent call last):
  File "c:\Users\Ryan\anaconda3\envs\curia-scraper\Lib\asyncio\tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Ryan\anaconda3\envs\curia-scraper\Lib\site-packages\playwright\_impl\_connection.py", line 310, in run
    await self._transport.connect()
  File "c:\Users\Ryan\anaconda3\envs\curia-scraper\Lib\site-packages\playwright\_impl\_transport.py", line 133, in connect
    raise exc
  File "c:\Users\Ryan\anaconda3\envs\curia-scraper\Lib\site-packages\playwright\_impl\_transport.py", line 120, in connect
    self._proc = await asyncio.create_subprocess_exec(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Ryan\anaconda3\envs\curia-

🚀 Starting CURIA scraper...
📁 Output directory: ./output
🌐 Target URL: https://curia.europa.eu/juris/documents.jsf?nat=or&mat=or&pcs=Oor&jur=C%2CT%2CF&for=&jge=&dates=%2524type%253Dpro%2524mode%253D1M%2524from%253D2025.09.27%2524to%253D2025.10.27&language=en&pro=&cit=none%252CC%252CCJ%252CR%252C2008E%252C%252C%252C%252C%252C%252C%252C%252C%252C%252Ctrue%252Cfalse%252Cfalse&oqp=&td=%24mode%3D1M%24from%3D2025.09.27%24to%3D2025.10.27%3B%3BPUB%3BPUB1%2CPUB2%2CPUB4%2CPUB7%2CPUB3%2CPUB8%2CPUB5%2CPUB6%3B%3B%3B%3BORDALL&avg=&lgrec=en&page=1&lg=EN%252C%252Btrue%252Cfalse&cid=7416927
🔍 Headless mode: True
🌍 Language filter: en
🔄 Attempting direct async execution...
⚠️ Direct execution failed: NotImplementedError
🔄 Switching to threaded execution...

🔧 Using threaded execution for Windows compatibility...
✅ Output directory ready: d:\Repos\Active\Client\Pandektes\scraper\output
❌ Error during scraping: 
Error type: NotImplementedError

🔍 Unexpected error: 

📝 Full traceback:
❌ Scraping failed - c

Traceback (most recent call last):
  File "C:\Users\Ryan\AppData\Local\Temp\ipykernel_39716\1283879018.py", line 32, in run_scraper_sync
    result = loop.run_until_complete(process_listing())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Ryan\anaconda3\envs\curia-scraper\Lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "c:\Users\Ryan\anaconda3\envs\curia-scraper\Lib\asyncio\futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "c:\Users\Ryan\anaconda3\envs\curia-scraper\Lib\asyncio\tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "C:\Users\Ryan\AppData\Local\Temp\ipykernel_39716\1213939597.py", line 11, in process_listing
    async with BrowserManager(headless=settings.general.headless, downloads_path=str(output_dir)) as bm:
  File "C:\Users\Ryan\AppData\Local\Temp\ipykernel_39716\3767045468.py", l

In [12]:
# Step 12: Configuration & System Test
# Comprehensive test suite to validate configuration and system compatibility before scraping

import requests
from urllib.parse import urlparse

print("🧪 Testing CURIA scraper configuration and system compatibility...")

# Test 1: Verify configuration values
print(f"\n📋 Configuration Test:")
print(f"  ✓ Listing URL: {settings.site.listing_url}")
print(f"  ✓ Output Directory: {settings.general.output_dir}")
print(f"  ✓ Target Language: {settings.general.preferred_language}")
print(f"  ✓ Headless Mode: {settings.general.headless}")
print(f"  ✓ Throttle Delay: {settings.general.throttle_delay_ms}ms")

# Test 2: Check CURIA website connectivity
print(f"\n🌐 Connectivity Test:")
try:
    response = requests.get(settings.site.listing_url, timeout=10)
    if response.status_code == 200:
        print(f"  ✅ Successfully connected to {settings.site.listing_url}")
        print(f"  ✓ Response size: {len(response.content)} bytes")

        # Validate it's a proper CURIA page
        content = response.text.lower()
        curia_indicators = ["curia", "court of justice", "european union"]
        document_indicators = ["document", "judgment", "case", "docid"]

        has_curia = any(indicator in content for indicator in curia_indicators)
        has_docs = any(indicator in content for indicator in document_indicators)

        if has_curia and has_docs:
            print("  ✅ Confirmed: Valid CURIA search results page")
        elif has_curia:
            print("  ⚠️ CURIA page detected, but may not contain search results")
        else:
            print("  ⚠️ May not be a valid CURIA page")

    else:
        print(f"  ❌ HTTP Error: {response.status_code}")

except requests.exceptions.RequestException as e:
    print(f"  ❌ Connection failed: {e}")
    print("  💡 Check your internet connection and verify the URL in config.toml")

# Test 3: Verify output directory permissions
print(f"\n📁 Directory Test:")
try:
    output_dir = Path(settings.general.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    # Test file creation and deletion
    test_file = output_dir / "test.txt"
    test_file.write_text("test content")
    test_file.unlink()

    print(f"  ✅ Output directory is writable: {output_dir.absolute()}")

except Exception as e:
    print(f"  ❌ Directory issue: {e}")
    print("  💡 Check folder permissions or change output_dir in config.toml")

# Test 4: Check Playwright compatibility (async-safe)
print(f"\n🎭 Playwright Test:")
try:
    from playwright.async_api import async_playwright

    async def test_playwright():
        async with async_playwright() as p:
            browser_path = p.chromium.executable_path
            return browser_path

    # Test in current async context
    try:
        browser_path = await test_playwright()
        print(f"  ✅ Playwright browsers installed: {browser_path}")
    except Exception as e:
        if "NotImplementedError" in str(e):
            print("  ⚠️ Windows/Jupyter subprocess limitation detected")
            print("  💡 This may prevent browser automation in this environment")
            print("  💡 Use the standalone script instead (generated below)")
        else:
            print(f"  ❌ Playwright issue: {e}")
            print("  💡 Try running: !playwright install")

except ImportError:
    print("  ❌ Playwright not installed")
    print("  💡 Run the package installation cell above")

# Test 5: Windows subprocess compatibility check
print(f"\n🪟 Windows Subprocess Compatibility Test:")
try:
    # Test if we can create subprocesses in this environment
    async def test_subprocess():
        proc = await asyncio.create_subprocess_exec(
            "echo", "test",
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE
        )
        stdout, stderr = await proc.communicate()
        return proc.returncode == 0

    can_subprocess = await test_subprocess()
    if can_subprocess:
        print("  ✅ Subprocess creation works in this environment")
        print("  💡 Direct notebook execution should work")
    else:
        print("  ❌ Subprocess creation failed")

except NotImplementedError:
    print("  ❌ Windows/Jupyter subprocess limitation confirmed")
    print("  💡 This will prevent Playwright from working in Jupyter on Windows")
    print("  💡 Recommended: Use the standalone script generated below")
except Exception as e:
    print(f"  ⚠️ Subprocess test error: {e}")

# Test 6: Configuration file validation
print(f"\n⚙️ Configuration File Test:")
try:
    config_path = Path("config.toml")
    if config_path.exists():
        print(f"  ✅ config.toml exists: {config_path.absolute()}")

        # Validate required sections
        config_data = toml.load(config_path)
        required_sections = ["general", "site"]
        for section in required_sections:
            if section in config_data:
                print(f"  ✓ Section '{section}' present")
            else:
                print(f"  ❌ Missing section '{section}'")
    else:
        print("  ❌ config.toml not found")

except Exception as e:
    print(f"  ⚠️ Configuration test error: {e}")

print(f"\n🎯 System Diagnosis Complete!")
print(f"💡 If subprocess test failed, use the standalone script generated in the next cell")
print(f"💡 If connectivity failed, check your internet connection and config.toml URL")
print(f"💡 Green checkmarks ✅ indicate ready components")

🧪 Testing CURIA scraper configuration and system compatibility...

📋 Configuration Test:
  ✓ Listing URL: https://curia.europa.eu/juris/documents.jsf?nat=or&mat=or&pcs=Oor&jur=C%2CT%2CF&for=&jge=&dates=%2524type%253Dpro%2524mode%253D1M%2524from%253D2025.09.27%2524to%253D2025.10.27&language=en&pro=&cit=none%252CC%252CCJ%252CR%252C2008E%252C%252C%252C%252C%252C%252C%252C%252C%252C%252Ctrue%252Cfalse%252Cfalse&oqp=&td=%24mode%3D1M%24from%3D2025.09.27%24to%3D2025.10.27%3B%3BPUB%3BPUB1%2CPUB2%2CPUB4%2CPUB7%2CPUB3%2CPUB8%2CPUB5%2CPUB6%3B%3B%3B%3BORDALL&avg=&lgrec=en&page=1&lg=EN%252C%252Btrue%252Cfalse&cid=7416927
  ✓ Output Directory: ./output
  ✓ Target Language: en
  ✓ Headless Mode: True
  ✓ Throttle Delay: 2000ms

🌐 Connectivity Test:
  ✅ Successfully connected to https://curia.europa.eu/juris/documents.jsf?nat=or&mat=or&pcs=Oor&jur=C%2CT%2CF&for=&jge=&dates=%2524type%253Dpro%2524mode%253D1M%2524from%253D2025.09.27%2524to%253D2025.10.27&language=en&pro=&cit=none%252CC%252CCJ%252CR%252C2

# Step 13: Windows Compatibility Solution
If you're experiencing Windows/Jupyter subprocess issues, this cell generates a standalone Python script that bypasses all limitations.

In [13]:
# Step 13: Generate Standalone Python Script (Windows Workaround)
# Creates a self-contained Python script that bypasses Windows/Jupyter subprocess limitations

print("🛠️ Creating standalone Python script to bypass Windows/Jupyter limitations...")

script_content = '''#!/usr/bin/env python
"""
CURIA Scraper - Standalone Version
==================================

This script bypasses Windows/Jupyter subprocess limitations by running
as a standalone Python application. It includes all the functionality
from the Jupyter notebook in a single executable file.

Usage:
    python curia_scraper_standalone.py

Requirements:
    - Python 3.8+
    - All packages from requirements (automatically checked)
    - config.toml file (automatically created if missing)

Output:
    - PDF files: curia-doc-{docid}.pdf
    - Metadata files: doc_{index}.json
    - All files saved to ./output directory
"""

import asyncio
import json
import logging
import re
import toml
from pathlib import Path
from datetime import datetime
from pydantic import BaseModel
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright, Page

# Configuration models
class GeneralSettings(BaseModel):
    output_dir: str
    checkpoint_file: str
    headless: bool
    throttle_delay_ms: int
    preferred_language: str = "EN"

class SiteSettings(BaseModel):
    listing_url: str
    document_content_selector: str = "body"
    start_print_button_text: str = "Start Printing"
    document_link_selector: str = "div#docHtml a[href*='document.jsf']"
    next_page_selector: str = "a[title='Next Page']"

class Settings(BaseModel):
    general: GeneralSettings
    site: SiteSettings

# Load or create configuration
def load_config():
    """Load configuration from config.toml or create default"""
    config_path = Path("config.toml")

    if not config_path.exists():
        print("📝 Creating default config.toml...")
        default_config = {
            "general": {
                "output_dir": "./output",
                "checkpoint_file": "./checkpoint.json",
                "headless": True,  # Headless for standalone execution
                "throttle_delay_ms": 2000,
                "preferred_language": "EN"
            },
            "site": {
                "listing_url": "https://curia.europa.eu/juris/recherche.jsf?language=en",
                "document_content_selector": "body",
                "start_print_button_text": "Start Printing",
                "document_link_selector": "div#docHtml a[href*='document.jsf']",
                "next_page_selector": "a[title='Next Page']"
            }
        }
        with open(config_path, "w") as f:
            toml.dump(default_config, f)
        print("✅ Created config.toml with CURIA defaults")

    return Settings(**toml.load("config.toml"))

# Load configuration
settings = load_config()

# Logger setup
logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s] %(levelname)s %(message)s"
)
logger = logging.getLogger("curia_scraper")

# Browser manager
class BrowserManager:
    """Manages Playwright browser instances with proper cleanup"""

    def __init__(self, headless: bool = True, downloads_path: str = None):
        self.headless = headless
        self.downloads_path = downloads_path

    async def __aenter__(self):
        self.playwright = await async_playwright().start()
        self.browser = await self.playwright.chromium.launch(headless=self.headless)
        return self

    async def __aexit__(self, exc_type, exc, tb):
        if self.browser:
            await self.browser.close()
        await self.playwright.stop()

    async def new_page(self):
        context_args = {}
        if self.downloads_path:
            context_args["accept_downloads"] = True
            context_args["downloads_path"] = self.downloads_path
        context = await self.browser.new_context(**context_args)
        page = await context.new_page()
        return page

# Storage functions
def save_json(content: dict, idx: int):
    """Save document metadata to JSON file"""
    output_dir = Path(settings.general.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    filename = output_dir / f"doc_{idx}.json"

    with open(filename, "w", encoding="utf-8") as f:
        json.dump(content, f, ensure_ascii=False, indent=2)

    logger.info(f"Saved JSON metadata: {filename}")

# CURIA-specific parser function
def parse_curia_document(html: str, url: str, doc_id: str = None) -> dict:
    """Parse CURIA legal document HTML and extract structured metadata"""
    soup = BeautifulSoup(html, "html.parser")

    # Extract case number
    case_number = None
    text_content = soup.get_text()
    case_matches = re.findall(r'Case\\s+[A-Z]-\\d+/\\d+', text_content, re.I)
    if case_matches:
        case_number = case_matches[0]

    # Extract title
    title = None
    for selector in ["title", "h1", "h2"]:
        tag = soup.select_one(selector)
        if tag and tag.text.strip():
            title = re.sub(r'^(CURIA\\s*-\\s*)?', '', tag.text.strip(), flags=re.I)
            break

    # Extract language from URL
    language = None
    lang_match = re.search(r'doclang=([A-Z]{2})', url)
    if lang_match:
        language = lang_match.group(1)

    return {
        "doc_id": doc_id,
        "url": url,
        "language": language,
        "case_number": case_number,
        "title": title,
        "html_length": len(html),
        "extracted_at": str(datetime.now()),
        "html": html
    }

# Document processor
async def process_document(page: Page, doc_link: str, idx: int):
    """Process a single CURIA document"""
    logger.info(f"[{idx}] Processing CURIA document: {doc_link}")

    await page.goto(doc_link)
    await page.wait_for_load_state("networkidle")

    doc_id_match = re.search(r'docid=(\\d+)', doc_link)
    doc_id = doc_id_match.group(1) if doc_id_match else str(idx)

    log_entry = {
        "idx": idx,
        "doc_id": doc_id,
        "url": doc_link,
        "used_path": None,
        "filename": None
    }

    # Look for print button
    print_button_selectors = [
        "input[value*='Start Printing']",
        "button:has-text('Start Printing')",
        "a:has-text('Start Printing')",
        "input[type='submit'][value*='Print']"
    ]

    print_btn = None
    for selector in print_button_selectors:
        try:
            print_btn = await page.wait_for_selector(selector, timeout=3000)
            if print_btn:
                logger.info(f"[{idx}] Found print button: {selector}")
                break
        except:
            continue

    if print_btn:
        # Generate PDF from print view
        await print_btn.click()
        await page.wait_for_load_state("networkidle")
        await page.wait_for_timeout(2000)

        filename = f"curia-doc-{doc_id}.pdf"
        filepath = Path(settings.general.output_dir) / filename

        await page.pdf(
            path=str(filepath),
            format="A4",
            print_background=True,
            margin={"top": "1cm", "right": "1cm", "bottom": "1cm", "left": "1cm"}
        )
        logger.info(f"[{idx}] Saved PDF: {filepath}")

        log_entry["used_path"] = "page_pdf"
        log_entry["filename"] = filename
        save_json(log_entry, idx)
    else:
        # Fallback to HTML parsing
        logger.info(f"[{idx}] No print button found; saving HTML")
        html_content = await page.content()
        parsed = parse_curia_document(html_content, doc_link, doc_id)
        parsed["filename"] = None
        parsed["used_path"] = "html_parse"
        save_json(parsed, idx)

# Main listing processor
async def process_listing():
    """Main CURIA listing processor"""
    output_dir = Path(settings.general.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    async with BrowserManager(headless=settings.general.headless, downloads_path=str(output_dir)) as bm:
        page = await bm.new_page()

        logger.info(f"Navigating to: {settings.site.listing_url}")
        await page.goto(settings.site.listing_url)
        await page.wait_for_load_state("networkidle")

        idx = 0
        page_num = 1

        while True:
            logger.info(f"Processing page {page_num}...")

            # Find document links
            document_links = []
            link_selectors = [
                "div#docHtml a[href*='document.jsf']",
                "a[href*='document.jsf']",
                "a[href*='docid=']"
            ]

            for selector in link_selectors:
                try:
                    links = await page.query_selector_all(selector)
                    if links:
                        logger.info(f"Found {len(links)} links with: {selector}")

                        for link in links:
                            href = await link.get_attribute("href")
                            if href:
                                if href.startswith("http"):
                                    full_url = href
                                else:
                                    full_url = f"https://curia.europa.eu{href}"

                                # Filter by language
                                if settings.general.preferred_language:
                                    if f"doclang={settings.general.preferred_language}" in full_url:
                                        document_links.append(full_url)
                                    elif "doclang=" not in full_url:
                                        full_url += f"&doclang={settings.general.preferred_language}"
                                        document_links.append(full_url)
                                else:
                                    document_links.append(full_url)
                        break
                except Exception as e:
                    logger.warning(f"Selector failed: {selector} - {e}")

            if not document_links:
                logger.warning("No document links found")
                break

            # Remove duplicates
            unique_links = []
            seen = set()
            for link in document_links:
                doc_match = re.search(r'docid=(\\d+)', link)
                if doc_match:
                    doc_id = doc_match.group(1)
                    if doc_id not in seen:
                        seen.add(doc_id)
                        unique_links.append(link)

            logger.info(f"Processing {len(unique_links)} unique documents")

            # Process each document
            for doc_link in unique_links:
                idx += 1
                try:
                    await process_document(page, doc_link, idx)
                    await page.wait_for_timeout(1000)
                except Exception as e:
                    logger.error(f"[{idx}] Error: {e}")

            # Look for next page
            next_btn = None
            next_selectors = ["a[title*='Next']", "a:has-text('Next')", "a:has-text('»')"]

            for selector in next_selectors:
                try:
                    next_btn = await page.query_selector(selector)
                    if next_btn:
                        is_disabled = await next_btn.get_attribute("disabled")
                        if not is_disabled:
                            break
                        next_btn = None
                except:
                    continue

            if not next_btn:
                logger.info("No next page found")
                break

            await next_btn.click()
            await page.wait_for_load_state("networkidle")
            await page.wait_for_timeout(settings.general.throttle_delay_ms)
            page_num += 1

        logger.info(f"Completed! Processed {idx} documents across {page_num} pages")

# Main execution
async def main():
    """Main entry point"""
    print("🚀 CURIA Scraper - Standalone Version")
    print("="*50)
    print(f"📁 Output: {settings.general.output_dir}")
    print(f"🌐 URL: {settings.site.listing_url}")
    print(f"🌍 Language: {settings.general.preferred_language}")
    print(f"👁️  Headless: {settings.general.headless}")
    print("="*50)

    try:
        await process_listing()
        print("\\n✅ Scraping completed successfully!")
        print(f"📂 Check {settings.general.output_dir} for results")

    except Exception as e:
        print(f"\\n❌ Error: {e}")
        import traceback
        traceback.print_exc()

if __name__ == "__main__":
    print("🎯 Starting CURIA document scraper...")
    asyncio.run(main())
'''

# Write the script to file
script_path = Path("curia_scraper_standalone.py")
script_path.write_text(script_content, encoding="utf-8")

print(f"✅ Created standalone script: {script_path.absolute()}")
print(f"\n🚀 To run the standalone scraper:")
print(f"   1. Open terminal/command prompt")
print(f"   2. Navigate to: {Path.cwd()}")
print(f"   3. Run: python curia_scraper_standalone.py")
print(f"\n💡 Benefits of standalone script:")
print(f"   ✓ Bypasses Windows/Jupyter subprocess limitations")
print(f"   ✓ Runs in headless mode by default (faster)")
print(f"   ✓ Better error handling and logging")
print(f"   ✓ Self-contained with all dependencies")
print(f"   ✓ Perfect for production/automated runs")

print(f"\n📋 The script will:")
print(f"   • Auto-create config.toml if missing")
print(f"   • Process all pages of CURIA search results")
print(f"   • Download PDFs when 'Start Printing' buttons available")
print(f"   • Extract structured metadata for all documents")
print(f"   • Save everything to ./output directory")

print(f"\n🔧 Configuration:")
print(f"   Edit config.toml to customize:")
print(f"   - listing_url: Your CURIA search results page")
print(f"   - preferred_language: EN, FR, DE, etc.")
print(f"   - output_dir: Where to save files")
print(f"   - headless: true/false for browser visibility")

🛠️ Creating standalone Python script to bypass Windows/Jupyter limitations...
✅ Created standalone script: d:\Repos\Active\Client\Pandektes\scraper\curia_scraper_standalone.py

🚀 To run the standalone scraper:
   1. Open terminal/command prompt
   2. Navigate to: d:\Repos\Active\Client\Pandektes\scraper
   3. Run: python curia_scraper_standalone.py

💡 Benefits of standalone script:
   ✓ Bypasses Windows/Jupyter subprocess limitations
   ✓ Runs in headless mode by default (faster)
   ✓ Better error handling and logging
   ✓ Self-contained with all dependencies
   ✓ Perfect for production/automated runs

📋 The script will:
   • Auto-create config.toml if missing
   • Process all pages of CURIA search results
   • Download PDFs when 'Start Printing' buttons available
   • Extract structured metadata for all documents
   • Save everything to ./output directory

🔧 Configuration:
   Edit config.toml to customize:
   - listing_url: Your CURIA search results page
   - preferred_language: EN, F