### 1- Scrape and Extract Website Data
web scraper with GPU detection, JavaScript rendering (using Pyppeteer), and content extraction



# DESYContentProcessor (Part 2): 
## Processes and extracts content from the mapped URLs 

## 📊 Comparison of Scraping Packages for Mixed HTTPS and JavaScript Pages (Async, with Wait Functions)

| Feature | `requests` + `BeautifulSoup` | `aiohttp` + `BeautifulSoup` | `httpx` + `selectolax` | `Scrapy` | `Playwright (Python)` | `Pyppeteer` |
|---------|------------------------------|------------------------------|------------------------|----------|------------------------|-------------|
| 🏗️ Created by | Community (Python) | Community (Python) | Community (Python) | Community (Python) | Microsoft | Community (unofficial, port of Puppeteer) |
| 🛠️ Language | Python | Python | Python | Python | Python | Python |
| 🌐 Browser Support | N/A (HTTP/HTTPS) | N/A (HTTP/HTTPS) | N/A (HTTP/HTTPS) | N/A (HTTP/HTTPS) | **Chromium, Firefox, WebKit** (Safari) | Only **Chromium** |
| 📦 Install | `pip install requests beautifulsoup4` | `pip install aiohttp beautifulsoup4` | `pip install httpx selectolax` | `pip install scrapy` | `pip install playwright` | `pip install pyppeteer` |
| 🧠 API Complexity | Simple and intuitive | Requires async programming knowledge | Requires async programming knowledge | Complex (framework, but well-organized) | More powerful, a bit more complex | Simple, Puppeteer-style |
| ⏱️ Speed | Moderate for small-scale scraping | 🚀 Very fast (async requests) | 🚀 Very fast (async requests) | 🚀 Very fast (but higher overhead) | ✅ Fast (even with JavaScript pages) | ✅ Moderate (similar to Pyppeteer) |
| 📦 Resource Usage | Moderate | Low (async, non-blocking) | Low (async, non-blocking) | High (especially for large projects) | Moderate | Moderate |
| 🪟 Handle popups / iframes | ❌ Not applicable | ❌ Not applicable | ❌ Not applicable | ❌ Not applicable | ✅ Excellent support | ✅ Basic support |
| 📱 Mobile emulation | ❌ Not applicable | ❌ Not applicable | ❌ Not applicable | ❌ Not applicable | ✅ Easy and built-in | ✅ Possible with extra setup |
| 🌍 Geolocation / Permissions | ❌ Not applicable | ❌ Not applicable | ❌ Not applicable | ❌ Not applicable | ✅ Built-in features | ❌ Not directly supported |
| 🔍 Code generation tool | ❌ Not available | ❌ Not available | ❌ Not available | ✅ Built-in support for crawling | ✅ `playwright codegen` tool | ❌ Not available |
| 🚀 Performance | Good for small to medium-scale tasks | 🚀 Excellent for scraping many pages | 🚀 Excellent for scraping many pages | 🚀 Very good for large projects | ✅ Great for both static and JS-heavy sites | ✅ Good for JS-heavy sites |
| 🔒 Anti-bot evasion | ❌ Basic headers for evasion | ❌ Basic headers for evasion | ❌ Basic headers for evasion | ❌ Basic headers for evasion | ✅ Better stealth for JS-heavy sites | ❌ Easily detected (headless Chromium) |
| 🛑 Project Status | ✅ Actively maintained | ✅ Actively maintained | ✅ Actively maintained | ✅ Actively maintained | ✅ Actively developed & supported | ❌ No longer actively maintained |

---

### ✅ Pros and ❌ Cons Summary

#### `requests` + `BeautifulSoup`
- ✅ **Very simple** and **easy to use** for beginners
- ✅ Well-documented and widely used
- ❌ **Not fast** for large-scale scraping (can be slow with 2 million URLs)
- ❌ Does **not handle JS** or dynamic content
- ❌ Does not support async handling or waiting for many pages to load

#### `aiohttp` + `BeautifulSoup`
- ✅ **Async** and **very fast** when scraping multiple pages concurrently
- ✅ **Low resource usage** (non-blocking)
- ❌ Requires **async** programming knowledge
- ❌ Does **not handle JS** or dynamic content
- ❌ Does not support advanced features like popups, mobile emulation, or permissions


#### `Scrapy`
- ✅ Best for **large-scale** scraping projects (200,000+ URLs)
- ✅ **Built-in crawling** and **link-following** features
- ✅ Supports **pipelines** for data cleaning and processing
- ✅ **Highly efficient** for handling huge numbers of URLs concurrently (async)
- ❌ Overkill for simple scraping tasks
- ❌ Does **not handle JS** or dynamic content

#### `Playwright (Python)`
- ✅ Handles **JavaScript-heavy** pages well and supports **async** functions
- ✅ **Fast** and reliable even with dynamic content
- ✅ **Multiple browser support** (Chromium, Firefox, Safari)
- ✅ **Built-in auto-waiting** for elements, reducing issues with timing
- ❌ **Slightly more complex** compared to requests-based solutions
- ❌ **Slower** than aiohttp or httpx for static pages but still fast enough for mixed sites

#### `Pyppeteer`
- ✅ Works well with **JavaScript-heavy sites** using **Chromium** browser
- ✅ **Simple** API, **similar to Puppeteer**
- ✅ Good for **JavaScript rendering** when simple browser automation is needed
- ❌ **Slower** than Playwright and **limited to Chromium**
- ❌ Not actively maintained anymore (may have issues with recent updates)
- ❌ **Basic anti-bot evasion**, less stealthy compared to Playwright
- ❌ **No native support** for multiple browsers like Playwright

---

### 📌 **Best Methods for Scraping 2,000,000 Mixed HTTPS + JavaScript URLs:**

#### **1. For Static HTTPS Pages:**
If the pages are **static** (not requiring JS to render), the best approach would be:
- **`aiohttp` + `BeautifulSoup`** or **`httpx` + `selectolax`** for **speed** and **low resource usage**. These tools handle many pages concurrently using **async**.

#### **2. For JavaScript-Rendered Pages:**
- **`Playwright (Python)`** is the **best option** to handle **both static and dynamic pages** for **2,000,000 URLs**. It combines **async support**, **browser automation**, and **dynamic content handling** in one package.
- **`Pyppeteer`** is a **good option** if you prefer a **simple API** and are working with **Chromium-only** sites. However, **Playwright** is the better choice if you want **multiple browser support** and more **advanced features** (like handling popups or mobile emulation).

#### **3. For Large-Scale Crawling:**
- If your project is primarily **large-scale crawling**, **`Scrapy`** is great for **handling 2 million URLs** concurrently and efficiently, with built-in features like **crawling** and **link-following**. However, **`Scrapy`** doesn't handle **JavaScript-rendered pages**, so you'll need to use **Playwright** alongside it for dynamic content.

---

### ⚡ **Final Thought:**
- **Playwright** is the **best choice** for handling **both static and dynamic pages** at scale, especially when scraping a mix of **2,000,000 URLs**. It can easily handle **JavaScript-heavy sites** and has advanced features that make it robust for large projects.
- **Pyppeteer** is a **simpler alternative** for JavaScript rendering, but it's **limited to Chromium** and is no longer actively maintained. If you need **more control and performance**, **Playwright** is the better tool.
- For **pure static page scraping**, **`aiohttp`** or **`httpx`** will be the fastest and most efficient tools.

# size_based and Structured_based chunking

In [7]:

import json
import os
import asyncio
import logging
import hashlib
import re
import unicodedata
import random
from typing import Dict, List, Optional, Tuple, Set
from bs4 import BeautifulSoup
from langchain.schema import Document
from tqdm import tqdm
from datetime import datetime
from urllib.parse import urlparse
import aiohttp
import ssl
import time
import nest_asyncio
from langdetect import detect
from playwright.async_api import async_playwright
import psutil

# Remove all existing handlers
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
    
logging.basicConfig(
    filename='desy_scraper.log',       # ✅ Your log file name
    filemode='w',                      # 'w' to overwrite each run, 'a' to append
    level=logging.DEBUG,               # Set to DEBUG to capture everything
    format='%(asctime)s - %(levelname)s - %(message)s',
    encoding='utf-8'
)

#logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
#logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Set directories
user_data_dir = "/afs/desy.de/user/t/taheri/scratch/cache/pyppeteer_cache"
download_dir = "/afs/desy.de/user/t/taheri/scratch/cache/pyppeteer_downloads"
os.makedirs(user_data_dir, exist_ok=True)
os.makedirs(download_dir, exist_ok=True)

# Apply nest_asyncio for Jupyter compatibility
nest_asyncio.apply()


        # Compile individual patterns for unwanted tags

class DESYContentProcessor:
    # Constants
    MIN_CHUNK_CHARS = 30      # Main threshold for final chunks
    MIN_INITIAL_CHARS = 20    # Initial content validation
    MIN_TEXT_SAMPLE_LENGTH = 50  # Reduced from 200
    MAX_CONTENT_AREA_SIZE = 50000
    DEFAULT_TIMEOUT = 300
    JS_DETECTION_MIN_LINKS = 3
    NON_HTML_EXTENSIONS = {'.jpg', '.jpeg', '.png', '.gif', '.bmp', '.svg', '.pdf', 
                          '.mp4', '.mp3', '.avi', '.mov', '.wmv', '.zip', '.tar', 
                          '.gz', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', ".xml"}

    
    def __init__(
        self,
        max_depth: int,
        content_tags: List[str] = [
            "p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "ul", "ol",
            "td", "th", "tr", "table", "caption", "dt", "dd", "span",
            "article", "section", "main", "div",
            "div.teaser-text", "div.content", "div.text-block",
            "div.publication-item", "div.news-item", "div.portlet-body",
            "div.event-details", "div.indico-content", "div.publication-list",
            "div.event-description", "div.news-content", "div.status-report",
            "div.status", "div.monitor", "div.experiment", "div.results",
            "p[id]", "table.i-table", "div.timetable"
        ],
        excluded_keywords: List[str] = ["cookie", "privacy", "copyright", "disclaimer", "login", "password"],
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        batch_size=10,
        timeout: int = 300,  # Increased from 30
        js_wait_time: int = 10000,  # Increased from 4000
        js_scroll: bool = True
    ):
        self.browser = None
        self.context = None
        #self.max_workers: int = 5 #48
        #self.max_workers = max_workers

        logical_cpus = psutil.cpu_count(logical=True)
        total_ram_gb = psutil.virtual_memory().total / 1e9
        
        # 🛠 Tune max_workers based on CPU and RAM
        self.max_workers = min(
            logical_cpus * 2,                          # Concurrency factor (2x CPUs)
            int((total_ram_gb // 2) * logical_cpus),   # Memory-aware cap
            200                                        # Optional hard ceiling to avoid overload
        )
        
        # 🎭 Adjust JS rendering separately (lighter by default)
        js_slots = max(4, self.max_workers // 6)  # JS rendering is heavier per task
       
        self.js_semaphore = asyncio.Semaphore(js_slots)
#        self.js_semaphore = asyncio.Semaphore(2) # ⚖️ Limits concurrent JS rendering tasks (e.g., n = 5 or 10)



        self.browser_lock = asyncio.Lock()       # 🔐 Protects Playwright init and page creation
        self.session_lock = asyncio.Lock()       # 🔐 Protects aiohttp session creation
        self.url_lock = asyncio.Lock()           # 🔐 Protects shared data: processed_urls, url_to_documents_map, error_urls, redirected_urls
        



        self.cookie_text_patterns = [
            r'cookie[- ]?banner',
            r'cookie[- ]?consent',
            r'diese website verwendet cookies',
            r'we use cookies',
            r'accept all cookies',
            r'cookie einstellungen',
            r'cookie policy',
            r'consent to cookies',
            r'diese seite nutzt cookies',
            r'cookie notice',
            r'cookie preferences',
            r'cookie declaration',
            r'cookie information',
            r'cookie settings',
            r'cookie usage'
        ]
        
        # CRITICAL: These should be applied first as they have the biggest impact
        self.critical_patterns = [
            # Scripts and styles - highest priority
            re.compile(r'<(script|style)[^>]*>.*?</\1>', re.DOTALL | re.IGNORECASE),
            
            # Navigation structures - very high impact
            re.compile(r'<nav\b[^>]*>.*?</nav>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<(?:header|footer)\b[^>]*>.*?</(?:header|footer)>', re.DOTALL | re.IGNORECASE),
            
            # Forms - high impact, often contain unwanted elements
            re.compile(r'<form\b[^>]*>.*?</form>', re.DOTALL | re.IGNORECASE),
            
            # Large container elements by ID - high impact
            re.compile(r'<(?:div|section|nav|ul|header)\b[^>]*id\s*=\s*[\'"](?:footer|overall|wrapper|icons|search_icon|phone_icon|close_gcs|mobile_menu_header|mobile_menu|mobile_dropdown|mobile_loading|mobile_dropdown_content|top|logoarea|topleft|topright|topmenu|menu|main_menu|header|leftmenu|rightmenu)\b[^\'\"]*[\'"][^>]*>.*?</(?:div|section|nav|ul|header)>', re.DOTALL | re.IGNORECASE),
            
            # Cookie removal patterns - ADDED FROM CODE 1
            re.compile(r'<(div|section|aside|footer)[^>]*id=["\']?[^"\'>]*\b(cookie|consent|privacy|banner|notice|preferences)\b[^"\'>]*["\']?[^>]*>.*?</\1>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<(div|section|aside|footer)[^>]*class=["\'][^"\'>]*\b(cookie|consent|banner|popup|notice|preferences|privacy|cookie-consent-wrapper|cookie-bar-wrapper)[^"\'>]*["\'][^>]*>.*?</\1>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<(div|section|aside|footer)[^>]*style=["\'][^"\']*display\s*:\s*none[^"\']*["\'][^>]*>.*?</\1>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<[^>]+class=["\'][^"\'>]*\bcookie-bar__inner\b[^"\'>]*["\'][^>]*>.*?</[^>]+>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<!--\s*Cookie\s+Bar\s*-->.*?<!--\s*End\s+Cookie\s+Bar\s*-->', re.DOTALL | re.IGNORECASE),
            re.compile(r'<div[^>]*id=["\']?cookie-bar["\']?[^>]*>.*?</div>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<nav\b[^>]*id\s*=\s*[\'"](?:leftmenu|topmenu|menu)[^\'\"]*[\'"][^>]*>.*?</nav>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<ul\b[^>]*id\s*=\s*[\'"](?:main_menu|menu)[^\'\"]*[\'"][^>]*>.*?</ul>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<li\b[^>]*class\s*=\s*[\'"][^\'\"]*\b(?:inactive|active|ZMSFolder\d*|ZMSDocument\d*)\b[^\'\"]*[\'"][^>]*>.*?</li>', re.DOTALL | re.IGNORECASE),
        ]
        
        # HIGH: Important patterns for navigation and UI elements
        self.high_priority_patterns = [
            # Navigation by class - high frequency
            re.compile(r'<(?:div|ul|ol|section)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:breadcrumb|bread[-_]?nav|nav|navigation|tagline|menu[-_]?bar|top[-_]?nav|site[-_]?nav|main[-_]?navigation|nav[-_]?container|sub[-_]?nav|menu[-_]?container|menu|sub[-_]?menu|nav[-_]?menu|quick[-_]?nav|quick[-_]?links)\b[^\'\"]*[\'"][^>]*>.*?</(?:div|ul|ol|section)>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<(?:div|ul|ol|section|li)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:breadcrumb|bread[-_]?nav|nav|navigation|tagline|menu[-_]?bar|top[-_]?nav|site[-_]?nav|main[-_]?navigation|nav[-_]?container|sub[-_]?nav|menu[-_]?container|menu|sub[-_]?menu|nav[-_]?menu|quick[-_]?nav|quick[-_]?links|topright[-_]?button|wrapper)\b[^\'\"]*[\'"][^>]*>.*?</(?:div|ul|ol|section|li)>', re.DOTALL | re.IGNORECASE),
            
            # Header and footer containers
            re.compile(r'<(?:header|footer)\b[^>]*(?:id\s*=\s*[\'"]header[\'"])?[^>]*>.*?</(?:header|footer)>', re.DOTALL | re.IGNORECASE),
            re.compile(r'<div\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:header|footer|site[-_]?footer|page[-_]?footer|site[-_]?header|nav[-_]?footer|group[-_]?header|banner[-_]?header|wrapper)\b[^\'\"]*[\'"][^>]*>.*?</div>', re.DOTALL | re.IGNORECASE),
            
            # Cookie and consent banners - high user impact
            re.compile(r'<(?:div|section|aside)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:cookies?|consent|banner|popup|modal|cookie[-_]?notices?|cookie[-_]?consents?|cookie[-_]?policys?|gdpr|privacy[-_]?banner)\b[^\'\"]*[\'"][^>]*>.*?</(?:div|section|aside)>', re.DOTALL | re.IGNORECASE),
            
            # Sidebar and widget areas
            re.compile(r'<(?:div|aside|section)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:sidebar|left|right|side[-_]?nav|widget[-_]?area|nav[-_]?panel)\b[^\'\"]*[\'"][^>]*>.*?</(?:div|aside|section)>', re.DOTALL | re.IGNORECASE),
        ]
        
        # MEDIUM: Specific element patterns
        self.medium_priority_patterns = [
            # Search elements
            re.compile(r'<div\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:search|search[-_]?form|search[-_]?box|search[-_]?bar|cse[-_]?search[-_]?form)\b[^\'\"]*[\'"][^>]*>.*?</div>', re.DOTALL | re.IGNORECASE),
            
            # Mobile elements
            re.compile(r'<(?:div|nav|ul)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\bmobile(?:[-_]?(?:nav|menu|back|toggle|dropdown|loading))?\b[^\'\"]*[\'"][^>]*>.*?</(?:div|nav|ul)>', re.DOTALL | re.IGNORECASE),
            
            # Language switchers
            re.compile(r'<(?:div|ul|select)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:lang|language|lang[-_]?switch)\b[^\'\"]*[\'"][^>]*>.*?</(?:div|ul|select)>', re.DOTALL | re.IGNORECASE),
            
            # Overlays and modals
            re.compile(r'<(?:div|section)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:overlay|modal[-_]?overlay|popup[-_]?overlay)\b[^\'\"]*[\'"][^>]*>.*?</(?:div|section)>', re.DOTALL | re.IGNORECASE),
            
            # Buttons and UI elements
            re.compile(r'<(?:button|input|div)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:btns?|buttons?|btt|topright[-_]?button)\b[^\'\"]*[\'"][^>]*(?:>.*?</(?:button|input|div)>|/??>)', re.DOTALL | re.IGNORECASE),
          
            # Remove DOI links
            re.compile(r'<a\b[^>]*href\s*=\s*[\'"][^\'\"]*\b(?:doi\.org|journals\.aps\.org|dx\.doi\.org|DOI:)[^\'\"]*[\'"][^>]*>.*?</a>', re.DOTALL | re.IGNORECASE),
            
            # Wrapper and container elements that contain navigation - FROM CODE 1
            re.compile(r'<(?:div|section)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:wrapper|container|main[-_]?container|page[-_]?wrapper|site[-_]?wrapper)\b[^\'\"]*[\'"][^>]*>(?:(?!<(?:main|article|content)\b).)*?</(?:div|section)>', re.DOTALL | re.IGNORECASE),
        ]
        
        # LOW: Fine-grained cleanup patterns
        self.low_priority_patterns = [
            # Specific list items - lower impact
            re.compile(r'<li\b[^>]*(?:class\s*=\s*[\'"][^\'\"]*\b(?:inactive|folder|nav[-_]?item|menu[-_]?item|ZMSFolder\d*|ZMSDocument\d*)\b[^\'\"]*[\'"])?[^>]*>.*?</li>', re.DOTALL | re.IGNORECASE),
            
            # Footnotes and references
            re.compile(r'<(?:div|section|aside|span)\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:footnotes?|foot[-_]?notes?|references?|citations?|endnotes?)\b[^\'\"]*[\'"][^>]*>.*?</(?:div|section|aside|span)>', re.DOTALL | re.IGNORECASE),
            
            # Specific links
            re.compile(r'<a\b[^>]*(?:id\s*=\s*[\'"](?:mobile_back_to_desy|mobile[-_]?nav[-_]?toggle|search|phone)[\'"]|(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:inactive|ZMSFolder\d*|ZMSDocument\d*)\b[^\'\"]*[\'"]|href\s*=\s*[\'"][^\'\"]*(?:index_print|desy\.de|testbeam\.desy\.de)[^\'\"]*[\'"]|title\s*=\s*[\'"][^\'\"]*(?:Change\s+language|DESY\s+Homepage|to\s+[\'"]Accelerators[\'"])[^\'\"]*[\'"]|target\s*=\s*[\'"]_blank[\'"])[^>]*>.*?</a>', re.DOTALL | re.IGNORECASE),
            
            # Specific images
            re.compile(r'<img\b[^>]*(?:id\s*=\s*[\'"][^\'\"]*(?:phonebook_icon|print_icon|lang_icon|desylogo)[^\'\"]*[\'"]|alt\s*=\s*[\'"][^\'\"]*(?:phone\s+book|Diese\s+Seite\s+drucken|loading|DESY\s+Logo)[^\'\"]*[\'"]|src\s*=\s*[\'"][^\'\"]*(?:loading\.gif|logo_desy\.gif|arrow_large_white\.png)[^\'\"]*[\'"])[^>]*/?>', re.IGNORECASE),
            
            # ARIA and accessibility
            re.compile(r'<[^>]*(?:role\s*=\s*[\'"]navigation[\'"]|aria-label\s*=\s*[\'"][^\'\"]*[\'"])[^>]*>.*?</[^>]+>', re.DOTALL | re.IGNORECASE),
            
            # Nested inactive lists - FROM CODE 1
            re.compile(r'<ul\b[^>]*>(?:\s*<li\b[^>]*(?:class|id)\s*=\s*[\'"][^\'\"]*\b(?:inactive|ZMSFolder\d*|ZMSDocument\d*)\b[^\'\"]*[\'"][^>]*>.*?</li>\s*)+</ul>', re.DOTALL | re.IGNORECASE),
        ]
        
        # SPECIALIZED: Domain-specific patterns
        self.specialized_patterns = [
            # DESY institutional content - FROM CODE 1
            re.compile(r'Deutsches\s+Elektronen-Synchrotron\s+DESY\s+A\s+Research\s+Centre\s+of\s+the\s+Helmholtz\s+Association', re.IGNORECASE),
            re.compile(r'Data\s+Privacy\s+Policy\s*\|\s*Declaration\s+of\s+Accessibility\s*\|\s*Imprint\s*©[^.]*', re.IGNORECASE),
            re.compile(r'A\s+Research\s+Centre\s+of\s+the\s+Helmholtz\s+Association', re.IGNORECASE),
            re.compile(r'©\s*\d{4}\s*Deutsches\s+Elektronen-Synchrotron\s+DESY.*?(?:Helmholtz\s+Association)?', re.IGNORECASE),
            re.compile(r'Deutsches\s*Elektronen-Synchrotron', re.IGNORECASE),
            re.compile(r'Data\s+Privacy\s+Policy\s*\|.*?(?:Imprint|©)', re.IGNORECASE),
            re.compile(r'Impressum\s*/\s*Datenschutz\s*/\s*Erklärung\s+zur\s+Barrierefreiheit', re.IGNORECASE),
            re.compile(r'\bSprungnavigation\b', re.IGNORECASE),
            re.compile(r'\bZielgruppennavigation\b', re.IGNORECASE),
            re.compile(r'\bServicefunktionen\b', re.IGNORECASE),
            re.compile(r'\bBreadcrumb\b', re.IGNORECASE),
            re.compile(r'\bFooter\b', re.IGNORECASE),
            re.compile(r'\bDesy\s+Global\b', re.IGNORECASE),
            re.compile(r'\bZum\s+Untermenü\b', re.IGNORECASE),
            re.compile(r'\bZum\s+Inhalt\b', re.IGNORECASE),
            re.compile(r'\bZum\s+Hauptmenu\b', re.IGNORECASE),
            re.compile(r'\bInfos\s*&\s*Services\b', re.IGNORECASE),
            re.compile(r'\bLeichte\s+Sprache\b', re.IGNORECASE),
            re.compile(r'\bGebärdensprache\b', re.IGNORECASE)
        ]
        
        # CLEANUP: Final cleanup patterns
        self.cleanup_patterns = [
            # HTML comments
            re.compile(r'<!--\s*(?://wrapper\s*//\s*-->.*?<!--\s*/standard_html_header\s*--|/?\s*standard_html_header\s*-->)', re.DOTALL | re.IGNORECASE),
            re.compile(r'<!--[^>]*(?:wrapper|overall|standard_html)[^>]*-->', re.DOTALL | re.IGNORECASE),
            re.compile(r'<!--[^>]*tal:attributes[^>]*-->', re.IGNORECASE),
            re.compile(r'<!--a\s+tal:.*?</a-->', re.DOTALL | re.IGNORECASE),
            
            # SVG and other media
            re.compile(r'<svg[^>]*>.*?</svg>', re.DOTALL | re.IGNORECASE),
            
            # Attributes and styling
            re.compile(r'title\s*=\s*[\'"][^\'\"]*(?:Aktuelle|Seminare|Events)[^\'\"]*[\'"]', re.IGNORECASE),
            re.compile(r'<[^>]*style\s*=\s*[\'"][^\'\"]*(?:display\s*:\s*block|text-align\s*:\s*right|margin|opacity)[^\'\"]*[\'"][^>]*>', re.IGNORECASE),
        ]
        
        # TEXT-BASED: Patterns for text content cleanup - ENHANCED FROM CODE 1
        self.text_cleanup_patterns = [
            
        #safe_patterns 
        re.compile(r'\bNavigation\b', re.IGNORECASE),
        re.compile(r'\bDatenschutzerklärung\b', re.IGNORECASE),
        re.compile(r'\bErklärung\s+zur\s+Barrierefreiheit\b', re.IGNORECASE),
        re.compile(r'\bBack\s+to\s+Top\b', re.IGNORECASE),
        re.compile(r'\b(?:nav|menu|breadcrumb|navigation)\s*[:\-\|]\s*', re.IGNORECASE),

        #moderate_patterns 
        re.compile(r'\b(?:Home|Startseite|Kontakt|Suche|Login|Anmelden)\b', re.IGNORECASE),
        re.compile(r'\b(?:Archiv|Archive)\s*\d{4}', re.IGNORECASE),
        re.compile(r'\b(?:Page\s+\d+|Seite\s+\d+|\d+\s+of\s+\d+)\b', re.IGNORECASE),
        re.compile(r'\b(?:cookie|gdpr|popup|consent)\b', re.IGNORECASE),
    
        #aggressive_patterns
        # re.compile(r'\b(?:LinkedIn|Twitter|Facebook|Instagram)\b', re.IGNORECASE),
        # re.compile(r'\b(?:News|Events?|Seminare|Aktuelles)\b', re.IGNORECASE),
        # re.compile(r'\b(?:About|Über\s+uns|Presse\s*&\s*Kommunikation)\b', re.IGNORECASE),
        # re.compile(r'\b(?:DOOR\s+-\s+DESY|User\s+consortia|Sample\s+Environment)\b', re.IGNORECASE),
        # re.compile(r'\b(?:index_ger|index_en|[A-Za-z]*Folder\d*)\b', re.IGNORECASE),
        ]

 
        
        # Whitespace pattern - FROM CODE 1 (SEPARATE FOR EFFICIENCY)
        self.whitespace_pattern = re.compile(r'[\xa0\u202f\n\r\t\s]+')
        
       

        self.max_depth = max_depth
        self.content_tags = content_tags
        self.excluded_keywords = excluded_keywords
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.batch_size = batch_size
        self.timeout = timeout
        self.processed_hashes: Set[str] = set()
        self.full_text_hashes: Set[str] = set()

        self.processed_urls: Set[str] = set()
        self.max_hashes = 100000  # Limit to 100,000 hashes
        self.max_urls = 10000     # Limit to 10,000 URLs
    
        self.error_urls: Dict[str, str] = {}
        # ADD THIS LINE: Track redirected URLs separately for better debugging
        self.redirected_urls: Dict[str, str] = {}
        self.js_wait_time = js_wait_time
        self.js_scroll = js_scroll
        
        self.session = None
        self.progress_bar = None
        self.ssl_bypass_domains = set()
        self.url_to_documents_map = {}    

        self.page_character_counts = {}  # Add this line to track character counts per page
  


        self.domain_configs = {
            "petra3.desy.de": {"timeout": 500, "max_connections": 2, "retry_delay": 3, "js_wait_time": 12000},
            "indico.desy.de": {"timeout": 500, "max_connections": 2, "retry_delay": 5, "js_wait_time": 15000},
            "pitz.desy.de": {"timeout": 500, "max_connections": 2, "retry_delay": 3, "js_wait_time": 12000}        }
        

        self.domain_configs.update({
            "www.desy.de": { "timeout": 900, "max_connections": 1, "retry_delay": 3, "js_wait_time": 60000},
            "desy.de": {"timeout": 500, "max_connections": 1, "retry_delay": 3, "js_wait_time": 30000},
            "newsletter.desy.de": { "timeout": 900, "max_connections": 1, "retry_delay": 3, "js_wait_time": 60000},
            "connect.desy.de": { "timeout": 500, "max_connections": 1, "retry_delay": 3, "js_wait_time": 30000},
            "astroparticle-physics.desy.de": { "timeout": 500, "max_connections": 1, "retry_delay": 3, "js_wait_time": 30000},
            "innovation.desy.de": { "timeout": 500, "max_connections": 1, "retry_delay": 3, "js_wait_time": 30000},
            "petra4.desy.de": { "timeout": 900, "max_connections": 1, "retry_delay": 3, "js_wait_time": 90000},
            "accelerators.desy.de": { "timeout": 900, "max_connections": 1, "retry_delay": 3, "js_wait_time": 60000},
            "v22.desy.de": { "timeout": 500, "max_connections": 1, "retry_delay": 3, "js_wait_time": 30000},
            "photon-science.desy.de": { "timeout": 500, "max_connections": 1, "retry_delay": 3, "js_wait_time": 30000},
            "particle-physics.desy.de": { "timeout": 900, "max_connections": 1, "retry_delay": 3, "js_wait_time": 60000},
            "pr.desy.de": { "timeout": 500, "max_connections": 1, "retry_delay": 3, "js_wait_time": 30000},
            "fh.desy.de": { "timeout": 900, "max_connections": 1, "retry_delay": 3, "js_wait_time": 60000}
        })
        #"requires_js": False ,
        self.default_domain_config = { "timeout": timeout, "max_connections": 10, "retry_delay": 2}  # Reduced from 20
        
#################################check4




        self.debug_mode = False #True #False



    
    def add_to_processed_hashes(self, content_hash: str):
        if len(self.processed_hashes) >= self.max_hashes:
            logger.warning("Processed hashes limit reached, clearing oldest entries")
            self.processed_hashes = set(list(self.processed_hashes)[self.max_hashes//2:])
        self.processed_hashes.add(content_hash)
    
    def add_to_processed_urls(self, url: str):
        if len(self.processed_urls) >= self.max_urls:
            logger.warning("Processed URLs limit reached, clearing oldest entries")
            self.processed_urls = set(list(self.processed_urls)[self.max_urls//2:])
        self.processed_urls.add(url)

    
   
    def track_page_character_count(self, url: str, content: str, title: str = "", language: str = "en", depth: int = 0):
        if len(content) < self.MIN_CHUNK_CHARS:
            logger.warning(f"Skipping character count tracking for {url}: Content length {len(content)} below threshold")
            return
        self.page_character_counts[url] = {
            'url': url,
            'title': title,
            'character_count': len(content),
            'word_count': len(content.split()) if content else 0,
            'language': language,
            'depth': depth
           # 'timestamp': datetime.now().isoformat()
        }

    
    def should_skip_url(self, url: str) -> bool:
        """Check if URL should be skipped based on file extension."""
        parsed_url = urlparse(url)
        path = parsed_url.path.lower()
        
        # Check if URL ends with any non-HTML extension
        for ext in self.NON_HTML_EXTENSIONS:
            if path.endswith(ext):
                return True
        return False




 


    def detect_language(self, soup: BeautifulSoup, text_sample: str, url: str = None) -> str:
        # Log URL for context
        
        
        # Check if URL ends with _ger.html
        if url and url.lower().endswith('_ger.html'):
            #logger.info(f"URL {url} identified as German due to _ger.html suffix")
            return 'de'
                
        # Prioritize langdetect for sufficient text
        if text_sample and len(text_sample) >= 50:
            try:
                detected_lang = detect(text_sample[:1000])
                #logger.info(f"Language detected via langdetect for {url}: {detected_lang}")
                return detected_lang
            except Exception as e:
                logger.warning(f"Langdetect failed for {url}: {e}")
        
        # Fallback to HTML attributes
        html_lang = None
        if soup.html and soup.html.get('lang'):
            html_lang = soup.html.get('lang').strip().lower()
            logger.debug(f"HTML lang attribute: {html_lang}")
        if not html_lang and soup.html and soup.html.get('xml:lang'):
            html_lang = soup.html.get('xml:lang').strip().lower()
            logger.debug(f"XML lang attribute: {html_lang}")
        if not html_lang:
            meta_lang = soup.find('meta', attrs={'http-equiv': 'content-language'})
            if meta_lang and meta_lang.get('content'):
                html_lang = meta_lang.get('content').strip().lower()
                logger.debug(f"Meta content-language: {html_lang}")
        if not html_lang:
            meta_lang = soup.find('meta', attrs={'property': 'og:locale'})
            if meta_lang and meta_lang.get('content'):
                html_lang = meta_lang.get('content').strip().lower()
                logger.debug(f"Meta og:locale: {html_lang}")
        if html_lang:
            html_lang = re.sub(r'[^a-z]', '', html_lang.split('-')[0].lower())
            if len(html_lang) == 2:
                #logger.info(f"Language from HTML attributes for {url}: {html_lang}")
                return html_lang
        
        # Final fallback
        logger.warning(f"No language detected for {url}, defaulting to 'en'")
        return 'en'

    def _apply_pattern_group(self, text: str, patterns: list, group_name: str) -> str:
        """Apply a group of patterns efficiently with optional debugging."""
        if not patterns:
            return text
            
        total_matches = 0
        for i, pattern in enumerate(patterns):
            if self.debug_mode:
                matches = len(pattern.findall(text))
                total_matches += matches
                if matches > 0:
                    logger.debug(f"{group_name}[{i}]: {matches} matches")
            
            text = pattern.sub('', text)
        
        # if self.debug_mode and total_matches > 0:
        #     logger.info(f"{group_name}: {total_matches} total matches")
        
        return text



        

    def clean_content(self, text: str) -> str:
        if not text:
            return ""
                  
        try:
            soup = BeautifulSoup(text, 'html.parser')
            
            main_content_selectors = [
                # Semantic HTML5 tags (most reliable)
                'main', 'article', 'section[class*="content"]',
            
                #  Highly specific content blocks
                'div[class*="main-content"]',
                'div[class*="content-section"]',
                'div[class*="text-block"]',
            
                #  Common structured content containers
                'div[id="content"]', 'div[id="main"]', 'div[id="bodyContent"]',
                'div[class*="content"]', 'div[class*="text"]', 'div[class*="body"]',
            
                #  Layout wrappers (lower priority)
                'div[class*="page"]',
                'div[class*="container"]',
            
                #  Non-semantic but sometimes useful
                'center'
            ]


            main_content = None
            for selector in main_content_selectors:
                main_content = soup.select_one(selector)
               
    
            if not main_content:
                main_content = soup.body or soup
             
    
            #  Move structural cleanup here — before converting to string
            selectors = [
                'div[id="overall"]', 'div[class="wrapper"]', 'header[id="header"]',
                'div[id="mobile_menu_header"]', 'div[id="mobile_menu"]', 'div[id="mobile_dropdown"]',
                'div[id="top"]', 'div[id="logoarea"]', 'div[id="topleft"]', 'div[id="topright"]',
                'div[id="topmenu"]', 'nav[id="menu"]', 'ul[id="main_menu"]',
                'nav', 'ul[id*="menu" i]', 'ol[id*="menu" i]',
                'div[id="icons"]', 'div[class="topright_button"]',
                'li[class*="ZMS"]', 'a[class*="ZMS"]',
                'img[class="imgNoborder"]', 'img[id*="logo"]', 'img[id*="icon"]',
                'a[target="_blank"]', 'a[href*="doi.org"]', 'a[href*="DOI"]',
                'a[href*="journals.aps.org"]', 'a[href*="dx.doi.org"]', 'a[href*="doi:"]',
                'a[href*="abstract"]', 'a[href*="citation"]',
                'div[class="clear"]', 'div[class="loading"]',
                'footer', 'div[id*="footer" i]', 'div[class*="footer" i]', 'div[class*="copyright" i]',
                'div[class*="teaser" i]', 'div[class*="LinkElement" i]', 'div[class*="quicklinks" i]', 
                'div[class*="ZMS" i]', 'div[id*="teaser" i]', 'div[id*="quicklinks" i]',
                '[data-cookie]', '[data-consent]', '[class*="cookie" i]', '[class*="consent" i]', 
                '[style*="display:none" i]', '[style*="visibility:hidden" i]',
                'div[id="quick_nav_container"]',
                'a[href*="data_privacy_policy"]', 'a[href*="declaration_of_accessibility"]',
                'ul[style*="padding-bottom"]',
                'button[class*="btt"]', 'div[class*="btt"]',
                'ul[class*="footer__links"]', 'div[class*="footer__logos"]',
                'img[alt*="Logo"]', 'a[href*="linkedin"]', 'a[href*="twitter"]',
                'li[class*="ZMSFolder"]', 'li[class*="ZMSDocument"]',
                'a[class*="ZMSFolder"]', 'a[class*="ZMSDocument"]',
                'p.hidden.showforprint',
                'p[class*="hidden"][class*="showforprint" i]',
                '[class*="showforprint" i]', '[class*="show-for-print" i]',
                '[class*="hidden" i][class*="print" i]',
                '[class~="showforprint"]', '[class~="hidden"]',
                'a[class*="print" i]', 'a[class*="changelang" i]',
                    #  Enhanced nav/menu cleanup
                'nav', 'header', 'footer',
                'div[class*="nav" i]', 'div[id*="nav" i]',
                'div[class*="menu" i]', 'div[id*="menu" i]',
                'ul[class*="menu" i]', 'ul[id*="menu" i]',
                'li[class*="menu" i]', 'li[id*="menu" i]',
                'a[class*="menu" i]', 'a[id*="menu" i]',
                'section[class*="nav" i]', 'section[class*="menu" i]',
                'ul[class*="nav" i]',  # catches <ul class="nav">
                'ul[id*="nav" i]',     # just in case
                'div[id*="content-nav" i]',  # catches <div id="content-nav">
                'div[id="page-footer"]',
                'ul[id="footer-nav"]',
            ]

            for selector in selectors:
                elements = set(main_content.select(selector)) | set(soup.select(selector))
                #elements = set(main_content.select(selector))
               
                # if elements:
                #     print(f"Selector '{selector}' matched {len(elements)} elements")
                for element in elements:
                    if element.extractable:
                        element.decompose()
    
            # 🧼 Handle lingering <li> outside content manually
            for li in main_content.find_all("li"):
                if not li.find_parent(id="content"):
                    li.decompose()
    
            # 🔧 Remove DOI links before conversion
            doi_href_pattern = re.compile(r'(doi\.org|journals\.aps\.org|dx\.doi\.org|DOI:)', re.IGNORECASE)
            for a_tag in main_content.find_all('a', href=True):
                if doi_href_pattern.search(a_tag['href']):
                    if self.debug_mode:
                        logger.debug(f"Removing DOI link: {a_tag}")
                    a_tag.decompose()
    
            text = str(main_content)
    
        except Exception as e:
            logger.error(f"Main content extraction failed: {e}")
            text = re.sub(r'<[^>]+>', ' ', text)
    
        # 🔧 Regex pattern cleaning
        text = self._apply_pattern_group(text, self.critical_patterns, "CRITICAL")
        text = self._apply_pattern_group(text, self.high_priority_patterns, "HIGH")
        text = self._apply_pattern_group(text, self.medium_priority_patterns, "MEDIUM")
        text = self._apply_pattern_group(text, self.low_priority_patterns, "LOW")
        text = self._apply_pattern_group(text, self.specialized_patterns, "SPECIALIZED")
        text = self._apply_pattern_group(text, self.cleanup_patterns, "CLEANUP")
    
        # 🧼 Final DOM pass if any HTML remains
        if '<' in text and '>' in text:
            try:
                soup = BeautifulSoup(text, 'html.parser')
    
                for el in soup.find_all(text=re.compile(r'©\s*\d{4}\s*Deutsches\s*Elektronen-Synchrotron\s*DESY', re.I)):
                    el.replace_with('')
    
                for el in soup.find_all(text=True):
                    text_content = (el.string or "").lower()
                    for pattern in self.cookie_text_patterns:
                        if re.search(pattern, text_content, re.I):
                            parent = el.parent
                            for _ in range(4):
                                if parent and parent.name in ['div', 'section', 'aside', 'p', 'span']:
                                    if self.debug_mode:
                                        logger.debug(f"Removed parent {parent.name} with cookie text")
                                    parent.decompose()
                                    break
                                parent = parent.parent if parent else None
                            break
    
                text = soup.get_text(separator=' ', strip=True)
    
            except Exception as e:
                logger.error(f"HTML parsing failed: {e}")
                text = re.sub(r'<[^>]+>', ' ', text)
    
        # Final cleanup
        text = self._apply_pattern_group(text, self.text_cleanup_patterns, "TEXT_CLEANUP")
        text = self.whitespace_pattern.sub(' ', text)
    
        # 🔁 Remove duplicate DOIs
        doi_pattern = re.compile(r'\b10\.\d{4,9}/[-._;()/:A-Z0-9]+\b', re.IGNORECASE)
        seen_dois = set()
    
        def replace_doi(match):
            doi = match.group(0)
            if doi in seen_dois:
                if self.debug_mode:
                    logger.debug(f"Removed duplicate DOI: {doi}")
                return ''
            seen_dois.add(doi)
            return doi
    
        text = doi_pattern.sub(replace_doi, text)
    
        if self.debug_mode:
            #logger.setLevel(logging.DEBUG)

            problematic_terms = ['<', '>', 'nav', 'menu', 'cookie', 'consent', 'cookie-consent', 'cookie-banner']
            found_terms = [term for term in problematic_terms if term in text.lower()]
            if found_terms:
                logger.warning(f"Residual unwanted content detected: {', '.join(found_terms)}")
                for term in found_terms:
                    if term in ['<', '>']:
                        continue
                    pattern = re.compile(rf'.{{0,150}}{re.escape(term)}.{{0,150}}', re.IGNORECASE)
                    matches = pattern.findall(text)
                    if matches:
                        logger.debug(f"Context for '{term}': {matches[:3]}")
                        #logger.warning(f"Context for '{term}': {matches[:3]}")
                text = re.sub(r'<[^>]+>', ' ', text)
    
        return re.sub(r'\s+', ' ', text).strip()

    
    

    
    def is_login_page(self, soup: BeautifulSoup) -> bool:
        """Check if the page is a login page."""
        login_indicators = [
            soup.find('form', {'id': lambda x: x and 'login' in x.lower()}),
            soup.find('form', {'action': lambda x: x and 'login' in x.lower()}),
            soup.find('input', {'name': 'username'}),
            soup.find('input', {'name': 'password', 'type': 'password'}),
            soup.find('button', text=re.compile(r'log\s*in|sign\s*in', re.I)),
            soup.find('input', {'value': re.compile(r'log\s*in|sign\s*in', re.I)}),
            soup.find('div', class_=['login-box', 'auth-form']),  # ← ADDED
            soup.find('a', text=re.compile(r'log\s*in|sign\s*in|authenticate', re.I))  # ← ADDED
        ]
       #return soup.title and re.search(r'log\s*in|sign\s*in', soup.title.text, re.I) or any(login_indicators)
        # Check page title or presence of login indicators
        title_matches = soup.title and re.search(r'log\s*in|sign\s*in', soup.title.text, re.I)
        return title_matches or any(login_indicators)

    
    def is_not_found_page(self, soup: BeautifulSoup) -> bool:
        """Check if the page is a 'not found' page."""
        error_phrases = [
            'not found', 'page doesn\'t exist', '404', 'page not found', 
            'does not exist', 'could not be found', 'site error', 
            'error was encountered', 'error occurred'
        ]
        page_text = soup.get_text(strip=True).lower()
    
        # Check title for error phrases
        if soup.title and any(phrase in soup.title.text.lower() for phrase in error_phrases):
            return True


        if re.search(r'error.*encountered.*publishing', page_text, re.I):
            return True


        # Check headings for error phrases
        for heading in soup.find_all(['h1', 'h2', 'h3']):
            if any(phrase in heading.get_text(strip=True).lower() for phrase in error_phrases):
                return True
        
        # Check entire page text for error phrases
        if any(phrase in page_text for phrase in error_phrases):
            return True
        
        # Check if page content is too short (indicates error page)
        if len(page_text) < self.MIN_TEXT_SAMPLE_LENGTH:
            return True
        
        return False

    def extract_list_metadata(self, soup: BeautifulSoup) -> str:
        """Extract metadata from structured lists."""
        content_parts = []
        
        # Find lists that might contain publication info
        
        lists = soup.find_all(['ul', 'ol', 'dl'], class_=['publication-list', 'pub-list'])

        for list_elem in lists:
            items = list_elem.find_all(['li', 'dt', 'dd'])
            if len(items) > 1:  # Substantial list-Relaxed from 3
                list_content = []
                for item in items:
                    item_text = item.get_text(strip=True)
                    if item_text and len(item_text) > 10:  # Meaningful content
                        # Check if item contains metadata indicators
                        if any(keyword in item_text.lower() for keyword in 
                              ['author', 'title', 'journal', 'doi', 'isbn', 'vol', 'pp', 'year', '20']):
                            list_content.append(item_text)
                
                if len(list_content) > 2:  # Multiple metadata items
                    content_parts.extend(list_content)
        
        return "\n".join(content_parts)

    def extract_table_metadata(self, soup: BeautifulSoup) -> str:
        """Extract publication metadata from tables."""
        content_parts = []
        
        # Find tables containing publication data
        tables = soup.find_all('table')
        for table in tables:
            rows = table.find_all('tr')
            if len(rows) < 2:  # Skip empty tables
                continue
                
            # Check if this looks like a publication table
            table_text = table.get_text(strip=True).lower()
            if any(keyword in table_text for keyword in ['author', 'title', 'journal', 'publication', 'presenter', 'date', 'conference']):
                
                for row in rows:
                    cells = row.find_all(['td', 'th'])
                    if len(cells) > 1:  # Multi-column row
                        cell_texts = []
                        for cell in cells:
                            cell_text = cell.get_text(strip=True)
                            if cell_text and len(cell_text) > 3:  # Meaningful content
                                cell_texts.append(cell_text)
                        
                        if cell_texts:
                            # Format as: "Title | Author | Journal" etc.
                            content_parts.append(" | ".join(cell_texts))
            else:
                # Even if no keywords, extract substantial table content
                for row in rows:
                    cells = row.find_all(['td', 'th'])
                    if len(cells) > 1:
                        cell_texts = [cell.get_text(strip=True) for cell in cells if cell.get_text(strip=True)]
                        if len(cell_texts) > 1 and any(len(text) > 15 for text in cell_texts):
                            content_parts.append(" | ".join(cell_texts))
        
        return "\n".join(content_parts)
    
        
    def extract_content(self, soup: BeautifulSoup, use_tags: bool = True) -> Tuple[str, str]:
        """Single-strategy content extraction with proper cleaning and deduplication"""
        if not soup:
            logger.warning("No soup provided for content extraction")
            return "", ""
    
        all_text_parts = []
        processed_elements = set()
        seen_text_hashes = set()
    
        # 🔧 Expanded early cleanup: Remove known footer, legal, and social media elements
        selectors = (    
            '[id*="nav" i], [class*="nav" i], '
            '[id*="menu" i], [class*="menu" i], '
            '[id*="sidebar" i], [class*="sidebar" i], '
            '[id*="quicklinks" i], [class*="quicklinks" i], '
            'p.copyright, div.copyright, footer, '
            '[class*="footer" i], [id*="footer" i], '
            '[class*="impressum" i], [id*="impressum" i], '
            '[class*="datenschutz" i], [id*="datenschutz" i], '
            '[class*="legal" i], [id*="legal" i], '
            '[class*="social" i], [id*="social" i], '
            '[class*="share" i], [id*="share" i], '
            '[class*="links" i], [id*="links" i], '
            '[class*="bottom" i], [id*="bottom" i], '
            '[class*="contact" i], [id*="contact" i], '
            '[class*="mastodon" i], [class*="facebook" i], '
            '[class*="instagram" i], [class*="linkedin" i], '
            '[class*="twitter" i], [class*="rss" i], '
            'a[href*="impressum"], a[href*="datenschutz"], '
            'a[href*="privacy"], a[href*="accessibility"], '
            'a[href*="kontakt"], a[href*="contact"], '
            'a[href*="social"], a[href*="linkedin"], '
            'a[href*="twitter"], a[href*="facebook"], '
            'a[href*="instagram"], a[href*="mastodon"], '
            'a[href*="rss"]'            
        )
        for nav_elem in soup.select(selectors):
            nav_elem.decompose()
    
        def should_skip_element(element):
            return (
                element.get('id') in ['cookie-bar', 'footer', 'page-footer', 'site-footer'] or 
                any(cls in element.get('class', []) for cls in [
                    'cookie-bar', 'LinkElementTitle', 'ZMSTeaserContainer', 'footer', 'copyright',
                    'link', 'site-footer', 'ZMSDocument0'
                ]) or 
                element.name == 'li' or 
                element.find_parent('li') or 
                element.find_parent(attrs={'id': re.compile(r'(footer|page-footer|site-footer)', re.I)})
            )
    
        comprehensive_tags = [
            'p[id]', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
            'div.content-section', 'div.module', 'div.text', 'div.content', 
            'div.text-block', 'div.main-content', 'div.publication-item',
            'div.news-item', 'div.event-details', 'div.news-content',
            'div.status-report', 'div.status', 'div.monitor',
            *self.content_tags,
            'table', 'table.i-table', 'caption', 'td', 'th', 'tr',
            'section', 'article', 'main', 'span', 'div'
        ]
    
        for tag in comprehensive_tags:
            if "." in tag:
                tag_name, tag_class = tag.split(".", 1)
                elements = soup.find_all(tag_name, class_=tag_class)
            elif tag.startswith('p['):
                elements = soup.find_all('p', id=True)
            else:
                elements = soup.find_all(tag)
    
            for element in elements:
                if id(element) in processed_elements or should_skip_element(element):
                    continue
                if any(id(ancestor) in processed_elements for ancestor in element.parents):
                    continue
                if any(id(descendant) in processed_elements for descendant in element.descendants if hasattr(descendant, 'name')):
                    continue
    
                raw_html = str(element)
                cleaned_text = self.clean_content(raw_html)
    
                cleaned_text = self._apply_pattern_group(cleaned_text, self.critical_patterns, "CRITICAL")
                cleaned_text = self._apply_pattern_group(cleaned_text, self.high_priority_patterns, "HIGH")
                cleaned_text = self._apply_pattern_group(cleaned_text, self.medium_priority_patterns, "MEDIUM")
                cleaned_text = self._apply_pattern_group(cleaned_text, self.low_priority_patterns, "LOW")
                cleaned_text = self._apply_pattern_group(cleaned_text, self.specialized_patterns, "SPECIALIZED")
                cleaned_text = self._apply_pattern_group(cleaned_text, self.cleanup_patterns, "CLEANUP")
    
                if not cleaned_text or len(cleaned_text) < self.MIN_CHUNK_CHARS:
                    continue
    
                normalized_for_hash = re.sub(r'\s+', ' ', cleaned_text.lower().strip())
                content_hash = hashlib.md5(normalized_for_hash.encode()).hexdigest()
                if content_hash in seen_text_hashes:
                    continue
    
                seen_text_hashes.add(content_hash)
                processed_elements.add(id(element))
                for descendant in element.descendants:
                    if hasattr(descendant, 'name'):
                        processed_elements.add(id(descendant))
    
                all_text_parts.append(cleaned_text)
    
                if self.debug_mode:
                    logger.debug(f"Added cleaned text from <{element.name}> (tag: {tag}), length: {len(cleaned_text)}")
                    logger.debug(f"Preview: {cleaned_text[:200]}")
    
        content = "\n".join(all_text_parts)
    
        # 🧼 Final DOM-based cleanup (same as clean_content)
        try:
            soup_final = BeautifulSoup(content, 'html.parser')
    
            for el in soup_final.find_all(text=re.compile(r'©\s*\d{4}.*?DESY', re.I)):
                el.replace_with('')
    
            for el in soup_final.find_all(text=True):
                text_content = (el.string or "").lower()
                for pattern in self.cookie_text_patterns:
                    if re.search(pattern, text_content, re.I):
                        parent = el.parent
                        for _ in range(4):
                            if parent and parent.name in ['div', 'section', 'aside', 'p', 'span']:
                                if self.debug_mode:
                                    logger.debug(f"Removed parent {parent.name} with cookie text")
                                parent.decompose()
                                break
                            parent = parent.parent if parent else None
                        break
    
            content = soup_final.get_text(separator=' ', strip=True)
        except Exception as e:
            logger.error(f"Final HTML parsing failed: {e}")
    
        # Final regex cleanup
        content = self._apply_pattern_group(content, self.text_cleanup_patterns, "TEXT_CLEANUP")
        content = self.whitespace_pattern.sub(' ', content)
    
        # Remove duplicate DOIs
        doi_pattern = re.compile(r'\b10\.\d{4,9}/[-._;()/:A-Z0-9]+\b', re.IGNORECASE)
        seen_dois = set()
    
        def replace_doi(match):
            doi = match.group(0)
            if doi in seen_dois:
                if self.debug_mode:
                    logger.debug(f"Removed duplicate DOI: {doi}")
                return ''
            seen_dois.add(doi)
            return doi
    
        content = doi_pattern.sub(replace_doi, content)
    
        return content, content

    
###########################
    #For fixed_size
    
    def create_chunks(self, text: str, metadata: Dict, chunk_type: str = "character") -> List[Document]:
        if not text:
            return []
            
        # Enhanced text cleaning - do this FIRST before any processing
        cleaned_text = self.clean_content(text)
        #if not cleaned_text or len(cleaned_text) < self.MIN_CHUNK_CHARS:
        if not cleaned_text: #softenning the upper condition    
            return []

    
        def split_text_by_size(cleaned_text: str, max_size: int, overlap_size: int, min_chars: int) -> List[str]:
            """
            Improved text splitting function that handles sentence boundaries and overlaps better.
            """            
            if len(cleaned_text) <= max_size:
                return [cleaned_text]
            
            chunks = []
            start = 0
            min_chunk_size = max(max_size // 2, max_size - overlap_size)
            text_len = len(cleaned_text)  # Cache length
            
            while start < text_len:
                end = min(start + max_size, text_len)
                
                # If this isn't the last chunk, try to find a good break point
                if end < text_len:
                    # Look for sentence endings in the last portion of the chunk
                    #search_start = max(end - 200, start + min_chunk_size)
                    search_start = max(end - int(max_size * 0.3), start + min_chunk_size)
                    search_zone = cleaned_text[search_start:end]

                    #Sara
                    # # Find sentence boundaries (compiled regex could be cached as class attribute)
                    # sentence_pattern = r'[.!?]\s+|[.!?]$|\n\s*\n'
                    # matches = list(re.finditer(sentence_pattern, search_zone))

                    sentence_pattern = re.compile(r'[.!?]\s+|[.!?]$|\n\s*\n')
                    matches = list(sentence_pattern.finditer(search_zone))                    
                    
                    if matches:
                        # Use the last sentence boundary found
                        match = matches[-1]
                        match_end = search_start + match.end()
                        end = match_end - len(match.group().lstrip('.!?'))
                    else:
                        # Fall back to word boundary
                        word_boundary = cleaned_text.rfind(' ', search_start, end)
                        if word_boundary > start:
                            end = word_boundary
                
                # Extract and validate chunk
                chunk = cleaned_text[start:end].strip()
                if len(chunk) >= min_chars:
                    chunks.append(chunk)
                
                # Check if we've reached the end
                if end >= text_len:
                    break
                
                # Calculate next start position with proper overlap
                ideal_next_start = end - overlap_size
                
                if ideal_next_start <= start:
                    next_start = start + min_chunk_size
                else:
                    word_start = cleaned_text.find(' ', ideal_next_start)
                    next_start = word_start + 1 if word_start != -1 and word_start < end else ideal_next_start
                
                start = next_start
            
            return chunks
        
        # Split text using instance variables directly
        texts = split_text_by_size(text, self.chunk_size, self.chunk_overlap, self.MIN_CHUNK_CHARS)
        
        # Pre-compile commonly used values
        section_title = metadata.get("section_title", "")
        section_level = metadata.get("section_level", 0)
        total_chunks = len(texts)
        
        unique_chunks = []
        chunk_fingerprints = set()
        
        for i, chunk in enumerate(texts):
            # Create fingerprint once (avoid duplicate normalization)
            normalized_text = re.sub(r'\s+', ' ', chunk.strip().lower())
            text_fingerprint = hashlib.md5((normalized_text + section_title).encode()).hexdigest()
            
            if text_fingerprint in chunk_fingerprints:
                continue
            
            chunk_fingerprints.add(text_fingerprint)
            
            # Use original chunk for content hash (avoid re-encoding)
            content_hash = hashlib.md5(chunk.encode()).hexdigest()
            
            if content_hash not in self.processed_hashes:
                #self.processed_hashes.add(content_hash)
                self.add_to_processed_hashes(content_hash)
                
                # Build metadata once
                chunk_metadata = {
                    **metadata,  # Spread operator is more efficient than copy()
                    "chunk_index": i,
                    "total_chunks": total_chunks,
                    "chunk_type": chunk_type,
                    "section_title": section_title,
                    "section_level": section_level
                }
                
                if i > 0:
                    chunk_metadata["continued"] = True
                
                unique_chunks.append(Document(page_content=chunk, metadata=chunk_metadata))
                # self.debug_mode= True
                # if self.debug_mode:
        if chunk_type == "character":
            logger.critical(f"Chunking result: {len(unique_chunks)} chunks for {metadata.get('source')} in create chunks function")
        
        return unique_chunks

    
    def create_full_text_chunks(self, text: str, metadata: Dict, chunk_type: str = "full_text") -> List[Document]:
        if not text:
            return []
        
        # Enhanced text cleaning - do this FIRST before any processing
        cleaned_text = self.clean_content(text)
        #logger.debug(f"Cleaned text length: {len(cleaned_text)}")
        if not cleaned_text or len(cleaned_text) < self.MIN_CHUNK_CHARS:
            #logger.debug(f"Skipping due to short cleaned content (< {self.MIN_CHUNK_CHARS})")
            return []
    
        def split_text_for_full_coverage(t: str, max_size: int) -> List[str]:
            """
            Simple text splitting that ensures NO content loss - just cuts at max_size
            """
            if len(t) <= max_size:
                return [t]
            
            chunks = []
            start = 0
            text_len = len(t)
            
            while start < text_len:
                end = min(start + max_size, text_len)
                chunk = t[start:end].strip()
                if chunk:  # Only add non-empty chunks
                    chunks.append(chunk)
                start = end  # Next chunk starts exactly where this one ended
            
            return chunks
        
        # Split text using the simple function
        texts = split_text_for_full_coverage(cleaned_text, self.chunk_size)
        #logger.debug(f"Split into {len(texts)} chunks after cleaning")

        
        # Keep all the deduplication logic from original function
        section_title = metadata.get("section_title", "")
        section_level = metadata.get("section_level", 0)
        total_chunks = len(texts)
        
        unique_chunks = []
        chunk_fingerprints = set()
        
        for i, chunk in enumerate(texts):
            # Create fingerprint once (avoid duplicate normalization)
            normalized_text = re.sub(r'\s+', ' ', chunk.strip().lower())
            text_fingerprint = hashlib.md5((normalized_text + section_title).encode()).hexdigest()
            
            if text_fingerprint in chunk_fingerprints:
                continue
            
            chunk_fingerprints.add(text_fingerprint)
            
            # Use original chunk for content hash (avoid re-encoding)
            content_hash = hashlib.md5(chunk.encode()).hexdigest()

            # New hash is defined
            if content_hash not in self.full_text_hashes:
                #self.add_to_processed_hashes(content_hash)
                self.full_text_hashes.add(content_hash)
                
                # Build metadata once
                chunk_metadata = {
                    **metadata,
                    "chunk_index": i,
                    "total_chunks": total_chunks,
                    "chunk_type": "full_text",  # Force full_text type
                    "section_title": section_title,
                    "section_level": section_level
                }
                
                if i > 0:
                    chunk_metadata["continued"] = True
                
                unique_chunks.append(Document(page_content=chunk, metadata=chunk_metadata))
        
        return unique_chunks
            
    
# Final        
    def create_structure_based_chunks(self, soup: BeautifulSoup, url: str, depth: int, language: str) -> List[Document]:
        if self.is_login_page(soup) or self.is_not_found_page(soup):
            logger.warning(f"Skipping {url}: Login or not found page")
            return []
        
        chunks = []
        processed_elements = set()
        #detected_language = self.detect_language(soup, soup.get_text(strip=True), url)
        page_title = soup.title.text.strip() if soup.title else "No title"
        #logger.info(f"Processing URL: {url}, Title: {page_title}")
        logger.info(f"Processing URL: {url}")
        
        # Define header tags for hierarchical processing
        header_tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']
        
        def add_section_to_chunks(section: Dict, metadata: Dict) -> List[Document]:
            if not section["content"]:
                return []
            content_text = "\n".join(section["content"]).strip()
            full_text = f"{section['title']}\n\n{content_text}" if section["title"] else content_text
            if len(full_text) < self.MIN_CHUNK_CHARS:
                return []
                
    # Use extract_content to align with Code 2's cleaning and deduplication
            cleaned_content, _ = self.extract_content(BeautifulSoup(full_text, 'html.parser'), use_tags=False)
            #cleaned_content = self.clean_content(full_text)

            if not cleaned_content:
                return []
            chunks = self.create_chunks(cleaned_content, metadata, chunk_type="structural")
            return [chunk for chunk in chunks if len(chunk.page_content) >= self.MIN_CHUNK_CHARS]
            
            # chunks = self.create_chunks(full_text, metadata, chunk_type="structural")  # Cleaning done in create_chunks
            # return [chunk for chunk in chunks if len(chunk.page_content) >= self.MIN_CHUNK_CHARS]
        
        # Process semantic sections
        section_tags = [
            "section", "article", "main", "div.content-section", "div.module", "div.text",
            "div.content", "div.text-block", "div.main-content", "div.container", "div.row",
            "div.card", "div.content-main", "div.teaser-text", "div.publication-item",
            "div.news-item", "div.portlet-body", "div.event-details", "div.indico-content",
            "div.publication-list", "div.event-description", "div.news-content",
            "div.status-report", "div.status", "div.monitor", "div.experiment", "div.results",
            "div.timetable", "p", "p[id]", "span", "table", "table.i-table", "caption",
            "td", "th", "tr", "ul", "ol", "li", *header_tags
        ]
        
        # Try semantic sections first
        for tag in section_tags:
            if "." in tag:
                tag_name, tag_class = tag.split(".", 1)
                elements = soup.find_all(tag_name, class_=tag_class)
            elif tag.startswith('p['):
                elements = soup.find_all('p', id=True)
            else:
                elements = soup.find_all(tag)
            for element in elements:
                # Skip if this element or its parent was already processed
                if id(element) in processed_elements:
                    continue
                
                # Skip if any ancestor was already processed
                if any(id(ancestor) in processed_elements for ancestor in element.parents):
                    continue
      
                text = element.get_text(separator=" ", strip=True)
                if text and len(text) > self.MIN_INITIAL_CHARS:
                    processed_elements.add(id(element))
                        # Also mark all descendants as processed
                    for descendant in element.descendants:
                        if hasattr(descendant, 'name'):
                            processed_elements.add(id(descendant))
                            
                        metadata = {
                            "source": url,
                            "section_title": element.find(header_tags).get_text(strip=True)
                                             if element.find(header_tags) else page_title,
                            "section_level": 1,
                            "language": language,
                            "depth": depth,
                            "title": page_title
                        }
                    # Collect text and defer chunking to add_section_to_chunks
                    section = {"title": metadata["section_title"], "content": [text], "level": 1}
                    chunks.extend(add_section_to_chunks(section, metadata))
                
                    #chunks.extend(self.create_chunks(text, metadata, chunk_type="structural"))
        
        # Hierarchical processing for headers and content
        if not chunks:
            #logger.debug(f"No chunks from semantic sections for {url}, trying hierarchical processing")
            active_sections = {}
            elements = soup.find_all([*header_tags, 'p', 'li', 'td'])
            for element in elements:
                tag_name = element.name
                if tag_name in header_tags:
                    text = element.get_text(strip=True)
                    if not text:
                        continue
                    level = int(tag_name[1])
                    for i in range(level, 7):
                        if i in active_sections:
                            metadata = {
                                "source": url,
                                "section_title": active_sections[i]["title"],
                                "section_level": active_sections[i]["level"],
                                "language": language,
                                "depth": depth,
                                "title": page_title
                            }
                            chunks.extend(add_section_to_chunks(active_sections[i], metadata))
                            del active_sections[i]
                    active_sections[level] = {"title": text, "content": [], "level": level}
                else:
                    text = element.get_text(strip=True)
                    if text and active_sections:
                        active_sections[max(active_sections.keys())]["content"].append(text)
            for level in sorted(active_sections.keys()):
                metadata = {
                    "source": url,
                    "section_title": active_sections[level]["title"],
                    "section_level": active_sections[level]["level"],
                    "language": language,
                    "depth": depth,
                    "title": page_title
                }
                chunks.extend(add_section_to_chunks(active_sections[level], metadata))
        
        # Fallback to body
        if not chunks:
            #logger.warning(f"No chunks from semantic or hierarchical processing for {url}, falling back to body")
            body = soup.find('body') or soup
            body_text = body.get_text(separator=" ", strip=True)
            if body_text:
                metadata = {
                    "source": url,
                    "section_title": page_title,
                    "section_level": 0,
                    "language": language,
                    "depth": depth,
                    "title": page_title
                }
                chunks.extend(self.create_chunks(body_text, metadata, chunk_type="structural"))
                
        logger.critical(f"Chunking result: {len(chunks)} chunks for {url} in create STRUCTURAL chunks function")
        
        return chunks



    
    async def create_session(self, url: str = None) -> aiohttp.ClientSession:
        async with self.session_lock:
            if not hasattr(self, 'session_request_count'):
                self.session_request_count = 0
    
            if self.session and not self.session.closed and self.session_request_count > 50:
                await self.session.close()
                self.session = None
                self.session_request_count = 0
                await asyncio.sleep(0.1)
    
            if self.session and not self.session.closed:
                #logger.info(f"[Session] Current count: {self.session_request_count} | Session closed: False")
                return self.session
    
            domain = urlparse(url).netloc if url else None
            domain_config = self.domain_configs.get(domain, self.default_domain_config)
            ssl_context = ssl.create_default_context()
            ssl_context.check_hostname = False
            ssl_context.verify_mode = ssl.CERT_NONE

            # Rotate User-Agent
            user_agents = [
                'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/125.0.0.0 Safari/537.36',
                'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Version/14.0 Safari/605.1.15',
                'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
            ]
        
    
            self.session = aiohttp.ClientSession(
                connector=aiohttp.TCPConnector(
                    limit=domain_config.get("max_connections", 10),
                    ttl_dns_cache=300,
                    ssl=ssl_context,
                    force_close=False,
                    enable_cleanup_closed=True,
                    resolver=aiohttp.AsyncResolver()
                ),
                timeout=aiohttp.ClientTimeout(total=self.timeout, connect=10, sock_read=20),
                headers={
                    'User-Agent': random.choice(user_agents), 
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                    'Accept-Language': 'en-US,en;q=0.5',
                    'Connection': 'keep-alive'
                }
            )
            return self.session
    

    
    async def close_session(self):
        """Close the aiohttp session."""
        if self.session and not self.session.closed:
            await self.session.close()
            self.session = None
            logger.info("Session closed successfully")

    
    
    async def fetch_simple(self, url: str) -> Optional[str]:
        """Fetch URL content using aiohttp."""
        for attempt in range(3):
            try:
                parsed_url = urlparse(url)
                domain = parsed_url.netloc
                domain_config = self.domain_configs.get(domain, self.default_domain_config)
                session = await self.create_session(url)
               # headers = {'Referer': f"{parsed_url.scheme}://{domain}"}
          
                headers = {
                    "Referer": random.choice([
                        "https://google.com",
                        "https://duckduckgo.com",
                        "https://www.bing.com",
                        f"https://{parsed_url.netloc}/",
                        "https://twitter.com",
                        "https://facebook.com"
                    ]),
                    "Origin": f"{parsed_url.scheme}://{parsed_url.netloc}"
                }

                timeout_config = aiohttp.ClientTimeout(
                    total=domain_config.get("timeout", self.timeout), #how long the full request can take
                    connect=60, #how long to wait to connect to the server
                    sock_connect=30, #30 second time to open the socket -30
                    sock_read=120 # 120 second time to read data from the server
                )
                # I Added small delay to mimic human behavior
                await asyncio.sleep(random.uniform(0.5, 1.5)) #(0.3, 0.8)) #0.5, 1.5
                
                async with session.get(
                    url,
                    headers=headers,
                    allow_redirects=True,
                    timeout=timeout_config
                ) as response:
                    self.session_request_count += 1
                    
                    if response.status == 200:
                        resolved_url = str(response.url)
                        text = await response.text(errors='replace')
    
                        # Soft block detection
                        if len(text.strip()) < 500 or 'access denied' in text.lower() or 'javascript required' in text.lower():
                            logger.warning(f"[Soft block?] Suspiciously low content from {url}")
                            logger.debug(f"[{url}] Response preview: {text.strip()[:300]}")  
                            self.error_urls[url] = "Soft block or empty content"
    
                            # Optional fallback to JS rendering
                            if hasattr(self, 'fetch_with_js'):
                                logger.info(f"Falling back to JS rendering for {url}")
                                return await self.fetch_with_js(url)
                            return None


                        
                        if resolved_url != url:
                            logger.info(f"Redirected {url} to {resolved_url}")
                            self.redirected_urls[url] = resolved_url  # Track redirect
                            if resolved_url not in self.processed_urls:
                                return await response.text(errors='replace')
                            else:
                                self.add_to_processed_urls(url)    
                                return None
                        else:
                            return await response.text(errors='replace')
                    else:
                        self.error_urls[url] = f"HTTP status: {response.status}"
                        return None
        
            except Exception as e:
                logger.warning(f"[Attempt {attempt+1}] Error fetching {url}: {e}")
                if attempt == 2:
                    self.error_urls[url] = f"{str(e)}"
                    return None
                await asyncio.sleep(2 ** attempt)  # exponential backoff

    async def init_browser(self):
        async with self.browser_lock:
            if self.browser is None or self.context is None:
                self.playwright = await async_playwright().start()
                self.browser = await self.playwright.chromium.launch(
                    headless=True,
                    args=[
                        "--no-sandbox",
                        "--disable-setuid-sandbox",
                        "--disable-gpu-vsync"
                    ]
                )
    
                user_agents = [
                    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/125.0.0.0 Safari/537.36',
                    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Version/14.0 Safari/605.1.15',
                    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
                ]
                chosen_user_agent = random.choice(user_agents)
                

            
    async def fetch_with_js(self, url: str) -> Optional[str]:
        for attempt in range(3):
            try:
                async with self.js_semaphore:  # 🔒 Limit concurrent JS fetches
                    if self.browser is None:
                        await self.init_browser()
    
                    # 🔄 NEW: Create a fresh browser context for this task
                    user_agents = [
                        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/125.0.0.0 Safari/537.36',
                        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Version/14.0 Safari/605.1.15',
                        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
                    ]
                    chosen_user_agent = random.choice(user_agents)

                    referers = [
                        "https://google.com",
                        # "https://duckduckgo.com",
                        # "https://www.bing.com",
                        f"https://{urlparse(url).netloc}/",
                        # "https://twitter.com",
                        # "https://facebook.com"
                        f"https://{urlparse(url).netloc}/index.html",
                        "https://www.desy.de/",
                        "https://desy.de/"
                    ]
                    chosen_referer = random.choice(referers)
                    
                    # ✅ Step 1: Build headers dictionary
                    extra_headers = {
                        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
                        "Accept-Language": "en-US,en;q=0.5",
                        "Connection": "keep-alive",
                        "Referer": chosen_referer,
                        "Origin": f"{urlparse(url).scheme}://{urlparse(url).netloc}"
                    }
                    
                    # ✅ Step 2: Add optional stealth headers
                    extra_headers.update({
                        "DNT": "1",
                        "Sec-Fetch-Site": "cross-site",
                        #"Sec-Fetch-Site": "same-origin",
                        "Sec-Fetch-Mode": "navigate",
                        "Sec-Fetch-Dest": "document",
                        "Upgrade-Insecure-Requests": "1"
                    })
                    
                    # ✅ Step 3: Pass the complete headers
                    context = await self.browser.new_context(
                        user_agent=chosen_user_agent,
                        locale="en-US",
                        viewport={"width": 1280, "height": 800},
                        extra_http_headers=extra_headers,
                        ignore_https_errors=True  # ✅ Enables scraping of HTTP/HTTPS-mismatched pages
                    )

    
                    # ✅ NEW: Open a page from the fresh context
                    page = await context.new_page()
    
                    domain = urlparse(url).netloc
                    domain_config = self.domain_configs.get(domain, self.default_domain_config)
                    timeout_ms = domain_config.get("timeout", self.timeout) * 1000
                    js_wait_time = domain_config.get("js_wait_time", self.js_wait_time)
                    consent_timeout = domain_config.get("consent_timeout", 300) # 300- 5000 #sara
    
                    try:
                        response = await page.goto(url, wait_until='networkidle', timeout=timeout_ms)
                    except Exception:
                        try:
                            response = await page.goto(url, wait_until='domcontentloaded', timeout=timeout_ms)
                        except Exception:
                            response = await page.goto(url, timeout=timeout_ms // 2)

                    # ✅ Insert this check right after the response is obtained
                    if response and not response.ok:
                        logger.warning(f"[{url}] JS response status: {response.status}")
    
    
                    if response:
                        final_url = page.url
                        if final_url != url:
                            logger.info(f"JS redirected {url} to {final_url}")
                            async with self.url_lock:
                                self.redirected_urls[url] = final_url
                                if final_url in self.processed_urls:
                                    self.add_to_processed_urls(url)
                                    await page.close()
                                    await context.close()
                                    return None
    
                        try:
                            

                            await page.click(
                                'button:has-text("Accept"), a:has-text("OK"), div:has-text(" Agree"), '
                                'button:has-text("Consent"), button:has-text("Zustimmen")',
                                timeout=consent_timeout
                            )
                        except Exception:
                            pass
    
                        await asyncio.sleep(js_wait_time / 1000)
    
                        if self.js_scroll:
                            await self._scroll_page(page)
    
                        content = await page.content()
    
                        if len(content) > 5_000_000:
                            logger.warning(f"==================Page content from {url} too large to process safely")
                            await page.close()
                            await context.close()
                            return None
    
                        soup = BeautifulSoup(content, 'html.parser')
                        soup.resolved_url = page.url
    
                        if 'login' in page.url.lower() or 'auth' in page.url.lower():
                            async with self.url_lock:
                                self.error_urls[url] = f"Redirected to login page: {page.url}"
                            await page.close()
                            await context.close()
                            return None
    
                        if len(soup.get_text(strip=True)) >= 100 and len(soup.find_all(['p', 'div', 'section'])) >= 5:
                            await page.close()
                            await context.close()
                            return content
    
                        logger.warning(f"Low content density in JS render for {url}, attempt {attempt + 1}")
                        await page.close()
                        await context.close()
    
                    await asyncio.sleep(2)
            except Exception as e:
                logger.warning(f"[Attempt {attempt + 1}] JS error on {url}: {e}")
                if attempt == 2:
                    async with self.url_lock:
                        self.error_urls[url] = f"JS rendering failed: {str(e)}"
                    return None
                await asyncio.sleep(2 ** attempt)


    
    async def _scroll_page(self, page):
        """Scroll page to load lazy-loaded content."""
        try:
            height = await page.evaluate('document.body.scrollHeight')
            for i in range(0, height, 300):
                await page.evaluate(f'window.scrollTo(0, {i})')
                await asyncio.sleep(0.1)
            await page.evaluate('window.scrollTo(0, 0)')
            await asyncio.sleep(0.1)
        except Exception as e:
            logger.error(f"Error scrolling page: {str(e)}")


    async def debug_character_extraction(self, url: str, soup: BeautifulSoup) -> Dict:
        """Debug helper to understand why character extraction fails"""
        debug_info = {
            'url': url,
            'has_soup': soup is not None,
            'title': soup.title.text.strip() if soup and soup.title else "No title",
            'total_text_length': len(soup.get_text(strip=True)) if soup else 0,
            'tag_extraction_results': {},
            'fallback_results': {}
        }
        
        if not soup:
            debug_info['error'] = "No soup object"
            return debug_info

        for tag in self.content_tags[:5]:  # Test first 5 tags
            if "." in tag:
                tag_name, tag_class = tag.split(".")
                elements = soup.find_all(tag_name, class_=tag_class)
            else:
                elements = soup.find_all(tag)
            
            tag_text_length = sum(len(elem.get_text(strip=True)) for elem in elements)
            debug_info['tag_extraction_results'][tag] = {
                'elements_found': len(elements),
                'total_text_length': tag_text_length
            }
        main_content = soup.find('main') or soup.find('body') or soup
        debug_info['fallback_results'] = {
            'main_content_tag': main_content.name if main_content else None,
            'main_content_length': len(main_content.get_text(strip=True)) if main_content else 0
        }
        
        return debug_info

    
    async def fetch_with_retry(self, fetch_function, url: str, max_retries: int = 3) -> Optional[str]:
        """Retry fetching with exponential backoff."""
        domain = urlparse(url).netloc
        base_delay = self.domain_configs.get(domain, self.default_domain_config).get("retry_delay", 2)
        for attempt in range(max_retries):
            if attempt > 0:
                delay = base_delay * (2 ** (attempt - 1)) * (0.5 + random.random())
                await asyncio.sleep(delay)
            try:
                return await fetch_function(url)
            except Exception as e:
                if attempt == max_retries - 1:
                    self.error_urls[url] = str(e)
                    return None
        return None




        
    async def fetch_url_async(self, url: str) -> Optional[BeautifulSoup]:
        if self.should_skip_url(url):
            async with self.url_lock:
                self.error_urls[url] = "Non-HTML content (skipped by extension)"
            return None
    
        content = await self.fetch_simple(url)
        if content:
            soup = BeautifulSoup(content, "html.parser")
    
            if self.is_login_page(soup):
                logger.warning(f"[{url}] Skipping login page")
                self.error_urls[url] = "Login page detected"
                return None
    
            if self.is_not_found_page(soup):
                logger.warning(f"[{url}] Skipping not-found page")
                self.error_urls[url] = "404 or content error detected"
                return None
    
            text_content = soup.get_text(strip=True)
            structure_tags = soup.find_all(['p', 'div', 'section', 'article'])
    
            # ✅ Dynamic JS dependency check
            weak_text = len(text_content) < 200
            low_structure = len(structure_tags) < 5
            #few_links = len(soup.find_all('a')) < 3
            has_js_warning = "javascript required" in content.lower()
    
            scripts = soup.find_all("script", src=True)
            external_js_count = sum(1 for s in scripts if 'zmi.js' in s['src'] or s['src'].startswith("/++resource++"))
            has_noscript_warning = bool(soup.find("noscript"))
            js_suspect = external_js_count > 1 or has_noscript_warning
    
            # ✅ Logging useful debug signals for diagnostics
            #logger.debug(f"[{url}] Text length: {len(text_content)}, Tags: {len(structure_tags)}, Links: {len(soup.find_all('a'))}, Scripts: {external_js_count}")
    
            if weak_text or low_structure or has_js_warning or js_suspect: #or few_links
                logger.info(f"[{url}] HTML signal suggests possible JS-dependency — triggering JS fallback")
                js_content = await self.fetch_with_js(url)
    
                if js_content:
                    soup = BeautifulSoup(js_content, "html.parser")
    
                    if self.is_login_page(soup):
                        logger.warning(f"[{url}] JS fallback landed on login page")
                        self.error_urls[url] = "Login page detected (post-JS)"
                        return None
    
                    if self.is_not_found_page(soup):
                        logger.warning(f"[{url}] JS fallback returned not-found page")
                        self.error_urls[url] = "404 (post-JS)"
                        return None
    
                    return soup
    
                return None
    
            return soup
    
        # If fetch_simple failed entirely, try JS rendering
        js_content = await self.fetch_with_js(url)
        if js_content:
            soup = BeautifulSoup(js_content, "html.parser")
    
            if self.is_login_page(soup):
                logger.warning(f"[{url}] JS-only fallback landed on login page")
                self.error_urls[url] = "Login page detected (pure JS)"
                return None
    
            if self.is_not_found_page(soup):
                logger.warning(f"[{url}] JS-only fallback returned not-found page")
                self.error_urls[url] = "404 (pure JS)"
                return None
    
            return soup
    
        return None
    
        
    
        
    async def process_url(self, url: str, depth: int) -> Tuple[List[Document], List[Document], List[Document]]:
        """Enhanced process_url with debugging for character extraction failures"""

        # ✅ Skip if already processed and cached
        if url in self.processed_urls and url in self.url_to_documents_map:
            cached_docs = self.url_to_documents_map[url]

            char_docs = [doc for doc in cached_docs if doc.metadata["chunk_type"] == "character"]
            struct_docs = [doc for doc in cached_docs if doc.metadata["chunk_type"] == "structural"]
            full_docs = [doc for doc in cached_docs if doc.metadata["chunk_type"] == "full_text"]
            return char_docs, struct_docs, full_docs

        
    
        # ✅ Skip if URL has already been redirected and processed
        if url in self.redirected_urls and self.redirected_urls[url] in self.processed_urls:
            logger.info(f"Skipping {url}: Redirected to {self.redirected_urls[url]}, which is already processed.")
            self.add_to_processed_urls(url)
            return [], [], []
            
    
        soup = await self.fetch_with_retry(self.fetch_url_async, url)
        if not soup or self.is_login_page(soup) or self.is_not_found_page(soup):
                self.error_urls[url] = "Failed to fetch or invalid page"
                self.add_to_processed_urls(url)  
                return [], [], []

        # **ADD THIS: Clean the HTML structure before content extraction**
        cleaned_soup = soup
    
        #title = soup.title.text.strip() if soup.title else "No title"
        title = cleaned_soup.title.text.strip() if cleaned_soup.title else "No title"
        content, text_sample = self.extract_content(cleaned_soup, use_tags=True)
        #content = self.clean_content(content)#Sara
        detected_language = self.detect_language(cleaned_soup, text_sample, url)
        self.track_page_character_count(url, content, title=title, language=detected_language, depth=depth)
    
        # Character-based chunks
        char_docs = []
        if len(content) >= self.MIN_CHUNK_CHARS:
            char_metadata = {
                "source": url,
                "title": title,
                "depth": depth,
                "language": detected_language
            }
            char_docs = self.create_chunks(content, char_metadata, chunk_type="character")

        # Structural chunks
        struct_docs = self.create_structure_based_chunks(cleaned_soup, url, depth, detected_language)

        # # Full-text document
        full_docs = []
        
        if len(content) >= self.MIN_CHUNK_CHARS:
            full_metadata = {
                "source": url,
                "title": title,
                "depth": depth,
                "language": detected_language,
                "chunk_type": "full_text"
            }
            #full_docs = self.create_full_text_chunks(content, full_metadata, chunk_type="full_text")
            content_hash = hashlib.md5(content.encode()).hexdigest()
            self.add_to_processed_hashes(content_hash)
            full_docs = [Document(page_content=content, metadata=full_metadata)]


        self.track_extraction_results(
            url,
            character_success=bool(char_docs),
            structural_success=bool(struct_docs),
            character_count=len(char_docs),
            structural_count=len(struct_docs)
        )
        

        all_docs = char_docs + struct_docs + full_docs
        if all_docs:  # Only mark as processed if something was extracted
            self.add_to_processed_urls(url)
            self.url_to_documents_map[url] = all_docs
        else:
            if url not in self.redirected_urls:
                logger.warning(f"⚠️ No chunks extracted for {url} — skipping marking as processed")
            # else:
            #     logger.info(f"[{url}] No chunks extracted — but URL was redirected to {self.redirected_urls[url]}")
            
        
        return char_docs, struct_docs, full_docs




    def save_character_counts(self, final: bool = False):
        """Save character counts to JSON file."""
        try:
            filename = "page_character_counts_final.json" if final else f"page_character_counts_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
            
            # Create summary statistics
            total_pages = len(self.page_character_counts)
            total_characters = sum(page['character_count'] for page in self.page_character_counts.values())
            total_words = sum(page['word_count'] for page in self.page_character_counts.values())
            avg_chars_per_page = total_characters / total_pages if total_pages > 0 else 0
            
            # Group by language
            language_stats = {}
            for page in self.page_character_counts.values():
                lang = page['language']
                if lang not in language_stats:
                    language_stats[lang] = {'pages': 0, 'characters': 0, 'words': 0}
                language_stats[lang]['pages'] += 1
                language_stats[lang]['characters'] += page['character_count']
                language_stats[lang]['words'] += page['word_count']
            
            data = {
                'timestamp': datetime.now().isoformat(),
                'summary': {
                    'total_pages': total_pages,
                    'total_characters': total_characters,
                    'total_words': total_words,
                    'average_characters_per_page': round(avg_chars_per_page, 2),
                    'language_breakdown': language_stats
                },
                'pages': list(self.page_character_counts.values())
            }
            
            with open(filename, 'w', encoding='utf-8') as f:
                json.dump(data, f, ensure_ascii=False, indent=2)
            
            logger.info(f"Character counts saved to {filename}")
            logger.info(f"Summary: {total_pages} pages, {total_characters:,} characters, {total_words:,} words")
            
        except Exception as e:
            logger.error(f"Error saving character counts: {e}")


    
    def _save_progress(self, documents: List[Document], processed_urls: Set[str], error_urls: Dict[str, str], final: bool = False, chunk_type: str = "character"):
        """Save processing progress to disk."""
        try:
            prefix = {"character": "processor_sized_base", "structural": "processor_structural_base", "full_text": "processor_full_text_base"}[chunk_type]
            filename = f"{prefix}_results_final.json" if final else f"{prefix}_progress_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
            data = {
                'timestamp': datetime.now().isoformat(),
                'processed_urls_count': len(processed_urls),
                'documents_count': len(documents),
                'error_urls_count': len(error_urls),
                'processed_urls': list(processed_urls),
                'error_urls': error_urls,
                'document_metadata': [doc.metadata for doc in documents]
            }
            with open(filename, 'w', encoding='utf-8') as f:
                json.dump(data, f, ensure_ascii=False, indent=2)
            text_filename = f"{prefix}_text_chunks_final.json" if final else f"{prefix}_text_chunks_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
            text_data = {
                'timestamp': datetime.now().isoformat(),
                'text_chunks': [{'content': doc.page_content, 'metadata': doc.metadata} for doc in documents]
            }
            with open(text_filename, 'w', encoding='utf-8') as f:
                json.dump(text_data, f, ensure_ascii=False, indent=2)
            #logger.info(f"Progress saved to {filename} and {text_filename}")
        except Exception as e:
            logger.error(f"Error saving progress: {e}")


    async def process_urls_from_mapping(self, url_map_file: str, batch_size=None, limit=None) -> Dict[str, List[Document]]:
        batch_size = batch_size or self.batch_size
        """Process URLs from a mapping file with all chunking strategies."""
        try:
            with open(url_map_file, 'r', encoding='utf-8') as f:
                url_map = json.load(f)
    
            urls_to_process = []
            skipped_urls = []
            for depth in range(self.max_depth + 1):
                depth_key = str(depth)
                if depth_key in url_map["urls_by_depth"]:
                    for url in url_map["urls_by_depth"][depth_key]:
                        if self.should_skip_url(url):
                            skipped_urls.append(url)
                        else:
                            urls_to_process.append((url, depth))
    
            # Deduplicate URLs while preserving first occurrence
            unique_urls = {}
            for url, depth in urls_to_process:
                if url not in unique_urls:
                    unique_urls[url] = depth
            unique_urls_to_process = list(unique_urls.items())
    
            if limit:
                unique_urls_to_process = unique_urls_to_process[:limit]
    
            if skipped_urls:
                logger.info(f"Skipped {len(skipped_urls)} URLs with non-HTML extensions")
            logger.info(f"Loaded {len(unique_urls_to_process)} unique URLs from mapping file")
    
            self.progress_bar = tqdm(total=len(unique_urls_to_process), desc="Processing URLs", unit="URL", dynamic_ncols=True)
            semaphore = asyncio.Semaphore(self.max_workers)
            character_chunks, structural_chunks, full_text_chunks = [], [], []
            failed_urls = []
    
            async def process_with_semaphore(url: str, depth: int) -> Tuple[List[Document], List[Document], List[Document]]:
                async with semaphore:
                    char_docs, struct_docs, full_docs = await self.process_url(url, depth)
                    self.progress_bar.update(1)
                    return char_docs, struct_docs, full_docs
    
            
            batch_size = batch_size or min(30, self.max_workers * 2)

    
            for i in range(0, len(unique_urls_to_process), batch_size):
                batch = unique_urls_to_process[i:i+batch_size]
    
                
                #process = psutil.Process()
                #logger.info(f"[Memory Before Batch {i//batch_size + 1}] RSS: {process.memory_info().rss / (1024 ** 2):.2f} MB")
                #logger.info(f"Processing batch {i//batch_size + 1}/{(len(unique_urls_to_process)-1)//batch_size + 1}")
    
                tasks = [process_with_semaphore(url, depth) for url, depth in batch]
                results = await asyncio.gather(*tasks, return_exceptions=True)
    
                #logger.info(f"[Memory After Batch {i//batch_size + 1}] RSS: {process.memory_info().rss / (1024 ** 2):.2f} MB")
    
                for j, result in enumerate(results):
                    url, depth = batch[j]
                    if isinstance(result, BaseException):
                        self.error_urls[url] = str(result)
                        failed_urls.append((url, depth))
                        continue
    
                    char_docs, struct_docs, full_docs = result
                    total_chunks = len(char_docs) + len(struct_docs) + len(full_docs)
    
                    if total_chunks > 0:
                        self.processed_urls.add(url)
                        if url in self.redirected_urls:
                            self.processed_urls.add(self.redirected_urls[url])
                    else:
                    
                        if (
                            url not in self.redirected_urls and
                            url not in self.error_urls  # ✅ Suppress warning for known bad pages
                        ):
                            logger.critical(f"⚠️ No chunks extracted for {url} (processed but empty)")


                    character_chunks.extend(char_docs)
                    structural_chunks.extend(struct_docs)
                    full_text_chunks.extend(full_docs)
    
            if failed_urls:
                logger.info(f"Retrying {len(failed_urls)} failed URLs")
                retry_semaphore = asyncio.Semaphore(max(1, self.max_workers // 2))
    
                async def retry_process(url: str, depth: int):
                    async with retry_semaphore:
                        await asyncio.sleep(2)
                        return await self.process_url(url, depth)
    
                retry_batch_size = max(1, self.max_workers // 2)
                for i in range(0, len(failed_urls), retry_batch_size):
                    retry_batch = failed_urls[i:i+retry_batch_size]
                    retry_tasks = [retry_process(url, depth) for url, depth in retry_batch]
                    retry_results = await asyncio.gather(*retry_tasks, return_exceptions=True)
    
                    for j, result in enumerate(retry_results):
                        url, depth = retry_batch[j]
                        if isinstance(result, BaseException):
                            continue
    
                        char_docs, struct_docs, full_docs = result
                        total_chunks = len(char_docs) + len(struct_docs) + len(full_docs)
    
                        if total_chunks > 0:
                            self.processed_urls.add(url)
                            if url in self.redirected_urls:
                                self.processed_urls.add(self.redirected_urls[url])

                        else:
                            if (
                                url not in self.redirected_urls and
                                url not in self.error_urls
                            ):
                                logger.warning(f"⚠️ After retry: No chunks extracted for {url}")

        
    
                        character_chunks.extend(char_docs)
                        structural_chunks.extend(struct_docs)
                        full_text_chunks.extend(full_docs)
    
            self.all_documents = character_chunks
            self.structural_documents = structural_chunks
            self.full_text_documents = full_text_chunks

            #Sara
            # self._save_progress(character_chunks, self.processed_urls, self.error_urls, final=True, chunk_type="character")
            # self._save_progress(structural_chunks, self.processed_urls, self.error_urls, final=True, chunk_type="structural")
            # self._save_progress(full_text_chunks, self.processed_urls, self.error_urls, final=True, chunk_type="full_text")
            self.save_character_counts(final=True)
    
            logger.info(f"Processing complete: {len(self.processed_urls)}/{len(unique_urls_to_process)} URLs, {len(self.error_urls)} errors")
            logger.info(f"Documents: {len(character_chunks)} character, {len(structural_chunks)} structural, {len(full_text_chunks)} full-text")
    
            # ✅ Save redirected URLs to file
            if self.redirected_urls:
                with open("redirected_urls.json", "w", encoding="utf-8") as f:
                    json.dump(self.redirected_urls, f, ensure_ascii=False, indent=2)
                logger.info(f"Saved {len(self.redirected_urls)} redirected URLs to redirected_urls.json")
    
            return {
                'character_chunks': character_chunks,
                'structural_chunks': structural_chunks,
                'full_text_chunks': full_text_chunks
            }
    
        finally:
            await self.close_session()
            self.session = None
            if self.progress_bar:
                self.progress_bar.close()
            # Clean up Playwright browser and context
            if self.context:
                await self.context.close()
                self.context = None
            if self.browser:
                await self.browser.close()
                self.browser = None
            if hasattr(self, "playwright"):
                await self.playwright.stop()
        
# Add these two functions to your DESYContentProcessor class:

    def track_extraction_results(self, url, character_success, structural_success, character_count=0, structural_count=0, error_msg=None):
        """Track which extraction methods succeeded for each URL"""
        if not hasattr(self, 'extraction_log'):
            self.extraction_log = {}
        
        self.extraction_log[url] = {
            'character_chunks_success': character_success,
            'structural_chunks_success': structural_success,
            'character_chunks_count': character_count,
            'structural_chunks_count': structural_count,
            'timestamp': datetime.now().isoformat(),
            'error_message': error_msg,
        }
        
        # Log extraction results
        logger.info(f"URL: {url} | Character: {'✓' if character_success else '✗'} ({character_count}) | Structural: {'✓' if structural_success else '✗'} ({structural_count})")
        if error_msg:
            logger.warning(f"Error for {url}: {error_msg}")

    
    

    
        
    def print_extraction_summary(self):
        """Print a simple summary of extraction results"""
        if not hasattr(self, 'extraction_log'):
            print("No extraction log available")
            return
        
        total_urls = len(self.extraction_log)
        character_successes = sum(1 for log in self.extraction_log.values() if log['character_chunks_success'])
        structural_successes = sum(1 for log in self.extraction_log.values() if log['structural_chunks_success'])
        both_failed = sum(1 for log in self.extraction_log.values() 
                         if not log['character_chunks_success'] and not log['structural_chunks_success'])
        
        print(f"\n--- EXTRACTION SUMMARY ---")
        print(f"Total URLs: {total_urls}")
        print(f"Character method succeeded: {character_successes}/{total_urls}")
        print(f"Structural method succeeded: {structural_successes}/{total_urls}")
        print(f"Both methods failed: {both_failed}")
        print(f"Character-only failures: {total_urls - character_successes}")
        print(f"Structural-only failures: {total_urls - structural_successes}") 

    

In [None]:
def load_urls_from_mapping_file(url_map_file):
    import json
    with open(url_map_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return list(data.keys())  # Assumes mapping is {url: metadata}

def batch_urls(urls, batch_size):
    for i in range(0, len(urls), batch_size):
        yield urls[i:i + batch_size]


def export_merged_results(merged, prefix="processor"):
    import json
    import orjson
    from datetime import datetime

    # 📌 Use consistent timestamp everywhere
    timestamp = datetime.now()
    timestamp_str = timestamp.strftime("%Y%m%d_%H%M%S")
    iso_time = timestamp.isoformat()

    # ✅ Validate merged structure
    required_keys = ['character_counts_data', 'structural_chunks', 'full_text_chunks', 'character_chunks', 'processed_urls', 'error_urls', 'url_stats']
    for key in required_keys:
        if key not in merged or not merged[key]:
            logger.warning(f"Missing or empty data for {key} in merged results")

    # def write_json(filename, data):
    #     with open(filename, "w", encoding="utf-8") as f:
    #         json.dump(data, f, indent=2, ensure_ascii=False)

    def write_json(filename, data):
        with open(filename, "wb") as f:  # Binary mode for orjson
            f.write(orjson.dumps(data))    

    def format_text_chunks(docs):
        # ✅ Sanity check for misalignment
        assert len(docs["text_chunks"]) == len(docs["document_metadata"]), "Mismatch in chunks and metadata"

        return {
            "timestamp": iso_time,
            "text_chunks": [
                {
                    "content": doc,
                    "metadata": meta
                }
                for doc, meta in zip(docs["text_chunks"], docs["document_metadata"])
            ]
        }

    def format_character_counts(data):
        total_pages = sum(len(v) for v in data.values())
        total_characters = sum(item["character_count"] for v in data.values() for item in v)
        total_words = sum(item["metadata"].get("word_count", 0) for v in data.values() for item in v)

        return {
            "timestamp": iso_time,
            "summary": {
                "total_pages": total_pages,
                "total_characters": total_characters,
                "total_words": total_words,
                "average_characters_per_page": round(total_characters / total_pages, 2) if total_pages else 0
            },
            "pages": [item for v in data.values() for item in v]
        }

    write_json(f"page_character_counts_final.json", format_character_counts(merged["character_counts_data"])) #_{timestamp_str}

    # 🔹 Structural
    write_json(f"{prefix}_structural_base_text_chunks_final.json", format_text_chunks(merged["structural_chunks"]))
    write_json(f"{prefix}_structural_base_results_final.json", merged["structural_chunks"]["document_metadata"])

    # 🔹 Full-text
    write_json(f"{prefix}_sized_base_text_chunks_final.json", format_text_chunks(merged["full_text_chunks"]))
    write_json(f"{prefix}_sized_base_results_final.json", merged["full_text_chunks"]["document_metadata"])

    # 🔹 Character-based
    write_json(f"{prefix}_character_base_text_chunks_final.json", format_text_chunks(merged["character_chunks"]))
    write_json(f"{prefix}_character_base_results_final.json", merged["character_chunks"]["document_metadata"])

    # 🔹 URL tracking
    write_json(f"{prefix}_processed_urls_final.json", list(merged["processed_urls"]))
    write_json(f"{prefix}_error_urls_final.json", list(merged["error_urls"]))

    # 🔹 Serialize safe url_stats (handles sets, defaultdicts)
    safe_url_stats = {
        k: list(v) if isinstance(v, set) else v
        for k, v in dict(merged["url_stats"]).items()
    }
    write_json(f"{prefix}_url_stats_final.json", safe_url_stats)


from collections import defaultdict
from itertools import chain

def merge_batch_results(all_results):
    merged = {
        'character_chunks': {'text_chunks': [], 'document_metadata': []},
        'structural_chunks': {'text_chunks': [], 'document_metadata': []},
        'full_text_chunks': {'text_chunks': [], 'document_metadata': []},
        'character_counts_data': {
            'character_chunks': [], 'structural_chunks': [], 'full_text_chunks': []
        },
        'processed_urls': set(),
        'error_urls': set(),
        'url_stats': defaultdict(list),
    }

    for key in ['character_chunks', 'structural_chunks', 'full_text_chunks']:
        merged[key]['text_chunks'] = list(chain.from_iterable(r[key]['text_chunks'] for r in all_results))
        merged[key]['document_metadata'] = list(chain.from_iterable(r[key]['document_metadata'] for r in all_results))
        merged['character_counts_data'][key] = list(chain.from_iterable(r['character_counts_data'][key] for r in all_results))

    merged['processed_urls'].update(chain.from_iterable(r['processed_urls'] for r in all_results))
    merged['error_urls'].update(chain.from_iterable(r['error_urls'] for r in all_results))

    for result in all_results:
        for stat_key, stat_val in result['url_stats'].items():
            if isinstance(stat_val, (list, set)):
                merged['url_stats'][stat_key].extend(stat_val)
            elif isinstance(stat_val, dict):
                merged['url_stats'][stat_key].append(stat_val)  # Store as list of dicts
            elif isinstance(stat_val, int):
                merged['url_stats'][stat_key].append(stat_val)  # Accumulate in list

    merged['processed_urls'] = list(merged['processed_urls'])
    merged['error_urls'] = list(merged['error_urls'])

    return merged
    



def process_mapped_urls(url_map_file, max_depth, batch_size, limit=None):    
    """Process URLs from a mapping file up to the specified depth."""
    
    async def _run():
        all_urls = load_urls_from_mapping_file(url_map_file)
        if limit:
            all_urls = all_urls[:limit]    
        all_results = []

        for batch_num, url_batch in enumerate(batch_urls(all_urls, batch_size), 1):
            print(f"\n🔹 Processing batch {batch_num} with {len(url_batch)} URLs...")
      
            processor = DESYContentProcessor(
                max_depth=max_depth,
                chunk_size=500,
                chunk_overlap=75,              
            )

            try:
                results = await processor.process_urls_from_mapping(
                    url_map_file, batch_size, limit=limit
                )
            except BaseException as e:
                # 🔧 CHANGED: Avoid halting the entire pipeline; continue to next batch
                logger.warning(f"Batch {batch_num} failed: {type(e).__name__} - {e}")
                continue

            # Track URLs
            character_urls = set(doc.metadata.get('source', '') for doc in results['character_chunks'])
            structural_urls = set(doc.metadata.get('source', '') for doc in results['structural_chunks'])
            full_text_urls = set(doc.metadata.get('source', '') for doc in results['full_text_chunks'])

            # 🔧 CHANGED: helper for DRY character counting
            def build_chunk_metadata(docs):
                return [
                    {
                        'url': doc.metadata.get('source', ''),
                        'chunk_index': i,
                        'character_count': len(doc.page_content),
                        'metadata': doc.metadata
                    }
                    for i, doc in enumerate(docs)
                ]

            character_counts_data = {
                'character_chunks': build_chunk_metadata(results['character_chunks']),
                'structural_chunks': build_chunk_metadata(results['structural_chunks']),
                'full_text_chunks': build_chunk_metadata(results['full_text_chunks']),
            }

            all_processed_urls = character_urls | structural_urls | full_text_urls
            for url in all_processed_urls:
                processor.track_extraction_results(
                    url=url,
                    character_success=url in character_urls,
                    structural_success=url in structural_urls,
                    character_count=sum(1 for doc in results['character_chunks'] if doc.metadata.get('source') == url),
                    structural_count=sum(1 for doc in results['structural_chunks'] if doc.metadata.get('source') == url),
                )

            missing_in_structural = character_urls - structural_urls
            missing_in_character = structural_urls - character_urls

            batch_result = {
                'character_chunks': {
                    'text_chunks': [doc.page_content for doc in results['character_chunks']],
                    'document_metadata': [doc.metadata for doc in results['character_chunks']],
                },
                'structural_chunks': {
                    'text_chunks': [doc.page_content for doc in results['structural_chunks']],
                    'document_metadata': [doc.metadata for doc in results['structural_chunks']],
                },
                'full_text_chunks': {
                    'text_chunks': [doc.page_content for doc in results['full_text_chunks']],
                    'document_metadata': [doc.metadata for doc in results['full_text_chunks']],
                },
                'character_counts_data': character_counts_data,
                'processed_urls': list(processor.processed_urls),
                'error_urls': processor.error_urls,
                'url_stats': {
                    'total_urls_processed': len(processor.processed_urls),
                    'total_urls_with_errors': len(processor.error_urls),
                    'error_urls': list(processor.error_urls),
                    'redirected_urls': processor.redirected_urls,
                    'error_url_names': {url: extract_url_name(url) for url in processor.error_urls},
                    'total_character_urls': len(character_urls),
                    'total_structural_urls': len(structural_urls),
                    'total_full_text_urls': len(full_text_urls),
                    'missing_in_structural': list(missing_in_structural),
                    'missing_in_character': list(missing_in_character),
                    'url_names': {url: extract_url_name(url) for url in processor.processed_urls},
                    'character_url_names': {url: extract_url_name(url) for url in character_urls},
                    'structural_url_names': {url: extract_url_name(url) for url in structural_urls},
                    'full_text_url_names': {url: extract_url_name(url) for url in full_text_urls},
                },
            }

            all_results.append(batch_result)

            # 🔧 CHANGED: optional anti-bot delay
            #await asyncio.sleep(random.uniform(2, 6))

        final_result = merge_batch_results(all_results)
        return final_result
        
    return asyncio.run(_run())


def extract_url_name(url):
    """Extract a readable name from a URL"""
    from urllib.parse import urlparse
    
    try:
        parsed = urlparse(url)
        # Remove www. if present and get domain
        domain = parsed.netloc.replace('www.', '')
        
        # Get the path without trailing slash
        path = parsed.path.rstrip('/')
        
        # Split path into segments
        segments = [seg for seg in path.split('/') if seg]
        
        if not segments:
            return domain  # Just return domain if no path segments
            
        # Use the last segment as the name (usually most specific)
        name = segments[-1].replace('-', ' ').replace('_', ' ')
        
        # Clean up the name
        name = ' '.join(word.capitalize() for word in name.split())
        
        return f"{domain} - {name}" if name else domain
    except:
        return url  # Return original URL if parsing fails


def process_mapped_urls_safe(url_map_file, max_depth, batch_size, limit=None):    
    """Enhanced synchronous wrapper with better error handling"""
    start_time = datetime.now()
    try:        
        result = process_mapped_urls(url_map_file, max_depth, batch_size, limit=limit)
        if isinstance(result, BaseException):
            logger.error(f"process_mapped_urls failed: {type(result).__name__} - {result}")
            return None

        end_time = datetime.now()
        duration = end_time - start_time
        print(f"Processing completed in: {duration}")
        
        # Extract URL statistics for reporting
        #url_stats = result['url_stats']
        url_stats = result.get('url_stats', {}) # to prevent crashes if keys are missing
                
       
        #Chunk summaries
        print(f"🧩 Total character-based chunks: {len(result['character_chunks']['text_chunks'])}")
        print(f"🔧 Total structure-based chunks: {len(result['structural_chunks']['text_chunks'])}")
        print(f"📄 Total full-text documents: {len(result['full_text_chunks']['text_chunks'])}")
        print(f"🔗 Total processed URLs: {url_stats.get('total_urls_processed', 0)}")
        print(f"❌ Total URLs with errors: {len(result.get('error_urls', []))}")
        
        # Report on URL distribution
        print(f"\n📊 URL Distribution:")
        print(f"Character chunks: {url_stats.get('total_character_urls', 0)}")
        print(f"Structural chunks: {url_stats.get('total_structural_urls', 0)}")
        print(f"Full-text chunks: {url_stats.get('total_full_text_urls', 0)}")
        
        # Report on missing URLs
        print(f"\nMissing URLs Analysis:")
        missing_in_structural = url_stats.get('missing_in_structural', [])
        missing_in_character = url_stats.get('missing_in_character', [])

        print(f"\n🕳️ Missing URLs Analysis:")
        print(f"Missing in structural: {len(missing_in_structural)}")
        print(f"Missing in character: {len(missing_in_character)}")
        

        # 🔧 CHANGED: safer logging of failed URLs
        if missing_in_character:
            logger.info(f"⚠️ URLs failing character-based chunking: {len(missing_in_character)}")
            for url in missing_in_character:
                log_entry = result.get('processor', {}).extraction_log.get(url, {}) if 'processor' in result else {}
                logger.info(
                    f"Failed URL: {url}, "
                    f"Content Length: {log_entry.get('content_length', 0)}, "
                    f"Error: {log_entry.get('error_message', 'Unknown')}"
                )

        # 🔧 CHANGED: avoid printing if processor object was removed
        processor_obj = result.get('processor')
        if processor_obj and hasattr(processor_obj, 'print_extraction_summary'):
            processor_obj.print_extraction_summary()

        print(f"✅ Finished processing file: {url_map_file}")
        return result

    except Exception as e:
        # 🔧 CHANGED: log full traceback for better debugging
        import traceback
        logging.error(f"🔥 Critical error in processor: {e}", exc_info=True)
        traceback.print_exc()
        return None





def save_character_counts_json(character_counts_data, export_prefix="desy_final"):
    import json
    from datetime import datetime
    timestamp_obj = datetime.now()

    timestamp_str = timestamp_obj.strftime("%Y%m%d_%H%M%S")
    filename = f"{export_prefix}_character_counts_{timestamp_str}.json"    
    

    full_text_data = character_counts_data.get("full_text_chunks", [])
    total_characters = sum(item.get("character_count", 0) for item in full_text_data)

    # ✅ Optional: wrap with summary and timestamp like your original format
    data = {
        "timestamp": timestamp_obj.isoformat(),
        "pages": full_text_data,
        "summary": {
            "total_pages": len(full_text_data),
            "total_characters": total_characters,
            "average_characters_per_page": round(total_characters / len(full_text_data), 2) if full_text_data else 0
        }
    }


    try:
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        print(f"✅ Full-text character counts saved to {filename}")
    except Exception as e:
        print(f"❌ Failed to save full-text character counts: {e}")
        



def save_url_stats(url_stats, export_prefix="desy_final"):
    import json
    import orjson
    from datetime import datetime
    

    # 🔧 CHANGED: Use consistent timestamp object for filenames and metadata
    # timestamp_obj = datetime.now()
    # timestamp_str = timestamp_obj.strftime("%Y%m%d_%H%M%S")

    url_stats = result.get("url_stats", {})  # 🔧 CHANGED: Safer access to avoid KeyError
    url_names = url_stats.get("url_names", {})
    missing_structural = url_stats.get("missing_in_structural", [])
    missing_character = url_stats.get("missing_in_character", [])

    # 🔧 CHANGED: Convert sets/defaultdict to plain dict for JSON serialization
    safe_url_stats = {
        k: list(v) if isinstance(v, set) else v
        for k, v in dict(url_stats).items()
    }

    stats_file = f"{export_prefix}_url_stats.json" #_{timestamp_str}
    # with open(stats_file, 'w', encoding='utf-8') as f:
    #     json.dump(safe_url_stats, f, indent=2, ensure_ascii=False)
    with open(stats_file, 'wb') as f:
        f.write(orjson.dumps(safe_url_stats))

    # Missing structural URLs
    missing_structural_file = f"{export_prefix}_missing_structural.txt" #_{timestamp_str}
    with open(missing_structural_file, 'w', encoding='utf-8') as f:
        for url in missing_structural:
            name = url_names.get(url, '')
            f.write(f"{url}\t{name}\n")

    # Missing character URLs
    missing_character_file = f"{export_prefix}_missing_character.txt" #_{timestamp_str}
    with open(missing_character_file, 'w', encoding='utf-8') as f:
        for url in missing_character:
            name = url_names.get(url, '')
            f.write(f"{url}\t{name}\n")

    print(f"\n📊 URL statistics saved to: {stats_file}")
    print(f"📄 Missing structural URLs saved to: {missing_structural_file}")
    print(f"📄 Missing character URLs saved to: {missing_character_file}")




# Example usage 
# Find the most recent URL map file or specify it directly

import sys

files_to_scrape = [
    "Zero_text_scraped_urls.json",
    "desy_url_map_20250425_155033_urls=200_000.json"
]

all_results = []
seen_urls = set()
sys.stdout = open('output_log.txt', 'w', encoding='utf-8') # Redirect stdout to a file


    
def map_urls_to_depth(url_map):
    depth_dict = {}
    for depth_key, url_list in url_map.get("urls_by_depth", {}).items():
        try:
            depth_int = int(depth_key)
        except (ValueError, TypeError):
            print(f"⚠️ Skipping non-numeric depth key: {depth_key}")
            continue
        for url in url_list:
            if url not in depth_dict or depth_int < depth_dict[url]:
                depth_dict[url] = depth_int
    return depth_dict



for map_file in files_to_scrape:
    try:
        print(f"🧭 Scraping from file: {map_file}")

        # Load URLs from mapping file before scraping
        with open(map_file, 'r', encoding='utf-8') as f:
            url_map = json.load(f)
            url_depth_map = map_urls_to_depth(url_map)
            new_urls = []
            #for depth_values in url_map.get("urls_by_depth", {}).values():
            for url, depth in url_depth_map.items():    
                #for URL, depth in url_depth_map:
                #for URL, depth in url_depth_map.items():
                if url not in seen_urls:
                    seen_urls.add(url)
                    new_urls.append(url)

        # Skip if no new URLs to scrape
        if not new_urls:
            print(f"⚠️ All URLs in {map_file} are already scraped — skipping.")
            continue

        result = process_mapped_urls_safe(map_file, max_depth=2, batch_size=100, limit=1000)
        if result:
            all_results.append(result)

    except FileNotFoundError as e:
        print(f"❌ {map_file} not found. Please run the URL mapper to generate it.")


# Restore stdout
sys.stdout.close()
sys.stdout = sys.__stdout__

# Merge and export all results together
merged = merge_batch_results(all_results)
export_merged_results(merged, prefix="desy_final")
print(f"🔍 Type of merged: {type(merged)}")
print(f"🔍 Keys in merged (if dict): {list(merged.keys()) if isinstance(merged, dict) else 'Not a dict'}")

save_url_stats(merged["url_stats"], export_prefix="desy_final")
#save_url_stats(merged, export_prefix="desy_final")
save_character_counts_json(merged['character_counts_data'], export_prefix="desy_final")
