Skip to content

[Bug]: Excluded DOM elements are not fully removed from parsed output #1982

@Khushal-Savalakha

Description

@Khushal-Savalakha

crawl4ai version

0.8.6

Expected Behavior

When excluded_tags=["header"] is configured, the entire <header> element — including all nested child elements such as <nav>, <ul>, <li>, and their contents — should be completely removed from the scraped output.

Current Behavior

The <header> element is removed, but its child elements (such as <nav>, <ul>, and <li>) are automatically promoted to the parent <body> element and still appear in the scraped content. No error or warning is generated during this process.

Is this reproducible?

Yes

Inputs Causing the Bug

Any HTML with nested children inside an excluded tag.
Example: <header> containing <nav> containing <ul>

Steps to Reproduce

1. Set excluded_tags=["header"] in CrawlerRunConfig
2. Crawl a page that has <header><nav><ul>...</ul></nav></header>
3. Check the markdown/cleaned output
4. <nav> and <ul> content still appears in the output

Code snippets

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

config = CrawlerRunConfig(excluded_tags=["header"])

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)
    print(result.markdown)  # <nav> content still visible ❌

OS

Ubuntu 20.04

Python version

3.10.13

Browser

Not applicable (using AsyncWebCrawler in headless mode)

Browser version

Not applicable

Error logs & Screenshots (if applicable)

No error thrown. The bug is silent — children silently survive in output.

Root cause: lxml promotes children when element.getparent().remove(element)
is called. Fix: call element.clear() before removal.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions