crawl4ai version
0.8.6
Expected Behavior
When excluded_tags=["header"] is configured, the entire <header> element — including all nested child elements such as <nav>, <ul>, <li>, and their contents — should be completely removed from the scraped output.
Current Behavior
The <header> element is removed, but its child elements (such as <nav>, <ul>, and <li>) are automatically promoted to the parent <body> element and still appear in the scraped content. No error or warning is generated during this process.
Is this reproducible?
Yes
Inputs Causing the Bug
Any HTML with nested children inside an excluded tag.
Example: <header> containing <nav> containing <ul>
Steps to Reproduce
1. Set excluded_tags=["header"] in CrawlerRunConfig
2. Crawl a page that has <header><nav><ul>...</ul></nav></header>
3. Check the markdown/cleaned output
4. <nav> and <ul> content still appears in the output
Code snippets
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(excluded_tags=["header"])
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown) # <nav> content still visible ❌
OS
Ubuntu 20.04
Python version
3.10.13
Browser
Not applicable (using AsyncWebCrawler in headless mode)
Browser version
Not applicable
Error logs & Screenshots (if applicable)
No error thrown. The bug is silent — children silently survive in output.
Root cause: lxml promotes children when element.getparent().remove(element)
is called. Fix: call element.clear() before removal.
crawl4ai version
0.8.6
Expected Behavior
When
excluded_tags=["header"]is configured, the entire<header>element — including all nested child elements such as<nav>,<ul>,<li>, and their contents — should be completely removed from the scraped output.Current Behavior
The
<header>element is removed, but its child elements (such as<nav>,<ul>, and<li>) are automatically promoted to the parent<body>element and still appear in the scraped content. No error or warning is generated during this process.Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Ubuntu 20.04
Python version
3.10.13
Browser
Not applicable (using AsyncWebCrawler in headless mode)
Browser version
Not applicable
Error logs & Screenshots (if applicable)
No error thrown. The bug is silent — children silently survive in output.
Root cause: lxml promotes children when element.getparent().remove(element)
is called. Fix: call element.clear() before removal.