Skip to content

[Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason is shown #1949

@tonicava

Description

@tonicava

crawl4ai version

0.8.5, 0.8.6

Expected Behavior

We should see some logs / error messages for the scraping error.

Description of the issue:

  • after successfully marked FETCH no SCRAPE is performed (0.00s time taken) and hence no EXTRACT, so COMPLETE is marked as failed. No error messages/logs are shown, even all possible verbose params are set to TRUE.
  • this behaviour has been observed on valid pages derived from a base url - where for the original url the whole process does work OK.
  • if a derived page is now fed as base url, the scraping-extraction process runs OK, but, again, for the derived ones, the scraping fails silently.

Could not find such an error pattern anywhere in the documentation nor other passed issues.

The crawler config is described in detail in "code snippets" section.
It uses the BFSDeepCrawlStrategy for crawling and LLMExtractionStrategy for extraction.

Current Behavior

[FETCH]... ↓ |
✓ | ⏱: 1.51s
[SCRAPE].. ◆ |
✓ | ⏱: 0.00s
[EXTRACT]. ■ |
✓ | ⏱: 0.00s
[COMPLETE] ● |
✗ | ⏱: 1.51s

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

browser_cfg = BrowserConfig(
    user_agent_mode="random",
    enable_stealth=True,
    headless=True,       
    viewport_width=1280,
    viewport_height=720,
    text_mode=False
)
filter_chain = FilterChain(
    [
        URLPatternFilter(
            patterns=[
                ...,
            ]
        ),
        URLPatternFilter(
            patterns=[
                ...
            ],
            reverse=True
        )
    ]
 )
 async with AsyncWebCrawler(verbose=True, config=browser_cfg, thread_safe=True) as crawler:
        crawl_config = CrawlerRunConfig(
            extraction_strategy=LLMExtractionStrategy(
                llm_config=LLMConfig(
                    ...
                ),
                schema=Model.model_json_schema(),
                extraction_type="schema",
                instruction="""
                    ...
                    """,
                ...
            ),
            deep_crawl_strategy=BFSDeepCrawlStrategy(
                max_depth=2,
                include_external=False,
                max_pages=10,
                filter_chain=filter_chain,
            ),
            cache_mode=CacheMode.BYPASS,
            exclude_external_links=True,
            remove_overlay_elements=True,
            process_iframes=True,
            stream=True,
            magic=True,
            verbose=True,
            scan_full_page=True
        )

OS

Databricks

Python version

3.12.3

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions