Skip to content

[Bug]: LLMContentFilter is ignored #603

@Natabu

Description

@Natabu

crawl4ai version

0.4.248

Expected Behavior

Filter is used and content has been passed to the LLM

Current Behavior

Filter is not used

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

1. Attempt to use the example code: https://crawl4ai.com/mkdocs/core/markdown-generation/#43-llmcontentfilter
2. Change provider to: "ollama:deepseek-r1:1.5b"
3. Enable cache bypass
4. Update URL to something more substantial
5. Run code

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter

async def main():
    # Initialize LLM filter with specific instruction
    filter = LLMContentFilter(
        provider="ollama/deepseek-r1:1.5b",  # or your preferred provider
        api_token="your-api-token",  # or use environment variable
        instruction="""
        Focus on extracting the core educational content.
        Include:
        - Key concepts and explanations
        - Important code examples
        - Essential technical details
        Exclude:
        - Navigation elements
        - Sidebars
        - Footer content
        Format the output as clean markdown with proper code blocks and headers.
        """,
        chunk_token_threshold=4096,  # Adjust based on your needs
        verbose=True
    )

    config = CrawlerRunConfig(
        content_filter=filter,
        cache_mode=CacheMode.BYPASS

    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://docs.crossplane.io/latest/concepts/environment-configs", config=config)
        print(result.fit_markdown)  # Filtered markdown content

        # print(filter.filter_content(result.html, True)) # running the filter manually on the content works
if __name__ == "__main__":
    asyncio.run(main())

OS

Windows

Python version

3.12.8

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://docs.crossplane.io/latest/concepts/environ... | Status: True | Time: 1.08s
[SCRAPE].. ◆ Processed https://docs.crossplane.io/latest/concepts/environ... | Time: 101ms
[COMPLETE] ● https://docs.crossplane.io/latest/concepts/environ... | Status: True | Total: 1.18s

Metadata

Metadata

Assignees

Labels

⚙️ Under TestBug fix / Feature request that's under testing🐞 BugSomething isn't working📖 DocumentationImprovements or additions to documentation

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions