crawl4ai version
0.4.248
Expected Behavior
Filter is used and content has been passed to the LLM
Current Behavior
Filter is not used
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. Attempt to use the example code: https://crawl4ai.com/mkdocs/core/markdown-generation/#43-llmcontentfilter
2. Change provider to: "ollama:deepseek-r1:1.5b"
3. Enable cache bypass
4. Update URL to something more substantial
5. Run code
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
async def main():
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
provider="ollama/deepseek-r1:1.5b", # or your preferred provider
api_token="your-api-token", # or use environment variable
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=4096, # Adjust based on your needs
verbose=True
)
config = CrawlerRunConfig(
content_filter=filter,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.crossplane.io/latest/concepts/environment-configs", config=config)
print(result.fit_markdown) # Filtered markdown content
# print(filter.filter_content(result.html, True)) # running the filter manually on the content works
if __name__ == "__main__":
asyncio.run(main())
OS
Windows
Python version
3.12.8
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://docs.crossplane.io/latest/concepts/environ... | Status: True | Time: 1.08s
[SCRAPE].. ◆ Processed https://docs.crossplane.io/latest/concepts/environ... | Time: 101ms
[COMPLETE] ● https://docs.crossplane.io/latest/concepts/environ... | Status: True | Total: 1.18s
crawl4ai version
0.4.248
Expected Behavior
Filter is used and content has been passed to the LLM
Current Behavior
Filter is not used
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Windows
Python version
3.12.8
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://docs.crossplane.io/latest/concepts/environ... | Status: True | Time: 1.08s
[SCRAPE].. ◆ Processed https://docs.crossplane.io/latest/concepts/environ... | Time: 101ms
[COMPLETE] ● https://docs.crossplane.io/latest/concepts/environ... | Status: True | Total: 1.18s