-
-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Open
Labels
✊🏼 On-holdCurrently put on pause, due to a blockerCurrently put on pause, due to a blocker🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug
Description
crawl4ai version
0.7.4
Expected Behavior
When using cache_mode=CacheMode.ENABLED on CrawlerRunConfig and a previously crawled and cached URL the result.extracted_content is filled with freshly generated output of LLM.
Current Behavior
When using cache_mode=CacheMode.ENABLED on CrawlerRunConfig and a previously crawled and cached URL the result.extracted_content is empty.
Is this reproducible?
Yes
Code snippets
import os
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy
class Product(BaseModel):
name: str
price: str
async def main():
# 1. Define the LLM extraction strategy
llm_strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv('OPENAI_API_KEY')),
schema=Product.schema_json(), # Or use model_json_schema()
extraction_type="schema",
instruction="Extract all product objects with 'name' and 'price' from the content.",
chunk_token_threshold=1000,
overlap_rate=0.0,
apply_chunking=True,
input_format="markdown", # or "html", "fit_markdown"
extra_args={"temperature": 0.0, "max_tokens": 800}
)
# 2. Build the crawler config
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.ENABLED # (!!) BYPASS is working
)
# 3. Create a browser config if needed
browser_cfg = BrowserConfig(headless=True)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
# 4. Let's say we want to crawl a single page
result = await crawler.arun(
url="https://example.com/products",
config=crawl_config
)
if result.success:
# 5. The extracted content is presumably JSON
data = json.loads(result.extracted_content)
print("Extracted items:", data)
# 6. Show usage stats
llm_strategy.show_usage() # prints token usage
else:
print("Error:", result.error_message)The Code Example is taken from the LLM Strategies Documentation
OS
macOS
Python version
3.11.4
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
✊🏼 On-holdCurrently put on pause, due to a blockerCurrently put on pause, due to a blocker🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug