# 🚀 Crawl4AI v0.7.5 - Complete Feature Walkthrough

Welcome to Crawl4AI v0.7.5! This notebook demonstrates all the new features introduced in this release.

## 📋 What's New in v0.7.5

1. **🔧 Docker Hooks System** - NEW! Complete pipeline customization with user-provided Python functions
2. **🤖 Enhanced LLM Integration** - Custom providers with temperature control
3. **🔒 HTTPS Preservation** - Secure internal link handling
4. **🛠️ Multiple Bug Fixes** - Community-reported issues resolved

---

## 📦 Setup and Installation

First, let's make sure we have the latest version installed:

In [4]:
# # Install or upgrade to v0.7.5
# !pip install -U crawl4ai==0.7.5 --quiet

# Import required modules
import asyncio
import nest_asyncio
nest_asyncio.apply()  # For Jupyter compatibility

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
from crawl4ai import hooks_to_string

print("✅ Crawl4AI v0.7.5 ready!")

✅ Crawl4AI v0.7.5 ready!


---

## 🔧 Feature 1: Docker Hooks System (NEW! 🆕)

### What is it?
v0.7.5 introduces a **completely new Docker Hooks System** that lets you inject custom Python functions at 8 key points in the crawling pipeline. This gives you full control over:
- Authentication setup
- Performance optimization
- Content processing
- Custom behavior at each stage

### Three Ways to Use Docker Hooks

The Docker Hooks System offers three approaches, all part of this new feature:

1. **String-based hooks** - Write hooks as strings for REST API
2. **Using `hooks_to_string()` utility** - Convert Python functions to strings
3. **Docker Client auto-conversion** - Pass functions directly (most convenient)

All three approaches are NEW in v0.7.5!

---

## 🔒 Feature 2: HTTPS Preservation for Internal Links

### Problem
When crawling HTTPS sites, internal links sometimes get downgraded to HTTP, breaking authentication and causing security warnings.

### Solution  
The new `preserve_https_for_internal_links=True` parameter maintains HTTPS protocol for all internal links.

In [9]:
async def demo_https_preservation():
    """
    Demonstrate HTTPS preservation with deep crawling
    """
    print("🔒 Testing HTTPS Preservation\n")
    print("=" * 60)
    
    # Setup URL filter for quotes.toscrape.com
    url_filter = URLPatternFilter(
        patterns=[r"^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
    )
    
    # Configure crawler with HTTPS preservation
    config = CrawlerRunConfig(
        exclude_external_links=True,
        preserve_https_for_internal_links=True,  # 🆕 NEW in v0.7.5
        cache_mode=CacheMode.BYPASS,
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,
            max_pages=5,
            filter_chain=FilterChain([url_filter])
        )
    )
    
    async with AsyncWebCrawler() as crawler:
        # With deep_crawl_strategy, arun() returns a list of CrawlResult objects
        results = await crawler.arun(
            url="https://quotes.toscrape.com",
            config=config
        )
        
        # Analyze the first result
        if results and len(results) > 0:
            first_result = results[0]
            internal_links = [link['href'] for link in first_result.links['internal']]
            
            # Check HTTPS preservation
            https_links = [link for link in internal_links if link.startswith('https://')]
            http_links = [link for link in internal_links if link.startswith('http://') and not link.startswith('https://')]
            
            print(f"\n📊 Results:")
            print(f"  Pages crawled: {len(results)}")
            print(f"  Total internal links (from first page): {len(internal_links)}")
            print(f"  HTTPS links: {len(https_links)} ✅")
            print(f"  HTTP links: {len(http_links)} {'⚠️' if http_links else ''}")
            if internal_links:
                print(f"  HTTPS preservation rate: {len(https_links)/len(internal_links)*100:.1f}%")
            
            print(f"\n🔗 Sample HTTPS-preserved links:")
            for link in https_links[:5]:
                print(f"  → {link}")
        else:
            print(f"\n⚠️ No results returned")
    
    print("\n" + "=" * 60)
    print("✅ HTTPS Preservation Demo Complete!\n")

# Run the demo
await demo_https_preservation()

🔒 Testing HTTPS Preservation




📊 Results:
  Pages crawled: 5
  Total internal links (from first page): 47
  HTTPS links: 47 ✅
  HTTP links: 0 
  HTTPS preservation rate: 100.0%

🔗 Sample HTTPS-preserved links:
  → https://quotes.toscrape.com/
  → https://quotes.toscrape.com/login
  → https://quotes.toscrape.com/author/Albert-Einstein
  → https://quotes.toscrape.com/tag/change/page/1
  → https://quotes.toscrape.com/tag/deep-thoughts/page/1

✅ HTTPS Preservation Demo Complete!



---

## 🤖 Feature 3: Enhanced LLM Integration

### What's New
- Custom `temperature` parameter for creativity control
- `base_url` for custom API endpoints
- Better multi-provider support

### Example with Custom Temperature

In [10]:
from crawl4ai import LLMExtractionStrategy, LLMConfig
from pydantic import BaseModel, Field
import os

# Define extraction schema
class Article(BaseModel):
    title: str = Field(description="Article title")
    summary: str = Field(description="Brief summary of the article")
    main_topics: list[str] = Field(description="List of main topics covered")

async def demo_enhanced_llm():
    """
    Demonstrate enhanced LLM integration with custom temperature
    """
    print("🤖 Testing Enhanced LLM Integration\n")
    print("=" * 60)
    
    # Check for API key
    api_key = os.getenv('OPENAI_API_KEY')
    if not api_key:
        print("⚠️ Note: Set OPENAI_API_KEY environment variable to test LLM extraction")
        print("For this demo, we'll show the configuration only.\n")
        
        print("📝 Example LLM Configuration with new v0.7.5 features:")
        print("""
llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",
        api_token="your-api-key",
        temperature=0.7,  # 🆕 NEW: Control creativity (0.0-2.0)
        base_url="custom-endpoint"  # 🆕 NEW: Custom API endpoint
    ),
    schema=Article.schema(),
    extraction_type="schema",
    instruction="Extract article information"
)
        """)
        return
    
    # Create LLM extraction strategy with custom temperature
    llm_strategy = LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="openai/gpt-4o-mini",
            api_token=api_key,
            temperature=0.3,  # 🆕 Lower temperature for more focused extraction
        ),
        schema=Article.schema(),
        extraction_type="schema",
        instruction="Extract the article title, a brief summary, and main topics discussed."
    )
    
    config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS
    )
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Artificial_intelligence",
            config=config
        )
        
        if result.success:
            print("\n✅ LLM Extraction Successful!")
            print(f"\n📄 Extracted Content:")
            print(result.extracted_content)
        else:
            print(f"\n❌ Extraction failed: {result.error_message}")
    
    print("\n" + "=" * 60)
    print("✅ Enhanced LLM Demo Complete!\n")

# Run the demo
await demo_enhanced_llm()

🤖 Testing Enhanced LLM Integration



/var/folders/k0/7502j87n0_q4f9g82c0w8ks80000gn/T/ipykernel_15029/173393508.py:47: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  schema=Article.schema(),



✅ LLM Extraction Successful!

📄 Extracted Content:
[
    {
        "title": "Artificial intelligence",
        "summary": "Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. AI can be applied in various fields and has numerous applications, including health, finance, and military.",
        "main_topics": [
            "Goals",
            "Techniques",
            "Applications",
            "Ethics",
            "History",
            "Philosophy",
            "Future",
            "In fiction"
        ],
        "error": false
    },
    {
        "title": "Artificial intelligence",
        "summary": "The article discusses artificial intelligence (AI), its various techniques, applications, and advancements, particularly focusing on machine learning, deep learning, and neural networks. It highlights the evolution of AI technologies, including generative pre-trained transformers (GPT), and their

### Creating Reusable Hook Functions

First, let's create some hook functions that we can reuse:

In [11]:
# Define reusable hooks as Python functions

async def block_images_hook(page, context, **kwargs):
    """
    Performance optimization: Block images to speed up crawling
    """
    print("[Hook] Blocking images for faster loading...")
    await context.route(
        "**/*.{png,jpg,jpeg,gif,webp,svg,ico}",
        lambda route: route.abort()
    )
    return page

async def set_viewport_hook(page, context, **kwargs):
    """
    Set consistent viewport size for rendering
    """
    print("[Hook] Setting viewport to 1920x1080...")
    await page.set_viewport_size({"width": 1920, "height": 1080})
    return page

async def add_custom_headers_hook(page, context, url, **kwargs):
    """
    Add custom headers before navigation
    """
    print(f"[Hook] Adding custom headers for {url}...")
    await page.set_extra_http_headers({
        'X-Crawl4AI-Version': '0.7.5',
        'X-Custom-Header': 'docker-hooks-demo',
        'Accept-Language': 'en-US,en;q=0.9'
    })
    return page

async def scroll_page_hook(page, context, **kwargs):
    """
    Scroll page to load lazy-loaded content
    """
    print("[Hook] Scrolling page to load lazy content...")
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    await page.wait_for_timeout(1000)
    await page.evaluate("window.scrollTo(0, 0)")
    await page.wait_for_timeout(500)
    return page

async def log_page_metrics_hook(page, context, **kwargs):
    """
    Log page metrics before extracting HTML
    """
    metrics = await page.evaluate('''
        () => ({
            images: document.images.length,
            links: document.links.length,
            scripts: document.scripts.length,
            title: document.title
        })
    ''')
    print(f"[Hook] Page Metrics - Title: {metrics['title']}")
    print(f"        Images: {metrics['images']}, Links: {metrics['links']}, Scripts: {metrics['scripts']}")
    return page

print("✅ Reusable hook library created!")
print("\n📚 Available hooks:")
print("  • block_images_hook - Speed optimization")
print("  • set_viewport_hook - Consistent rendering")
print("  • add_custom_headers_hook - Custom headers")
print("  • scroll_page_hook - Lazy content loading")
print("  • log_page_metrics_hook - Page analytics")

✅ Reusable hook library created!

📚 Available hooks:
  • block_images_hook - Speed optimization
  • set_viewport_hook - Consistent rendering
  • add_custom_headers_hook - Custom headers
  • scroll_page_hook - Lazy content loading
  • log_page_metrics_hook - Page analytics


### Using hooks_to_string() Utility

The new `hooks_to_string()` utility converts Python function objects to strings that can be sent to the Docker API:

In [12]:
# Convert functions to strings using the NEW utility
hooks_as_strings = hooks_to_string({
    "on_page_context_created": block_images_hook,
    "before_goto": add_custom_headers_hook,
    "before_retrieve_html": scroll_page_hook,
})

print("✅ Converted 3 hook functions to string format")
print("\n📝 Example of converted hook (first 200 chars):")
print(hooks_as_strings["on_page_context_created"][:200] + "...")

print("\n💡 Benefits of hooks_to_string():")
print("  ✓ Write hooks as Python functions (IDE support, type checking)")
print("  ✓ Automatically converts to string format for Docker API")
print("  ✓ Reusable across projects")
print("  ✓ Easy to test and debug")

✅ Converted 3 hook functions to string format

📝 Example of converted hook (first 200 chars):
async def block_images_hook(page, context, **kwargs):
    """
    Performance optimization: Block images to speed up crawling
    """
    print("[Hook] Blocking images for faster loading...")
    awai...

💡 Benefits of hooks_to_string():
  ✓ Write hooks as Python functions (IDE support, type checking)
  ✓ Automatically converts to string format for Docker API
  ✓ Reusable across projects
  ✓ Easy to test and debug


### 8 Available Hook Points

The Docker Hooks System provides 8 strategic points where you can inject custom behavior:

1. **on_browser_created** - Browser initialization
2. **on_page_context_created** - Page context setup
3. **on_user_agent_updated** - User agent configuration
4. **before_goto** - Pre-navigation setup
5. **after_goto** - Post-navigation processing
6. **on_execution_started** - JavaScript execution start
7. **before_retrieve_html** - Pre-extraction processing
8. **before_return_html** - Final HTML processing

### Complete Docker Hooks Demo

**Note**: For a complete demonstration of all Docker Hooks approaches including:
- String-based hooks with REST API
- hooks_to_string() utility usage
- Docker Client with automatic conversion
- Complete pipeline with all 8 hook points

See the separate file: **`v0.7.5_docker_hooks_demo.py`**

This standalone Python script provides comprehensive, runnable examples of the entire Docker Hooks System.

---

## 🛠️ Feature 4: Bug Fixes Summary

### Major Fixes in v0.7.5

1. **URL Processing** - Fixed '+' sign preservation in query parameters
2. **Proxy Configuration** - Enhanced proxy string parsing (old parameter deprecated)
3. **Docker Error Handling** - Better error messages with status codes
4. **Memory Management** - Fixed leaks in long-running sessions
5. **JWT Authentication** - Fixed Docker JWT validation
6. **Playwright Stealth** - Fixed stealth features
7. **API Configuration** - Fixed config handling
8. **Deep Crawl Strategy** - Resolved JSON encoding errors
9. **LLM Provider Support** - Fixed custom provider integration
10. **Performance** - Resolved backoff strategy failures

### New Proxy Configuration Example

In [13]:
# OLD WAY (Deprecated)
# browser_config = BrowserConfig(proxy="http://proxy:8080")

# NEW WAY (v0.7.5)
browser_config_with_proxy = BrowserConfig(
    proxy_config={
        "server": "http://proxy.example.com:8080",
        "username": "optional-username",  # Optional
        "password": "optional-password"   # Optional
    }
)

print("✅ New proxy configuration format demonstrated")
print("\n📝 Benefits:")
print("  • More explicit and clear")
print("  • Better authentication support")
print("  • Consistent with industry standards")

✅ New proxy configuration format demonstrated

📝 Benefits:
  • More explicit and clear
  • Better authentication support
  • Consistent with industry standards


---

## 🎯 Complete Example: Combining Multiple Features

Let's create a real-world example that uses multiple v0.7.5 features together:

In [15]:
async def complete_demo():
    """
    Comprehensive demo combining multiple v0.7.5 features
    """
    print("🎯 Complete v0.7.5 Feature Demo\n")
    print("=" * 60)
    
    # Use function-based hooks (NEW Docker Hooks System)
    print("\n1️⃣ Using Docker Hooks System (NEW!)")
    hooks = {
        "on_page_context_created": set_viewport_hook,
        "before_goto": add_custom_headers_hook,
        "before_retrieve_html": log_page_metrics_hook
    }
    
    # Convert to strings using the NEW utility
    hooks_strings = hooks_to_string(hooks)
    print(f"   ✓ Converted {len(hooks_strings)} hooks to string format")
    print("   ✓ Ready to send to Docker API")
    
    # Use HTTPS preservation
    print("\n2️⃣ Enabling HTTPS Preservation")
    url_filter = URLPatternFilter(
        patterns=[r"^(https:\/\/)?example\.com(\/.*)?$"]
    )
    
    config = CrawlerRunConfig(
        exclude_external_links=True,
        preserve_https_for_internal_links=True,  # v0.7.5 feature
        cache_mode=CacheMode.BYPASS,
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=1,
            max_pages=3,
            filter_chain=FilterChain([url_filter])
        )
    )
    print("   ✓ HTTPS preservation enabled")
    
    # Use new proxy config format
    print("\n3️⃣ Using New Proxy Configuration Format")
    browser_config = BrowserConfig(
        headless=True,
        # proxy_config={  # Uncomment if you have a proxy
        #     "server": "http://proxy:8080"
        # }
    )
    print("   ✓ New proxy config format ready")
    
    # Run the crawl
    print("\n4️⃣ Executing Crawl with All Features")
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # With deep_crawl_strategy, returns a list
        results = await crawler.arun(
            url="https://example.com",
            config=config
        )
        
        if results and len(results) > 0:
            result = results[0]  # Get first result
            print("   ✓ Crawl successful!")
            print(f"\n📊 Results:")
            print(f"   • Pages crawled: {len(results)}")
            print(f"   • Title: {result.metadata.get('title', 'N/A')}")
            print(f"   • Content length: {len(result.markdown.raw_markdown)} characters")
            print(f"   • Links found: {len(result.links['internal']) + len(result.links['external'])}")
        else:
            print(f"   ⚠️ No results returned")
    
    print("\n" + "=" * 60)
    print("✅ Complete Feature Demo Finished!\n")

# Run complete demo
await complete_demo()

🎯 Complete v0.7.5 Feature Demo


1️⃣ Using Docker Hooks System (NEW!)
   ✓ Converted 3 hooks to string format
   ✓ Ready to send to Docker API

2️⃣ Enabling HTTPS Preservation
   ✓ HTTPS preservation enabled

3️⃣ Using New Proxy Configuration Format
   ✓ New proxy config format ready

4️⃣ Executing Crawl with All Features


   ✓ Crawl successful!

📊 Results:
   • Pages crawled: 1
   • Title: Example Domain
   • Content length: 119 characters
   • Links found: 0

✅ Complete Feature Demo Finished!



---

## 🎓 Summary

### What We Covered

✅ **HTTPS Preservation** - Maintain secure protocols throughout crawling  
✅ **Enhanced LLM Integration** - Custom temperature and provider configuration  
✅ **Docker Hooks System (NEW!)** - Complete pipeline customization with 3 approaches  
✅ **hooks_to_string() Utility (NEW!)** - Convert functions for Docker API  
✅ **Bug Fixes** - New proxy config and multiple improvements  

### Key Highlight: Docker Hooks System 🌟

The **Docker Hooks System** is completely NEW in v0.7.5. It offers:
- 8 strategic hook points in the pipeline
- 3 ways to use hooks (strings, utility, auto-conversion)
- Full control over crawling behavior
- Support for authentication, optimization, and custom processing

### Next Steps

1. **Docker Hooks Demo** - See `v0.7.5_docker_hooks_demo.py` for complete Docker Hooks examples
2. **Documentation** - Visit [docs.crawl4ai.com](https://docs.crawl4ai.com) for full reference
3. **Examples** - Check [GitHub examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples)
4. **Community** - Join [Discord](https://discord.gg/jP8KfhDhyN) for support

---

## 📚 Resources

- 📖 [Full Documentation](https://docs.crawl4ai.com)
- 🐙 [GitHub Repository](https://github.com/unclecode/crawl4ai)
- 📝 [Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)
- 💬 [Discord Community](https://discord.gg/jP8KfhDhyN)
- 🐦 [Twitter](https://x.com/unclecode)

---

**Happy Crawling with v0.7.5! 🚀**