## 使用模型
    - 2個範例
    - 第1個本地模型
    - 第2個使用gemini模型

### 1.使用ollama的llama3.2
    - **非常重要的觀念,做錯資料出不來**
        - schema的dictionary要先做出來
        - 然後再建立JsonCssExtractionStrategy
        - 然後再建立CrawlerRunConfig
        - arun(config=CrawlerRunConfig的實體)

In [34]:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import LLMConfig, AsyncWebCrawler,CacheMode,CrawlerRunConfig
import json
from pprint import pprint

# Generate a schema (one-time cost)
#html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"
html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"

# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
    html,
    llm_config = LLMConfig(provider="ollama/llama3.2", api_token=None)  # Not needed for Ollama
)

# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)

#非常重要,一定要有CrawlerRunConfig的實體
#一定要有extraction_strategy的引數名稱
#不然使用result.extracted_content會是None

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    extraction_strategy=strategy
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url = f"raw://{html}",
        config=config
    )

    print("=====lamma3.2產生的schema=========")
    pprint(schema)
    data = json.loads(result.extracted_content)
    print("==========擷取結果==========")
    pprint(data)


{'baseFields': [{'attribute': 'href',
                 'name': 'data_href',
                 'type': 'attribute'}],
 'baseSelector': '.item',
 'fields': [{'name': 'title', 'selector': 'h2', 'type': 'text'},
            {'attribute': 'href',
             'name': 'link',
             'selector': 'a',
             'type': 'attribute'}],
 'name': 'Item List'}
[{'link': 'https://example.com/item1', 'title': 'Item 1'}]


### 2.使用gemini,apikey要下心,不要上傳至github
    - **非常重要的觀念,做錯資料出不來**
        - schema的dictionary要先做出來
        - 然後再建立JsonCssExtractionStrategy
        - 然後再建立CrawlerRunConfig
        - arun(config=CrawlerRunConfig的實體)

In [31]:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import LLMConfig,CrawlerRunConfig
from pprint import pprint

# Generate a schema (one-time cost)
html = """
<html>
    <body>
        <div class='product'>
            <h2>Gaming Laptop</h2>
            <span class='price'>$999.99</span>
        </div>
    <body>
</html>
"""

# Using OpenAI (requires API token)

schema = JsonCssExtractionStrategy.generate_schema(
    html,
    llm_config = LLMConfig(        
        provider="gemini/gemini-2.5-flash",
        api_token="gemini api key")  # Required for OpenAI
)

# 手動產生的schema
# schema = {
#     'name': 'Product Details',
#     'baseSelector': '.product',
#     'fields': [
#         {'name': 'title', 
#          'selector': 'h2', 
#          'type': 'text'},
#         {'name': 'price',
#          'selector': '.price',
#          'type': 'text'}]
# }

# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema,verbose=True)

#3. 設定爬蟲配置
config = CrawlerRunConfig(
    cache_mode = CacheMode.BYPASS,
    extraction_strategy=strategy
)
async with AsyncWebCrawler() as crawler:
    raw_url = f"raw://{html}"
    result = await crawler.arun(
        url = raw_url,
        config=config
    )
    print("======Gmini 自動產生的schema======")
    print(schema)
    print("=======取出的結果===========")
    data = json.loads(result.extracted_content)
    print(data)

{'name': 'Product Details', 'baseSelector': '.product', 'baseFields': [], 'fields': [{'name': 'title', 'selector': 'h2', 'type': 'text'}, {'name': 'price', 'selector': '.price', 'type': 'text'}]}
[{'title': 'Gaming Laptop', 'price': '$999.99'}]
