# Crawl4AI Cloud — Interactive Tutorial

Welcome to Crawl4AI Cloud. This notebook walks you through the platform's key features with **real, runnable examples**. Run each cell to see live results from the API.

**What you'll learn:**
- **One-Shot Extraction** — Generate a schema once with AI, reuse it for free
- **LLM Extraction** — Extract structured data from unstructured pages
- **Sync & Batch Crawl** — Quick tests with immediate results
- **Async Crawl** — Scale to hundreds of URLs with job queues
- **Deep Crawl** — Discover and crawl entire sites
- **Advanced Configuration** — Content filtering, wait conditions, and more

---

### Setup

Replace `YOUR_API_KEY` below with your real key. Get one from [crawl4ai.com/keys](https://crawl4ai.com/keys).

In [None]:
!pip install -q crawl4ai-cloud-sdk

In [None]:
import json
from crawl4ai_cloud import AsyncWebCrawler, CrawlerRunConfig

API_KEY = "YOUR_API_KEY"  # <-- Replace with your key
crawler = AsyncWebCrawler(api_key=API_KEY)

---
## 1. One-Shot Extraction

Most web pages have repeating patterns — product listings, search results, article feeds. One-shot extraction lets you pull structured data from these pages using a schema-based approach:

1. **Generate a schema once with AI** — it analyzes the HTML and produces CSS selectors
2. **Apply the schema to any page** with the same structure — no AI costs, just fast pattern matching

This is the most cost-effective extraction method at scale. AI runs once, then you reuse the schema across unlimited pages.

### 1a. Generate Schema

Point the schema generator at a page and describe what you want in natural language. AI analyzes the HTML and produces a set of CSS selectors.

In [None]:
schema = await crawler.generate_schema(
    urls=["https://books.toscrape.com"],
    query="Extract all book titles, prices, and ratings"
)
print(json.dumps(schema.schema, indent=2))

The result is a schema — a set of CSS selectors. AI figured out the page structure, but **applying the schema costs nothing**. You can also edit the schema by hand to fine-tune selectors or add fields.

### 1b. Apply Schema

Now use that schema to extract data. This is pure CSS pattern matching — fast and free of AI costs.

In [None]:
result = await crawler.run(
    "https://books.toscrape.com",
    config=CrawlerRunConfig(
        extraction_strategy={
            "type": "json_css",
            "schema": schema.to_dict()
        }
    )
)
extracted = result.extracted_content
if isinstance(extracted, str):
    extracted = json.loads(extracted)
print(f"Extracted {len(extracted)} items:")
print(json.dumps(extracted[:3], indent=2))
print(f"... and {len(extracted) - 3} more")

Notice the speed — no AI call, just CSS selectors. Now use the **same schema** on page 2:

### 1c. Reuse on Another Page

Same schema, different page — same speed. This is the power of one-shot extraction.

In [None]:
result2 = await crawler.run(
    "https://books.toscrape.com/catalogue/page-2.html",
    config=CrawlerRunConfig(
        extraction_strategy={
            "type": "json_css",
            "schema": schema.to_dict()
        }
    )
)
extracted2 = result2.extracted_content
if isinstance(extracted2, str):
    extracted2 = json.loads(extracted2)
print(f"Page 2 — extracted {len(extracted2)} items:")
print(json.dumps(extracted2[:3], indent=2))
print(f"... and {len(extracted2) - 3} more")

**One schema, unlimited pages.** When the site redesigns, regenerate the schema once. That's the one-shot pattern.

---
## 2. LLM Extraction

Schema-based extraction handles repeating patterns. But what about unstructured content — a recipe's ingredients, an article's key takeaways, specific facts from prose?

LLM extraction works here. AI reads the page and extracts exactly what you ask for.

You can use our managed AI model or bring your own API key. We're continuously fine-tuning a specialized model for schema generation and JSON extraction — this is an active area of development.

**When to use which:**
- **Schema-based (one-shot)** — catalogs, listings, repeating patterns (fast, no AI cost at scale)
- **LLM extraction** — individual pages with unique, unstructured content

In [None]:
llm_result = await crawler.run(
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    config=CrawlerRunConfig(
        extraction_strategy={
            "type": "llm",
            "instruction": "Extract the book title, price, description, and availability status"
        }
    )
)
llm_content = llm_result.extracted_content
if isinstance(llm_content, str):
    try:
        llm_content = json.loads(llm_content)
    except (json.JSONDecodeError, TypeError):
        pass
print(json.dumps(llm_content, indent=2) if isinstance(llm_content, (dict, list)) else llm_content)

---
## 3. Sync & Batch Crawl

Everything above uses crawling under the hood. Sync and batch crawls are your tools for **quick tests and immediate results** — up to 10 URLs at a time.

They're great for experimentation before moving to async or deep crawl for real workloads.

### 3a. Sync Crawl

Crawl a single URL and get markdown, metadata, and links back instantly.

In [None]:
sync = await crawler.run("https://example.com")
print("Markdown output:")
print(sync.markdown.raw_markdown[:500])
print(f"\nLinks found: {len(sync.links.get('external', []))}")

### 3b. Batch Crawl

Crawl multiple URLs in one request using `run_many()`. With `wait=True`, results come back together when all URLs are done.

In [None]:
job = await crawler.run_many(
    urls=[
        "https://example.com",
        "https://httpbin.org/html",
        "https://books.toscrape.com"
    ],
    wait=True
)
print(f"Completed: {job.progress.completed}/{job.progress.total}")
print(f"Failed: {job.progress.failed}")

Sync and batch are great for experimentation. For real workloads — dozens to thousands of URLs — use async crawling or deep crawl.

---
## 4. Async Crawl

Async crawl handles **real-world scale**. Submit a job with up to 100 URLs, get a job ID back immediately, then poll for results. No connection timeouts, no waiting.

### 4a. Submit a Job

Without `wait=True`, `run_many()` returns immediately with a job ID. The crawling happens in the background.

In [None]:
job = await crawler.run_many(
    urls=[
        "https://example.com",
        "https://books.toscrape.com",
        "https://httpbin.org/html",
        "https://quotes.toscrape.com",
        "https://webscraper.io/test-sites/e-commerce/allinone"
    ]
)
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
print(f"URLs: {job.urls_count}")

### 4b. Poll for Results

Use `wait_job()` to poll until the job completes. You can also use `get_job()` for a single status check, or set up a webhook for push notifications.

In [None]:
completed = await crawler.wait_job(job.id, poll_interval=2.0, timeout=300)
print(f"Status: {completed.status}")
print(f"Progress: {completed.progress.completed}/{completed.progress.total}")
print(f"Failed: {completed.progress.failed}")

if completed.is_complete:
    url = await crawler.download_url(completed.id)
    print(f"\nDownload results: {url[:80]}...")

---
## 5. Deep Crawl

Deep crawl is how you crawl **an entire site**. It works in two stages:

1. **Scan** — discovers all URLs using sitemaps, link analysis, and site structure
2. **Extract** — crawls the discovered pages with your configuration

Most tools discover pages one at a time through link traversal. Our scan phase finds all URLs upfront — you see the full map before crawling a single page.

### Scanning Strategies

- **`map`** (sitemap-based) — reads sitemaps and available internet data. Fast and reliable for established, well-indexed sites.
- **`bfs`** (breadth-first) — traverses links level by level. Use when sitemaps aren't available (new sites, poor indexing).
- **`dfs`** (depth-first) — follows links deep before going wide.
- **`best_first`** — priority-based with keyword scoring.

You only need to scan once. The URL map is cached. Re-extract anytime without re-scanning.

### 5a. Scan with Map Strategy

In [None]:
scan = await crawler.deep_crawl(
    "https://docs.crawl4ai.com",
    strategy="map",
    scan_only=True,
    wait=True,
    timeout=120
)
print(f"Discovered {scan.discovered_count} URLs")
print(f"\nSample URLs:")
if scan.urls:
    for u in scan.urls[:10]:
        print(f"  {u.url}")
    if len(scan.urls) > 10:
        print(f"  ... and {len(scan.urls) - 10} more")

### 5b. Scan with BFS (Link Traversal)

For sites without good sitemaps, use BFS. It traverses links level by level to discover pages. Slower than sitemap-based scanning but works universally.

> **Tip:** You can combine `include_patterns` and `exclude_patterns` to filter which URLs get discovered.

In [None]:
bfs_scan = await crawler.deep_crawl(
    "https://books.toscrape.com",
    strategy="bfs",
    max_depth=2,
    max_urls=20,
    scan_only=True,
    wait=True,
    timeout=120
)
print(f"BFS discovered {bfs_scan.discovered_count} URLs (max_depth=2, max_urls=20)")
if bfs_scan.urls:
    for u in bfs_scan.urls[:5]:
        print(f"  {u.url}")

---
## 6. Advanced Configuration

Every crawl accepts a `CrawlerRunConfig` — the same configuration object from the open-source [Crawl4AI](https://github.com/unclecode/crawl4ai) library. It controls:

- **JavaScript execution** — run JS code, wait for elements, click buttons
- **Wait conditions** — wait for network idle, specific selectors, or timeouts
- **Content filtering** — BM25 relevance scoring to return only relevant sections
- **Screenshots & PDFs** — capture visual snapshots
- **Element removal** — strip overlays, popups, cookie banners

Here's an example combining crawling with content filtering. The `content_filter` uses BM25 to return only sections relevant to your query — no AI needed, great for building focused context.

> **Tip:** Full `CrawlerRunConfig` reference at [docs.crawl4ai.com](https://docs.crawl4ai.com). Join the [Discord](https://discord.gg/crawl4ai) for help.

In [None]:
adv = await crawler.run(
    "https://docs.crawl4ai.com",
    config=CrawlerRunConfig(
        wait_until="networkidle",
        page_timeout=30000,
        remove_overlay_elements=True,
        content_filter={
            "type": "bm25",
            "query": "authentication setup",
            "threshold": 1.0
        }
    )
)
md = adv.markdown.raw_markdown if adv.markdown else ""
if md:
    print(f"Filtered markdown ({len(md)} chars — only authentication-related sections):")
    print(md[:1000])
else:
    print("No sections matched the filter (try a broader query or lower threshold)")

---
## Next Steps

You've seen the core features. Here's where to go from here:

- **API Reference** — Full endpoint documentation at [docs.crawl4ai.com](https://docs.crawl4ai.com)
- **SDKs** — [Python](https://github.com/unclecode/crawl4ai) · [Node.js](https://github.com/unclecode/crawl4ai-js) · [Go](https://github.com/unclecode/crawl4ai-go)
- **Community** — Join us on [Discord](https://discord.gg/crawl4ai) — we're active and help
- **Open Source** — [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) — 60k+ stars

### Quick Reference

| Method | Use Case |
|--------|----------|
| `crawler.run(url)` | Single URL, immediate result |
| `crawler.run_many(urls, wait=True)` | Multiple URLs, wait for all |
| `crawler.run_many(urls)` | Fire-and-forget, poll with `wait_job()` |
| `crawler.deep_crawl(url, strategy="map")` | Discover all URLs on a site |
| `crawler.generate_schema(urls, query)` | Generate CSS extraction schema with AI |
| `CrawlerRunConfig(extraction_strategy={...})` | Configure extraction, filtering, JS, etc. |

In [None]:
await crawler.close()
print("Done! All examples completed.")