# Supacrawler Python SDK - Parse Examples

This notebook demonstrates how to use the Parse API for intelligent AI-powered data extraction using natural language prompts.


In [2]:
# Setup and imports
import os
from dotenv import load_dotenv
from supacrawler import SupacrawlerClient

load_dotenv()

# Initialize client - uses local engine if no API key provided
SUPACRAWLER_API_KEY = os.environ.get("SUPACRAWLER_API_KEY")
client = SupacrawlerClient(api_key=SUPACRAWLER_API_KEY)

print("Parse API client initialized!")
print(f"Using {'hosted API' if SUPACRAWLER_API_KEY else 'local engine'}")


Parse API client initialized!
Using hosted API


## Basic Parse Examples

The Parse API understands natural language prompts and automatically decides whether to scrape single pages or crawl multiple pages.


In [6]:
# Simple product extraction from a single page
response = client.parse("""Parse blog articles from https://supacrawler.com/blog and return JSON with:
  - Title
  - Published date
  - Summary (2 sentences)
  - Tags
  - Reading time""")

print("Response:")
print(f"Job enqueued: {response.success}")
print(f"Job ID: {response.job_id}")

Response:
Job enqueued: True
Job ID: 5ecf3b7c-a461-4647-a27b-3c06c48206f2


In [8]:
parse_out = client.wait_for_parse(response.job_id)
parse_out

ParseJob(success=True, job_id='5ecf3b7c-a461-4647-a27b-3c06c48206f2', status='completed', type='parse', results={'parse_result': {'data': [{'published_date': 'August 28, 2025', 'reading_time': '10 minutes', 'summary': 'Retrieval-Augmented Generation (RAG) is a powerful technique that enhances Large Language Models (LLMs) by providing them with up-to-date, external knowledge. This guide provides a complete, end-to-end walkthrough of how to crawl an entire website to build a knowledge base, chunk the crawled content, embed the chunks into vectors, and store and query those vectors in a PostgreSQL database using the `pgvector` extension.', 'tags': 'rag, ai, llm, pgvector, supabase, langchain, llamaindex', 'title': 'Building a Production-Ready RAG Pipeline with Supacrawler and pgvector'}, {'published_date': 'August 26, 2025', 'reading_time': '5 minutes', 'summary': "Watching websites for changes is a perfect use case for automation. Instead of manually checking or building brittle polling 

In [9]:
parse_out.data

{'parse_result': {'data': [{'published_date': 'August 28, 2025',
    'reading_time': '10 minutes',
    'summary': 'Retrieval-Augmented Generation (RAG) is a powerful technique that enhances Large Language Models (LLMs) by providing them with up-to-date, external knowledge. This guide provides a complete, end-to-end walkthrough of how to crawl an entire website to build a knowledge base, chunk the crawled content, embed the chunks into vectors, and store and query those vectors in a PostgreSQL database using the `pgvector` extension.',
    'tags': 'rag, ai, llm, pgvector, supabase, langchain, llamaindex',
    'title': 'Building a Production-Ready RAG Pipeline with Supacrawler and pgvector'},
   {'published_date': 'August 26, 2025',
    'reading_time': '5 minutes',
    'summary': "Watching websites for changes is a perfect use case for automation. Instead of manually checking or building brittle polling scripts, use Supacrawler's Watch API to schedule automatic checks, detect content cha

## Parse with JSON Schema

You can provide a structured JSON schema to get precise data extraction:


In [None]:
# Parse with structured schema
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "content": {"type": "string"},
        "links": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["title", "content"]
}

response = client.parse(
    "Extract the page title, main content, and any links from https://httpbin.org/html",
    schema=schema,
    output_format="json"
)

print("Structured extraction:")
print(f"Success: {response.success}")
if response.has_data:
    import json
    print("Extracted data:")
    if isinstance(response.data, dict):
        print(json.dumps(response.data, indent=2))
    else:
        print(response.data)
elif response.error:
    print(f"Error: {response.error}")
else:
    print("No data available")


## Different Output Formats

The Parse API supports multiple output formats:


In [None]:
# CSV format example
csv_response = client.parse(
    "Extract any information from https://httpbin.org/html in a table format",
    output_format="csv"
)

print("CSV Output:")
print(f"Success: {csv_response.success}")
if csv_response.has_data:
    print("CSV Data:")
    print(csv_response.data)
elif csv_response.error:
    print(f"Error: {csv_response.error}")

print("\n" + "="*50 + "\n")

# Markdown format example  
md_response = client.parse(
    "Summarize the content from https://httpbin.org/html",
    output_format="markdown"
)

print("Markdown Output:")
print(f"Success: {md_response.success}")
if md_response.has_data:
    print("Markdown Data:")
    print(md_response.data)
elif md_response.error:
    print(f"Error: {md_response.error}")
