# Crawl a Blog Post with Firecrawl’s Python SDK (Markdown/HTML/JSON)

This notebook crawls a single URL with Firecrawl, persists multiple output formats (Markdown/HTML/JSON), and shows how to access common `Document` fields.

**Target URL:**
- https://blog1.neuralengineer.org/openai-moderation-api-multimodal-llm-with-omni-moderation-latest-text-image-63b42d5f57a7

> Note: Make sure you have permission to crawl the target site and that you follow its terms/robots policies.


## Setup

Requirements:
- Python 3.9+
- A Firecrawl API key available as `FIRECRAWL_API_KEY` (loaded from `.env`)

Install dependencies:
- `pip install firecrawl python-dotenv beautifulsoup4 markdownify`
- (optional) `pip install dicttoxml` for XML export

Create a `.env` file in your project root:

```bash
FIRECRAWL_API_KEY="your_api_key_here"
```


In [1]:
#Optional: install packages (uncomment if needed)
%pip install firecrawl python-dotenv beautifulsoup4 markdownify dicttoxml


Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
from dotenv import load_dotenv
from firecrawl import Firecrawl

URL = "https://blog1.neuralengineer.org/openai-moderation-api-multimodal-llm-with-omni-moderation-latest-text-image-63b42d5f57a7"

load_dotenv(".env")  # reads .env into environment

api_key = os.environ.get("FIRECRAWL_API_KEY")
if not api_key:
    raise RuntimeError("Missing FIRECRAWL_API_KEY in environment (create .env or export it)")

app = Firecrawl(api_key=api_key)


## Crawl a Specific URL

Request multiple output formats in one call. `only_main_content=True` is a good first pass to reduce boilerplate.


In [None]:
result = app.scrape(
    URL,
    formats=["markdown", "html", "links", "images"],
    only_main_content=True,
)

type(result), result.model_dump(exclude_none=True).keys()


## Post-processing: Save Only the Primary Article Content

Even with `only_main_content=True`, post-processing the returned HTML can help you reliably isolate the article element across sites.


In [None]:
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import html as html_lib
import json
import re

def extract_primary_html(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")

    for tag in soup.select("script, style, noscript, header, footer, nav, aside"):
        tag.decompose()

    primary = soup.find("article") or soup.find("main") or soup.body
    return str(primary) if primary else html

def _get_meta(soup: BeautifulSoup, *, name: str | None = None, prop: str | None = None) -> str | None:
    attrs: dict[str, str] = {}
    if name:
        attrs["name"] = name
    if prop:
        attrs["property"] = prop
    tag = soup.find("meta", attrs=attrs)
    return tag.get("content") if tag else None

def extract_metadata(page_html: str) -> dict:
    soup = BeautifulSoup(page_html, "html.parser")
    meta: dict[str, object] = {
        "author": _get_meta(soup, name="author") or _get_meta(soup, prop="article:author"),
        "published_date": _get_meta(soup, prop="article:published_time"),
        "thumbnail_image": _get_meta(soup, prop="og:image"),
        "featured": None,
        "reading_time_minutes": None,
    }

    for script in soup.find_all("script", attrs={"type": "application/ld+json"}):
        try:
            data = json.loads(script.get_text(strip=True) or "null")
        except Exception:
            continue
        items = data if isinstance(data, list) else [data]
        for item in items:
            if not isinstance(item, dict):
                continue
            t = item.get("@type")
            if t in {"Article", "BlogPosting", "NewsArticle"}:
                author = item.get("author")
                if isinstance(author, dict):
                    meta["author"] = meta["author"] or author.get("name")
                meta["published_date"] = meta["published_date"] or item.get("datePublished")
                img = item.get("image")
                if isinstance(img, str):
                    meta["thumbnail_image"] = meta["thumbnail_image"] or img
                elif isinstance(img, list) and img and isinstance(img[0], str):
                    meta["thumbnail_image"] = meta["thumbnail_image"] or img[0]

    text_head = soup.get_text(" ", strip=True)[:5000]
    m = re.search(r"(\d+)\s*min\s*read", text_head, re.IGNORECASE)
    meta["reading_time_minutes"] = int(m.group(1)) if m else None
    meta["featured"] = True if re.search(r"\bFeatured\b", text_head) else None

    time_tag = soup.find("time")
    if time_tag:
        meta["published_date"] = meta["published_date"] or time_tag.get("datetime") or time_tag.get_text(" ", strip=True)

    if not meta["author"]:
        byline = soup.find("a", href=re.compile(r"byline", re.IGNORECASE))
        if byline:
            meta["author"] = byline.get_text(" ", strip=True) or None
    if not meta["thumbnail_image"]:
        img = soup.find("img", src=re.compile(r"^https?://"))
        if img:
            meta["thumbnail_image"] = img.get("src")
    return meta

def clean_primary_html(primary_html: str) -> tuple[dict, str]:
    soup = BeautifulSoup(primary_html, "html.parser")
    container = soup.find("article") or soup.find("main") or soup

    top_children = [c for c in container.children if getattr(c, "name", None)]

    def remove_subscribe_popup(root) -> None:
        pattern = re.compile(r"stories in\s+your\s+inbox|join medium for free", re.IGNORECASE)
        for tag in list(root.find_all(["div", "section", "form"], recursive=True)):
            text = tag.get_text(" ", strip=True)
            if not text:
                continue
            if pattern.search(text) and "subscribe" in text.lower():
                tag.decompose()

    remove_subscribe_popup(container)

    if len(top_children) == 1 and len(top_children[0].find_all("p")) >= 5:
        container = top_children[0]

    title_tag = container.find("h1")
    title = title_tag.get_text(" ", strip=True) if title_tag else None
    if title_tag:
        title_tag.decompose()

    for tag in container.select("header"):
        tag.decompose()

    def is_boilerplate(text: str) -> bool:
        t = text.strip()
        if not t:
            return True
        if re.search(r"\\bFollow\\b|\\bListen\\b|\\bShare\\b", t):
            return True
        if re.search(r"\\bFeatured\\b", t):
            return True
        if re.search(r"\\d+\\s*min\\s*read", t, re.IGNORECASE):
            return True
        if re.search(r"Press enter or click to view image", t, re.IGNORECASE):
            return True
        return False

    first_content_p = None
    for p in container.find_all("p"):
        t = p.get_text(" ", strip=True)
        if len(t) >= 200 and not is_boilerplate(t):
            first_content_p = p
            break

    if first_content_p is not None:
        node = first_content_p
        while node is not None and node is not container:
            for sib in list(getattr(node, "previous_siblings", [])):
                if getattr(sib, "name", None):
                    sib.decompose()
            node = node.parent

    return {"title": title}, str(container)

def to_front_matter(d: dict) -> str:
    def esc(v: object) -> str:
        if v is None:
            return "null"
        if isinstance(v, bool):
            return "true" if v else "false"
        if isinstance(v, (int, float)):
            return str(v)
        return json.dumps(str(v))
    keys = ["title", "author", "published_date", "reading_time_minutes", "thumbnail_image", "featured", "source_url"]
    lines = ["---"] + [f"{k}: {esc(d.get(k))}" for k in keys] + ["---", ""]
    return "\n".join(lines)

def build_html_document(meta: dict, body_html: str) -> str:
    title = html_lib.escape(str(meta.get("title") or ""))
    author = html_lib.escape(str(meta.get("author") or ""))
    published = html_lib.escape(str(meta.get("published_date") or ""))
    image = html_lib.escape(str(meta.get("thumbnail_image") or ""))
    read_mins = html_lib.escape(str(meta.get("reading_time_minutes") or ""))

    head_lines = [
        '<meta charset="utf-8">',
        f'<title>{title}</title>' if title else '',
        f'<meta name="author" content="{author}">' if author else '',
        f'<meta name="reading_time_minutes" content="{read_mins}">' if read_mins else '',
        f'<meta property="article:published_time" content="{published}">' if published else '',
        f'<meta property="og:image" content="{image}">' if image else '',
    ]
    head = "".join([line for line in head_lines if line]).strip()

    meta_entries = []
    if title:
        meta_entries.append(f"<h1>{title}</h1>")
    if author:
        meta_entries.append(f"<div><strong>Author:</strong> {author}</div>")
    if read_mins:
        meta_entries.append(f"<div><strong>Reading time:</strong> {read_mins} min</div>")
    if published:
        meta_entries.append(f"<div><strong>Published:</strong> {published}</div>")


    meta_block = ""
    if meta_entries:
        meta_block = f"<div class=\"article-meta\">{''.join(meta_entries)}</div>"

    body_content = f"{meta_block}{body_html}"
    return f"<!doctype html><html><head>{head}</head><body>{body_content}</body></html>"

primary_html = extract_primary_html(result.html or "")
page_meta = extract_metadata(result.html or "")
primary_meta, cleaned_primary_html = clean_primary_html(primary_html)

doc_meta = result.metadata.model_dump(exclude_none=True) if getattr(result, "metadata", None) else {}

def first(*vals: object) -> object:
    for v in vals:
        if v is None:
            continue
        if isinstance(v, str) and not v.strip():
            continue
        return v
    return None

merged = dict(page_meta)
merged["author"] = first(doc_meta.get("author"), page_meta.get("author"))
merged["published_date"] = first(doc_meta.get("published_time"), doc_meta.get("article:published_time"), page_meta.get("published_date"))
merged["thumbnail_image"] = first(
    doc_meta.get("og_image"),
    doc_meta.get("og:image"),
    doc_meta.get("twitter:image:src"),
    page_meta.get("thumbnail_image"),
)

rt = first(doc_meta.get("twitter:data1"), doc_meta.get("twitter:data2"))
if isinstance(rt, str):
    m_rt = re.search(r"(\d+)\s*min", rt, re.IGNORECASE)
    if m_rt:
        merged["reading_time_minutes"] = int(m_rt.group(1))

metadata = {**merged, **primary_meta, "source_url": URL}

primary_md = md(primary_html, heading_style="ATX")
clean_primary_md = md(cleaned_primary_html, heading_style="ATX")
primary_text = BeautifulSoup(primary_html, "html.parser").get_text("\n", strip=True)
clean_primary_text = BeautifulSoup(cleaned_primary_html, "html.parser").get_text("\n", strip=True)

{
    "title": metadata.get("title"),
    "author": metadata.get("author"),
    "published_date": metadata.get("published_date"),
    "reading_time_minutes": metadata.get("reading_time_minutes"),
    "thumbnail_image": metadata.get("thumbnail_image"),
    "featured": metadata.get("featured"),
}


## Persisting Output in Multiple Formats

Write outputs to `out/`.


In [None]:
from pathlib import Path

Path("out").mkdir(parents=True, exist_ok=True)


In [None]:
if result.json is not None:
    with open("out/page.json", "w", encoding="utf-8") as fp:
        json.dump(result.json, fp, ensure_ascii=False, indent=2)

with open("out/page.html", "w", encoding="utf-8") as fp:
    fp.write(result.html or "")

with open("out/page.md", "w", encoding="utf-8") as fp:
    fp.write(result.markdown or "")

with open("out/primary.md", "w", encoding="utf-8") as fp:
    fp.write(to_front_matter(metadata) + clean_primary_md)

with open("out/primary.html", "w", encoding="utf-8") as fp:
    fp.write(build_html_document(metadata, cleaned_primary_html))

with open("out/primary.txt", "w", encoding="utf-8") as fp:
    fp.write(clean_primary_text)

with open("out/metadata.json", "w", encoding="utf-8") as fp:
    json.dump(metadata, fp, ensure_ascii=False, indent=2)

print("Wrote out/page.* and out/primary.*" + (" (+ out/page.json)" if result.json is not None else ""))


### Optional: Structured JSON

Firecrawl’s JSON output is schema-driven. The snippet below shows the expected `formats` shape.


In [4]:
from pydantic import BaseModel

class PageSchema(BaseModel):
    title: str | None = None
    summary: str | None = None

structured = app.scrape(
    URL,
    formats=[
        "markdown",
        {"type": "json", "prompt": "Extract a title and short summary.", "schema": PageSchema.model_json_schema()},
    ],
)
print(structured.json)


{'title': 'OpenAI Moderation API: multimodal LLM with `omni-moderation-latest` (text + image)', 'summary': "This article discusses the OpenAI Moderation API's new multimodal capabilities, enabling classification of both text and images in content moderation, alongside practical usage guidelines and comparisons with older models."}


### Optional: Save as XML

Requires `dicttoxml`.


In [None]:
# from dicttoxml import dicttoxml
#
# if result.json is None:
#     raise RuntimeError("No JSON payload available; include 'json' in formats")
#
# xml_bytes = dicttoxml(result.json, custom_root="page", attr_type=False)
# with open("out/page.xml", "wb") as fp:
#     fp.write(xml_bytes)
#
# print("Wrote out/page.xml")


## Inspect Useful Fields

Firecrawl returns a typed `Document`. These fields are often the most useful for downstream processing.


In [None]:
len(result.links or []), len(result.images or [])


In [None]:
result.metadata


## Large Extractions

For site-level work, use a crawl job and poll (or configure a webhook). Keep this commented for the single-page example.


In [None]:
# job = app.crawl(
#     "https://blog1.neuralengineer.org/",
#     include_paths=["/openai-moderation-api-multimodal-llm-with-omni-moderation-latest-text-image-63b42d5f57a7"],
#     limit=1,
# )
# job


## References
- https://docs.firecrawl.dev/introduction
- https://pypi.org/project/firecrawl/
