# Scrapfly Extraction API

Hi, this is a quick introduction to Scrapfly's Extraction API which a service for extracting data from HTML and text documents using AI and schema-based parsing.

It's designed to be simple but powerful, and in this intro we'll take a look at three of it's components:

1. LLM based extraction - which allows to parse data using AI prompts.
2. AI Auto Extractor - which extracts predefined objects like products, reviews and real estate listings.
3. Extraction rules - traditional data parsing simplified through JSON.

Before we begin, make sure to grab your API key from the [scrapfly dashboard](https://scrapfly.io/dashboard)

To illustrate the examples, we'll be using Python with HTTPX library but the Extraction API can be accessed using any HTTP client like Fetch in javascript or cURL and operates through simple URL parameters. 

To start, setup the http client with some default values:

In [2]:
import httpx

client = httpx.Client(
    # tip: raise min timeout as long LLM queries can take a few seconds.
    timeout=60,  
    params={
        # your Scrapfly API key
        "key": "YOUR SCRAPFLY KEY",  
    },
)

Let's start by taking a look at LLM extraction

# LLM based extraction

Extraction API can be used to extract data from HTML pages using LLM (Large Language model) parsers. For this HTML content is sent to ScrapFly with a text prompt of what to extract.

In [27]:
# Example of a very simple HTML product page
html = """
<!DOCTYPE html>
<html>
  <body>
    <h1>Product</h1>
    </p>
      This product is available 
      in every store location
      for 16.99 USD for limited time!
    </p>
  </body>
</html>
"""


In [26]:
resp = client.post(
  "https://api.scrapfly.io/extraction",
  data=html,
  headers={
    "content-type": "text/html",
  },
  params={
    "extraction_prompt": "extract price and currency in json format"
  }
)
print(resp.url)
display.JSON(resp.json())

https://api.scrapfly.io/extraction?key=scp-test-14644f57089d47c0aeb705b776792366&extraction_prompt=extract%20price%20and%20currency%20in%20json%20format


<IPython.core.display.JSON object>

## Integration with Scraper API

If you don't have access to HTML datasets yet see [Scrapfly's scraper API](https://scrapfly.io/docs/scrape-api/getting-started) which integrates with extraction API directly and can scrape the data for you.

## Prompts can be any question

The Prompts are not limited to extraction and the AI can be used to answer general questions based on the provided data input or summarize the entire content.

In this example, we ask a freeform question about the product's availability using a very simple prompt of "what locations is this product available in?"

In [2]:
html = """
<!DOCTYPE html>
  <body>
    <h1>Product</h1>
    </p>
      This product is available 
      in every store location
      for 16.99 USD for limited time!
    </p>
  </body>
</html>
"""
resp = client.post(
    "https://api.scrapfly.io/extraction",
    data=html,
    headers={
        "content-type": "text/html",
    },
    params={
        "extraction_prompt": "what locations is this product \
                              available in?"
    }
)
print(resp.url)
display.JSON(resp.json())

https://api.scrapfly.io/extraction?key=scp-test-14644f57089d47c0aeb705b776792366&extraction_prompt=what%20locations%20is%20this%20product%20available%20in%3F


<IPython.core.display.JSON object>

Here we can see that the AI directly responded to us with a text answer.

## Real Life Example

In this example, we POST the page HTML with prompt asking to extract product specification in JSON format.

In response the API returns us a JSON with what it thinks counts as product specification like description, features, packaging and price data.

In [38]:
# html from https://web-scraping.dev/product/1
html = Path('product_1.html').read_text()
print("https://web-scraping.dev/product/1")
display.IFrame("https://web-scraping.dev/product/1", 1000, 800)

https://web-scraping.dev/product/1


In [14]:
# let's ask it to extract data as JSON
resp = client.post(
    "https://api.scrapfly.io/extraction",
    data=html,
    headers={
        "content-type": "text/html",
    },
    params={
        "extraction_prompt": "extract product specification \
                              in JSON format"
    }
)
print(resp.url)
display.JSON(resp.json())

https://api.scrapfly.io/extraction?key=scp-test-14644f57089d47c0aeb705b776792366&extraction_prompt=extract%20product%20specification%20in%20JSON%20format


<IPython.core.display.JSON object>

Further, the prompts can be as complex and detailed as you'd like. For example, here to extract reviews from our web-scraping.dev example we can mix structured data extraction and freeform questions allowing AI to do a bit of reasoning.

In [6]:
# html from https://web-scraping.dev/reviews
html = Path('reviews_1.html').read_text()
resp = client.post(
    "https://api.scrapfly.io/extraction",
    data=html,
    headers={
        "content-type": "text/html",
    },
    params={
        "extraction_prompt": """
        extract review data in JSON format for fields: 
        date, text also include tone field 
        which estimates the tone of the review
        which can be: positive, negative or neutral
        """
    }
)
print(resp.url)
from IPython import display
display.JSON(resp.json())

https://api.scrapfly.io/extraction?key=scp-test-14644f57089d47c0aeb705b776792366&extraction_prompt=%0A%20%20%20%20%20%20%20%20extract%20review%20data%20in%20JSON%20format%20for%20fields%3A%20date%2C%20text%0A%20%20%20%20%20%20%20%20also%20include%20tone%20field%20which%20estimates%20the%20tone%20of%20the%20review%20%0A%20%20%20%20%20%20%20%20


<IPython.core.display.JSON object>

Here we asked the model to extract structured data of review date and text and evaluate it for review sentiment tone. This gives us both exact data on the page and AI generated evaluation.

In [8]:
html = """
<!DOCTYPE html>
<html>
  <body>
    <h1>Products</h1>
    <ul>
      <li><a href="/product/1.html">Product 1</a></li>
      <li><a href="/product/2.html">Product 2</a></li>
      <li><a href="/product/3.html">Product 3</a></li>
    </ul>
  </body>
</html>
"""
resp = client.post(
    "https://api.scrapfly.io/extraction",
    data=html,
    headers={
        "content-type": "text/html",
    },
    params={
        "extraction_prompt": "extract product links in json format",
        "url": "https://web-scraping.dev/product/1"  # base url
    }
)
print(resp.json())
from IPython import display
display.JSON(resp.json())

{'content_type': 'application/json', 'data': ['https://www.example.com/product1', 'https://www.example.com/product2', 'https://www.example.com/product3']}


<IPython.core.display.JSON object>

Next let's take a look at a different AI extraction method - predefined auto parsing.

# AI Auto Parser

Scrapfly also includes some pre-build extraction models for common data parsing scenarios like parsing products, reviews, articles or real estate listings.

For this example let's take a look at web-scraping.dev/product/1 page which contains product and review data.

In [6]:
# html from https://web-scraping.dev/product/1
html = Path('product_1.html').read_text()

resp = client.post(
    "https://api.scrapfly.io/extraction",
    data=html,
    headers={
        "content-type": "text/html",
    },
    params={
        "extraction_model": "product",
    }
)

from IPython import display
display.JSON(resp.json())

<IPython.core.display.JSON object>

Here we set the `extraction_model` parameter to `product` value and in this case Scrapfly AI tries to extract each field of a predefined product schema which includes all of the standard product fields like price, description, brand names and so on.

You can find the [full schema definitions in Scrapfly docs](https://scrapfly.io/docs/extraction-api/automatic-ai#models).

The advantages of auto parsing over LLMs is that the structure is static and predictable and also doesn't require any prompt engineering. The `data_quality` meta field also includes an estimate of how much data was found.

In [9]:
# html from https://web-scraping.dev/product/1
html = Path('product_1.html').read_text()
resp = client.post(
    "https://api.scrapfly.io/extraction",
    data=html,
    headers={
        "content-type": "text/html",
    },
    params={
        "extraction_model": "review_list",
    }
)
from IPython import display
display.JSON(resp.json())

<IPython.core.display.JSON object>

To have even more control over the data extraction, let's take a look at the extraction rules feature next.

## Extraction Rules

Extraction Rules allow to specify exact data parsing instructions. This replicates common data parsing tools as a unified parsing system expressed through a single JSON definition.

For example, here we're scraping the reviews page and define a `date posted` field for finding date objects. We define the elements location using CSS selectors and apply formatters to reformat the dates in a desired format. 

This gives us full control over data extraction and once we define our ruleset we can apply it to any document.


In [44]:
# html from https://web-scraping.dev/reviews (with js render)
html = Path('reviews_1.html').read_text()

# 1. create extaction template as JSON
rules = """
{  
  "source": "html",
  "selectors": [
    {
      "name": "date_posted",
      "type": "css",
      "query": "[data-testid='review-date']::text",
      "multiple": true,
      "formatters": [ {
        "name": "datetime",
        "args": {"format": "%Y, %b %d — %A"}
      } ]
    }
  ]
}
"""

# 2. base64 encode the template
from base64 import urlsafe_b64encode
rules = urlsafe_b64encode(rules.encode()).decode()

resp = client.post(
    "https://api.scrapfly.io/extraction",
    data=html,
    headers={
        "content-type": "text/html",
    },
    params={
        # use ephemeral: prefix for on-demand templates (templates can also be defined in dashboard soon)
        "extraction_template": "ephemeral:" + rules,
    }
)
print(resp.url)
display.JSON(resp.json())

https://api.scrapfly.io/extraction?key=scp-test-14644f57089d47c0aeb705b776792366&url=https%3A%2F%2Fweb-scraping.dev&extraction_template=ephemeral%3ACnsgIAogICJzb3VyY2UiOiAiaHRtbCIsCiAgInNlbGVjdG9ycyI6IFsKICAgIHsKICAgICAgIm5hbWUiOiAiZGF0ZV9wb3N0ZWQiLAogICAgICAidHlwZSI6ICJjc3MiLAogICAgICAicXVlcnkiOiAiW2RhdGEtdGVzdGlkPSdyZXZpZXctZGF0ZSddOjp0ZXh0IiwKICAgICAgIm11bHRpcGxlIjogdHJ1ZSwKICAgICAgImZvcm1hdHRlcnMiOiBbIHsKICAgICAgICAibmFtZSI6ICJkYXRldGltZSIsCiAgICAgICAgImFyZ3MiOiB7ImZvcm1hdCI6ICIlWSwgJWIgJWQg4oCUICVBIn0KICAgICAgfSBdCiAgICB9CiAgXQp9Cg%3D%3D


<IPython.core.display.JSON object>

This is just the tip of the iceberg when it comes to extraction rule features and for more in-depth examples see the full documentation page.

This short intro to Scrapfly's Extraction API should get you started with your data parsing journey and for more see our [full documentation at scrapfly.io/docs](https://scrapfly.io/docs/extraction-api/getting-started)



















