<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/WebPageDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Page Reader

Demonstrates our web page reader.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
%pip install llama-index llama-index-readers-web

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#### Using SimpleWebPageReader

In [None]:
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
import os

In [None]:
# NOTE: the html_to_text=True option requires html2text to be installed

In [None]:
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["http://paulgraham.com/worked.html"]
)

In [None]:
documents[0]

In [None]:
index = SummaryIndex.from_documents(documents)

In [None]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

In [None]:
display(Markdown(f"<b>{response}</b>"))

# Using Spider Reader 🕷
[Spider](https://spider.cloud/?ref=llama_index) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.

Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc... 

**Prerequisites:** you need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud).

In [None]:
# Scrape single URL
from llama_index.readers.web import SpiderWebReader

spider_reader = SpiderWebReader(
    api_key="YOUR_API_KEY",  # Get one at https://spider.cloud
    mode="scrape",
    # params={} # Optional parameters see more on https://spider.cloud/docs/api
)

documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)



Crawl domain following all deeper subpages

In [None]:
# Crawl domain with deeper crawling following subpages
from llama_index.readers.web import SpiderWebReader

spider_reader = SpiderWebReader(
    api_key="YOUR_API_KEY",
    mode="crawl",
    # params={} # Optional parameters see more on https://spider.cloud/docs/api
)

documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)



For guides and documentation, visit [Spider](https://spider.cloud/docs/api)

# Using Browserbase Reader 🅱️

[Browserbase](https://browserbase.com) is a serverless platform for running headless browsers, it offers advanced debugging, session recordings, stealth mode, integrated proxies and captcha solving.

## Installation and Setup

- Get an API key and Project ID from [browserbase.com](https://browserbase.com) and set it in environment variables (`BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID`).
- Install the [Browserbase SDK](http://github.com/browserbase/python-sdk):

In [None]:
%pip install browserbase

In [None]:
from llama_index.readers.web import BrowserbaseWebReader

In [None]:
reader = BrowserbaseWebReader()
docs = reader.load_data(
    urls=[
        "https://example.com",
    ],
    # Text mode
    text_content=False,
)

### Using FireCrawl Reader 🔥


Firecrawl is an api that turns entire websites into clean, LLM accessible markdown.

Using Firecrawl to gather an entire website

In [None]:
from llama_index.readers.web import FireCrawlWebReader

In [None]:
# using firecrawl to crawl a website
firecrawl_reader = FireCrawlWebReader(
    api_key="<your_api_key>",  # Replace with your actual API key from https://www.firecrawl.dev/
    mode="scrape",  # Choose between "crawl" and "scrape" for single page scraping
    params={"additional": "parameters"},  # Optional additional parameters
)

# Load documents from a single page URL
documents = firecrawl_reader.load_data(url="http://paulgraham.com/")

In [None]:
index = SummaryIndex.from_documents(documents)

In [None]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

In [None]:
display(Markdown(f"<b>{response}</b>"))

Using firecrawl for a single page


In [None]:
# Initialize the FireCrawlWebReader with your API key and desired mode
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader

firecrawl_reader = FireCrawlWebReader(
    api_key="<your_api_key>",  # Replace with your actual API key from https://www.firecrawl.dev/
    mode="scrape",  # Choose between "crawl" and "scrape" for single page scraping
    params={"additional": "parameters"},  # Optional additional parameters
)

# Load documents from a single page URL
documents = firecrawl_reader.load_data(url="http://paulgraham.com/worked.html")

: 

In [None]:
index = SummaryIndex.from_documents(documents)

In [None]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

In [None]:
display(Markdown(f"<b>{response}</b>"))

Using FireCrawl's extract mode to extract structured data from URLs

In [None]:
# Initialize the FireCrawlWebReader with your API key and extract mode
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader

firecrawl_reader = FireCrawlWebReader(
    api_key="<your_api_key>",  # Replace with your actual API key from https://www.firecrawl.dev/
    mode="extract",  # Use extract mode to extract structured data
    params={
        "prompt": "Extract the title, author, and main points from this essay",
        # Required prompt parameter for extract mode
    },
)

# Load documents by providing a list of URLs to extract data from
documents = firecrawl_reader.load_data(
    urls=[
        "https://www.paulgraham.com",
        "https://www.paulgraham.com/worked.html",
    ]
)

In [None]:
index = SummaryIndex.from_documents(documents)

In [None]:
# Query the extracted structured data
query_engine = index.as_query_engine()
response = query_engine.query("What are the main points from these essays?")

display(Markdown(f"<b>{response}</b>"))

# Using Hyperbrowser Reader ⚡

[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.

Key Features:
- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches
- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright
- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more
- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies

For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).

## Installation and Setup

- Head to [Hyperbrowser](https://app.hyperbrowser.ai/) to sign up and generate an API key. Once you've done this set the `HYPERBROWSER_API_KEY` environment variable or you can pass it to the `HyperbrowserWebReader` constructor.
- Install the [Hyperbrowser SDK](https://github.com/hyperbrowserai/python-sdk):

In [None]:
%pip install hyperbrowser

In [None]:
from llama_index.readers.web import HyperbrowserWebReader

reader = HyperbrowserWebReader(api_key="your_api_key_here")
docs = reader.load_data(
    urls=["https://example.com"],
    operation="scrape",
)
docs

#### Using TrafilaturaWebReader

In [None]:
from llama_index.readers.web import TrafilaturaWebReader

ModuleNotFoundError: No module named 'llama_index.readers.web'

In [None]:
documents = TrafilaturaWebReader().load_data(
    ["http://paulgraham.com/worked.html"]
)

In [None]:
index = SummaryIndex.from_documents(documents)

In [None]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

In [None]:
display(Markdown(f"<b>{response}</b>"))

### Using RssReader

In [None]:
from llama_index.core import SummaryIndex
from llama_index.readers.web import RssReader

documents = RssReader().load_data(
    ["https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"]
)

index = SummaryIndex.from_documents(documents)

# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()
response = query_engine.query("What happened in the news today?")

## Using ScrapFly
ScrapFly is a web scraping API with headless browser capabilities, proxies, and anti-bot bypass. It allows for extracting web page data into accessible LLM markdown or text. Install ScrapFly Python SDK using pip:
```shell
pip install scrapfly-sdk
```

Here is a basic usage of ScrapflyReader 

In [None]:
from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your ScrapFly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"]
)

The ScrapflyReader also allows passigng ScrapeConfig object for customizing the scrape request. See the documentation for the full feature details and their API params: https://scrapfly.io/docs/scrape-api/getting-started

In [None]:
from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your ScrapFly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

scrapfly_scrape_config = {
    "asp": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool",  # Select a proxy pool (datacenter or residnetial)
    "country": "us",  # Select a proxy location
    "auto_scroll": True,  # Auto scroll the page
    "js": "",  # Execute custom JavaScript code by the headless browser
}

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"],
    scrape_config=scrapfly_scrape_config,  # Pass the scrape config
    scrape_format="markdown",  # The scrape result format, either `markdown`(default) or `text`
)

# Using ZyteWebReader

ZyteWebReader allows a user to access the content of webpage in different modes ("article", "html-text", "html"). 
It enables user to change setting such as browser rendering and JS as the content of many sites would require setting these options to access relevant content. All supported options can be found here: https://docs.zyte.com/zyte-api/usage/reference.html

To install dependencies:
```shell
pip install zyte-api
```

To get access to your ZYTE API key please visit: https://docs.zyte.com/zyte-api/get-started.html

In [None]:
from llama_index.readers.web import ZyteWebReader

# Required to run it in notebook
# import nest_asyncio
# nest_asyncio.apply()


# Initiate ZyteWebReader with your Zyte API key
zyte_reader = ZyteWebReader(
    api_key="your ZYTE API key here",
    mode="article",  # or "html-text" or "html"
)

urls = [
    "https://www.zyte.com/blog/web-scraping-apis/",
    "https://www.zyte.com/blog/system-integrators-extract-big-data/",
]

documents = zyte_reader.load_data(
    urls=urls,
)

print(len(documents[0].text))

5871


Browser rendering and javascript can be enabled by passing setting corresponding parameters during initialization. 

In [None]:
zyte_dw_params = {
    "browserHtml": True,  # Enable browser rendering
    "javascript": True,  # Enable JavaScript
}

# Initiate ZyteWebReader with your Zyte API key and use default "article" mode
zyte_reader = ZyteWebReader(
    api_key="your ZYTE API key here",
    download_kwargs=zyte_dw_params,
)

# Load documents from URLs
documents = zyte_reader.load_data(
    urls=urls,
)

In [None]:
len(documents[0].text)

4355

Set "continue_on_failure" to False if you'd like to stop when any request fails.

In [None]:
zyte_reader = ZyteWebReader(
    api_key="your ZYTE API key here",
    mode="html-text",
    download_kwargs=zyte_dw_params,
    continue_on_failure=False,
)

# Load documents from URLs
documents = zyte_reader.load_data(
    urls=urls,
)

In [None]:
len(documents[0].text)

17488

In default mode ("article") only the article text is extracted while in the "html-text" full text is extracted from the webpage, there the length of the text is significantly longer. 

# Using AgentQLWebReader 🐠

Use AgentQL to scrape structured data from a website.

In [None]:
from llama_index.readers.web import AgentQLWebReader
from llama_index.core import VectorStoreIndex
from IPython.display import Markdown, display

In [None]:
# Using AgentQL to crawl a website
agentql_reader = AgentQLWebReader(
    api_key="YOUR_API_KEY",  # Replace with your actual API key from https://dev.agentql.com
    params={
        "is_scroll_to_bottom_enabled": True
    },  # Optional additional parameters
)

# Load documents from a single page URL
document = agentql_reader.load_data(
    url="https://www.ycombinator.com/companies?batch=W25",
    query="{ company[] { name location description industry_category link(a link to the company's detail on Ycombinator)} }",
)

In [None]:
index = VectorStoreIndex.from_documents(document)
query_engine = index.as_query_engine()
response = query_engine.query(
    "Find companies that are working on web agent, list their names, locations and link"
)

display(Markdown(f"<b>{response}</b>"))

# Using OxylabsWebReader

OxylabsWebReader allows a user to scrape any website with different parameters while bypassing most of the anti-bot tools. Check out the [Oxylabs documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/other-websites) to get the full list of parameters.

Get the credentials by creating an Oxylabs Account [here](https://oxylabs.io/).


In [None]:
from llama_index.readers.web import OxylabsWebReader


reader = OxylabsWebReader(
    username="OXYLABS_USERNAME", password="OXYLABS_PASSWORD"
)

documents = reader.load_data(
    [
        "https://sandbox.oxylabs.io/products/1",
        "https://sandbox.oxylabs.io/products/2",
    ]
)

print(documents[0].text)

The Legend of Zelda: Ocarina of Time | Oxylabs Scraping Sandbox

[![]()logo](/)

Game platforms:

* **All**

* [Nintendo platform](/products/category/nintendo)

+ wii
+ wii-u
+ nintendo-64
+ switch
+ gamecube
+ game-boy-advance
+ 3ds
+ ds

* [Xbox platform](/products/category/xbox-platform)

* **Dreamcast**

* [Playstation platform](/products/category/playstation-platform)

* **Pc**

* **Stadia**

Go Back

Note!This is a sandbox website used for web scraping. Information listed in this website does not have any real meaning and should not be associated with the actual products.

The Legend of Zelda: Ocarina of Time

The Legend of Zelda: Ocarina of Time
------------------------------------

**Developer:** Nintendo**Platform:****Type:** singleplayer

As a young boy, Link is tricked by Ganondorf, the King of the Gerudo Thieve

Another example with parameters for selecting the geolocation, user agent type, JavaScript rendering, headers, and cookies.

In [None]:
documents = reader.load_data(
    [
        "https://sandbox.oxylabs.io/products/3",
    ],
    {
        "geo_location": "Berlin, Germany",
        "render": "html",
        "user_agent_type": "mobile",
        "context": [
            {"key": "force_headers", "value": True},
            {"key": "force_cookies", "value": True},
            {
                "key": "headers",
                "value": {
                    "Content-Type": "text/html",
                    "Custom-Header-Name": "custom header content",
                },
            },
            {
                "key": "cookies",
                "value": [
                    {"key": "NID", "value": "1234567890"},
                    {"key": "1P JAR", "value": "0987654321"},
                ],
            },
            {"key": "http_method", "value": "get"},
            {"key": "follow_redirects", "value": True},
            {"key": "successful_status_codes", "value": [808, 909]},
        ],
    },
)