---
sidebar_label: ZyteURLLoader
---

# ZyteURLLoader

This notebook provides a quick overview for getting started with Zyte URL Document loader [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders). It allows a user to access the text of the webpage in various formats, given the url of the webpage. For detailed documentation of all ZyteURLLoader features and configurations head to the [API reference](https://python.langchain.com/v0.2/api_reference/community/document_loaders/langchain_community.document_loaders.zyte.ZyteURLLoader.html).

## Overview

### Integration details


| Class | Package | Local | Serializable | 
| :--- | :--- | :---: | :---: |
| [ZyteURLLoader](https://python.langchain.com/v0.2/api_reference/community/document_loaders/langchain_community.document_loaders.zyte.ZyteURLLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ✅ | ❌ |


### Loader features

| Source | Document Lazy Loading | Native Async Support
| :---: | :---: | :---: | 
| ZyteURLLoader | ✅ | ✅ | 

## Setup

To access ZyteURL document loader you'll need to install the `langchain-community`, `zyte-api` and optionally `html-text` packages.

### Credentials

Head to [ZYTE-AI-SCRAPING](https://www.zyte.com/ai-web-scraping/) to sign up to Zyte-API and generate an API key. Once you've done this set the ZYTE_API_KEY environment variable:

In [None]:
import getpass
import os

os.environ["ZYTE_API_KEY"] = getpass.getpass("Enter your ZyteURL API key: ")

If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [None]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation


In [26]:
%pip install -qU langchain_community zyte-api html2text

## Initialization

Now we can instantiate our document loader and get the content in different loading modes. There are three modes: 'html', 'html-text', 'article'. For a single URL we try all three options and check the results. 


In [25]:
# To make it run in notebook
# %pip install -qU nest_asyncio

In [1]:
import os

import nest_asyncio
from langchain_community.document_loaders import ZyteURLLoader

nest_asyncio.apply()

In [2]:
urls = [
    "https://www.rte.ie/sport/paris-2024/2024/0806/1463620-paris-2024-coyle-and-sweetnam-dont-figure-in-medal/"
]
zyte_api_key = os.environ.get("ZYTE_API_KEY")

In [3]:
loader_html = ZyteURLLoader(urls, mode="html", api_key=zyte_api_key)
loader_html_text = ZyteURLLoader(urls, mode="html-text", api_key=zyte_api_key)
loader_article = ZyteURLLoader(urls, mode="article", api_key=zyte_api_key)

ZyteURLLoader takes following parameters:
- urls: URLs to load. Each is loaded into its own document.
- api_key: Zyte API key.
- mode: Determines how the text is extracted for the page content.
        It can take one of the following values: 'html', 'html-text', 'article'
- continue_on_failure: If True, continue loading other URLs if one fails.
- n_conn: It is the maximum number of concurrent requests to use.
- **download_kwargs: Any additional download arguments to pass for download. It allows user to enable browser rendering, JS, set a wait for timeout. See: https://docs.zyte.com/zyte-api/usage/reference.html

## Load

Here we show different modes in which a URL can be loaded. 

In [4]:
page_html = loader_html.load()
page_html_text = loader_html_text.load()
page_article = loader_article.load()

We check the beginning of page content and the length of the text to get an idea of the content

In [5]:
print(page_html[0].page_content[:400])





<!DOCTYPE html>
<html class="no-js" lang="en">
<head>

<script src="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" type="text/javascript" charset="UTF-8" data-domain-script="a58df52b-2812-4cc9-99c6-cf9bcfe5af8b"></script>
<script src="https://www.rte.ie/djstatic/dotie/privacy/cookie-functions.js?v=20240912v38843"></script>
<script type="text/javascript">
    var optanonCallbacks = [
 


In [6]:
len(page_html[0].page_content)

143056

In [7]:
print(page_html_text[0].page_content[:200])

skip to main content

Your browser does not support Javascript. Please turn Javascript on to get the
best experience from rte.ie

Menu

[ Weather ](https://www.rte.ie/weather/)

[ __ Ireland's Nationa


In [8]:
len(page_html_text[0].page_content)

19140

In [9]:
print(page_article[0].page_content[:200])

Paris 2024: No luck for Daniel Coyle and Shane Sweetnam in individual final

There was no joy for Ireland's Daniel Coyle and Shane Sweetnam in the show jumping individual final, with the gold medal wo


In [10]:
len(page_article[0].page_content)

3840

The length of the webpage content vary considerably with different webpage loading modes. In `article` mode we only get the relevant article content of the webpage and is much shorter. 

### Page content with browser rendering enabled (JS off)

At time we would need the content of the page by enabling browser rendering. It can be done by passing setting the browserHtml argument. We disable the JS in this example.

In [11]:
urls = ["https://www.whatismybrowser.com/detect/is-javascript-enabled/"]

In [12]:
kwargs = {"browserHtml": True, "javascript": False}
loader_browser_text = ZyteURLLoader(
    urls, mode="html-text", api_key=zyte_api_key, download_kwargs=kwargs
)

In [13]:
page_browser = loader_browser_text.load()

In [14]:
print(page_browser[0].page_content[:500])

[ ](/)

[WhatIsMyBrowser.com](/)

  * [My browser](/)
  * [Guides](/guides/)
  * [Detect my settings](/detect/)
  * [Tools](/developers/tools/)

  1. [ Homepage  ](/)
  2. [ Detect my settings  ](/detect/)
  3. [ Is JavaScript enabled?  ](/detect/is-javascript-enabled/)

# Is JavaScript enabled?

Updated at: Jun 26, 2024

No

JavaScript is enabled in your web browser. Congratulations; you're one step
closer to having a fully featured online experience.

## [Need help enabling JavaScript?](/guide


### Page content with JS enabled

In [15]:
kwargs = {
    "browserHtml": True,
    "javascript": True,
}
loader_browser_js_text = ZyteURLLoader(
    urls, mode="html-text", api_key=zyte_api_key, download_kwargs=kwargs
)

In [16]:
page_browser_js = loader_browser_js_text.load()

In [17]:
print(page_browser_js[0].page_content[:500])

[ ](/)

[WhatIsMyBrowser.com](/)

  * [My browser](/)
  * [Guides](/guides/)
  * [Detect my settings](/detect/)
  * [Tools](/developers/tools/)

  1. [ Homepage  ](/)
  2. [ Detect my settings  ](/detect/)
  3. [ Is JavaScript enabled?  ](/detect/is-javascript-enabled/)

# Is JavaScript enabled?

Updated at: Jun 26, 2024

Yes

JavaScript is enabled in your web browser. Congratulations; you're one step
closer to having a fully featured online experience.

## [Need help enabling JavaScript?](/guid


### Question answering from some news articles

In [27]:
#!pip install -qU chromadb langchain-openai

In [5]:
openai_api_key = os.environ.get("OPENAI_API_KEY")

In [6]:
urls = [
    # Sports
    "https://www.rte.ie/sport/paris-2024/2024/0806/1463620-paris-2024-coyle-and-sweetnam-dont-figure-in-medal/",
    "https://www.rte.ie/sport/paris-2024/2024/0806/1463576-kellie-harrington-olympics-yang/",
    "https://www.rte.ie/sport/paris-2024/2024/0806/1463615-paris-2024-healy-and-osullivan-into-1500m-repechage/",
    "https://www.rte.ie/sport/football/2024/0806/1463614-gaa-invites-bids-for-rights-to-gaago-broadcast-package/",
    # Business
    "https://www.rte.ie/news/business/2024/0806/1463642-cso-monthly-unemployment-figures/",
    "https://www.rte.ie/news/business/2024/0806/1463597-tokyos-nikkei-index-recovers/",
    "https://www.rte.ie/news/business/2024/0806/1463591-aib-services-pmi/",
]

In [7]:
loader_articles = ZyteURLLoader(urls, mode="article", api_key=zyte_api_key)

In [8]:
pages = loader_articles.load()

In [11]:
len(pages)

7

In [12]:
from langchain.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

We use OpenAI embedding to store the semantic information about the articles. 

In [13]:
embedding = OpenAIEmbeddings(api_key=openai_api_key)

In [14]:
vectordb = Chroma.from_documents(
    documents=pages,
    embedding=embedding,
)

In [15]:
question = "Who won the gold medal in individual jumping in 2024 olympics?"

In [16]:
ret_docs = vectordb.similarity_search(question, k=1)

In [17]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", api_key=openai_api_key, temperature=0)

In [18]:
from langchain.chains import RetrievalQA

In [19]:
qa_chain = RetrievalQA.from_chain_type(
    llm, return_source_documents=True, retriever=vectordb.as_retriever()
)

In [20]:
result = qa_chain.invoke(question)

In [21]:
result["result"]

"The gold medal in individual jumping at the 2024 Olympics was won by Germany's Christian Kukuk after a jump-off."

In [22]:
question2 = "How is the employment rate changing for men in recent months?"

In [23]:
result = qa_chain.invoke(question2)

In [24]:
result["result"]

'The unemployment rate for men in July was 4.7%, up from the revised rate of 4.5% in June and also higher than the rate of 4.5% in July of the previous year. This indicates a slight increase in unemployment for men in recent months.'

## Lazy Load

In [4]:
urls = [
    "https://www.rte.ie/sport/paris-2024/2024/0806/1463620-paris-2024-coyle-and-sweetnam-dont-figure-in-medal/",
    "https://www.rte.ie/sport/football/2024/0806/1463614-gaa-invites-bids-for-rights-to-gaago-broadcast-package/",
]
loader_html = ZyteURLLoader(urls, mode="html", api_key=zyte_api_key)

In [6]:
page = []
for doc in loader_html.lazy_load():
    page.append(doc)
    if len(page) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        page = []

In [7]:
len(page)

2

## API reference

For detailed documentation of all ZyteURLLoader features and configurations head to the API reference: https://python.langchain.com/v0.2/api_reference/community/document_loaders/langchain_community.document_loaders.zyte.ZyteURLLoader.html