In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

# Recursive URL

The RecursiveUrlLoader lets you recursively scrape all child links from a root URL and parse them into Documents.

In [2]:
from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/",
    # max_depth=2,
    # use_async=False,
    # extractor=None,
    # metadata_extractor=None,
    # exclude_dirs=(),
    # timeout=10,
    # check_response_status=True,
    # continue_on_failure=True,
    # prevent_outside=True,
    # base_url=None,
    # ...
)

# Load
Use ```.load()``` to synchronously load into memory all Documents, with one Document per visited URL. Starting from the initial URL, we recurse through all linked URLs up to the specified max_depth.

Let's run through a basic example of how to use the ```RecursiveUrlLoader``` on the [Python 3.9 Documentation.](https://docs.python.org/3.9/)

Note that with no advance knowledge of the page HTML structure, we recover a natural organization of the body text:



In [4]:
docs = loader.load()

  k = self.parse_starttag(i)


{'source': 'https://docs.python.org/3.9/',
 'content_type': 'text/html',
 'title': '3.9.20 Documentation',
 'language': None}

In [6]:
len(docs)

24

In [7]:
for doc in docs:
    print(doc.page_content)


<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8" /><title>3.9.20 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">
    
    <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script src="_static/jquery.js"></script>
    <script src="_static/underscore.js"></script>
    <script src="_static/doctools.js"></script>
    <script src="_static/language_data.js"></script>
    
    <script src="_static/sidebar.js"></script>
    
    <link rel="search" type="application/opensearchdescription+xml"
          title="Search within Python 3.9.20 documentation"
          href="_static/opensearch.xml"/>
    <link rel="author" title="About these documents" href="about.html" />
    <link rel="index" title="I

In [9]:
docs[0].metadata

{'source': 'https://docs.python.org/3.9/',
 'content_type': 'text/html',
 'title': '3.9.20 Documentation',
 'language': None}

Great! The first document looks like the root page we started from. Let's look at the metadata of the next document



In [10]:
docs[1].metadata

{'source': 'https://docs.python.org/3.9/license.html',
 'content_type': 'text/html',
 'title': 'History and License — Python 3.9.20 documentation',
 'language': None}

That url looks like a child of our root page, which is great! Let's move on from metadata to examine the content of one of our documents

In [11]:
print(docs[0].page_content[:300])


<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8" /><title>3.9.20 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">
    
    <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
    <link rel=


# Lazy loading
If we're loading a large number of Documents and our downstream operations can be done over subsets of all loaded Documents, we can lazily load our Documents one at a time to minimize our memory footprint:

In [12]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []

  k = self.parse_starttag(i)


# Adding an Extractor
By default the loader sets the raw HTML from each link as the Document page content. 

To parse this HTML into a more human/LLM-friendly format you can pass in a custom ```extractor``` method:

In [13]:
import re

from bs4 import BeautifulSoup


def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()


loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])

  soup = BeautifulSoup(html, "lxml")


3.9.20 Documentation

Download
Download these documents
Docs by version

Python 3.14 (in development)
Python 3.13 (pre-release)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-


In [14]:
docs[0]

Document(metadata={'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.20 Documentation', 'language': None}, page_content='3.9.20 Documentation\n\nDownload\nDownload these documents\nDocs by version\n\nPython 3.14 (in development)\nPython 3.13 (pre-release)\nPython 3.12 (stable)\nPython 3.11 (security-fixes)\nPython 3.10 (security-fixes)\nPython 3.9 (security-fixes)\nPython 3.8 (security-fixes)\nPython 3.7 (EOL)\nPython 3.6 (EOL)\nPython 3.5 (EOL)\nPython 3.4 (EOL)\nPython 3.3 (EOL)\nPython 3.2 (EOL)\nPython 3.1 (EOL)\nPython 3.0 (EOL)\nPython 2.7 (EOL)\nPython 2.6 (EOL)\nAll versions\n\nOther resources\n\nPEP Index\nBeginner\'s Guide\nBook List\nAudio/Visual Talks\nPython Developer’s Guide\n\nNavigation\n\nindex\n\nmodules |\n\nPython »\n\n3.9.20 Documentation »\n    \n\n                     |\n                \n\nPython 3.9.20 documentation\n\n  Welcome! This is the official documentation for Python 3.9.20.\n  \nParts of the documentation:\n\nWhat\'s 