In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

# Advanced parsing

This method is appropriate if we want more granular control or processing of the page content. Below, instead of generating one Document per page and controlling its content via BeautifulSoup, we generate multiple Document objects representing distinct structures on a page. These structures can include section titles and their corresponding body texts, lists or enumerations, tables, and more.

Under the hood it uses the langchain-unstructured library. See the integration docs for more information about using Unstructured with LangChain.

In [2]:
from langchain_unstructured import UnstructuredLoader

page_url = "https://python.langchain.com/docs/how_to/chatbots_memory/"
loader = UnstructuredLoader(web_url=page_url)

docs = []
async for doc in loader.alazy_load():
    docs.append(doc)

INFO: NumExpr defaulting to 4 threads.


In [4]:
print(len(docs))

66


Note that with no advance knowledge of the page HTML structure, we recover a natural organization of the body text:



In [5]:
for doc in docs:
    print(doc.page_content)

How to add memory to chatbots
A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:
Simply stuffing previous messages into a chat model prompt.
The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
More complex modifications like synthesizing summaries for long running conversations.
We'll go into more detail on a few techniques below!
note
This how-to guide previously built a chatbot using RunnableWithMessageHistory. You can access this version of the guide in the v0.2 docs.
As of the v0.3 release of LangChain, we recommend that LangChain users take advantage of LangGraph persistence to incorporate memory into new LangChain applications.
If your code is already relying on RunnableWithMessageHistory or BaseChatMessageHistory, you do not need to make any changes. We do not plan on deprecating this functionality in the near fut

# Extracting content from specific sections

Each ```Document``` object represents an element of the page. Its metadata contains useful information, such as its category:

In [5]:
print(f"{doc.metadata}\n")
print(doc.page_content)

{'source': 'https://python.langchain.com/docs/how_to/chatbots_memory/'}

How to add memory to chatbots | A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including: | Simply stuffing previous messages into a chat model prompt. | The above, but trimming old messages to reduce the amount of distracting information the model has to deal with. | More complex modifications like synthesizing summaries for long running conversations. | We'll go into more detail on a few techniques below! | note | This how-to guide previously built a chatbot using | RunnableWithMessageHistory | . You can access this version of the guide in the | v0.2 docs | . | As of the v0.3 release of LangChain, we recommend that LangChain users take advantage of | LangGraph persistence | to incorporate | memory | into new LangChain applications. | If your code is already relying on | RunnableWithMessageHistory | or | BaseChatMe

In [6]:
for doc in docs:
    print(f'{doc.metadata["category"]}: {doc.page_content}')

Title: How to add memory to chatbots
NarrativeText: A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:
ListItem: Simply stuffing previous messages into a chat model prompt.
ListItem: The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
ListItem: More complex modifications like synthesizing summaries for long running conversations.
NarrativeText: We'll go into more detail on a few techniques below!
Title: note
NarrativeText: This how-to guide previously built a chatbot using RunnableWithMessageHistory. You can access this version of the guide in the v0.2 docs.
NarrativeText: As of the v0.3 release of LangChain, we recommend that LangChain users take advantage of LangGraph persistence to incorporate memory into new LangChain applications.
NarrativeText: If your code is already relying on RunnableWithMessageHistory or BaseCh

In [8]:
from IPython.display import display, JSON
display(JSON(docs[0].metadata))

<IPython.core.display.JSON object>

In [10]:
for doc in docs:
    print(doc.metadata)

{'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://python.langchain.com/docs/how_to/chatbots_memory/', 'category': 'Title', 'element_id': 'fe037635255efdcb12c13437486c388b'}
{'languages': ['eng'], 'filetype': 'text/html', 'parent_id': 'fe037635255efdcb12c13437486c388b', 'url': 'https://python.langchain.com/docs/how_to/chatbots_memory/', 'category': 'NarrativeText', 'element_id': '6d776151467bd8f096fc404de3ad502a'}
{'category_depth': 1, 'languages': ['eng'], 'filetype': 'text/html', 'parent_id': 'fe037635255efdcb12c13437486c388b', 'url': 'https://python.langchain.com/docs/how_to/chatbots_memory/', 'category': 'ListItem', 'element_id': '465d3d600e2200c5575addbe51297f9d'}
{'category_depth': 1, 'languages': ['eng'], 'filetype': 'text/html', 'parent_id': 'fe037635255efdcb12c13437486c388b', 'url': 'https://python.langchain.com/docs/how_to/chatbots_memory/', 'category': 'ListItem', 'element_id': '7215434059e03b088853eda85b628e98'}
{'category_depth': 1, 'langu

Elements may also have parent-child relationships -- 

for example, a paragraph might belong to a section with a title. 

If a section is of particular interest (e.g., for indexing) we can isolate the corresponding ```Document``` objects.

As an example, below we load the content of the "Setup" sections for two web pages:

In [11]:
from typing import List

from langchain_core.documents import Document


async def _get_setup_docs_from_url(url: str) -> List[Document]:
    loader = UnstructuredLoader(web_url=url)

    setup_docs = []
    parent_id = -1
    async for doc in loader.alazy_load():
        if doc.metadata["category"] == "Title" and doc.page_content.startswith("Setup"):
            parent_id = doc.metadata["element_id"]
        if doc.metadata.get("parent_id") == parent_id:
            setup_docs.append(doc)

    return setup_docs


page_urls = [
    "https://python.langchain.com/docs/how_to/chatbots_memory/",
    "https://python.langchain.com/docs/how_to/chatbots_tools/",
]
setup_docs = []
for url in page_urls:
    page_setup_docs = await _get_setup_docs_from_url(url)
    setup_docs.extend(page_setup_docs)

In [12]:
from collections import defaultdict

setup_text = defaultdict(str)

for doc in setup_docs:
    url = doc.metadata["url"]
    setup_text[url] += f"{doc.page_content}\n"

dict(setup_text)

{'https://python.langchain.com/docs/how_to/chatbots_memory/': 'You\'ll need to install a few packages, and have your OpenAI API key set as an environment variable named OPENAI_API_KEY:\n%pip install --upgrade --quiet langchain langchain-openai langgraph\n\nimport getpass\nimport os\n\nif not os.environ.get("OPENAI_API_KEY"):\n    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")\n',
 'https://python.langchain.com/docs/how_to/chatbots_tools/': 'For this guide, we\'ll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you\'re using Tavily.\nYou\'ll need to sign up for an account on the Tavily website, and install the following packages:\n%pip install --upgrade --quiet langchain-community langchain-openai tavily-python langgraph\n\nimport getpass\nimport os\n\nif not os.environ.get("OPENAI_API_KEY"):\n    os.environ["OPENAI_API_KE

# Vector search over page content
Once we have loaded the page contents into LangChain Document objects, we can index them (e.g., for a RAG application) in the usual way. Below we use OpenAI embeddings, although any LangChain embeddings model will suffice.

In [13]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

vector_store = InMemoryVectorStore.from_documents(setup_docs, OpenAIEmbeddings())
retrieved_docs = vector_store.similarity_search("Install Tavily", k=2)
for doc in retrieved_docs:
    print(f'Page {doc.metadata["url"]}: {doc.page_content[:300]}\n')

INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Page https://python.langchain.com/docs/how_to/chatbots_tools/: You'll need to sign up for an account on the Tavily website, and install the following packages:

Page https://python.langchain.com/docs/how_to/chatbots_tools/: For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.

