# Lab 22: Document Loading from Web Pages

This lab demonstrates how to load and process web page content using LangChain's document loaders. You'll learn:
- How to use `WebBaseLoader` for extracting content from websites
- Loading web pages and converting them to document format
- Understanding web document structure and metadata
- Accessing and analyzing web content for LangChain workflows
- Preparing web data for further processing and analysis

In [None]:
# Import LangChain web document loader for processing web page content
# WebBaseLoader is designed to extract and parse content from web URLs
from langchain_community.document_loaders import WebBaseLoader

In [None]:
# Configure OpenAI API key for potential downstream processing
# While not directly used for web loading, this enables LLM integration with loaded content
import os
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
# Define the URL of the web page to load and process
# This example uses a tech news article about Meta's AI assistant and Llama 3
# You can replace this with any publicly accessible web page URL
URL = "https://www.theverge.com/2024/4/18/24133808/meta-ai-assistant-llama-3-chatgpt-openai-rival"

In [None]:
# Initialize the WebBaseLoader with the specified URL
# The loader will fetch and parse the web page content when load() is called
loader = WebBaseLoader(URL)

In [None]:
# Load the web page content (first attempt)
# This fetches the HTML and extracts the main text content
data = loader.load()

In [None]:
# Load the web page content (duplicate operation for demonstration)
# In practice, you would only need to call load() once
# The result is a list of Document objects containing the extracted content
data = loader.load()

In [None]:
# Check the number of documents loaded from the web page
# Typically returns 1 document containing the entire page content
len(data)

In [None]:
# Display the loaded document(s) structure and content
# This shows the Document object with page_content (extracted text) and metadata (URL, title, etc.)
# The content includes the main article text extracted from the HTML
data