# LangChain Web Loading Experiments

This notebook demonstrates how to use LangChain's web loading capabilities to:
1. Load and parse web content
2. Extract structured data
3. Clean and process HTML content
4. Handle different types of web pages
5. Integrate with LLMs for content analysis

In [2]:
# Import required libraries
import sys
from typing import List
from dotenv import load_dotenv
from pydantic import BaseModel, Field

from langchain_community.document_loaders import WebBaseLoader

# Add src to path for imports
sys.path.append('../src')
from agents.llmtools import get_llm

# Load environment variables
load_dotenv()

INFO     [browser_use] BrowserUse logging setup complete with level info
INFO     [root] Anonymized telemetry enabled. See https://github.com/gregpr07/browser-use for more information.


True

## Basic Web Loading

Let's start with basic web page loading and see what we get.

In [6]:
# Load a simple web page
url = "https://graygrids.com/blog/product-hunt-alternatives"
loader = WebBaseLoader(url)

# Load the page
docs = loader.load()

# Print basic info
print(f"Number of documents: {len(docs)}")
print(f"\nMetadata from first document:")
print(docs[0].metadata)
print(f"\nFirst 500 characters of content:")
print(docs[0].page_content[:500])

Number of documents: 1

Metadata from first document:
{'source': 'https://graygrids.com/blog/product-hunt-alternatives', 'title': 'Top 11 Product Hunt Alternatives to Try in 2025 | GrayGrids', 'description': 'Discover Product Hunt Alternatives: Top Websites, Directories, and Communities to Successfully Launch Your Next Product, Startup, and To...', 'language': 'No language found.'}

First 500 characters of content:
Top 11 Product Hunt Alternatives to Try in 2025 | GrayGrids0Hours0Mins0Secs60% OFF on GrayGrids - Apply Coupon Code: BLFCM2024Explore All DealsTemplatesComponentsTailwind BuilderResourcesInspirationsImage CompressorMeta Tags GeneratorBrandsTailGridsUI Components, Blocks and Toolkit for Tailwind CSSTailAdminOpen-Source Tailwind CSS Admin Dashboard TemplateLineicons8400+ Line Icons for Designers and DevelopersFormBoldForm API and Backend Solution for All PlatformsPlainAdminFree Vanilla JS Multipur


## Structured Data Extraction

Now let's extract structured data from web pages using Pydantic models.

In [11]:
from langchain_core.documents import Document


class WebContent(BaseModel):
    """Model for structured web content"""
    links: List[str] = Field(description="List of links to competitors of Product Hunt")
    competitorNames: List[str] = Field(description="name of the competitors")

def extract_structured_content(doc: Document) -> WebContent:
    """Extract structured content from a web page"""
    llm = get_llm()
    prompt = f"""Extract the following information from this web page content:
    - All links to top level domains that appear to be competitors to Product Hunt
    
    Content:
    {doc.page_content}
    """
    
    structured_llm = llm.with_structured_output(WebContent)
    return structured_llm.invoke(prompt)

# Try it out
structured_content = extract_structured_content(docs[0])
print("Extracted Content:")
print("\nImportant Links:")
for link in structured_content.links:
    print(f"- {link}")

Extracted Content:

Important Links:
- https://resource.fyi
- https://indiehackers.com
- https://betalist.com
- https://www.reddit.com
- https://news.ycombinator.com
- https://www.capterra.com
- https://startupstash.com
- https://launchingnext.com
- https://ctrlalt.com
- https://sideprojectors.com
- https://saasworthy.com
- https://betapage.co
- https://alternativeto.net


## Text Splitting and Processing

Let's look at how to split and process longer web content.

In [None]:
# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

# Split the document
splits = text_splitter.split_documents(docs)

print(f"Number of splits: {len(splits)}")
print("\nFirst split:")
print(splits[0].page_content)

## Advanced Web Loading

Now let's try some more advanced web loading features.

In [None]:
# Load with custom options
advanced_loader = WebBaseLoader(
    web_paths=["https://python.langchain.com/docs/modules/data_connection/document_loaders/"],
    verify_ssl=True,
    continue_on_failure=True,
    requests_per_second=2,
    requests_kwargs={
        "headers": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
        }
    }
)

# Load with custom processing
advanced_docs = advanced_loader.load()

print(f"Loaded {len(advanced_docs)} documents")
print("\nMetadata from first document:")
print(advanced_docs[0].metadata)

## Content Analysis with LLM

Finally, let's analyze the web content using an LLM.

In [None]:
def analyze_content(doc: Document) -> str:
    """Analyze web content using LLM"""
    llm = get_llm()
    prompt = f"""Analyze this web page content and provide:
    1. Main topic and purpose
    2. Key points or takeaways
    3. Technical complexity level
    4. Target audience
    
    Content:
    {doc.page_content}
    """
    
    chain = llm | StrOutputParser()
    return chain.invoke(prompt)

# Try the analysis
analysis = analyze_content(docs[0])
print("Content Analysis:")
print(analysis)