# LangChain Web Loading Experiments

This notebook demonstrates how to use LangChain's web loading capabilities to:
1. Load and parse web content
2. Extract structured data
3. Clean and process HTML content
4. Handle different types of web pages
5. Integrate with LLMs for content analysis

In [1]:
# Import required libraries
import sys
from typing import List
from dotenv import load_dotenv
from pydantic import BaseModel, Field

from langchain_community.document_loaders import WebBaseLoader

# Add src to path for imports
sys.path.append('../src')
from agents.llmtools import get_llm

# Load environment variables
load_dotenv()

USER_AGENT environment variable not set, consider setting it to identify your requests.


INFO     [browser_use] BrowserUse logging setup complete with level info
INFO     [root] Anonymized telemetry enabled. See https://github.com/gregpr07/browser-use for more information.


True

## Basic Web Loading

Let's start with basic web page loading and see what we get.

In [4]:
# Load a simple web page
url = "https://graygrids.com/blog/product-hunt-alternatives"
loader = WebBaseLoader(url)

# Load the page
docs = loader.load()

# Print basic info
print(f"Number of documents: {len(docs)}")
print(f"\nMetadata from first document:")
print(docs[0].metadata)
print(f"\nFirst 500 characters of content:")
print(docs[0].page_content[:500])

Number of documents: 1

Metadata from first document:
{'source': 'https://graygrids.com/blog/product-hunt-alternatives', 'title': 'Top 11 Product Hunt Alternatives to Try in 2025 | GrayGrids', 'description': 'Discover Product Hunt Alternatives: Top Websites, Directories, and Communities to Successfully Launch Your Next Product, Startup, and To...', 'language': 'No language found.'}

First 500 characters of content:
Top 11 Product Hunt Alternatives to Try in 2025 | GrayGrids0Hours0Mins0Secs60% OFF on GrayGrids - Apply Coupon Code: BLFCM2024Explore All DealsTemplatesComponentsTailwind BuilderResourcesInspirationsImage CompressorMeta Tags GeneratorBrandsTailGridsUI Components, Blocks and Toolkit for Tailwind CSSTailAdminOpen-Source Tailwind CSS Admin Dashboard TemplateLineicons8400+ Line Icons for Designers and DevelopersFormBoldForm API and Backend Solution for All PlatformsPlainAdminFree Vanilla JS Multipur


## Structured Data Extraction

Now let's extract structured data from web pages using Pydantic models.

In [17]:
from langchain_core.documents import Document
from pydantic import BaseModel, Field

class CompetitorList(BaseModel):
    competitors:List[Competitor]

class Competitor(BaseModel):
    name: str = Field(
        description="Name of the competitor"
    )
    description: str = Field(
        description="Description of the persona focus, concerns, and motives.",
    )
    url: str = Field(
        description="URL of the competitor",
    )

class CompetitorList(BaseModel):
    competitors:List[Competitor]
    
def extract_structured_content(doc: Document) -> CompetitorList:
    """Extract structured content from a web page"""
    llm = get_llm()
    prompt = f"""Extract the following information from this web page content:
    - All links to top level domains that appear to be competitors to Product Hunt
    - For each competitor also map their name and a brief description if available.
    
    Content:
    {doc.page_content}
    """
    
    structured_llm = llm.with_structured_output(CompetitorList)
    return structured_llm.invoke(prompt)

# Try it out
structured_content = extract_structured_content(docs[0])

for comp in structured_content.competitors:
    print(comp.name)
    print(comp.url)
    print("--")

Resource.fyi
https://resource.fyi
--
Indie Hackers
https://indiehackers.com
--
BetaList
https://betalist.com
--
Reddit
https://www.reddit.com
--
Y Combinator (news.ycombinator.com)
https://news.ycombinator.com
--
Capterra
https://www.capterra.com
--
StartupStash
https://startupstash.com
--
LaunchingNext
https://launchingnext.com
--
ctrlalt
https://ctrlalt.com
--
SideProjectors
https://sideprojectors.com
--
SaaSworthy
https://saasworthy.com
--
BetaPage
https://betapage.co
--
Alternativeto
https://alternativeto.net
--


In [36]:
from agents.tools.searchweb import search_web_with_query 
from langchain_community.tools.tavily_search import TavilySearchResults,TavilyAnswer


tavily_answer = TavilyAnswer()
query="what are altnernatives to Product Hunt"

# Search
answer = tavily_answer.invoke(query)

print (answer)

Based on the data provided, some alternatives to Product Hunt include LaunchingNext, The Startup Pitch, Softpedia, Starter Story, Betafy, BetaList, Maker Mag, Silicon.news, AI Site Hunt, Indiehackers.com, and Slant.
