website link : https://python.langchain.com/docs/integrations/document_loaders/#webpages

### LangChain documentation on **loading and processing web pages**:

---

### **1. Overview**
- **Web Pages**: Contain text, images, and multimedia, typically represented in HTML.  
- **Challenges**: Extracting content can be complex due to varying structures, dynamic content, and the need to handle HTML/JavaScript.  
- **LangChain Loaders**: Provide tools to fetch and process web content efficiently.

---

### **2. Supported Loaders**
LangChain supports multiple web page loaders, including:

#### **a. WebBaseLoader**
- **Description**: A general-purpose loader for fetching content from web pages.  
- **Use Case**: Extracting text content from static or dynamic web pages.  
- **Example**:
  ```python
  from langchain_community.document_loaders import WebBaseLoader
  loader = WebBaseLoader(web_paths=["https://example.com"])
  documents = loader.load()
  ```

#### **b. SeleniumLoader**
- **Description**: A loader that uses Selenium to fetch content from dynamic web pages (e.g., pages with JavaScript).  
- **Use Case**: Extracting content from web pages that require rendering or interaction.  
- **Example**:
  ```python
  from langchain_community.document_loaders import SeleniumLoader
  loader = SeleniumLoader(urls=["https://example.com"])
  documents = loader.load()
  ```

#### **c. PlaywrightLoader**
- **Description**: A loader that uses Playwright to fetch content from dynamic web pages.  
- **Use Case**: Similar to SeleniumLoader but with better performance and modern browser automation features.  
- **Example**:
  ```python
  from langchain_community.document_loaders import PlaywrightLoader
  loader = PlaywrightLoader(urls=["https://example.com"])
  documents = loader.load()
  ```

#### **d. RSSLoader**
- **Description**: A loader for fetching content from RSS feeds.  
- **Use Case**: Extracting structured content (e.g., news articles, blog posts) from RSS feeds.  
- **Example**:
  ```python
  from langchain_community.document_loaders import RSSLoader
  loader = RSSLoader(feed_url="https://example.com/feed")
  documents = loader.load()
  ```

---

### **3. Choosing the Right Loader**
- **Static Web Pages**: Use `WebBaseLoader`.  
- **Dynamic Web Pages**: Use `SeleniumLoader` or `PlaywrightLoader`.  
- **RSS Feeds**: Use `RSSLoader`.  
- **Custom Needs**: LangChain allows you to create custom loaders for specific use cases.

---

### **4. Additional Features**
- **Metadata Extraction**: Most loaders extract metadata (e.g., URL, title) along with the content.  
- **Lazy Loading**: Some loaders support lazy loading (`lazy_load()`), which is useful for processing large amounts of content incrementally.  
- **Customization**: Loaders can be customized to handle specific web page structures or extract specific elements.

---

### **5. Example Workflow**
Here’s an example of using `WebBaseLoader` to fetch content from multiple URLs:
```python
from langchain_community.document_loaders import WebBaseLoader

urls = ["https://example.com/page1", "https://example.com/page2"]
loader = WebBaseLoader(urls)
documents = loader.load()

for doc in documents:
    print(f"URL: {doc.metadata['source']}")
    print(f"Content: {doc.page_content[:200]}...")  # Print first 200 characters
```

---

### **6. Advanced Parsing**
- **Tool**: `UnstructuredLoader`.  
- **Use Case**: Extracts multiple `Document` objects per page, representing sections, lists, tables, etc.  
- **Example**:
  ```python
  from langchain_unstructured import UnstructuredLoader
  loader = UnstructuredLoader(web_url="https://example.com")
  docs = loader.load()
  ```
- **Granular Control**: Each `Document` object includes metadata (e.g., category, parent-child relationships) for advanced processing.

---

### **7. Extracting Specific Sections**
- **Method**: Use metadata (e.g., `category`, `parent_id`) to isolate specific sections (e.g., "Setup").  
- **Example**:
  ```python
  async def _get_setup_docs_from_url(url):
      loader = UnstructuredLoader(web_url=url)
      setup_docs = []
      async for doc in loader.alazy_load():
          if doc.metadata["category"] == "Title" and doc.page_content.startswith("Setup"):
              setup_docs.append(doc)
      return setup_docs
  ```

---

### **8. Vector Search**
- **Use Case**: Index web content for retrieval-augmented generation (RAG).  
- **Example**:
  ```python
  from langchain_core.vectorstores import InMemoryVectorStore
  from langchain_openai import OpenAIEmbeddings
  vector_store = InMemoryVectorStore.from_documents(docs, OpenAIEmbeddings())
  retrieved_docs = vector_store.similarity_search("query", k=2)
  ```

---

### **9. Key Takeaways**
- **Multiple Options**: LangChain provides a variety of web loaders to suit different needs.  
- **Dynamic Content**: Loaders like `SeleniumLoader` and `PlaywrightLoader` can handle dynamic web pages.  
- **Structured Data**: Loaders like `RSSLoader` are optimized for structured content.  
- **Ease of Use**: All loaders integrate seamlessly into LangChain’s document processing pipeline.  
- **Advanced Features**: Use `UnstructuredLoader` for granular control over web content and metadata.  
- **Integration**: Loaded documents can be indexed and used in downstream tasks like RAG.

---


## Webpage Loaders
- Load the webpage and extract the data using the `WebBaseLoader` and `BeautifulSoup` libraries.
- Use LLM to extract meaningful data from the webpage.

In [1]:
from langchain_community.document_loaders import WebBaseLoader

urls = ['https://economictimes.indiatimes.com/markets/stocks/news',
        'https://www.livemint.com/latest-news',
        'https://www.livemint.com/latest-news/page-2'
        'https://www.livemint.com/latest-news/page-3',
        'https://www.moneycontrol.com/']

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
loader = WebBaseLoader(web_paths=urls)

In [3]:
docs = []
async for doc in loader.alazy_load():
    docs.append(doc)
print(docs)

In [4]:
def format_docs(docs):
    return "\n\n".join([x.page_content for x in docs])

In [5]:
context = format_docs(docs)

In [6]:
print(context)


Stocks in News Today - Latest News on Stocks, Stock in News | The Economic TimesBenchmarks Nifty23,813.4063.21FEATURED FUNDS★★★★★Canara Robeco Bluechip Equity Fund Regular-Growt..5Y Return17.33 %
                Invest NowFEATURED FUNDS★★★★★Canara Robeco Equity Hybrid Fund Direct-Growth5Y Return16.86 %
                Invest NowEnglish EditionEnglish Editionहिन्दीગુજરાતીमराठीবাংলাಕನ್ನಡമലയാളംதமிழ்తెలుగు | 28 December, 2024, 02:04 PM IST | Today's ePaper
            			        My Watchlist
                            SubscribeSign InJoin Value & Valuation MasterclassHomeETPrimeMarketsMarket DataNewsIndustryRisePoliticsWealthMFTechCareersOpinionNRIPanacheLuxuryVideosMore MenuStocksNewsLive BlogStock Live BlogEarningsPodcastMarket ClassroomDons of Dalal StreetRecosStock Reports PlusNewMy ScreenerCandlestick ScreenerStock ScreenerStock WatchMarket CalendarStock Price QuotesOptionsIPOs/FPOsExpert ViewsInvestment IdeasCommoditiesViewsNewsOthersMentha OilPrecious MetalsGold MGoldSilverGold Pet

In [7]:
import re

def text_clean(text):
    text = re.sub(r'\n\n+', '\n\n', text)
    text = re.sub(r'\t+', '\t', text)
    text = re.sub(r'\s+', ' ', text)
    return text

In [8]:
context = text_clean(context)

In [9]:
print(context)

Stocks in News Today - Latest News on Stocks, Stock in News | The Economic TimesBenchmarks Nifty23,813.4063.21FEATURED FUNDS★★★★★Canara Robeco Bluechip Equity Fund Regular-Growt..5Y Return17.33 % Invest NowFEATURED FUNDS★★★★★Canara Robeco Equity Hybrid Fund Direct-Growth5Y Return16.86 % Invest NowEnglish EditionEnglish Editionहिन्दीગુજરાતીमराठीবাংলাಕನ್ನಡമലയാളംதமிழ்తెలుగు | 28 December, 2024, 02:04 PM IST | Today's ePaper My Watchlist SubscribeSign InJoin Value & Valuation MasterclassHomeETPrimeMarketsMarket DataNewsIndustryRisePoliticsWealthMFTechCareersOpinionNRIPanacheLuxuryVideosMore MenuStocksNewsLive BlogStock Live BlogEarningsPodcastMarket ClassroomDons of Dalal StreetRecosStock Reports PlusNewMy ScreenerCandlestick ScreenerStock ScreenerStock WatchMarket CalendarStock Price QuotesOptionsIPOs/FPOsExpert ViewsInvestment IdeasCommoditiesViewsNewsOthersMentha OilPrecious MetalsGold MGoldSilverGold PetalSilver MicroSilver MGold GuineaOil & EnergyNatural GasCrude OilCrude Oil MiniBase

#### Stock Market Data Processing with LLM

In [10]:
from scripts import llm

input_variables=['context', 'question'] input_types={} partial_variables={} messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], input_types={}, partial_variables={}, template='You are helpful AI assistant who answer user question based on the provided context.'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer user question based on the provided context ONLY! If you do not know the answer, just say "I don\'t know".\n            ### Context:\n            {context}\n\n            ### Question:\n            {question}\n\n            ### Answer:'), additional_kwargs={})]


In [12]:
context = context[:6999]

In [13]:
response = llm.ask_llm(context, "What is todays news?")
# response = llm.ask_llm(context, "Extract stock market news from the given text.")


In [14]:
print(response)

Today's news includes:

- 13 smallcap stocks offering double-digit returns in a flattish market week.
- Tata Motors Share Price increased by 1.31%.
- 12 stocks have record dates for dividends, bonuses, splits, or rights issues next week.
- 2025 stock picks from Religare Broking with up to 29% upside potential.
- 21 stocks showing bearish trends on MACD charts.
- Defence sector growth will be modest.
- NBFC, QSR stocks may recover in 2025.
- Mkt may see 13% annual return in 2025.
- Reduce LTCG taxes to boost participation.
- Time to allocate 20% capital to unlisted stocks.


In [15]:
context = format_docs(docs)

In [19]:
context = text_clean(context)

In [16]:
def chunk_text(text, chunk_size, overlap=100):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

In [22]:
chunks = chunk_text(context, 6999)

In [23]:
print(chunks)

["Stocks in News Today - Latest News on Stocks, Stock in News | The Economic TimesBenchmarks Nifty23,813.4063.21FEATURED FUNDS★★★★★Canara Robeco Bluechip Equity Fund Regular-Growt..5Y Return17.33 % Invest NowFEATURED FUNDS★★★★★Canara Robeco Equity Hybrid Fund Direct-Growth5Y Return16.86 % Invest NowEnglish EditionEnglish Editionहिन्दीગુજરાતીमराठीবাংলাಕನ್ನಡമലയാളംதமிழ்తెలుగు | 28 December, 2024, 02:04 PM IST | Today's ePaper My Watchlist SubscribeSign InJoin Value & Valuation MasterclassHomeETPrimeMarketsMarket DataNewsIndustryRisePoliticsWealthMFTechCareersOpinionNRIPanacheLuxuryVideosMore MenuStocksNewsLive BlogStock Live BlogEarningsPodcastMarket ClassroomDons of Dalal StreetRecosStock Reports PlusNewMy ScreenerCandlestick ScreenerStock ScreenerStock WatchMarket CalendarStock Price QuotesOptionsIPOs/FPOsExpert ViewsInvestment IdeasCommoditiesViewsNewsOthersMentha OilPrecious MetalsGold MGoldSilverGold PetalSilver MicroSilver MGold GuineaOil & EnergyNatural GasCrude OilCrude Oil MiniBa

In [25]:
len(chunks)

16

In [24]:
question = "Extract stock market news from the given text."

chunk_summary = []
for chunk in chunks:
    response = llm.ask_llm(chunk, question)
    chunk_summary.append(response)

In [30]:
chunk_summary

['Here are the extracted stock market news from the given text:\n\n1. Mazagon Dock Share Price\n2. Sensex Today\n3. Transrail Lighting IPO Listing Today\n4. Mamata Machinery Share Price Live\n5. DAM Capital Advisors Share Price Live\n6. Sanathan Textiles IPO GMP\n7. HDFC Bank Share Price\n8. Vedanta Share Price\n9. Dixon Tech Share Price\n10. Jungle Camps India IPO GMP\n11. IT stocks to thrive in 2025\n12. Pharma stocks to thrive in 2025\n13. Defence sector growth will be modest\n14. NBFC stocks may recover in 2025\n15. Mkt may see 13% annual return in 2025\n16. Reduce LTCG taxes to boost participation\n17. Time to allocate 20% capital to unlisted stks\n18. 2025 stock picks from Religare Broking with up to 29% upside potential\n19. 6 fundamentally strong stocks across various sectors poised for robust performance in 2025 with up to 29% upside\n20. Tata Motors Share Price\n21. Intellect Design Arena and Amber Enterprises led 13 smallcap stocks delivering double-digit returns\n22. Mahind

In [29]:
len(chunk_summary)

16

In [33]:
for chunk in chunk_summary:
    print(chunk)
    print("\n\n")
    

Here are the extracted stock market news from the given text:

1. Mazagon Dock Share Price
2. Sensex Today
3. Transrail Lighting IPO Listing Today
4. Mamata Machinery Share Price Live
5. DAM Capital Advisors Share Price Live
6. Sanathan Textiles IPO GMP
7. HDFC Bank Share Price
8. Vedanta Share Price
9. Dixon Tech Share Price
10. Jungle Camps India IPO GMP
11. IT stocks to thrive in 2025
12. Pharma stocks to thrive in 2025
13. Defence sector growth will be modest
14. NBFC stocks may recover in 2025
15. Mkt may see 13% annual return in 2025
16. Reduce LTCG taxes to boost participation
17. Time to allocate 20% capital to unlisted stks
18. 2025 stock picks from Religare Broking with up to 29% upside potential
19. 6 fundamentally strong stocks across various sectors poised for robust performance in 2025 with up to 29% upside
20. Tata Motors Share Price
21. Intellect Design Arena and Amber Enterprises led 13 smallcap stocks delivering double-digit returns
22. Mahindra & Mahindra, Adani Port

In [31]:
summary = "\n\n".join(chunk_summary)

In [32]:
print(summary)

Here are the extracted stock market news from the given text:

1. Mazagon Dock Share Price
2. Sensex Today
3. Transrail Lighting IPO Listing Today
4. Mamata Machinery Share Price Live
5. DAM Capital Advisors Share Price Live
6. Sanathan Textiles IPO GMP
7. HDFC Bank Share Price
8. Vedanta Share Price
9. Dixon Tech Share Price
10. Jungle Camps India IPO GMP
11. IT stocks to thrive in 2025
12. Pharma stocks to thrive in 2025
13. Defence sector growth will be modest
14. NBFC stocks may recover in 2025
15. Mkt may see 13% annual return in 2025
16. Reduce LTCG taxes to boost participation
17. Time to allocate 20% capital to unlisted stks
18. 2025 stock picks from Religare Broking with up to 29% upside potential
19. 6 fundamentally strong stocks across various sectors poised for robust performance in 2025 with up to 29% upside
20. Tata Motors Share Price
21. Intellect Design Arena and Amber Enterprises led 13 smallcap stocks delivering double-digit returns
22. Mahindra & Mahindra, Adani Port

In [34]:
# question = "Write a detailed report in Markdown from the given context."
question = """Write a detailed market news report in markdown format. Think carefully then write the report."""
response = llm.ask_llm(summary, question)

In [39]:
import os 

os.makedirs('data',exist_ok=True)

with open('data/report.md','w') as f:
    f.write(response)

In [42]:
with open("data/summary.md", "w", encoding="utf-8") as f:
    f.write(summary)

### Site Map -- https://python.langchain.com/docs/integrations/document_loaders/sitemap/

Sitemap
Extends from the WebBaseLoader, SitemapLoader loads a sitemap from a given URL, and then scrapes and loads all pages in the sitemap, returning each page as a Document.

The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load you can increase this limit. Note, while this will speed up the scraping process, it may cause the server to block you. Be careful!

In [43]:
from langchain_community.document_loaders.sitemap import SitemapLoader

## Load

In [45]:
sitemap_loader = SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")


In [49]:
import nest_asyncio
import asyncio

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

docs = sitemap_loader.load()
docs[0]

Fetching pages: 100%|##########| 18/18 [00:01<00:00, 11.39it/s]


Document(metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-12-09T14:05:30.040082+00:00', 'changefreq': 'weekly', 'priority': '1'}, page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n')

In [50]:
print(docs[0].metadata)

{'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-12-09T14:05:30.040082+00:00', 'changefreq': 'weekly', 'priority': '1'}


In [51]:
sitemap_loader.requests_per_second = 2
# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue
sitemap_loader.requests_kwargs = {"verify": False}

## Lazy Load

In [53]:
# page = []
# for doc in sitemap_loader.lazy_load():
#     page.append(doc)
#     if len(page) >= 10:
#         # do some paged operation, e.g.
#         # index.upsert(page)
# # 
#         page = []

### Recursive URL  
The RecursiveUrlLoader lets you recursively scrape all child links from a root URL and parse them into Documents.
#

LangChain documentation:

---

### **1. Recursive URL Loader**
**Page**: [Recursive URL Loader](https://python.langchain.com/docs/integrations/document_loaders/recursive_url/)  
**Summary**:  
- **Purpose**: Fetches content from a URL and recursively follows links to extract content from linked pages.  
- **Use Case**: Extracting content from a website and its subpages.  
- **Example**:
  ```python
  from langchain_community.document_loaders import RecursiveUrlLoader
  loader = RecursiveUrlLoader(url="https://example.com", max_depth=2)
  documents = loader.load()
  ```
- **Key Features**:  
  - **`max_depth`**: Controls how deep the loader should follow links.  
  - **Customization**: Can exclude specific URLs or patterns.  
- **Applications**: Web scraping, content aggregation, and building knowledge bases.

---

### **2. Sitemap Loader**
**Page**: [Sitemap Loader](https://python.langchain.com/docs/integrations/document_loaders/sitemap/)  
**Summary**:  
- **Purpose**: Fetches content from URLs listed in a website’s sitemap.  
- **Use Case**: Extracting structured content from websites with a sitemap.  
- **Example**:
  ```python
  from langchain_community.document_loaders import SitemapLoader
  loader = SitemapLoader(url="https://example.com/sitemap.xml")
  documents = loader.load()
  ```
- **Key Features**:  
  - **Efficiency**: Directly uses the sitemap to identify URLs, avoiding the need for crawling.  
  - **Filtering**: Can filter URLs based on patterns or criteria.  
- **Applications**: Content indexing, SEO analysis, and structured data extraction.

---

### **3. Unstructured File Loader**
**Page**: [Unstructured File Loader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file/)  
**Summary**:  
- **Purpose**: Loads and processes unstructured text files (e.g., PDFs, Word documents, HTML).  
- **Use Case**: Extracting text from files with complex layouts or formats.  
- **Example**:
  ```python
  from langchain_community.document_loaders import UnstructuredFileLoader
  loader = UnstructuredFileLoader(file_path="example.pdf")
  documents = loader.load()
  ```
- **Key Features**:  
  - **Supports Multiple Formats**: Handles PDFs, Word documents, HTML, and more.  
  - **Granular Extraction**: Can extract sections, tables, and other structures.  
- **Applications**: Document analysis, content extraction, and data preprocessing.

---

### **4. Firecrawl Loader**
**Page**: [Firecrawl Loader](https://python.langchain.com/docs/integrations/document_loaders/firecrawl/)  
**Summary**:  
- **Purpose**: Fetches content from web pages using the Firecrawl service.  
- **Use Case**: Extracting content from dynamic or JavaScript-heavy web pages.  
- **Example**:
  ```python
  from langchain_community.document_loaders import FirecrawlLoader
  loader = FirecrawlLoader(url="https://example.com")
  documents = loader.load()
  ```
- **Key Features**:  
  - **Dynamic Content**: Handles pages that require JavaScript rendering.  
  - **Ease of Use**: Simple integration with the Firecrawl service.  
- **Applications**: Web scraping, content aggregation, and dynamic page analysis.

---

### **Key Takeaways**
1. **Recursive URL Loader**: Fetches content from a URL and its linked pages, useful for scraping entire websites.  
2. **Sitemap Loader**: Extracts content from URLs listed in a sitemap, ideal for structured websites.  
3. **Unstructured File Loader**: Processes complex file formats (e.g., PDFs, Word documents) for text extraction.  
4. **Firecrawl Loader**: Handles dynamic or JavaScript-heavy web pages using the Firecrawl service.  

These loaders provide flexible and efficient ways to extract and process content from various sources, enabling applications like web scraping, content indexing, and document analysis.