***

### [Different types of Document Loaders](https://python.langchain.com/docs/how_to/#document-loaders)

#### [How to load web pages](https://python.langchain.com/docs/how_to/document_loader_web/)

***

## Webpage Loaders
- Load the webpage and extract the data using the `WebBaseLoader` and `BeautifulSoup` libraries.
- Use LLM to extract meaningful data from the webpage.

### Project 1: Share Market Data Analysis Based on Global Cues
- We will extract the data from the stock market website and analyze the data to understand the impact of global cues on the Indian share market.

#### Stock Market Data Extraction

In [1]:
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())

True

In [2]:
import re
from typing import List

from langchain_community.document_loaders import WebBaseLoader
from langchain.schema import Document

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
urls = [
    'https://economictimes.indiatimes.com/markets/stocks/news',
    'https://www.livemint.com/latest-news',
    'https://www.livemint.com/latest-news/page-2',
    'https://www.livemint.com/latest-news/page-3'
    'https://www.moneycontrol.com/'
]

In [4]:
loader = WebBaseLoader(web_path=urls)

In [5]:
loader

<langchain_community.document_loaders.web_base.WebBaseLoader at 0x11c0053fdf0>

In [None]:
docs = []

async for doc in loader.alazy_load():
    docs.append(doc)

In [7]:
docs

[Document(metadata={'source': 'https://economictimes.indiatimes.com/markets/stocks/news', 'title': 'Stocks in News Today - Latest News on Stocks, Stock in News | The Economic Times', 'description': 'Stocks in News - Find the latest Stocks in News on The Economic Times. Get Stocks Analysis, News on Stocks and more.', 'language': 'en'}, page_content="Stocks in News Today - Latest News on Stocks, Stock in News | The Economic TimesBenchmarks Nifty23,907.25557.35FEATURED FUNDS★★★★★Canara Robeco Infrastructure Direct-Growth5Y Return29.96 %\n                Invest NowFEATURED FUNDS★★★★★Canara Robeco Flexi Cap Fund Direct-Growth5Y Return20.06 %\n                Invest NowEnglish EditionEnglish Editionहिन्दीગુજરાતીमराठीবাংলাಕನ್ನಡമലയാളംதமிழ்తెలుగు | 23 November, 2024, 12:31 PM IST | Today's ePaper\n        \t\t\t        My Watchlist\n                        \n                        Subscribe\n                    Sign InHomeETPrimeMarketsMarket DataNewsIndustryRisePoliticsWealthMFTechCareersOpin

In [8]:
def format_docs(docs: List[Document], verbose: bool = False) -> str:
    """
    Formats a list of Document objects into a single string with their content joined by double newlines.

    Args:
        docs (List[Document]): List of Document objects to format.
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        str: A single string containing the page content of all documents, separated by double newlines.

    Raises:
        ValueError: If the input is not a list or contains non-Document objects.
        Exception: If the list is empty or no valid content is found.
    """
    if not isinstance(docs, list):
        raise ValueError("The input 'docs' must be a list of Document objects.")

    if not all(isinstance(doc, Document) for doc in docs):
        raise ValueError("All elements in 'docs' must be instances of the Document class.")

    if verbose:
        print(f"Received {len(docs)} document(s) to format.")

    content_list = [doc.page_content for doc in docs if doc.page_content]

    if not content_list:
        raise Exception("No valid content found in the provided documents.")

    result = "\n\n".join(content_list)

    return result


In [9]:
context = format_docs(docs=docs)

In [10]:
print(context)

Stocks in News Today - Latest News on Stocks, Stock in News | The Economic TimesBenchmarks Nifty23,907.25557.35FEATURED FUNDS★★★★★Canara Robeco Infrastructure Direct-Growth5Y Return29.96 %
                Invest NowFEATURED FUNDS★★★★★Canara Robeco Flexi Cap Fund Direct-Growth5Y Return20.06 %
                Invest NowEnglish EditionEnglish Editionहिन्दीગુજરાતીमराठीবাংলাಕನ್ನಡമലയാളംதமிழ்తెలుగు | 23 November, 2024, 12:31 PM IST | Today's ePaper
        			        My Watchlist
                        
                        Subscribe
                    Sign InHomeETPrimeMarketsMarket DataNewsIndustryRisePoliticsWealthMFTechCareersOpinionNRIPanacheLuxuryVideosMore MenuStocksStock LiveblogNewsLive BlogEarningsPodcastMarket ClassroomDons of Dalal StreetRecosStock Reports PlusNewMy ScreenerCandlestick ScreenerStock ScreenerStock WatchMarket CalendarStock Price QuotesOptionsIPOs/FPOsExpert ViewsInvestment IdeasCommoditiesViewsNewsOthersMentha OilPrecious MetalsGold MGoldSilverGold PetalSilver

In [11]:

def clean_text(text: str, verbose: bool = False) -> str:
    """
    Cleans the input text by normalizing whitespace characters.
    - Reduces multiple newlines to a single newline.
    - Reduces multiple tabs to a single tab.
    - Replaces multiple spaces with a single space.

    Args:
        text (str): The input text to be cleaned.
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        str: The cleaned and normalized text.

    Raises:
        ValueError: If the input is not a string.
    """
    if not isinstance(text, str):
        raise ValueError("The input 'text' must be a string.")

    if verbose:
        print("Starting text cleaning...")

    original_length = len(text)
    if verbose:
        print(f"Original text length: {original_length} characters.")

    # Reduce multiple newlines to a single newline
    text = re.sub(pattern="\n{2,}", repl="\n", string=text)
    if verbose:
        print("Normalized multiple newlines.")

    # Reduce multiple tabs to a single tab
    text = re.sub(pattern="\t{2,}", repl="\t", string=text)
    if verbose:
        print("Normalized multiple tabs.")

    # Replace multiple spaces with a single space, excluding newlines
    text = re.sub(pattern="[^\S\n]+", repl=" ", string=text)
    if verbose:
        print("Normalized multiple spaces (excluding newlines).")

    cleaned_length = len(text)
    if verbose:
        print(f"Cleaning complete. Cleaned text length: {cleaned_length} characters.")

    return text

In [None]:
context = clean_text(context,verbose=True)

Starting text cleaning...
Original text length: 56030 characters.
Normalized multiple newlines.
Normalized multiple tabs.
Normalized multiple spaces (excluding newlines).
Cleaning complete. Cleaned text length: 51673 characters.


In [13]:
print(context)

Stocks in News Today - Latest News on Stocks, Stock in News | The Economic TimesBenchmarks Nifty23,907.25557.35FEATURED FUNDS★★★★★Canara Robeco Infrastructure Direct-Growth5Y Return29.96 %
 Invest NowFEATURED FUNDS★★★★★Canara Robeco Flexi Cap Fund Direct-Growth5Y Return20.06 %
 Invest NowEnglish EditionEnglish Editionहिन्दीગુજરાતીमराठीবাংলাಕನ್ನಡമലയാളംதமிழ்తెలుగు | 23 November, 2024, 12:31 PM IST | Today's ePaper
 My Watchlist
 
 Subscribe
 Sign InHomeETPrimeMarketsMarket DataNewsIndustryRisePoliticsWealthMFTechCareersOpinionNRIPanacheLuxuryVideosMore MenuStocksStock LiveblogNewsLive BlogEarningsPodcastMarket ClassroomDons of Dalal StreetRecosStock Reports PlusNewMy ScreenerCandlestick ScreenerStock ScreenerStock WatchMarket CalendarStock Price QuotesOptionsIPOs/FPOsExpert ViewsInvestment IdeasCommoditiesViewsNewsOthersMentha OilPrecious MetalsGold MGoldSilverGold PetalSilver MicroSilver MGold GuineaOil & EnergyNatural GasCrude OilCrude Oil MiniBase MetalsAluminiumZinc MiniLead MiniCopp

In [38]:
from scripts import ask_llm

In [20]:
response = ask_llm(context=context,
                   question="Extract the stock market news from the given context.")

print(response)

The article discusses the current state of India's economy and provides information on IPOs, and stocks in India. The main points are that:

1. **Stock Prices:** Current stock prices for top Indian companies like Hero MotoCorp, HINDALCO Industry Insights
2. **Market Analysis:** NTPC Green Energy IPO allotment date: November 15th
3. **Market Analysis**: NSE Index rises: 2-day rise of 60 points from 64,000.50 to 64,040.60.
4. **Mint News:** Reliance Industries, JSW Steel to consider buyback offer.
5. **Stock Market Trends:** 
    * Indian stocks rally as Niti Aayog advises cautious stance on interest rates
6. **Market Research:**
    * UltraTech Cement rises 4% on plans to invest Rs 1,500 crores in plant expansion.
    * HCL Technologies Q2 net falls 14% on revenue drop due to economic downturn.
7. **Stock Market Trends:**
    * IndusInd Bank shares fall amid concerns over bank’s credit growth slowing down
    * Hero MotoCorp shares slide as company cuts production forecast


In [33]:
response = ask_llm(context=context[:10_000],
                   question="Extract stock market news from the given text.")

print(response)

Here are the extracted stock market news:

1. **36 smallcaps shine with double-digit gains**: Indian equity markets rebounded strongly this week, with the Sensex closing nearly 2% higher. Small-cap stocks led the rally.
2. **Most Adani stocks gain; Ambuja, ACC and Adani Ports offer value**: Sanghi Industries, Ambuja Cements and ACC rose 3-4% on Friday and were among the top gainers. Adani Enterprises, the group's flagship company, gained 2.2%.
3. **FPIs sell big in oil & gas and financial stocks, add IT**: So far in November, foreign investors sold shares worth ₹25,254.6 crore. This comes after foreign outflows hit record high of ₹1 lakh crore in October following rebound Chinese markets and disappointing corporate earnings season
4. **Sebi begins inquiry into alleged false statements by Adani Group**: The regulator has sought information from stock exchanges regarding Adani Green Energy's communications on US investigations, they said, and will consider law.
5. **Bull's Eye: know why 

In [34]:
def chunk_text(text: str, chunk_size: int, overlap: int = 100, verbose: bool = False) -> list:
    """
    Splits the input text into overlapping chunks of a specified size.

    Args:
        text (str): The input text to be chunked.
        chunk_size (int): The size of each chunk.
        overlap (int): The number of overlapping characters between chunks. Default is 100.
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        list: A list of text chunks.

    Raises:
        ValueError: If `chunk_size` is not positive or less than `overlap`.
    """
    if not isinstance(text, str):
        raise ValueError("The input 'text' must be a string.")

    if chunk_size <= 0:
        raise ValueError("The 'chunk_size' must be a positive integer.")

    if overlap < 0:
        raise ValueError("The 'overlap' must be a non-negative integer.")

    if chunk_size <= overlap:
        raise ValueError("The 'chunk_size' must be greater than the 'overlap'.")

    if verbose:
        print(f"Starting chunking: text length = {len(text)}, chunk_size = {chunk_size}, overlap = {overlap}")

    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)

        if verbose:
            print(f"Chunk created: start = {i}, end = {min(i + chunk_size, len(text))}, length = {len(chunk)}")

    if verbose:
        print(f"Chunking complete. Total chunks created: {len(chunks)}")

    return chunks


In [36]:
chunks = chunk_text(text=context,chunk_size=10_000)

In [37]:
chunks

["Stocks in News Today - Latest News on Stocks, Stock in News | The Economic TimesBenchmarks Nifty23,907.25557.35FEATURED FUNDS★★★★★Canara Robeco Infrastructure Direct-Growth5Y Return29.96 %\n Invest NowFEATURED FUNDS★★★★★Canara Robeco Flexi Cap Fund Direct-Growth5Y Return20.06 %\n Invest NowEnglish EditionEnglish Editionहिन्दीગુજરાતીमराठीবাংলাಕನ್ನಡമലയാളംதமிழ்తెలుగు | 23 November, 2024, 12:31 PM IST | Today's ePaper\n My Watchlist\n \n Subscribe\n Sign InHomeETPrimeMarketsMarket DataNewsIndustryRisePoliticsWealthMFTechCareersOpinionNRIPanacheLuxuryVideosMore MenuStocksStock LiveblogNewsLive BlogEarningsPodcastMarket ClassroomDons of Dalal StreetRecosStock Reports PlusNewMy ScreenerCandlestick ScreenerStock ScreenerStock WatchMarket CalendarStock Price QuotesOptionsIPOs/FPOsExpert ViewsInvestment IdeasCommoditiesViewsNewsOthersMentha OilPrecious MetalsGold MGoldSilverGold PetalSilver MicroSilver MGold GuineaOil & EnergyNatural GasCrude OilCrude Oil MiniBase MetalsAluminiumZinc MiniLead 

In [39]:
question = "Extract stock market news from the given text."
chunk_summary = []

for chunk in chunks:
    response = ask_llm(
        context = chunk,
        question = question
    )
    chunk_summary.append(response)

In [40]:
chunk_summary

["Here are some key points related to stock market news extracted from the provided text:\n\n*   **Small-cap stocks gain**: The domestically focused small-cap Russell 2000 index outperformed large-cap indexes and rose 1.8% for the week, closing at its highest in more than a week.\n*   **Adani stocks rise**: Most Adani stocks gained, with Ambuja Cements and ACC rising by 3-4% on Friday. Adani Enterprises, the group's flagship company, gained 2.2%.\n*   **FPIs sell big in oil & gas and financial stocks**: Foreign investors sold shares worth ₹25,254.6 crore so far in November, following a rebound in Chinese markets and a disappointing corporate earnings season.\n*   **Sebi begins inquiry into alleged false statements by Adani Group** The regulator has sought information from stock exchanges about Adani Green energy communications on US investigations and will consider law",
 'Based on the provided snippet, here is a summary of extracted stock market news:\n\n1. **Reliance Power** has incr

In [41]:
for chunk in chunk_summary:
    print(chunk)
    print("\n\n")
    break

Here are some key points related to stock market news extracted from the provided text:

*   **Small-cap stocks gain**: The domestically focused small-cap Russell 2000 index outperformed large-cap indexes and rose 1.8% for the week, closing at its highest in more than a week.
*   **Adani stocks rise**: Most Adani stocks gained, with Ambuja Cements and ACC rising by 3-4% on Friday. Adani Enterprises, the group's flagship company, gained 2.2%.
*   **FPIs sell big in oil & gas and financial stocks**: Foreign investors sold shares worth ₹25,254.6 crore so far in November, following a rebound in Chinese markets and a disappointing corporate earnings season.
*   **Sebi begins inquiry into alleged false statements by Adani Group** The regulator has sought information from stock exchanges about Adani Green energy communications on US investigations and will consider law





In [42]:
finall_summary = "\n\n".join(chunk_summary)

In [44]:
question = """Write a detailed market news report in markdown format. Think carefully then write the report."""
response = ask_llm(finall_summary, question)

In [45]:
print(response)

# Market News Report
## Small-cap Stocks Shine Amidst Turbulent Market

The domestic stock market witnessed a resurgence of small-cap stocks, with the Russell 2000 index outperforming large-cap indexes and rising by 1.8% for the week. This marked its highest close in over a week, indicating a positive trend in smaller companies.

## Adani Stocks on the Up
Most Adani group stocks gained significantly, with Ambuja Cements and ACC rising by 3-4%. Adani Enterprises, the flagship company of the group, saw a 2.2% increase in its share price. This marked a significant turnaround for the beleaguered conglomerate, which has been facing several challenges in recent times.

## Foreign Investors Sell Big
Foreign investors sold shares worth ₹25,254.6 crore so far in November, citing a rebound in Chinese markets and disappointing corporate earnings season. This was a clear indication of market volatility and investor sentiment.

## Regulatory Scrutiny
The Securities and Exchange Board of India (SEBI

In [46]:
import os

os.makedirs("./data",exist_ok=True)

with open("./data/report.md",'w') as f:
    f.write(response)

***

In [54]:
async def get_report(urls, chunk_question, final_question, save=False, verbose=False):
    """
    Generates a report by fetching data from web URLs, processing it into chunks, summarizing, 
    and answering a final question.

    Args:
        urls (list or str): URL(s) to fetch content from.
        chunk_question (str): The question to ask for summarizing each chunk.
        final_question (str): The final question to ask after summarizing all chunks.
        save (bool): If True, saves the final report to a file. Default is False.
        verbose (bool): If True, displays progress messages. Default is False.

    Returns:
        str or None: The final response if `save` is False; otherwise, saves the report and returns None.
    """
    if verbose:
        print("Initializing WebBaseLoader...")

    loader = WebBaseLoader(web_path=urls)
    docs = []

    if verbose:
        print("Loading documents asynchronously...")

    # Load documents asynchronously
    async for doc in loader.alazy_load():
        docs.append(doc)

    if verbose:
        print(f"Loaded {len(docs)} documents.")

    # Format and clean the document content
    context = format_docs(docs=docs)
    context = clean_text(context, verbose=verbose)

    # Chunk the cleaned text
    if verbose:
        print("Splitting the context into chunks...")
    chunks = chunk_text(text=context, chunk_size=10_000)

    if verbose:
        print(f"Created {len(chunks)} chunks.")

    # Summarize each chunk
    chunk_summary = []
    for idx, chunk in enumerate(chunks, start=1):
        if verbose:
            print(f"Summarizing chunk {idx}/{len(chunks)}...")
        response = ask_llm(context=chunk, question=chunk_question)
        chunk_summary.append(response)

    if verbose:
        print("Combining chunk summaries into a final summary...")

    # Combine all chunk summaries
    final_summary = "\n\n".join(chunk_summary)

    # Ask the final question
    if verbose:
        print("Asking the final question...")
    response = ask_llm(context=final_summary, question=final_question)

    # Save or return the report
    if save:
        if verbose:
            print("Saving the report to a file...")
        os.makedirs("./data", exist_ok=True)
        with open("./data/report.md", 'w') as f:
            f.write(response)
        if verbose:
            print("Report saved successfully!")
    else:
        if verbose:
            print("Returning the final response...")
        return response

In [57]:
import asyncio

chunk_question = "Extract stock market news from the given text."
final_question = "Write a detailed market news report in markdown format. Think carefully then write the report."

# Call the function using 'await' directly
response = await get_report(urls, chunk_question, final_question, save=False, verbose=True)

if response:
    print("Final Response:")
    print(response)

Initializing WebBaseLoader...
Loading documents asynchronously...
Loaded 4 documents.
Starting text cleaning...
Original text length: 55779 characters.
Normalized multiple newlines.
Normalized multiple tabs.
Normalized multiple spaces (excluding newlines).
Cleaning complete. Cleaned text length: 51443 characters.
Splitting the context into chunks...
Created 6 chunks.
Summarizing chunk 1/6...
Summarizing chunk 2/6...
Summarizing chunk 3/6...
Summarizing chunk 4/6...
Summarizing chunk 5/6...
Summarizing chunk 6/6...
Combining chunk summaries into a final summary...
Asking the final question...
Returning the final response...
Final Response:
**Market News Report**

### Top Gainers of the Past Three Sessions

As per ETMarkets' analysis, four stocks have consistently increased in price, trading volume, and delivery volume over the past three sessions (November 19-22). The top gainers include:

#### Sanghi Industries
*   Stock Price: Rose by 3-4% on Friday
*   Industry: Energy

#### Ambuja C