# **Web Page RAG**

Authored by [Kalyan KS](https://www.linkedin.com/in/kalyanksnlp/). To stay updated with LLM, RAG and Agent updates, you can follow me on [Twitter](https://x.com/kalyan_kpl).

- Step-1 : Extract the web page text
- Step-2 : Chunk the extracted web page text
- Step-3 : Create a vector store with the extracted web page text chunks
- Step-4 : Create a retriever which will return the relevant chunks
- Step-5 : Build context from the relevant chunk texts
- Step-6 : Build the RAG chain using rag prompt, LLM and string output parser.
- Step-7 : Run the RAG chain to get the answer.

## **Install and import libraries**


In [None]:
!pip install -qU langchain langchain-community langchain-text-splitters
!pip install -qU langchain-openai langchain-chroma

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00

In [None]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters  import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda



## **Set up LLM API Key**

- Save the `OPENAI_API_KEY` in Google Colab Secrets

In [None]:
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

## **Extract YouTube video transcript**

In [None]:
from typing import List
from langchain.schema import Document

def wp_text(page_url: str) -> List[Document]:
    """
    Extracts text from the web page using WebBaseLoader.

    Parameters:
    page_url (str): The URL of the web page.

    Returns:
    List[Document]: A list of Document objects containing the text.
    """

    print("Web page text is extracted...")

    loader = WebBaseLoader(page_url)
    webpage_text = loader.load()

    return webpage_text

In [None]:
page_url = "https://x.ai/blog/grok-2"
webpage_text = wp_text(page_url)

Web page text is extracted...


In [None]:
print(webpage_text)

[Document(metadata={'source': 'https://x.ai/blog/grok-2', 'title': 'Grok-2 Beta Release', 'description': 'We announce our new Grok-2 and Grok-2 mini models.', 'language': 'en'}, page_content='Grok-2 Beta ReleaseGrokAPIBlogAboutCareersMenuAugust 13, 2024Grok-2 Beta ReleaseGrok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the \uf8ffùïè platform.We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.Grok-2 and Grok-2 mini are current

In [None]:
print(f"Number of documents = {len(webpage_text)}")

Number of documents = 1


## **Chunk Transcript text**

In [None]:
def wp_chunk(webpage_text: List[Document]) -> List[Document]:
    """
    Splits extracted web page text into smaller chunks using RecursiveCharacterTextSplitter.

    Parameters:
    webpage_text (List[Document]): A list of Document objects containing extracted web page text.

    Returns:
    List[Document]: A list of chunked Document objects.
    """

    print("Web page text is chunked....")

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    chunks = text_splitter.split_documents(webpage_text)

    return chunks

In [None]:
chunks = wp_chunk(webpage_text)

Web page text is chunked....


In [None]:
print(f"Number of chunks = {len(chunks)}")

Number of chunks = 9


In [None]:
print(chunks[0])

page_content='Grok-2 Beta ReleaseGrokAPIBlogAboutCareersMenuAugust 13, 2024Grok-2 Beta ReleaseGrok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the ùïè platform.We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.Grok-2 and Grok-2 mini are currently in beta on ùïè, and we are also making both models available through our enterprise API later this month.Grok-2 language model and chat capabilitiesWe introduced an early version 

## **Create Vector Store**

In [None]:
# Set the chroma DB path
current_dir = "/content/rag"
persistent_directory = os.path.join(current_dir, "db", "chroma_db_wp")

In [None]:
def create_vector_store(chunks: List[Document], db_path: str) -> Chroma:
    """
    Creates a Chroma vector store from chunked documents.

    Parameters:
    chunks (List[Document]): A list of chunked Document objects.
    db_path (str): The directory path to persist the vector store.

    Returns:
    Chroma: A Chroma vector store containing the embedded documents.
    """

    print("Chrome vector store is created...\n")

    embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
    db = Chroma.from_documents(documents=chunks, embedding=embedding_model, persist_directory=db_path)

    return db

In [None]:
db = create_vector_store(chunks, persistent_directory)

Chrome vector store is created...



## **Retrieve relevant chunks**

In [None]:
def retrieve_context(db: Chroma, query: str) -> List[Document]:
    """
    Retrieves relevant document chunks from the Chroma vector store based on a query.

    Parameters:
    db (Chroma): The Chroma vector store containing embedded documents.
    query (str): The query string to search for relevant document chunks.

    Returns:
    List[Document]: A list of retrieved relevant document chunks.
    """

    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})
    print("Relevant chunks are retrieved...\n")
    relevant_chunks = retriever.invoke(query)

    return relevant_chunks

In [None]:
query = 'What is Grok 2?'

relevant_chunks = retrieve_context(db, query)

Relevant chunks are retrieved...



In [None]:
print(f"Number of relevant chunks = {len(relevant_chunks)}")

Number of relevant chunks = 2


In [None]:
for i, chunk in enumerate(relevant_chunks):
  print(f"Chunk-{i}")
  print(chunk)
  print("\n")

Chunk-0
page_content='Grok-2 Beta ReleaseGrokAPIBlogAboutCareersMenuAugust 13, 2024Grok-2 Beta ReleaseGrok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the ùïè platform.We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.Grok-2 and Grok-2 mini are currently in beta on ùïè, and we are also making both models available through our enterprise API later this month.Grok-2 language model and chat capabilitiesWe introduced an early 

## **Build context**

In [None]:
def build_context(relevant_chunks: List[Document]) -> str:
    """
    Builds a context string from retrieved relevant document chunks.

    Parameters:
    relevant_chunks (List[Document]): A list of retrieved relevant document chunks.

    Returns:
    str: A concatenated string containing the content of the relevant chunks.
    """

    print("Context is built from relevant chunks")
    context = "\n\n".join([chunk.page_content for chunk in relevant_chunks])

    return context

In [None]:
context = build_context(relevant_chunks)

Context is built from relevant chunks


In [None]:
print(context)

Grok-2 Beta ReleaseGrokAPIBlogAboutCareersMenuAugust 13, 2024Grok-2 Beta ReleaseGrok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the ùïè platform.We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.Grok-2 and Grok-2 mini are currently in beta on ùïè, and we are also making both models available through our enterprise API later this month.Grok-2 language model and chat capabilitiesWe introduced an early version of Grok-2

and

## **Combine all the steps into one function**

In [None]:
import os
from typing import Dict
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

def get_context(inputs: Dict[str, str]) -> Dict[str, str]:
    """
    Creates or loads a vector store for the video transcript and retrieves relevant chunks based on a query.

    Args:
        inputs (Dict[str, str]): A dictionary containing the following keys:
            - 'page_url' (str): Web page URL
            - 'query' (str): The user query.
            - 'db_path' (str): Path to the vector database.

    Returns:
        Dict[str, str]: A dictionary containing:
            - 'context' (str): Extracted relevant context.
            - 'query' (str): The user query.
    """
    page_url, query, db_path  = inputs['page_url'], inputs['query'], inputs['db_path']

    # Create new vector store if it does not exist
    if not os.path.exists(db_path):
        print("Creating a new vector store...\n")
        webpage_text = wp_text(page_url)
        chunks = wp_chunk(webpage_text)
        db = create_vector_store(chunks, db_path)

    # Load the existing vector store
    else:
        print("Loading the existing vector store\n")
        embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
        db = Chroma(persist_directory=db_path, embedding_function=embedding_model)

    relevant_chunks = retrieve_context(db, query)
    context = build_context(relevant_chunks)

    return {'context': context, 'query': query}

## **Build RAG chain**

In [None]:
template = """ You are an AI model trained for question answering. You should answer the
  given question based on the given context only.
  Question : {query}
  \n
  Context : {context}
  \n
  If the answer is not present in the given context, respond as: The answer to this question is not available
  in the provided content.
  """

rag_prompt = ChatPromptTemplate.from_template(template)

llm = ChatOpenAI(model='gpt-4o-mini')

str_parser = StrOutputParser()

rag_chain = (
    RunnableLambda(get_context)
    | rag_prompt
    | llm
    | str_parser
)

## **Run RAG chain**

In [None]:
# Set the chroma DB path
current_dir = "/content/rag"
persistent_directory = os.path.join(current_dir, "db", "chroma_db_wp")

In [None]:
# Web page URL
page_url = "https://x.ai/blog/grok-2"

In [None]:
# Write the query
query = 'What is Grok 2?'

In [None]:
answer = rag_chain.invoke({'page_url':page_url, 'query':query, 'db_path':persistent_directory})

Loading the existing vector store

Relevant chunks are retrieved...

Context is built from relevant chunks


In [None]:
print(f"Query:{query}\n")
print(f"Generated answer:{answer}")

Query:What is Grok 2?

Generated answer:Grok-2 is a frontier language model that features state-of-the-art reasoning capabilities, representing a significant advancement over its predecessor, Grok-1.5. It is designed for chat, coding, and reasoning tasks, and it is currently available in beta on the ùïè platform. Alongside Grok-2, a smaller version named Grok-2 mini is also released. Grok-2 has been tested and shown to outperform other models like Claude 3.5 Sonnet and GPT-4-Turbo in various performance metrics, including graduate-level science knowledge, general knowledge, and math problems, as well as excelling in vision-based tasks.
