TASK-2 : **Chat with Website Using RAG Pipeline**

**Overview**

The goal is to implement a Retrieval-Augmented Generation (RAG) pipeline that allows users to
interact with structured and unstructured data extracted from websites. The system will crawl,
scrape, and store website content, convert it into embeddings, and store it in a vector database.
Users can query the system for information and receive accurate, context-rich responses
generated by a selected LLM.


***Installing the necessary packages ***

In [None]:
!pip install requests
!pip install beautifulsoup4
!pip install sentence-transformers
!pip install torch torchvision torchaudio
!pip install transformers




In [None]:
#Importing the packages
import requests
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer, util
import sqlite3
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

#Defining the models
# Constants
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2'  # SentenceTransformer for embeddings
LLM_MODEL_NAME = 'bigscience/bloom-560m'  # Hugging Face model for response generation
DATABASE_NAME = 'embeddings.db'

# Step 1: Crawl and Scrape Website
def crawl_and_scrape(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    texts = [p.get_text().strip() for p in soup.find_all('p') if p.get_text().strip()]
    return texts

#Explanation:
#Purpose: Scrapes the content of a given website URL, specifically extracting paragraphs of text.
#Steps:
#1 BeautifulSoup(response.text, 'html.parser') parses the HTML content.
#2.Filters out empty or whitespace-only paragraphs using a list comprehension.
#3.Output: A list of non-empty text strings from the <p> tags of the webpage.



# Step 2: Generate Embeddings
def generate_embeddings(texts, model):
    return model.encode(texts, convert_to_tensor=True)

#Explanation: Purpose: Converts the list of text strings into high-dimensional vectors (embeddings) using a pre-trained SentenceTransformer model.
#Steps:
#1.model.encode() generates embeddings for each text.
#2.convert_to_tensor=True ensures the output is a PyTorch tensor.
#3.Output: A tensor containing embeddings for the input texts. Each embedding represents a text in vector space.


# Step 3: Store Embeddings in SQLite Database
def setup_database():
    conn = sqlite3.connect(DATABASE_NAME)
    c = conn.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS embeddings (
                    id INTEGER PRIMARY KEY,
                    text TEXT,
                    embedding BLOB
                )''')
    conn.commit()
    conn.close()

#Explanation:
#Purpose: Prepares a SQLite database to store text and their embeddings persistently.

#Steps:
#1.Connects to (or creates) a SQLite database file named DATABASE_NAME.
#2.Creates a table embeddings with columns:
#3.embedding: The binary representation of the embedding.
#4.Ensures the table exists using CREATE TABLE IF NOT EXISTS.
#5.Output: A database setup ready to store text and embeddings.



def store_embeddings(texts, embeddings):
    conn = sqlite3.connect(DATABASE_NAME)
    c = conn.cursor()
    for text, embedding in zip(texts, embeddings):
        c.execute("INSERT INTO embeddings (text, embedding) VALUES (?, ?)", (text, embedding.numpy().tobytes()))
    conn.commit()
    conn.close()

#Explanation:
#Purpose: Stores text and their embeddings into the SQLite database.
#Steps:
#1.Opens a connection to the database.
#2.Iterates over texts and their corresponding embeddings.
#3.Converts embeddings to binary format (tobytes()) for storage in the database.
#4.Inserts each text and its embedding into the embeddings table.
#5.Output: The database is populated with texts and their embeddings

# Step 4: Query Handling and Cosine Similarity
def retrieve_relevant_chunks(query, model):
    query_embedding = model.encode(query, convert_to_tensor=True)
    conn = sqlite3.connect(DATABASE_NAME)
    c = conn.cursor()
    c.execute("SELECT text, embedding FROM embeddings")
    rows = c.fetchall()
    conn.close()

    texts = []
    similarities = []
    for text, embedding_blob in rows:
        embedding = torch.tensor(torch.frombuffer(embedding_blob,dtype=torch.float32))
        similarity = util.pytorch_cos_sim(query_embedding, embedding)[0][0].item()
        texts.append(text)
        similarities.append(similarity)

    # Sort by similarity and return top chunks
    sorted_texts = [text for _, text in sorted(zip(similarities, texts), reverse=True)]
    return sorted_texts[:5]  # Return top 5 chunks

#Explanation: Purpose: Finds the most relevant pieces of text from the database for a given query using cosine similarity.
#Steps:
#1.Encodes the query into an embedding.
#2.Fetches all stored texts and embeddings from the database.
#3.Decodes binary embeddings using torch.frombuffer and calculates the cosine similarity with the query embedding using util.pytorch_cos_sim.
#4.Sorts texts based on similarity scores in descending order.
#5.Returns the top 5 most relevant texts.
#Output: A list of the top 5 relevant text chunks based on the query.



# Step 5: Response Generation
def generate_response(retrieved_chunks, query, model, tokenizer):
    prompt = f"Context: {retrieved_chunks}\n\nQuestion: {query}\n\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt")
    # Increase max_length or better yet, use max_new_tokens
    outputs = model.generate(**inputs, max_new_tokens=200, num_beams=3, no_repeat_ngram_size=2)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

#Explanation: Purpose: Uses a language model to generate a response to the query based on the retrieved relevant texts.
#Steps:
#1.Constructs a prompt combining the retrieved texts (retrieved_chunks) and the query.
#2.Tokenizes the prompt using tokenizer.
#3.Generates a response using the language model (model.generate()), with parameters:
#4.Decodes the generated tokens into a readable string.
#Output: A string representing the generated response.


# Main Workflow
def main():
    # Models
    embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)
    llm_model = AutoModelForCausalLM.from_pretrained(LLM_MODEL_NAME)
    llm_tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL_NAME)

    # Setup database
    setup_database()

    # Crawl and scrape websites
    urls = ["https://www.uchicago.edu/", "https://www.washington.edu/"]
    for url in urls:
        print(f"Crawling and scraping: {url}")
        texts = crawl_and_scrape(url)
        print(f"Found {len(texts)} texts. Generating embeddings...")

        embeddings = generate_embeddings(texts, embedding_model)
        store_embeddings(texts, embeddings)
        print(f"Stored {len(texts)} embeddings.")

    # Query and respond
    user_query = input("Enter your question: ")
    print("Retrieving relevant chunks...")
    relevant_chunks = retrieve_relevant_chunks(user_query, embedding_model)

    print("Generating response...")
    response = generate_response(relevant_chunks, user_query, llm_model, llm_tokenizer)
    print(f"Response:\n{response}")

if __name__ == "__main__":
    main()

#Explanation:
#1.Loads the embedding and language models.
#2.Sets up the database.
#3.Scrapes and stores website content into the database.
#4.Accepts a query from the user.
#5.Retrieves relevant texts for the query.
#6.Generates a response using the LLM.

Crawling and scraping: https://www.uchicago.edu/
Found 1 texts. Generating embeddings...
Stored 1 embeddings.
Crawling and scraping: https://www.washington.edu/
Found 15 texts. Generating embeddings...
Stored 15 embeddings.
Enter your question: What major achievements have come from Stanford University's research programs?
Retrieving relevant chunks...


  embedding = torch.tensor(torch.frombuffer(embedding_blob,dtype=torch.float32))


Generating response...
Response:
Context: ['David Baker, professor of biochemistry at the UW School of Medicine in Seattle, received the 2024 Nobel Prize in Chemistry. Nobel Week wove stately traditions with imaginative recognitions.', 'David Baker, professor of biochemistry at the UW School of Medicine in Seattle, received the 2024 Nobel Prize in Chemistry. Nobel Week wove stately traditions with imaginative recognitions.', 'David Baker, professor of biochemistry at the UW School of Medicine in Seattle, received the 2024 Nobel Prize in Chemistry. Nobel Week wove stately traditions with imaginative recognitions.', 'David Baker, professor of biochemistry at the UW School of Medicine in Seattle, received the 2024 Nobel Prize in Chemistry. Nobel Week wove stately traditions with imaginative recognitions.', 'David Baker, professor of biochemistry at the UW School of Medicine in Seattle, received the 2024 Nobel Prize in Chemistry. Nobel Week wove stately traditions with imaginative recognit

Output:

Query:

Enter your question: What major achievements have come from Stanford University's research programs?

Answer:

The Stanford Research Institute (SRI) is the largest research organization in the United States. The SRI is responsible for the development and implementation of the National Institutes of Health's (NIH) National Center for Biotechnology Information (NCBI) and National Science Foundation (NSF) funded research projects. In addition to the NIH and NSF funded projects, the Institute also has a number of non-funded projects that are funded by other organizations, such as the U.S. Department of Energy (DOE), the European Union (EU), and the World Health Organization (WHO). The Institute is also a member of several international research consortia, including the International Union of Pure and Applied Chemistry (IUPAC), International Society for Chemical Engineering (ISCE), The International Association of Chemical Engineers (IAE); the American Chemical Society (ACS), American Society of Testing and Materials (ASTM), National Academy of Sciences (NAS).

This is the answer it generates when we give the query to the model


