<a href="https://colab.research.google.com/github/yashika-ishi/CSI_Assignments_2025/blob/main/Week8_Assignment_RAG_Q%26A_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Celebal Technologies**
*Celebel Summer Internship (CSI)*
<br>
WEEK-8
<br>
Assignment : RAG Q&A chatbot
<br>
Description:
<br>
1. RAG Q&A chatbot using document retrieval and generative AI for intelligent response generation (can use any light model from hugging face or a license llm(opneai, claude, grok, gemini) if free credits available
    <br>Resources :<br>
https://www.kaggle.com/datasets/sonalisingh1411/loan-approval-prediction?select=Training+Dataset.csv

***By: Yashika***

# **Step 1: Installing Required Libraries**

In [1]:
pip install langchain sentence-transformers faiss-cpu transformers torch

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.meta

In [3]:
pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 k

# **Step 2: Importing Libraries**

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# **Step3: Document Collection**

In [7]:
# Document Collection (Example Text)
document_content = """
The Amazon rainforest is the largest rainforest in the world, covering an area of approximately 6.7 million square kilometers. It spans across nine countries, with the majority of it located in Brazil. The Amazon is incredibly biodiverse, home to an estimated 10% of the world's known species. It plays a crucial role in regulating global climate patterns due to its vast carbon absorption capacity, often referred to as the "lungs of the Earth." Deforestation is a major threat to the Amazon, driven by cattle ranching, agriculture, and logging. Protecting the Amazon is vital for climate stability and biodiversity preservation.
"""


# **Step 4: Text Preprocessing: Chunking**

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    add_start_index=True,
)
texts = text_splitter.create_documents([document_content])

print(f"Number of chunks created: {len(texts)}")
# for i, chunk in enumerate(texts):
#     print(f"Chunk {i+1}:\n{chunk.page_content}\n---")

Number of chunks created: 1


# **Step 5: Text Preprocessing: Embedding**

In [9]:
# Using a small, efficient sentence-transformer model for embeddings
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# **Step 6: Vector Store/Database (FAISS)**

In [10]:
print("Creating FAISS vector store...")
vectorstore = FAISS.from_documents(texts, embeddings)
print("FAISS vector store created.")

Creating FAISS vector store...
FAISS vector store created.


# **Step 7:Generative AI (Lightweight Hugging Face LLM)**

In [11]:
model_name = "distilbert/distilgpt2" # Very small, often struggles with coherent Q&A
# For slightly better, still small: "sshleifer/tiny-gpt2" or similar
# For local RAG, you ideally want something like Mistral-7B-Instruct-v0.1 or Llama-2-7b-chat-hf if you have ~8GB VRAM+

print(f"Loading generative model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Create a text generation pipeline
# We set max_new_tokens to avoid overly long or irrelevant generations from small models
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100, # Limit output length for small models
    temperature=0.7,
    do_sample=True,
    top_p=0.95,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.eos_token_id, # Important for generation
)
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe)
print("Generative model loaded.")

Loading generative model: distilbert/distilgpt2...


Device set to use cpu


Generative model loaded.


# **Step 8 : RAG Chain Setup**

In [12]:
# Create a retriever from the vectorstore
retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) # Retrieve top 2 relevant chunks

# Create the RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", # "stuff" combines all retrieved documents into one prompt
    retriever=retriever,
    return_source_documents=True # To see which documents were used
)

# **Step 9 : Q&A Chatbot Loop**

In [13]:
print("\n--- RAG Q&A Chatbot ---")
print("Type 'exit' to quit.")

while True:
    query = input("\nYour question: ")
    if query.lower() == 'exit':
        print("Goodbye!")
        break

    print("Searching for relevant information and generating response...")
    response = qa_chain.invoke({"query": query})

    print("\nChatbot:", response["result"])

    if response["source_documents"]:
        print("\n--- Sources Used ---")
        for i, doc in enumerate(response["source_documents"]):
            print(f"Source {i+1}:")
            print(doc.page_content[:200] + "...") # Print first 200 chars of source
            # You might also want to print doc.metadata if available (e.g., page number, file name)
    else:
        print("No specific source documents found for this query within the context.")


--- RAG Q&A Chatbot ---
Type 'exit' to quit.

Your question: What are the main threats to the Amazon rainforest?
Searching for relevant information and generating response...

Chatbot: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The Amazon rainforest is the largest rainforest in the world, covering an area of approximately 6.7 million square kilometers. It spans across nine countries, with the majority of it located in Brazil. The Amazon is incredibly biodiverse, home to an estimated 10% of the world's known species. It plays a crucial role in regulating global climate patterns due to its vast carbon absorption capacity, often referred to as the "lungs of the Earth." Deforestation is a major threat to the Amazon, driven by cattle ranching, agriculture, and logging. Protecting the Amazon is vital for climate stability and biodiversity preservation.

Question: What are