## **DocChat RAG Bot — LangChain + Chroma + Gemini**

**Goal:**
Create an intelligent chatbot that can answer questions based on any uploaded PDF or text document — powered by LangChain, Gemini embeddings, and a vector database (Chroma or FAISS).

### Import Dependencies

In [1]:
import os
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma, FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from dotenv import load_dotenv
from langchain_huggingface import HuggingFaceEmbeddings

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

### Set Up Gemini API Key

In [19]:
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')

### Load Your Document

In [21]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

In [22]:
def load_pdf(data):
    loader = DirectoryLoader(data,
                    glob="*.pdf",
                    loader_cls=PyPDFLoader)
    
    documents = loader.load()

    return documents

In [23]:
extracted_data = load_pdf("../data/")

### Split Text into Chunks

In [24]:
def text_split(extracted_data):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 20)
    text_chunks = text_splitter.split_documents(extracted_data)

    return text_chunks

In [25]:

text_chunks = text_split(extracted_data)
print("length of my chunk:", len(text_chunks))

length of my chunk: 2524


In [26]:

def download_hugging_face_embeddings():
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return embeddings

In [27]:
embeddings = download_hugging_face_embeddings()

In [28]:

embeddings

HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

In [29]:

query_result = embeddings.embed_query("Hello world")
print("Length", len(query_result))

Length 384


### Pinecone

In [30]:
from pinecone import Pinecone
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')


In [32]:
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("rag-chatbot-economy")

In [43]:
from sentence_transformers import SentenceTransformer

# Load the same model you used to create your index's data
# Example: 'all-MiniLM-L6-v2' has a dimension of 384
model = SentenceTransformer('all-MiniLM-L6-v2') 

# Your query text
query_text = "Tell me about UPSCprep"

# Generate the embedding. This will be a list of 384 numbers.
query_embedding = model.encode(query_text).tolist() 

# print(len(query_embedding)) # This should print 384
# print(query_embedding[:5])  # Look at the first few numbers

In [44]:
query_response = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

# 3. Print the results
for match in query_response['matches']:
    print(f"Score: {match['score']:.4f}")
    if 'text' in match['metadata']:
        print(f"Text: {match['metadata']['text']}")
    print("-" * 20)

Score: 0.4454
Text: Weekly current affairs sessions.
Prelims + Mains coverage 
(Basics to Advanced)
Prelims and Mains Test Series
with evaluation.
UPSC ESSENTIALS 
+ MENTORSHIP 2024
Enroll Now : courses.upscprep.com
₹ 18499/-PRICE
S im p lify  y o u r  U P S C  jo u r n e y  
w ith  e x p e r t m e n to r s
 A b h ije e t Y a d a v , C o -F o u n d e r
( A IR  6 5 3 , C S E  2 0 17  )
Adv.Shashank Ratnoo, Co-Founder 
( AIR 688, CSE 2015 )
--------------------
Score: 0.4454
Text: Weekly current affairs sessions.
Prelims + Mains coverage 
(Basics to Advanced)
Prelims and Mains Test Series
with evaluation.
UPSC ESSENTIALS 
+ MENTORSHIP 2024
Enroll Now : courses.upscprep.com
₹ 18499/-PRICE
S im p lify  y o u r  U P S C  jo u r n e y  
w ith  e x p e r t m e n to r s
 A b h ije e t Y a d a v , C o -F o u n d e r
( A IR  6 5 3 , C S E  2 0 17  )
Adv.Shashank Ratnoo, Co-Founder 
( AIR 688, CSE 2015 )
--------------------
Score: 0.3780
Text: and knowledge economy. This includes initiatives to 

docsearch = PineconeVectorStore.from_documents(
    documents=text_chunks,
    embedding=embeddings,
    index_name="rag-chatbot-economy"
)

### RAG Chain

In [33]:
from langchain_pinecone import Pinecone as PineconeVectorStore

In [34]:
import os
from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate



In [45]:
import os
import google.generativeai as genai
from pinecone import Pinecone
from sentence_transformers import SentenceTransformer


PINECONE_INDEX_NAME = "rag-chatbot-economy"
# Important: Use the same model you used for creating embeddings in your DB
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2'

In [46]:
# Initialize Pinecone
try:
    pc = Pinecone(api_key=PINECONE_API_KEY)
    index = pc.Index(PINECONE_INDEX_NAME)
    print("✅ Successfully connected to Pinecone index.")
    print(index.describe_index_stats())
except Exception as e:
    print(f"❌ Error connecting to Pinecone: {e}")

# Initialize the embedding model
try:
    embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)
    print("✅ Embedding model loaded successfully.")
except Exception as e:
    print(f"❌ Error loading embedding model: {e}")

# Initialize the Gemini model
try:
    genai.configure(api_key=GOOGLE_API_KEY)
    llm = genai.GenerativeModel('gemini-2.0-flash')
    print("✅ Gemini model loaded successfully.")
except Exception as e:
    print(f"❌ Error configuring Gemini: {e}")

✅ Successfully connected to Pinecone index.
{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 2524}},
 'total_vector_count': 2524,
 'vector_type': 'dense'}
✅ Embedding model loaded successfully.
✅ Gemini model loaded successfully.


In [47]:
def get_rag_response(user_query):
    """
    Takes a user query and returns a response from Gemini based on Pinecone context.
    """
    # Step 1: Retrieve from Pinecone
    query_embedding = embedding_model.encode(user_query).tolist()
    query_results = index.query(
        vector=query_embedding,
        top_k=3,
        include_metadata=True
    )
    
    context_chunks = [match['metadata']['text'] for match in query_results['matches']]
    context_string = "\n\n".join(context_chunks)

    # Step 2: Augment the Prompt for Gemini
    prompt_template = f"""
    You are a helpful assistant for the UPSC exam. 
    You are a professional and helpful assistant. Your goal is to provide clear, concise, and well-formatted answers based on the provided context.

    Please answer the user's question using only the information given in the following context.
    If the information is not available in the context, politely state that you don't have enough information to answer.

    When you formulate the answer, follow these rules:
    - Synthesize the information into a helpful response. Do not just copy-paste from the context.
    - Use bullet points for lists or to break down key points.
    - Use bolding to highlight important terms.
    - Maintain a positive and professional tone."
    CONTEXT:
    {context_string}

    QUESTION:
    {user_query}

    ANSWER:
    """

    # Step 3: Generate the Response with Gemini
    try:
        response = llm.generate_content(prompt_template)
        return response.text
    except Exception as e:
        return f"An error occurred while generating the response: {e}"


In [48]:
print("\n--- UPSC Chatbot ---")
print("Ask a question about the documents you've stored. Type 'exit' to quit.")

while True:
    user_input = input("\nYou: ")
    if user_input.lower() == 'exit':
        print("Goodbye!")
        break
    
    # Get the response from our RAG function
    ai_response = get_rag_response(user_input)
    print(f"\nAI: {ai_response}")


--- UPSC Chatbot ---
Ask a question about the documents you've stored. Type 'exit' to quit.

AI: **Club goods** are those that are **excludable** but **non-rivalrous**.

Here's a breakdown of the key characteristics:

*   **Excludable:** Access can be restricted to those who pay or meet membership criteria.
*   **Non-rivalrous:** One person's consumption does not reduce the availability of the good for others.

Some examples of club goods are:

*   Cable television
*   Private golf courses
*   Subscription-based services like Netflix


AI: Here's a breakdown of the types of goods discussed:

*   **Consumer Goods:** These are purchased by individuals or households for personal use to fulfill needs and desires. Examples include food, clothing, electronics, and furniture.
*   **Capital Goods:** These are used by businesses to produce other goods or provide services, and are not for direct consumption.
*   **Common Goods:** These are **rivalrous** but **non-excludable**, meaning that whil