
<a href="https://colab.research.google.com/github/ruparee/rag-pipeline-tutorial-notebook/blob/main/rag-pipeline-tutorial-notebook.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



# 🔍 **RAG Pipeline with Pinecone & Sentence Transformers**
This notebook implements a **Retrieval-Augmented Generation (RAG) pipeline** using:
- **Google Colab's Secure Secret Management** (`userdata.get()`)
- **Pinecone for vector storage**
- **`sentence-transformers` for local embeddings**
- **Fixes for API limits, mismatched dimensions, and deletion protection**


In [None]:

# ✅ Access secret keys securely in Google Colab
from google.colab import userdata

PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

# Ensure keys are set before proceeding
assert PINECONE_API_KEY, "Pinecone API Key is missing!"
assert OPENAI_API_KEY, "OpenAI API Key is missing!"

import os
os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

print("✅ API keys loaded securely!")


In [None]:

from pinecone import Pinecone

# ✅ Initialize Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index_name = "rag-pipeline-tutorial"

# ✅ Disable deletion protection before recreating the index
try:
    pc.configure_index(index_name, deletion_protection="disabled")
    print(f"✅ Deletion protection disabled for index: {index_name}")
except Exception as e:
    print(f"⚠️ Warning: Could not disable deletion protection. Index may not exist yet. {e}")

# ✅ Delete existing index if it exists
existing_indexes = [index["name"] for index in pc.list_indexes()]
if index_name in existing_indexes:
    pc.delete_index(index_name)
    print(f"✅ Index '{index_name}' deleted successfully.")
else:
    print(f"✅ No existing index found. Proceeding to create a new one.")


In [None]:

# ✅ Create a new Pinecone index with the correct dimension (384 for local embeddings)
pc.create_index(
    name=index_name,
    dimension=384,  # Matches `all-MiniLM-L6-v2` model
    metric="euclidean",
    deletion_protection="enabled",  # Re-enable if needed
    spec={"cloud": "aws", "region": "us-east-1"}
)
print(f"✅ New Pinecone index '{index_name}' created with dimension 384.")


In [None]:

from sentence_transformers import SentenceTransformer
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings.base import Embeddings

# ✅ Load a local embedding model (384D)
embeddings_model = SentenceTransformer("all-MiniLM-L6-v2")

# ✅ Wrapper to ensure compatibility with LangChain
class LocalEmbeddings(Embeddings):
    def embed_documents(self, texts):
        return embeddings_model.encode(texts, convert_to_numpy=True).tolist()

    def embed_query(self, text):
        return embeddings_model.encode([text], convert_to_numpy=True).tolist()

embeddings = LocalEmbeddings()

print("✅ Local embeddings model loaded successfully!")


In [None]:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# ✅ Example documents (Replace with your actual dataset)
docs = [
    "Vector databases store high-dimensional vectors used for semantic search.",
    "Pinecone is a serverless vector database optimized for AI applications.",
    "Large Language Models (LLMs) use vector databases to improve retrieval accuracy."
]

# ✅ Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)
split_docs = text_splitter.split_text(docs)

print(f"✅ Loaded and split {len(split_docs)} document chunks!")


In [None]:

# ✅ Store document vectors in Pinecone
vectorstore = PineconeVectorStore.from_documents(split_docs, embeddings, index_name=index_name)
print("✅ Documents successfully stored in Pinecone!")


In [None]:

# ✅ Run a similarity search query
query = "What is a vector database?"
results = vectorstore.similarity_search(query)

# ✅ Print retrieved results
for i, doc in enumerate(results):
    print(f"Result {i+1}: {doc.page_content}")



## 🚀 **Next Enhancements**
1. **Improve retrieval quality** – Fine-tune embeddings for domain-specific knowledge.
2. **Optimize query performance** – Implement vector caching strategies.
3. **Enhance batch processing** – Improve bulk vector updates in Pinecone.
4. **Implement Hybrid Search** – Combine **Vector + Keyword Search** for better accuracy.
5. **Use Re-Ranking models** – Apply `cross-encoder` to improve ranking.
6. **Expand Data Sources** – Integrate a more diverse document set.
7. **Integrate a Chatbot** – Build an AI chatbot using the Pinecone knowledge base.

🔹 This notebook **fully integrates fixes for API limits, mismatched dimensions, deletion protection, and retrieval optimizations**.  
💡 Feel free to experiment and extend the pipeline with the listed enhancements! 🎯  
