## Data Ingestion


In [1]:
### Document Structure

from langchain_core.documents import Document

In [2]:
doc = Document(
    page_content="this is the main text content I am using to create RAG",
    metadata = {
        "source": "example.txt",
        "pages": 1,
        "author":"predator",
        "date_created":"2025-12-24"
    }
)

doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'predator', 'date_created': '2025-12-24'}, page_content='this is the main text content I am using to create RAG')

In [3]:
##create a simple directory

import os
os.makedirs("../data/txt_files",exist_ok=True)

In [4]:
sample_texts = {
    "../data/txt_files/python_intro.txt":"""Python Programming Introduction
    
Python is a high-level, interpreted, general-purpose programming language designed to be simple, readable, and powerful.

Key features :

🧠 Easy to learn & read – clear, English-like syntax

⚡ Interpreted – runs code line by line, no compilation

🌍 Cross-platform – works on Windows, macOS, Linux

📚 Large standard library – built-in tools for many tasks

🔌 Rich ecosystem – thousands of third-party libraries

🧩 Object-oriented & functional – supports multiple paradigms

🚀 Versatile – used in web, data science, AI, automation, and more

    """,

     "../data/txt_files/ml_intro.txt":""" Machine Learning Introduction

    Machine Learning (ML) is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed.

Key features

📊 Data-driven – learns patterns from data

🤖 Self-improving – performance improves with more data

🧠 Predictive – makes predictions or decisions

🔁 Automated learning – reduces manual rule-based coding

📈 Scalable – works with large and complex datasets

🌐 Wide applications – used in vision, speech, recommendation, fraud detection

"""
}

for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)    

print("Sample text file created")

Sample text file created


In [5]:
###TextLoader

from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/txt_files/python_intro.txt",encoding="utf-8")
print(loader.load())

  from .autonotebook import tqdm as notebook_tqdm


[Document(metadata={'source': '../data/txt_files/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted, general-purpose programming language designed to be simple, readable, and powerful.\n\nKey features :\n\n🧠 Easy to learn & read – clear, English-like syntax\n\n⚡ Interpreted – runs code line by line, no compilation\n\n🌍 Cross-platform – works on Windows, macOS, Linux\n\n📚 Large standard library – built-in tools for many tasks\n\n🔌 Rich ecosystem – thousands of third-party libraries\n\n🧩 Object-oriented & functional – supports multiple paradigms\n\n🚀 Versatile – used in web, data science, AI, automation, and more\n\n    ')]


In [6]:
###Directory Loader

from langchain_community.document_loaders import DirectoryLoader

## Load all text files from the directory

dir_loader = DirectoryLoader(
    "../data/txt_files",
     glob="**/*.txt", ## pattern to match file
     loader_cls = TextLoader, ## loader class to use
     loader_kwargs={'encoding':'utf-8'},
     show_progress=False
)

documents = dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\txt_files\\ml_intro.txt'}, page_content=' Machine Learning Introduction\n\n    Machine Learning (ML) is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed.\n\nKey features\n\n📊 Data-driven – learns patterns from data\n\n🤖 Self-improving – performance improves with more data\n\n🧠 Predictive – makes predictions or decisions\n\n🔁 Automated learning – reduces manual rule-based coding\n\n📈 Scalable – works with large and complex datasets\n\n🌐 Wide applications – used in vision, speech, recommendation, fraud detection\n\n'),
 Document(metadata={'source': '..\\data\\txt_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted, general-purpose programming language designed to be simple, readable, and powerful.\n\nKey features :\n\n🧠 Easy to learn & read – clear, English-like syntax\n\n⚡ Interpreted – runs code lin

In [7]:
from langchain_community.document_loaders import PyMuPDFLoader
import glob

pdf_documents = []

for path in glob.glob("../data/pdfs/*.pdf"):
    loader = PyMuPDFLoader(path)
    docs = loader.load()
    for d in docs:
        d.metadata["source"] = path   # 🔑 REQUIRED
    pdf_documents.extend(docs)

from collections import Counter
print("Loaded PDFs:", Counter(d.metadata["source"] for d in pdf_documents))


Loaded PDFs: Counter({'../data/pdfs\\JavaInterviewQuestions-UdemyCourse.pdf': 109, '../data/pdfs\\Engineering Physics Notes.pdf': 69, '../data/pdfs\\Exception-Handling-in-Java.pdf': 9})


In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents,chunk_size=1000,chunk_overlap=200):
    """
    Split documents into smaller chunks for better RAG performance.
    
    Parameters:
    - chunk_size: Maximum characters per chunk (adjust based on your LLM)
    - chunk_overlap: Characters to overlap between chunks (preserves context)
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, # Each chunk: ~1000 characters
        chunk_overlap=chunk_overlap, # 200 chars overlap for context
        length_function=len, # How to measure length
        separators=["\n\n", "\n", " ", ""] # Split hierarchy
    )
    # Actually split the documents
    split_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
    
    # Show what a chunk looks like
    if split_docs:
        print(f"\nExample chunk:")
        print(f"Content: {split_docs[0].page_content[:200]}...")
        print(f"Metadata: {split_docs[0].metadata}")
    
    return split_docs

##Embedding and VectorStoreDB


In [9]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List,Dict,Any,Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [10]:
class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformer"""

    def __init__(self,model_name: str = "all-MiniLM-L6-v2"):
        """
            Initialize the embedding manager

            Args:
                model_name : Hugging face Model name for sentence embeddings
        """

        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        """Load the sentence Transformer model"""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model =    SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension : {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self,texts: List[str])->np.ndarray:
        """
        Generate embedding for a list of texts

        Args:
            texts:LIst of text strings to embed

        Return :
            numpy array of embedding with shape (len(texts),embedding_dim)
        """

        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts,show_progress_bar=True)
        print(f"Generated embeddings with shape : {embeddings.shape}")
        return embeddings
    



    ### Initialize the embedding manager

embedding_manager= EmbeddingManager()
embedding_manager

Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension : 384


<__main__.EmbeddingManager at 0x1c7fb270590>

## Vector StoreDB

In [11]:
class VectorStore:
    """ manages document embeddings in a chromaDB vector store """

    def __init__(self,collection_name: str= "pdf_documents", persist_directory: str= "../data/vector_store"):
        """  
        Initialize the vector store

        Args:
            collection_name: Name of chromaDB collection
            persistent_directory: Directory to persist the vector store

        """

        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """  
        Initialize chromaDB client and collection
        """

        try:
            # Create persistent chromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)

            # Get or create collection

            self.collection = self.client.get_or_create_collection(
                name = self.collection_name,
                metadata={"Description": "Pdf document embedding for RAG"}
            )
            print(f"Vector Store Initialized. Collections: {self.collection_name}")
            print(f"Existing documents in collections   : {self.collection.count()}")

        except Exception as e:
            print(f"Error Initializing Vector Store: {e}")
            raise

    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """ 
        Add documents and their embeddings to the vector store

        Args:
            documents: List of langchain document
            embeddings : Corresponding embeddings for the documents
        """

        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        print(f"Adding{len(documents)} documents to the vector store..")

        # Prepare data for chromaDB

        ids = []
        metadatas = []
        documents_text =[]
        embeddings_list = []

        for i,(doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate unique ID

            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # Prepare metadata

            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)



            # Document Content

            documents_text.append(doc.page_content)

            # Embeddings

            embeddings_list.append(embedding.tolist())

        # Add to collection

        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,
                documents=documents_text

            )

            print(f"Successfully added {len(documents)} documents to vector store")
            print(f"Total documents in collections: {self.collection.count()}")
        
        except Exception as e:
            print(f"Error adding document to vector store {e}")
            raise


vectorstore = VectorStore()
vectorstore






Vector Store Initialized. Collections: pdf_documents
Existing documents in collections   : 87


<__main__.VectorStore at 0x1c7fd5fa3c0>

In [12]:
from collections import Counter

Counter(doc.metadata["source"] for doc in pdf_documents)


Counter({'../data/pdfs\\JavaInterviewQuestions-UdemyCourse.pdf': 109,
         '../data/pdfs\\Engineering Physics Notes.pdf': 69,
         '../data/pdfs\\Exception-Handling-in-Java.pdf': 9})

In [13]:
# Convert text to embeddings
split_docs = split_documents(pdf_documents)
texts = [doc.page_content for doc in split_docs]

# Generate the embeddings 

embeddings = embedding_manager.generate_embeddings(texts)

# Store in the vector database
vectorstore.add_documents(split_docs,embeddings)

Split 187 documents into 481 chunks

Example chunk:
Content: Engineering Physics           I B.Tech                         CSE/EEE/IT & ECE 
 
GRIET 
 1 
 
 
 
I.B.Tech (CSE/EEE/IT & ECE) 
 
Engineering Physics  Syllabus 
  
UNIT-I 
1. Crystal Structures: Latt...
Metadata: {'producer': 'Microsoft® Word 2013', 'creator': 'Microsoft® Word 2013', 'creationdate': '2015-08-20T11:09:44+05:30', 'source': '../data/pdfs\\Engineering Physics Notes.pdf', 'file_path': '../data/pdfs\\Engineering Physics Notes.pdf', 'total_pages': 69, 'format': 'PDF 1.5', 'title': 'Engineering Physics           I B.Tech                         CSE/EEE/IT & ECE', 'author': 'Rajesh', 'subject': '', 'keywords': '', 'moddate': '2015-08-20T11:09:44+05:30', 'trapped': '', 'modDate': "D:20150820110944+05'30'", 'creationDate': "D:20150820110944+05'30'", 'page': 0}
Generating embeddings for 481 texts...


Batches: 100%|██████████| 16/16 [00:16<00:00,  1.06s/it]


Generated embeddings with shape : (481, 384)
Adding481 documents to the vector store..
Successfully added 481 documents to vector store
Total documents in collections: 568


## Retriever Pipeline from VectorStore

In [14]:
class RAGRetriever:
    """ Handles query based retrieval from the vector store """

    def __init__(self,vector_store: VectorStore, embedding_manager: EmbeddingManager):
        """ 
            Initialize The retriever

            Args:
                vector_store: vector store contaning document embedding
                embedding_manager: Manager for generating query embedding

        """

        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int=5, score_threshold: float = 0.0) -> List[Dict[str,Any]]:
        """
        Retrieve relevant document for a query

        Args:
            query: The search query
            top_k: No. of top results to return
            score_threshold: Minimum similarity score threshold

        Returns:
            List of Dictionaries conatning retrieved document and metadata
        """ 

        print(f"Retrieving documents for query: '{query}' ")
        print(f"Top K: {top_k}, Score Threshold {score_threshold}")


        ## Generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]

        # Search in Vector Store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )

            # Process Result

            retrieved_doc = []

            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]

                for i, (doc_id,document,metadata,distance) in enumerate(zip(ids,documents,metadatas,distances)):
                    # Convert distance to similarity score (ChromaDB uses cosine distance)

                    similarity_score = 1 - distance

                    if similarity_score >= score_threshold:
                        retrieved_doc.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distance,
                            'rank': i+1

                        })
                print(f"Retrieved {len(retrieved_doc)} document after filtering")
            else:
                print("No document found")

            return retrieved_doc

        except Exception as e:
            print(f"Error during Retrieval : {e}")
            return []
        
rag_retriever = RAGRetriever(vectorstore,embedding_manager)





In [15]:
rag_retriever

<__main__.RAGRetriever at 0x1c78009c980>

In [19]:
rag_retriever.retrieve("The Benefits of Exception Handling")

Retrieving documents for query: 'The Benefits of Exception Handling' 
Top K: 5, Score Threshold 0.0
Generating embeddings for 481 texts...


Batches: 100%|██████████| 16/16 [00:16<00:00,  1.03s/it]

Generated embeddings with shape : (481, 384)
Retrieved 5 document after filtering





[{'id': 'doc_619a1766_0',
  'content': 'Engineering Physics           I B.Tech                         CSE/EEE/IT & ECE \n \nGRIET \n 1 \n \n \n \nI.B.Tech (CSE/EEE/IT & ECE) \n \nEngineering Physics  Syllabus \n  \nUNIT-I \n1. Crystal Structures: Lattice points, Space lattice, Basis, Bravais lattice, unit cell and lattice parameters, \nSeven Crystal Systems with 14 Bravais lattices , Atomic Radius, Co-ordination Number and Packing \nFactor of SC, BCC, FCC, Miller Indices, Inter planer spacing of Cubic crystal system. \n2. Defects in Crystals: Classification of defects, Point Defects: Vacancies, Substitution, Interstitial, \nConcentration of Vacancies, Frenkel and Schottky Defects, Edge and Screw Dislocations (Qualitative \ntreatment), Burger’s Vector. \n3. Principles of Quantum Mechanics: Waves and Particles, de Broglie Hypothesis, Matter Waves, \nDavisson and Germer’s Experiment, Heisenberg’s Uncertainty Principle, Schrodinger’s Time \nIndependent Wave Equation-Physical Significance 

In [None]:
rag_retriever.retrieve("What are differences between String	and	StringBuffer?)")

Retrieving documents for query: 'What are differences	between	String	and	StringBuffer?)' 
Top K: 5, Score Threshold 0.0
Generating embeddings for 481 texts...


Batches: 100%|██████████| 16/16 [00:16<00:00,  1.05s/it]

Generated embeddings with shape : (481, 384)
Retrieved 5 document after filtering





[{'id': 'doc_619a1766_0',
  'content': 'Engineering Physics           I B.Tech                         CSE/EEE/IT & ECE \n \nGRIET \n 1 \n \n \n \nI.B.Tech (CSE/EEE/IT & ECE) \n \nEngineering Physics  Syllabus \n  \nUNIT-I \n1. Crystal Structures: Lattice points, Space lattice, Basis, Bravais lattice, unit cell and lattice parameters, \nSeven Crystal Systems with 14 Bravais lattices , Atomic Radius, Co-ordination Number and Packing \nFactor of SC, BCC, FCC, Miller Indices, Inter planer spacing of Cubic crystal system. \n2. Defects in Crystals: Classification of defects, Point Defects: Vacancies, Substitution, Interstitial, \nConcentration of Vacancies, Frenkel and Schottky Defects, Edge and Screw Dislocations (Qualitative \ntreatment), Burger’s Vector. \n3. Principles of Quantum Mechanics: Waves and Particles, de Broglie Hypothesis, Matter Waves, \nDavisson and Germer’s Experiment, Heisenberg’s Uncertainty Principle, Schrodinger’s Time \nIndependent Wave Equation-Physical Significance 