## Data Ingestion


In [18]:
### Document Structure

from langchain_core.documents import Document

In [19]:
doc = Document(
    page_content="this is the main text content I am using to create RAG",
    metadata = {
        "source": "example.txt",
        "pages": 1,
        "author":"predator",
        "date_created":"2025-12-24"
    }
)

doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'predator', 'date_created': '2025-12-24'}, page_content='this is the main text content I am using to create RAG')

In [20]:
##create a simple directory

import os
os.makedirs("../data/txt_files",exist_ok=True)

In [21]:
sample_texts = {
    "../data/txt_files/python_intro.txt":"""Python Programming Introduction
    
Python is a high-level, interpreted, general-purpose programming language designed to be simple, readable, and powerful.

Key features :

üß† Easy to learn & read ‚Äì clear, English-like syntax

‚ö° Interpreted ‚Äì runs code line by line, no compilation

üåç Cross-platform ‚Äì works on Windows, macOS, Linux

üìö Large standard library ‚Äì built-in tools for many tasks

üîå Rich ecosystem ‚Äì thousands of third-party libraries

üß© Object-oriented & functional ‚Äì supports multiple paradigms

üöÄ Versatile ‚Äì used in web, data science, AI, automation, and more

    """,

     "../data/txt_files/ml_intro.txt":""" Machine Learning Introduction

    Machine Learning (ML) is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed.

Key features

üìä Data-driven ‚Äì learns patterns from data

ü§ñ Self-improving ‚Äì performance improves with more data

üß† Predictive ‚Äì makes predictions or decisions

üîÅ Automated learning ‚Äì reduces manual rule-based coding

üìà Scalable ‚Äì works with large and complex datasets

üåê Wide applications ‚Äì used in vision, speech, recommendation, fraud detection

"""
}

for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)    

print("Sample text file created")

Sample text file created


In [22]:
###TextLoader

from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/txt_files/python_intro.txt",encoding="utf-8")
print(loader.load())

[Document(metadata={'source': '../data/txt_files/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted, general-purpose programming language designed to be simple, readable, and powerful.\n\nKey features :\n\nüß† Easy to learn & read ‚Äì clear, English-like syntax\n\n‚ö° Interpreted ‚Äì runs code line by line, no compilation\n\nüåç Cross-platform ‚Äì works on Windows, macOS, Linux\n\nüìö Large standard library ‚Äì built-in tools for many tasks\n\nüîå Rich ecosystem ‚Äì thousands of third-party libraries\n\nüß© Object-oriented & functional ‚Äì supports multiple paradigms\n\nüöÄ Versatile ‚Äì used in web, data science, AI, automation, and more\n\n    ')]


In [23]:
###Directory Loader

from langchain_community.document_loaders import DirectoryLoader

## Load all text files from the directory

dir_loader = DirectoryLoader(
    "../data/txt_files",
     glob="**/*.txt", ## pattern to match file
     loader_cls = TextLoader, ## loader class to use
     loader_kwargs={'encoding':'utf-8'},
     show_progress=False
)

documents = dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\txt_files\\ml_intro.txt'}, page_content=' Machine Learning Introduction\n\n    Machine Learning (ML) is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed.\n\nKey features\n\nüìä Data-driven ‚Äì learns patterns from data\n\nü§ñ Self-improving ‚Äì performance improves with more data\n\nüß† Predictive ‚Äì makes predictions or decisions\n\nüîÅ Automated learning ‚Äì reduces manual rule-based coding\n\nüìà Scalable ‚Äì works with large and complex datasets\n\nüåê Wide applications ‚Äì used in vision, speech, recommendation, fraud detection\n\n'),
 Document(metadata={'source': '..\\data\\txt_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted, general-purpose programming language designed to be simple, readable, and powerful.\n\nKey features :\n\nüß† Easy to learn & read ‚Äì clear, English-like synt

In [24]:
from langchain_community.document_loaders import PyMuPDFLoader
import glob

pdf_documents = []

for path in glob.glob("../data/pdfs/*.pdf"):
    loader = PyMuPDFLoader(path)
    docs = loader.load()
    for d in docs:
        d.metadata["source"] = path   # üîë REQUIRED
    pdf_documents.extend(docs)

from collections import Counter
print("Loaded PDFs:", Counter(d.metadata["source"] for d in pdf_documents))


Loaded PDFs: Counter({'../data/pdfs\\JavaInterviewQuestions-UdemyCourse.pdf': 109, '../data/pdfs\\Engineering Physics Notes.pdf': 69, '../data/pdfs\\Exception-Handling-in-Java.pdf': 9})


In [25]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents,chunk_size=1000,chunk_overlap=200):
    """
    Split documents into smaller chunks for better RAG performance.
    
    Parameters:
    - chunk_size: Maximum characters per chunk (adjust based on your LLM)
    - chunk_overlap: Characters to overlap between chunks (preserves context)
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, # Each chunk: ~1000 characters
        chunk_overlap=chunk_overlap, # 200 chars overlap for context
        length_function=len, # How to measure length
        separators=["\n\n", "\n", " ", ""] # Split hierarchy
    )
    # Actually split the documents
    split_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
    
    # Show what a chunk looks like
    if split_docs:
        print(f"\nExample chunk:")
        print(f"Content: {split_docs[0].page_content[:200]}...")
        print(f"Metadata: {split_docs[0].metadata}")
    
    return split_docs

##Embedding and VectorStoreDB


In [26]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List,Dict,Any,Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [27]:
class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformer"""

    def __init__(self,model_name: str = "all-MiniLM-L6-v2"):
        """
            Initialize the embedding manager

            Args:
                model_name : Hugging face Model name for sentence embeddings
        """

        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        """Load the sentence Transformer model"""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model =    SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension : {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self,texts: List[str])->np.ndarray:
        """
        Generate embedding for a list of texts

        Args:
            texts:LIst of text strings to embed

        Return :
            numpy array of embedding with shape (len(texts),embedding_dim)
        """

        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts,show_progress_bar=True)
        print(f"Generated embeddings with shape : {embeddings.shape}")
        return embeddings
    



    ### Initialize the embedding manager

embedding_manager= EmbeddingManager()
embedding_manager

Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension : 384


<__main__.EmbeddingManager at 0x20562165550>

## Vector StoreDB

In [28]:
class VectorStore:
    """ manages document embeddings in a chromaDB vector store """

    def __init__(self,collection_name: str= "pdf_documents", persist_directory: str= "../data/vector_store"):
        """  
        Initialize the vector store

        Args:
            collection_name: Name of chromaDB collection
            persistent_directory: Directory to persist the vector store

        """

        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """  
        Initialize chromaDB client and collection
        """

        try:
            # Create persistent chromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)

            # Get or create collection

            self.collection = self.client.get_or_create_collection(
                name = self.collection_name,
                metadata={"Description": "Pdf document embedding for RAG"}
            )
            print(f"Vector Store Initialized. Collections: {self.collection_name}")
            print(f"Existing documents in collections   : {self.collection.count()}")

        except Exception as e:
            print(f"Error Initializing Vector Store: {e}")
            raise

    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """ 
        Add documents and their embeddings to the vector store

        Args:
            documents: List of langchain document
            embeddings : Corresponding embeddings for the documents
        """

        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        print(f"Adding{len(documents)} documents to the vector store..")

        # Prepare data for chromaDB

        ids = []
        metadatas = []
        documents_text =[]
        embeddings_list = []

        for i,(doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate unique ID

            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # Prepare metadata

            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)



            # Document Content

            documents_text.append(doc.page_content)

            # Embeddings

            embeddings_list.append(embedding.tolist())

        # Add to collection

        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,
                documents=documents_text

            )

            print(f"Successfully added {len(documents)} documents to vector store")
            print(f"Total documents in collections: {self.collection.count()}")
        
        except Exception as e:
            print(f"Error adding document to vector store {e}")
            raise


vectorstore = VectorStore()
vectorstore






Vector Store Initialized. Collections: pdf_documents
Existing documents in collections   : 481


<__main__.VectorStore at 0x205621652b0>

In [29]:
from collections import Counter

Counter(doc.metadata["source"] for doc in pdf_documents)


Counter({'../data/pdfs\\JavaInterviewQuestions-UdemyCourse.pdf': 109,
         '../data/pdfs\\Engineering Physics Notes.pdf': 69,
         '../data/pdfs\\Exception-Handling-in-Java.pdf': 9})

In [30]:
# Convert text to embeddings
split_docs = split_documents(pdf_documents)
texts = [doc.page_content for doc in split_docs]

# Generate the embeddings 

embeddings = embedding_manager.generate_embeddings(texts)

# Store in the vector database
vectorstore.add_documents(split_docs,embeddings)

Split 187 documents into 481 chunks

Example chunk:
Content: Engineering Physics           I B.Tech                         CSE/EEE/IT & ECE 
 
GRIET 
 1 
 
 
 
I.B.Tech (CSE/EEE/IT & ECE) 
 
Engineering Physics  Syllabus 
  
UNIT-I 
1. Crystal Structures: Latt...
Metadata: {'producer': 'Microsoft¬Æ Word 2013', 'creator': 'Microsoft¬Æ Word 2013', 'creationdate': '2015-08-20T11:09:44+05:30', 'source': '../data/pdfs\\Engineering Physics Notes.pdf', 'file_path': '../data/pdfs\\Engineering Physics Notes.pdf', 'total_pages': 69, 'format': 'PDF 1.5', 'title': 'Engineering Physics           I B.Tech                         CSE/EEE/IT & ECE', 'author': 'Rajesh', 'subject': '', 'keywords': '', 'moddate': '2015-08-20T11:09:44+05:30', 'trapped': '', 'modDate': "D:20150820110944+05'30'", 'creationDate': "D:20150820110944+05'30'", 'page': 0}
Generating embeddings for 481 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:16<00:00,  1.05s/it]


Generated embeddings with shape : (481, 384)
Adding481 documents to the vector store..
Successfully added 481 documents to vector store
Total documents in collections: 962


## Retriever Pipeline from VectorStore

In [31]:
class RAGRetriever:
    """ Handles query based retrieval from the vector store """

    def __init__(self,vector_store: VectorStore, embedding_manager: EmbeddingManager):
        """ 
            Initialize The retriever

            Args:
                vector_store: vector store contaning document embedding
                embedding_manager: Manager for generating query embedding

        """

        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int=5, score_threshold: float = 0.0) -> List[Dict[str,Any]]:
        """
        Retrieve relevant document for a query

        Args:
            query: The search query
            top_k: No. of top results to return
            score_threshold: Minimum similarity score threshold

        Returns:
            List of Dictionaries conatning retrieved document and metadata
        """ 

        print(f"Retrieving documents for query: '{query}' ")
        print(f"Top K: {top_k}, Score Threshold {score_threshold}")


        ## Generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]

        # Search in Vector Store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )

            # Process Result

            retrieved_doc = []

            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]

                for i, (doc_id,document,metadata,distance) in enumerate(zip(ids,documents,metadatas,distances)):
                    # Convert distance to similarity score (ChromaDB uses cosine distance)

                    similarity_score = 1 - distance

                    if similarity_score >= score_threshold:
                        retrieved_doc.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distance,
                            'rank': i+1

                        })
                print(f"Retrieved {len(retrieved_doc)} document after filtering")
            else:
                print("No document found")

            return retrieved_doc

        except Exception as e:
            print(f"Error during Retrieval : {e}")
            return []
        
rag_retriever = RAGRetriever(vectorstore,embedding_manager)





In [32]:
rag_retriever

<__main__.RAGRetriever at 0x20538beef90>

In [33]:
rag_retriever.retrieve("The Benefits of Exception Handling")

Retrieving documents for query: 'The Benefits of Exception Handling' 
Top K: 5, Score Threshold 0.0
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 93.44it/s]

Generated embeddings with shape : (1, 384)
Retrieved 5 document after filtering





[{'id': 'doc_08e16ffe_202',
  'content': 'Exception Handling \nin Java',
  'metadata': {'moddate': "D:20240927195246Z00'00'",
   'author': '',
   'creationDate': "D:20240927195246Z00'00'",
   'title': '',
   'trapped': '',
   'source': '../data/pdfs\\Exception-Handling-in-Java.pdf',
   'producer': 'GPL Ghostscript 10.02.0',
   'keywords': '',
   'creator': 'pdf-lib (https://github.com/Hopding/pdf-lib)',
   'total_pages': 9,
   'content_length': 27,
   'format': 'PDF 1.4',
   'creationdate': "D:20240927195246Z00'00'",
   'file_path': '../data/pdfs\\Exception-Handling-in-Java.pdf',
   'page': 0,
   'subject': '',
   'modDate': "D:20240927195246Z00'00'",
   'doc_index': 202},
  'similarity_score': 0.4134162664413452,
  'distance': 0.5865837335586548,
  'rank': 1},
 {'id': 'doc_ac9838f9_202',
  'content': 'Exception Handling \nin Java',
  'metadata': {'creationdate': "D:20240927195246Z00'00'",
   'format': 'PDF 1.4',
   'producer': 'GPL Ghostscript 10.02.0',
   'title': '',
   'page': 0,
 

In [34]:
rag_retriever.retrieve("What are differences between String	and	StringBuffer?)")

Retrieving documents for query: 'What are differences between String	and	StringBuffer?)' 
Top K: 5, Score Threshold 0.0
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 69.85it/s]

Generated embeddings with shape : (1, 384)
Retrieved 5 document after filtering





[{'id': 'doc_fb43db68_269',
  'content': 'performance benefits. \n‚Ä¢ \nBoth String and StringBuffer are thread-safe. \n‚Ä¢ \nStringBuffer is implemented by using synchronized keyword on all methods. \nWhat are differences between StringBuilder and StringBuffer? \nStringBuilder is not thread safe. So, it performs better in situations where thread safety is not required. \nCan you give examples of different utility methods in String class? \nString class defines a number of methods to get information about the string content. \nString str = "abcdefghijk"; \nGet information from String \nFollowing methods help to get information from a String. \n//char charAt(int paramInt) \nSystem.out.println(str.charAt(2)); //prints a char - c \nSystem.out.println("ABCDEFGH".length());//8 \nSystem.out.println("abcdefghij".toString()); //abcdefghij \nSystem.out.println("ABC".equalsIgnoreCase("abc"));//true \n \n//Get All characters from index paramInt \n//String substring(int paramInt)',
  'metadata': {

In [35]:
def retrieve_and_print(query):
    results = rag_retriever.retrieve(query)
    
    print(f"\n{'='*20} RESULTS {'='*20}")
    for doc in results:
        print(f"Content: {doc['content'].strip()}")
        print(f"Source: {doc['metadata']['source']}")
        print("-" * 30)

# Example usage:
retrieve_and_print("Explain Electrons in a periodic potential Bloch Theorem:")


Retrieving documents for query: 'Explain Electrons in a periodic potential Bloch Theorem:' 
Top K: 5, Score Threshold 0.0
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 77.71it/s]

Generated embeddings with shape : (1, 384)
Retrieved 5 document after filtering

Content: 1
exp(E‚àíEf)/KT + 1  
Here F(E) is called Fermi ‚Äì Dirac probability function. It indicates that the fraction of all energy 
state (E) occupied under thermal equilibrium ‚ÄòK‚Äô is Boltzmann constant. 
 
4) Explain the motion of an electron in periodic potential using Bloch theorem? (or) 
Explain Band theory of solids in detail. (or) Discuss the Kronig- penny model for the 
motion of an electron in a periodic potential. 
 
Electrons in a periodic potential ‚ÄìBloch Theorem:   
An electron moves through + ve ions, it experiences varying potential. The potential of the 
electron at the +ve ions site is zero and is maximum in between two +ve ions sites. 
The potential experienced by an eÀâ, when it passes though +ve ions shown in fig.  
 
 
eÀâ    (+)      (+)     (+)     (+) 
 
       (+)      (+)     (+)     (+) 
 
       (+)      (+)     (+)     (+) 
 
       (+)      (+)     (+)     (+) 
 
 
 



