# RAG Application with Automated Document Parser

This notebook demonstrates how to build a complete RAG (Retrieval-Augmented Generation) application using the automated-document-parser package and LangChain with HuggingFace models.

RAG enables you to:
- Load and process documents from various formats
- Create embeddings and store them in a vector database
- Retrieve relevant context for user queries
- Generate accurate answers based on your documents

## 1. Import Required Libraries

Import all necessary components for our RAG application.

In [1]:
import os
from pathlib import Path

# Our custom document parser
from automated_document_parser import DocumentParser

# LangChain components
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline
from langchain_core.prompts import PromptTemplate

print("All libraries imported successfully!")

  from .autonotebook import tqdm as notebook_tqdm


All libraries imported successfully!


## 3. Configure HuggingFace Models

We'll use free, small HuggingFace models for testing. No API key required!

In [2]:
# Configure models to use
# We'll use small, efficient models for testing

# Embedding model - small and fast
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # 384 dimensions, 22M parameters

# Language model for generation - small model for testing
LLM_MODEL = "google/flan-t5-small"  # 60M parameters, good for testing

print(f"Embedding model: {EMBEDDING_MODEL}")
print(f"Language model: {LLM_MODEL}")
print("No API key required - models run locally!")

Embedding model: sentence-transformers/all-MiniLM-L6-v2
Language model: google/flan-t5-small
No API key required - models run locally!


## 4. Create Sample Documents

Let's create some sample documents for our RAG system to demonstrate the functionality.

In [3]:
# Create sample documents directory
sample_docs_dir = Path("sample_docs")
sample_docs_dir.mkdir(exist_ok=True)

# Create sample text file
with open(sample_docs_dir / "python_basics.txt", "w") as f:
    f.write("""
Python Programming Basics

Python is a high-level, interpreted programming language known for its simplicity and readability.
It was created by Guido van Rossum and first released in 1991.

Key Features:
- Easy to learn and read
- Dynamically typed
- Object-oriented
- Large standard library
- Cross-platform compatibility

Common Use Cases:
- Web development (Django, Flask)
- Data science and machine learning
- Automation and scripting
- API development
""")

# Create another sample file
with open(sample_docs_dir / "machine_learning.txt", "w") as f:
    f.write("""
Introduction to Machine Learning

Machine Learning is a subset of artificial intelligence that enables systems to learn
and improve from experience without being explicitly programmed.

Types of Machine Learning:
1. Supervised Learning - Learning from labeled data
2. Unsupervised Learning - Finding patterns in unlabeled data
3. Reinforcement Learning - Learning through trial and error

Popular ML Libraries:
- Scikit-learn for classical ML
- TensorFlow and PyTorch for deep learning
- Pandas for data manipulation
- NumPy for numerical computations
""")

# Create a CSV file
with open(sample_docs_dir / "data.csv", "w") as f:
    f.write("""language,year_created,paradigm
Python,1991,Multi-paradigm
JavaScript,1995,Multi-paradigm
Java,1995,Object-oriented
Rust,2010,Multi-paradigm
""")

print(f"Sample documents created in {sample_docs_dir}/")
print(f"   - python_basics.txt")
print(f"   - machine_learning.txt")
print(f"   - data.csv")

Sample documents created in sample_docs/
   - python_basics.txt
   - machine_learning.txt
   - data.csv


## 5. Load Documents Using Automated Document Parser

Now we'll use our custom DocumentParser to automatically detect and load documents of different formats. The parser intelligently detects file types based on extensions and uses the appropriate loader.

In [4]:
# Initialize the DocumentParser
parser = DocumentParser()

# Get all files in the sample directory
file_paths = [str(f) for f in sample_docs_dir.glob("*") if f.is_file()]

# Parse all documents using our automated parser
parsed_docs = parser.parse_multiple(file_paths)

# Display loaded documents
total_docs = sum(len(docs) for docs in parsed_docs.values())
print(f"Loaded {total_docs} documents from {len(parsed_docs)} files:")
for file_path, docs in parsed_docs.items():
    print(f"  - {Path(file_path).name}: {len(docs)} document(s)")

# Flatten all documents into a single list
all_documents = []
for docs in parsed_docs.values():
    all_documents.extend(docs)

print(f"\nTotal document chunks: {len(all_documents)}")

# Display first document as example
if all_documents:
    print("\n--- Sample Document ---")
    print(f"Content preview: {all_documents[0].page_content[:200]}...")
    print(f"Metadata: {all_documents[0].metadata}")

Loaded 6 documents from 3 files:
  - python_basics.txt: 1 document(s)
  - data.csv: 4 document(s)
  - machine_learning.txt: 1 document(s)

Total document chunks: 6

--- Sample Document ---
Content preview: 
Python Programming Basics

Python is a high-level, interpreted programming language known for its simplicity and readability.
It was created by Guido van Rossum and first released in 1991.

Key Featu...
Metadata: {'source': '/Users/pulkit/Desktop/test/document_parser/notebooks/sample_docs/python_basics.txt', 'file_name': 'python_basics.txt', 'file_type': '.txt'}


## 6. Split Documents into Chunks

For better retrieval accuracy, we split documents into smaller chunks. This ensures:
- Each chunk fits within the embedding model's context window
- More precise retrieval of relevant information
- Better matching between queries and document segments

In [5]:
# Split documents into smaller chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split all documents
split_documents = text_splitter.split_documents(all_documents)

print(f"Split {len(all_documents)} documents into {len(split_documents)} chunks")
print(f"\nSample chunk:")
print(f"Content: {split_documents[0].page_content[:200]}...")
print(f"Metadata: {split_documents[0].metadata}")

Split 6 documents into 7 chunks

Sample chunk:
Content: Python Programming Basics

Python is a high-level, interpreted programming language known for its simplicity and readability.
It was created by Guido van Rossum and first released in 1991.

Key Featur...
Metadata: {'source': '/Users/pulkit/Desktop/test/document_parser/notebooks/sample_docs/python_basics.txt', 'file_name': 'python_basics.txt', 'file_type': '.txt'}


## 7. Create Embeddings and Vector Store

Initialize HuggingFace embeddings and create a FAISS vector store. The vector store enables:
- Fast similarity search across thousands of documents
- Efficient retrieval of relevant context
- Semantic matching beyond keyword search

In [6]:
# Initialize HuggingFace embeddings
embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL,
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

print(f"Using embedding model: {EMBEDDING_MODEL}")
print("Creating vector store...")

# Create FAISS vector store
vectorstore = FAISS.from_documents(
    documents=split_documents,
    embedding=embeddings
)

print(f"Vector store created with {vectorstore.index.ntotal} embeddings")

# Test similarity search
query = "What is Python?"
results = vectorstore.similarity_search(query, k=2)
print(f"\nTest search for: '{query}'")
print(f"Found {len(results)} relevant chunks")
print(f"\nTop result preview:")
print(results[0].page_content[:200])

Using embedding model: sentence-transformers/all-MiniLM-L6-v2
Creating vector store...
Vector store created with 7 embeddings

Test search for: 'What is Python?'
Found 2 relevant chunks

Top result preview:
Python Programming Basics

Python is a high-level, interpreted programming language known for its simplicity and readability.
It was created by Guido van Rossum and first released in 1991.

Key Featur
Vector store created with 7 embeddings

Test search for: 'What is Python?'
Found 2 relevant chunks

Top result preview:
Python Programming Basics

Python is a high-level, interpreted programming language known for its simplicity and readability.
It was created by Guido van Rossum and first released in 1991.

Key Featur


## 8. Configure Language Model

Set up the HuggingFace language model for generating answers. We're using flan-t5-small, a compact model that's perfect for testing and learning.

In [7]:
# Initialize the language model pipeline
from transformers import pipeline

print(f"Loading language model: {LLM_MODEL}")
print("This may take a moment on first run...")

text_generation_pipeline = pipeline(
    "text2text-generation",
    model=LLM_MODEL,
    max_length=512,
    device=-1  # Use CPU
)

# Create LangChain LLM wrapper
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

print("Language model loaded successfully")

Loading language model: google/flan-t5-small
This may take a moment on first run...


Device set to use cpu


Language model loaded successfully


## 9. Create Simple Q&amp;A Function

Now we'll create a simple question-answering function that:
1. **Retrieves**: Searches the vector store for relevant documents
2. **Generates**: Uses the LLM to answer based on retrieved context
3. **Returns**: Provides a single answer without maintaining history

In [8]:
# Create a simple Q&amp;A function
def answer_question(question, retriever, llm, k=3):
    """
    Answer a question based on retrieved documents.
    
    Args:
        question: The user's question
        retriever: Vector store retriever
        llm: Language model
        k: Number of documents to retrieve
    
    Returns:
        tuple: (answer, source_documents)
    """
    # Retrieve relevant documents
    docs = retriever.invoke(question)
    
    # Combine document contents
    context = "\n\n".join([doc.page_content for doc in docs])
    
    # Create prompt
    prompt = f"""Answer the question based on the context below. If the answer cannot be found in the context, say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""
    
    # Generate answer
    answer = llm.invoke(prompt)
    
    return answer, docs

# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

print("Simple Q&amp;A function created successfully!")
print("\nReady to answer questions based on your documents!")


Simple Q&amp;A function created successfully!

Ready to answer questions based on your documents!


## 10. Ask Questions and Get Answers

Now let's test our Q&amp;A function! The system will:
1. Take your question
2. Search for relevant document chunks
3. Generate an answer based on the context
4. Return the answer with source attribution

In [9]:
# Ask questions about the documents
questions = [
    "What is Python and when was it created?",
    "What are the types of machine learning?",
    "Which programming language was created in 2010?"
]

for question in questions:
    print(f"\nQuestion: {question}")
    print("-" * 80)
    
    answer, source_docs = answer_question(question, retriever, llm)
    
    print(f"Answer: {answer}")
    print(f"\nSources used:")
    for i, doc in enumerate(source_docs, 1):
        source = doc.metadata.get('source', 'Unknown')
        file_name = Path(source).name if source != 'Unknown' else 'Unknown'
        print(f"  {i}. {file_name}")


Question: What is Python and when was it created?
--------------------------------------------------------------------------------
Answer: Python Programming Basics Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991.

Sources used:
  1. python_basics.txt
  2. data.csv
  3. data.csv

Question: What are the types of machine learning?
--------------------------------------------------------------------------------
Answer: Python Programming Basics Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991.

Sources used:
  1. python_basics.txt
  2. data.csv
  3. data.csv

Question: What are the types of machine learning?
--------------------------------------------------------------------------------
Answer: a subset of artificial intelligence that enables systems to learn and im