# Simple RAG System Demo

This notebook demonstrates how to use DocChunker together with LangChain to build a simple Retrieval-Augmented Generation (RAG) pipeline. We will:
- Process a DOCX document into retrievable chunks.
- Index the chunks using FAISS as our vector store.
- Retrieve relevant chunks based on a user query.
- Use a language model to generate an answer using the retrieved context.

Note: To run this demo, please install the optional dependencies by running:

```bash
pip install docchunker[dev]
```
This will install LangChain, FAISS, OpenAI, and other libraries needed for the RAG pipeline demonstration.

## Setup and Imports

In [None]:
import os
from pathlib import Path
from docchunker import DocChunker

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

## Process Document into Chunks

In [5]:
current_dir = Path.cwd()
samples_dir = current_dir.parent / "data" / "samples"
doc_path = samples_dir / "complex_document.docx"

# Initialize the DocChunker with a target chunk size (tokens) and optional overlap settings
chunker = DocChunker(chunk_size=200)

# Process the document (returns a list of Chunk objects)
chunks = chunker.process_document(str(doc_path))

print(f"Processed document and generated {len(chunks)} chunks.")

Processed document and generated 86 chunks.


## Build a Vector Store from Chunks

In [8]:
# Convert chunks into a list of text along with some metadata
documents = [chunk.text for chunk in chunks]

# Initialize embeddings (using OpenAI's embeddings as an example; use a dummy if you don't have an API key)
embeddings = OpenAIEmbeddings()  # or use: embeddings = YourCustomEmbeddings()

# Build a FAISS vector store from the documents
vector_store = FAISS.from_texts(documents, embeddings)

print("Vector store created with FAISS.")

Vector store created with FAISS.


## Build the RAG Pipeline

In [None]:
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

llm = OpenAI(temperature=0)

rag_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

query = "What is the key innovation introduced in Johnson & Williams's 2023 paper?"
result = rag_chain.run(query)

print("Query:", query)
print("\nAnswer (generated by the RAG system):\n", result)
print("Context used for the answer:\n", retriever.get_relevant_documents(query))

Query: What is the key innovation introduced in Johnson & Williams's 2023 paper?

Answer (generated by the RAG system):
  The key innovation introduced in Johnson & Williams's 2023 paper is a hierarchical attention mechanism.


  print("Context used for the answer:\n", retriever.get_relevant_documents(query))


Context used for the answer:
 [Document(id='0d930ec6-5a78-4c3e-b5e6-d4725867f59e', metadata={}, page_content='H1: Complex Document for Chunking Tests\nH2: Section 8: Mixed Content with References\nH3: Research Findings on Document Processing Techniques\n---\nStudy: Smith et al. | Year: 2022 | Key Innovation: Structure-aware chunking | Performance Improvement: 37% higher semantic coherence\nStudy: Johnson & Williams | Year: 2023 | Key Innovation: Hierarchical attention mechanism | Performance Improvement: 42% improvement in QA accuracy\nStudy: Zhang et al. | Year: 2023 | Key Innovation: Multi-modal embeddings | Performance Improvement: 28% better image-text alignment\nStudy: Patel & Garcia | Year: 2024 | Key Innovation: Recursive table parsing | Performance Improvement: 53% reduction in information loss'), Document(id='e3c34870-9712-4221-bf0a-0585a0d87ecb', metadata={}, page_content='H1: Complex Document for Chunking Tests\nH2: Section 8: Mixed Content with References\nH3: Research Find