# RAG System Demo Notebook

This notebook demonstrates the core functionality of the RAG Question-Answering system.


## Setup and Imports

In [4]:
!pip install langchain



In [6]:
import sys
import os

# Add src directory to path
sys.path.append('src')

from ingest import ingest_documents
from embed import build_index_from_chunks
from retriever import create_retriever_from_chunks
from generator import create_generator
from langchain.schema import Document

import logging
logging.basicConfig(level=logging.INFO)



NameError: name 'Document' is not defined

## 1. Document Ingestion

First, let's ingest some sample documents and see how they are processed.

In [None]:
# Create a sample document
sample_doc_content = """
Machine Learning Fundamentals

Machine learning is a subset of artificial intelligence that focuses on developing algorithms 
that can learn and make decisions from data without being explicitly programmed.

Types of Machine Learning:
1. Supervised Learning: Uses labeled data to train models
2. Unsupervised Learning: Finds patterns in unlabeled data
3. Reinforcement Learning: Learns through interaction with environment

Deep Learning:
Deep learning is a subset of machine learning that uses neural networks with multiple layers.
It has been particularly successful in image recognition, natural language processing, and speech recognition.
"""

# Save to file
with open('../sample_ml_doc.txt', 'w') as f:
    f.write(sample_doc_content)

print("Sample document created!")

In [None]:
# Ingest the document
document_paths = ['../sample_ml_doc.txt']
chunks = ingest_documents(document_paths)

print(f"Created {len(chunks)} chunks from {len(document_paths)} document(s)")
print("\nFirst few chunks:")
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i+1}:")
    print(f"Content: {chunk['content'][:100]}...")
    print(f"Source: {chunk['source_document']}")
    print(f"Chunk ID: {chunk['chunk_id']}")

## 2. Embedding and Indexing

Now let's create embeddings and build a FAISS index for similarity search.

In [None]:
# Build FAISS index
embedding_manager = build_index_from_chunks(chunks, index_path='demo_index')
print("FAISS index created successfully!")
print(f"Index contains {embedding_manager.index.ntotal} vectors")
print(f"Embedding dimension: {embedding_manager.dimension}")

In [None]:
# Test similarity search
query = "What is deep learning?"
results = embedding_manager.search(query, k=3)

print(f"Search results for: '{query}'")
print("=" * 50)
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"Score: {result['score']:.3f}")
    print(f"Content: {result['content'][:150]}...")
    print(f"Source: {result['chunk_metadata'].get('source_document', 'Unknown')}")

## 3. LangChain Retrieval

Let's use LangChain for more advanced retrieval capabilities.

In [None]:
# Create LangChain retriever
retriever = create_retriever_from_chunks(chunks, save_path='langchain_demo_store')
print("LangChain retriever created successfully!")

In [None]:
# Test retrieval with LangChain
query = "What are the types of machine learning?"
documents = retriever.retrieve_documents(query)

print(f"Retrieved {len(documents)} documents for: '{query}'")
print("=" * 50)
for i, doc in enumerate(documents, 1):
    print(f"\nDocument {i}:")
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata: {doc.metadata}")

## 4. Question Answering

Now let's use the generator to answer questions based on retrieved documents.

In [None]:
# Create generator
generator = create_generator()
print("Generator created successfully!")

In [None]:
# Test question answering
questions = [
    "What is machine learning?",
    "What are the main types of machine learning?",
    "How is deep learning different from machine learning?",
    "What applications is deep learning good for?"
]

for question in questions:
    print(f"\nQuestion: {question}")
    print("=" * 60)
    
    # Retrieve relevant documents
    docs = retriever.retrieve_documents(question)
    
    # Generate answer
    result = generator.answer_question(question, docs)
    
    print(f"Answer: {result['answer']}")
    print(f"Confidence: {result['confidence']:.2f}")
    print(f"Sources: {len(result['sources'])} documents")

## 5. Document Summarization

Let's test the summarization functionality.

In [None]:
# Test summarization
all_docs = retriever.retrieve_documents("machine learning deep learning")
summary_result = generator.summarize_documents(all_docs)

print("Document Summary:")
print("=" * 40)
print(summary_result['summary'])
print(f"\nSummary Statistics:")
print(f"Documents processed: {summary_result['num_documents']}")
print(f"Input length: {summary_result['input_length']} characters")

## 6. Performance Analysis

Let's analyze the performance of different components.

In [None]:
import time

# Measure retrieval performance
test_queries = [
    "machine learning",
    "deep learning",
    "supervised learning",
    "neural networks",
    "artificial intelligence"
]

retrieval_times = []
generation_times = []

for query in test_queries:
    # Measure retrieval time
    start_time = time.time()
    docs = retriever.retrieve_documents(query)
    retrieval_time = time.time() - start_time
    retrieval_times.append(retrieval_time)
    
    # Measure generation time
    start_time = time.time()
    result = generator.answer_question(query, docs)
    generation_time = time.time() - start_time
    generation_times.append(generation_time)

print("Performance Analysis:")
print("=" * 30)
print(f"Average retrieval time: {sum(retrieval_times)/len(retrieval_times):.3f} seconds")
print(f"Average generation time: {sum(generation_times)/len(generation_times):.3f} seconds")
print(f"Total average time: {(sum(retrieval_times) + sum(generation_times))/len(test_queries):.3f} seconds")

## 7. Cleanup

Clean up temporary files created during the demo.

In [None]:
import os
import shutil

# Clean up files
files_to_remove = [
    '../sample_ml_doc.txt',
    'demo_index.faiss',
    'demo_index_metadata.json'
]

dirs_to_remove = [
    'langchain_demo_store'
]

for file_path in files_to_remove:
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f"Removed: {file_path}")

for dir_path in dirs_to_remove:
    if os.path.exists(dir_path):
        shutil.rmtree(dir_path)
        print(f"Removed directory: {dir_path}")

print("\nCleanup completed!")

## Conclusion

This notebook demonstrated the core functionality of the RAG system:

1. **Document Ingestion**: Processing and chunking documents
2. **Embedding & Indexing**: Creating vector representations and FAISS index
3. **Retrieval**: Finding relevant document chunks for queries
4. **Generation**: Answering questions and summarizing documents
5. **Performance**: Measuring system response times

The system successfully processes documents, creates searchable embeddings, retrieves relevant content, and generates coherent answers with source attribution.

### Next Steps

- Try with your own documents
- Experiment with different models
- Tune parameters for your use case
- Deploy using the provided Docker configuration