# Streamlit Document Q&A Application

> **Created by [Build Fast with AI](https://www.buildfastwithai.com)**

This notebook demonstrates how to build a document question-answering application using Streamlit, Gemini 3 Pro, and RAG (Retrieval-Augmented Generation).

## What you'll learn:
- Building document upload interfaces
- Extracting text from various file formats
- Creating vector embeddings
- Implementing RAG for Q&A
- Building interactive document viewers
- Managing document collections

## 1. Installation and Setup

In [None]:
!pip install -q streamlit google-generativeai chromadb langchain langchain-google-genai pypdf python-docx

In [None]:
import os
import google.generativeai as genai
from IPython.display import Markdown, display

In [None]:
# Configure API key
try:
    from google.colab import userdata
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
except:
    GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY', 'your-api-key-here')

genai.configure(api_key=GOOGLE_API_KEY)

## 2. Document Processing Utilities

In [None]:
# Document processing functions
from typing import List, Dict
import re

def extract_text_from_txt(file) -> str:
    """Extract text from TXT file."""
    return file.read().decode('utf-8')

def extract_text_from_pdf(file) -> str:
    """Extract text from PDF file."""
    try:
        from PyPDF2 import PdfReader
        pdf = PdfReader(file)
        text = ""
        for page in pdf.pages:
            text += page.extract_text()
        return text
    except ImportError:
        return "PyPDF2 not installed. Run: pip install pypdf"

def extract_text_from_docx(file) -> str:
    """Extract text from DOCX file."""
    try:
        from docx import Document
        doc = Document(file)
        text = "\n".join([paragraph.text for paragraph in doc.paragraphs])
        return text
    except ImportError:
        return "python-docx not installed. Run: pip install python-docx"

def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200) -> List[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        
        if i + chunk_size >= len(words):
            break
    
    return chunks

# Test document processing
sample_text = """Artificial Intelligence (AI) is transforming the world. 
Machine learning, a subset of AI, enables computers to learn from data. 
Deep learning uses neural networks with multiple layers to process complex patterns.
Natural language processing helps computers understand human language.
Computer vision enables machines to interpret visual information."""

chunks = chunk_text(sample_text, chunk_size=20, overlap=5)
print(f"Created {len(chunks)} chunks from sample text\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk}\n")

## 3. Basic Document Q&A System

In [None]:
class SimpleDocumentQA:
    """Simple document Q&A without vector search."""
    
    def __init__(self, document_text: str):
        self.document = document_text
        self.model = genai.GenerativeModel('gemini-3-pro')
    
    def ask(self, question: str) -> str:
        """Ask a question about the document."""
        prompt = f"""
        Based on the following document, answer the question.
        If the answer is not in the document, say "I cannot find that information in the document."
        
        Document:
        {self.document}
        
        Question: {question}
        
        Answer:
        """
        
        response = self.model.generate_content(prompt)
        return response.text

# Test with sample document
doc_text = """
The Python programming language was created by Guido van Rossum and first released in 1991.
Python is known for its simple, readable syntax and comprehensive standard library.
It is widely used in web development, data science, artificial intelligence, and automation.
Python 3.0 was released in 2008 and introduced several breaking changes from Python 2.
The Python Software Foundation manages the development of Python.
"""

qa_system = SimpleDocumentQA(doc_text)

questions = [
    "Who created Python?",
    "When was Python first released?",
    "What is Python used for?",
    "What is the capital of France?"  # Not in document
]

for q in questions:
    print(f"\nQ: {q}")
    answer = qa_system.ask(q)
    print(f"A: {answer}")
    print("-" * 80)

## 4. RAG-Based Document Q&A with ChromaDB

In [None]:
import chromadb
from chromadb.utils import embedding_functions

class RAGDocumentQA:
    """RAG-based document Q&A with vector search."""
    
    def __init__(self):
        self.client = chromadb.Client()
        self.embedding_function = embedding_functions.DefaultEmbeddingFunction()
        self.collection = self.client.create_collection(
            name="documents",
            embedding_function=self.embedding_function
        )
        self.model = genai.GenerativeModel('gemini-3-pro')
    
    def add_document(self, text: str, doc_id: str, chunk_size: int = 500):
        """Add a document to the collection."""
        chunks = chunk_text(text, chunk_size=chunk_size)
        
        for i, chunk in enumerate(chunks):
            self.collection.add(
                documents=[chunk],
                ids=[f"{doc_id}_chunk_{i}"],
                metadatas=[{"doc_id": doc_id, "chunk_index": i}]
            )
        
        return len(chunks)
    
    def ask(self, question: str, n_results: int = 3) -> Dict:
        """Ask a question and retrieve relevant context."""
        # Search for relevant chunks
        results = self.collection.query(
            query_texts=[question],
            n_results=n_results
        )
        
        # Combine relevant chunks
        context = "\n\n".join(results['documents'][0])
        
        # Generate answer
        prompt = f"""
        Answer the question based on the following context.
        If you cannot answer based on the context, say so.
        
        Context:
        {context}
        
        Question: {question}
        
        Answer:
        """
        
        response = self.model.generate_content(prompt)
        
        return {
            "answer": response.text,
            "context": context,
            "sources": results['metadatas'][0]
        }

# Test RAG system
rag_qa = RAGDocumentQA()

# Add multiple documents
docs = {
    "python_intro": """Python is a high-level programming language created by Guido van Rossum.
    It was first released in 1991 and emphasizes code readability. Python supports multiple
    programming paradigms including procedural, object-oriented, and functional programming.""",
    
    "python_applications": """Python is widely used in many domains. In web development,
    frameworks like Django and Flask are popular. For data science, libraries like NumPy,
    Pandas, and Scikit-learn are essential. Python is also the leading language for
    artificial intelligence and machine learning with libraries like TensorFlow and PyTorch.""",
    
    "python_features": """Python features include dynamic typing, automatic memory management,
    and a comprehensive standard library. The language uses indentation for code blocks
    instead of braces. Python's syntax is designed to be clean and easy to read."""
}

for doc_id, text in docs.items():
    chunks = rag_qa.add_document(text, doc_id)
    print(f"Added {doc_id}: {chunks} chunks")

# Ask questions
print("\n" + "="*80 + "\n")

questions = [
    "What is Python used for in data science?",
    "Who created Python?",
    "What are Python's main features?"
]

for q in questions:
    print(f"Question: {q}\n")
    result = rag_qa.ask(q)
    print(f"Answer: {result['answer']}\n")
    print(f"Sources: {result['sources']}")
    print("\n" + "="*80 + "\n")

## 5. Streamlit Document Q&A App - Basic Version

In [None]:
# Save this as document_qa_basic.py
basic_app_code = '''
import streamlit as st
import google.generativeai as genai
import os

st.set_page_config(
    page_title="Document Q&A",
    page_icon="üìÑ",
    layout="wide"
)

st.title("üìÑ Document Question & Answer")
st.caption("Upload a document and ask questions about it")

# API Key
api_key = os.environ.get('GOOGLE_API_KEY')
if not api_key:
    api_key = st.sidebar.text_input("Google API Key", type="password")

if api_key:
    genai.configure(api_key=api_key)
    
    # File upload
    uploaded_file = st.file_uploader(
        "Upload a document (TXT, PDF, or DOCX)",
        type=["txt", "pdf", "docx"]
    )
    
    if uploaded_file:
        # Extract text based on file type
        if uploaded_file.type == "text/plain":
            text = uploaded_file.read().decode("utf-8")
        elif uploaded_file.type == "application/pdf":
            from PyPDF2 import PdfReader
            pdf = PdfReader(uploaded_file)
            text = ""
            for page in pdf.pages:
                text += page.extract_text()
        elif uploaded_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
            from docx import Document
            doc = Document(uploaded_file)
            text = "\n".join([para.text for para in doc.paragraphs])
        
        # Store in session state
        st.session_state.document_text = text
        
        # Display document info
        col1, col2, col3 = st.columns(3)
        with col1:
            st.metric("Characters", len(text))
        with col2:
            st.metric("Words", len(text.split()))
        with col3:
            st.metric("Lines", text.count("\n"))
        
        # Show document preview
        with st.expander("üìñ Document Preview"):
            st.text_area("Content", text[:2000] + "..." if len(text) > 2000 else text, height=300)
        
        st.divider()
        
        # Q&A interface
        st.subheader("Ask Questions")
        
        question = st.text_input("Enter your question:")
        
        if st.button("Get Answer", type="primary"):
            if question:
                with st.spinner("Finding answer..."):
                    model = genai.GenerativeModel('gemini-3-pro')
                    
                    prompt = f"""
                    Based on the following document, answer the question.
                    If the answer is not in the document, say so clearly.
                    
                    Document:
                    {text}
                    
                    Question: {question}
                    
                    Answer:
                    """
                    
                    response = model.generate_content(prompt)
                    
                    st.success("Answer:")
                    st.markdown(response.text)
            else:
                st.warning("Please enter a question")
    else:
        st.info("üëÜ Upload a document to get started")
else:
    st.warning("Please enter your API key in the sidebar")
'''

with open('document_qa_basic.py', 'w') as f:
    f.write(basic_app_code)

print("Basic Document Q&A app saved to document_qa_basic.py")
print("\nTo run: streamlit run document_qa_basic.py")

## 6. Advanced Document Q&A App with RAG

In [None]:
# Save this as document_qa_advanced.py
advanced_app_code = '''
import streamlit as st
import google.generativeai as genai
import chromadb
from chromadb.utils import embedding_functions
import os
from PyPDF2 import PdfReader
from docx import Document

st.set_page_config(
    page_title="Advanced Document Q&A",
    page_icon="üîç",
    layout="wide"
)

# Custom CSS
st.markdown("""
<style>
    .stApp {
        max-width: 1200px;
        margin: 0 auto;
    }
    .upload-section {
        background-color: #f0f2f6;
        padding: 2rem;
        border-radius: 10px;
        margin-bottom: 2rem;
    }
</style>
""", unsafe_allow_html=True)

st.title("üîç Advanced Document Q&A with RAG")
st.caption("Upload multiple documents and ask questions using semantic search")

# Sidebar configuration
with st.sidebar:
    st.header("‚öôÔ∏è Configuration")
    
    api_key = os.environ.get('GOOGLE_API_KEY')
    if not api_key:
        api_key = st.text_input("Google API Key", type="password")
    
    st.divider()
    
    st.subheader("RAG Parameters")
    chunk_size = st.slider("Chunk Size", 200, 1000, 500, 50)
    n_results = st.slider("Results to Retrieve", 1, 5, 3)
    
    st.divider()
    
    if st.button("üóëÔ∏è Clear All Documents"):
        if "collection" in st.session_state:
            st.session_state.documents = {}
            st.rerun()

# Helper functions
def chunk_text(text, chunk_size=500, overlap=100):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        if i + chunk_size >= len(words):
            break
    return chunks

def extract_text(uploaded_file):
    if uploaded_file.type == "text/plain":
        return uploaded_file.read().decode("utf-8")
    elif uploaded_file.type == "application/pdf":
        pdf = PdfReader(uploaded_file)
        return "".join([page.extract_text() for page in pdf.pages])
    elif uploaded_file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
        doc = Document(uploaded_file)
        return "\n".join([para.text for para in doc.paragraphs])
    return ""

if api_key:
    genai.configure(api_key=api_key)
    
    # Initialize ChromaDB
    if "collection" not in st.session_state:
        client = chromadb.Client()
        embedding_function = embedding_functions.DefaultEmbeddingFunction()
        st.session_state.collection = client.create_collection(
            name="documents",
            embedding_function=embedding_function
        )
        st.session_state.documents = {}
    
    # Document upload section
    st.subheader("üì§ Upload Documents")
    
    uploaded_files = st.file_uploader(
        "Choose files",
        type=["txt", "pdf", "docx"],
        accept_multiple_files=True
    )
    
    if uploaded_files:
        for uploaded_file in uploaded_files:
            doc_id = uploaded_file.name
            
            if doc_id not in st.session_state.documents:
                with st.spinner(f"Processing {doc_id}..."):
                    text = extract_text(uploaded_file)
                    chunks = chunk_text(text, chunk_size=chunk_size)
                    
                    # Add to ChromaDB
                    for i, chunk in enumerate(chunks):
                        st.session_state.collection.add(
                            documents=[chunk],
                            ids=[f"{doc_id}_chunk_{i}"],
                            metadatas=[{"doc_id": doc_id, "chunk_index": i}]
                        )
                    
                    st.session_state.documents[doc_id] = {
                        "text": text,
                        "chunks": len(chunks)
                    }
                    
                    st.success(f"‚úÖ {doc_id} processed ({len(chunks)} chunks)")
    
    # Display loaded documents
    if st.session_state.documents:
        st.divider()
        st.subheader("üìö Loaded Documents")
        
        for doc_id, info in st.session_state.documents.items():
            with st.expander(f"üìÑ {doc_id}"):
                col1, col2 = st.columns(2)
                with col1:
                    st.metric("Chunks", info["chunks"])
                with col2:
                    st.metric("Characters", len(info["text"]))
                
                st.text_area(
                    "Preview",
                    info["text"][:500] + "..." if len(info["text"]) > 500 else info["text"],
                    height=100,
                    key=f"preview_{doc_id}"
                )
        
        st.divider()
        
        # Q&A Interface
        st.subheader("üí¨ Ask Questions")
        
        # Initialize chat history
        if "qa_history" not in st.session_state:
            st.session_state.qa_history = []
        
        # Display chat history
        for qa in st.session_state.qa_history:
            with st.chat_message("user"):
                st.write(qa["question"])
            with st.chat_message("assistant"):
                st.write(qa["answer"])
                if qa.get("sources"):
                    st.caption(f"üìé Sources: {', '.join(set([s['doc_id'] for s in qa['sources']]))}")
        
        # Question input
        question = st.chat_input("Ask a question about your documents...")
        
        if question:
            with st.chat_message("user"):
                st.write(question)
            
            with st.chat_message("assistant"):
                with st.spinner("Searching and generating answer..."):
                    # Search for relevant chunks
                    results = st.session_state.collection.query(
                        query_texts=[question],
                        n_results=n_results
                    )
                    
                    context = "\n\n".join(results['documents'][0])
                    sources = results['metadatas'][0]
                    
                    # Generate answer
                    model = genai.GenerativeModel('gemini-3-pro')
                    prompt = f"""
                    Answer the question based on the following context from uploaded documents.
                    Be specific and cite relevant information from the context.
                    If the answer is not in the context, say so.
                    
                    Context:
                    {context}
                    
                    Question: {question}
                    
                    Answer:
                    """
                    
                    response = model.generate_content(prompt)
                    answer = response.text
                    
                    st.write(answer)
                    
                    # Show sources
                    unique_docs = set([s['doc_id'] for s in sources])
                    st.caption(f"üìé Sources: {', '.join(unique_docs)}")
                    
                    # Save to history
                    st.session_state.qa_history.append({
                        "question": question,
                        "answer": answer,
                        "sources": sources
                    })
    else:
        st.info("üëÜ Upload documents to start asking questions")
else:
    st.warning("Please enter your API key in the sidebar")
'''

with open('document_qa_advanced.py', 'w') as f:
    f.write(advanced_app_code)

print("Advanced Document Q&A app saved to document_qa_advanced.py")
print("\nTo run: streamlit run document_qa_advanced.py")

## 7. Document Summarization

In [None]:
class DocumentSummarizer:
    """Summarize documents using Gemini."""
    
    def __init__(self):
        self.model = genai.GenerativeModel('gemini-3-pro')
    
    def summarize(self, text: str, style: str = "concise") -> str:
        """Summarize text in different styles."""
        style_prompts = {
            "concise": "Provide a brief, concise summary in 2-3 sentences.",
            "detailed": "Provide a detailed summary covering all main points.",
            "bullet": "Provide a summary as bullet points of key information.",
            "executive": "Provide an executive summary for business stakeholders."
        }
        
        prompt = f"""
        {style_prompts.get(style, style_prompts["concise"])}
        
        Document:
        {text}
        
        Summary:
        """
        
        response = self.model.generate_content(prompt)
        return response.text
    
    def extract_key_points(self, text: str, n_points: int = 5) -> str:
        """Extract key points from document."""
        prompt = f"""
        Extract the {n_points} most important key points from this document.
        Format as a numbered list.
        
        Document:
        {text}
        
        Key Points:
        """
        
        response = self.model.generate_content(prompt)
        return response.text

# Test summarization
long_doc = """
Machine learning is a subset of artificial intelligence that enables computers to learn 
from data without being explicitly programmed. The field has evolved significantly over 
the past decades, with deep learning emerging as a powerful technique for handling 
complex patterns in large datasets.

Supervised learning involves training models on labeled data, where the correct output 
is known. Common algorithms include linear regression, decision trees, and neural networks. 
These methods are widely used in applications like image classification, speech recognition, 
and predictive analytics.

Unsupervised learning, on the other hand, works with unlabeled data to discover hidden 
patterns. Clustering algorithms like K-means and dimensionality reduction techniques like 
PCA are popular unsupervised learning methods. These are useful for exploratory data 
analysis and feature engineering.

Reinforcement learning is a third paradigm where agents learn through trial and error, 
receiving rewards or penalties for their actions. This approach has achieved remarkable 
success in game playing, robotics, and autonomous systems.
"""

summarizer = DocumentSummarizer()

print("ORIGINAL DOCUMENT:")
print("=" * 80)
print(long_doc)

print("\n\nCONCISE SUMMARY:")
print("=" * 80)
print(summarizer.summarize(long_doc, style="concise"))

print("\n\nBULLET POINT SUMMARY:")
print("=" * 80)
print(summarizer.summarize(long_doc, style="bullet"))

print("\n\nKEY POINTS:")
print("=" * 80)
print(summarizer.extract_key_points(long_doc, n_points=3))

## 8. Document Comparison Tool

In [None]:
class DocumentComparer:
    """Compare multiple documents."""
    
    def __init__(self):
        self.model = genai.GenerativeModel('gemini-3-pro')
    
    def compare(self, doc1: str, doc2: str) -> str:
        """Compare two documents."""
        prompt = f"""
        Compare these two documents and provide:
        1. Main similarities
        2. Key differences
        3. Unique points in each document
        
        Document 1:
        {doc1}
        
        Document 2:
        {doc2}
        
        Comparison:
        """
        
        response = self.model.generate_content(prompt)
        return response.text

# Test comparison
doc1 = "Python is a high-level programming language known for its simplicity and readability. It's widely used in web development, data science, and automation."
doc2 = "JavaScript is a programming language primarily used for web development. It runs in browsers and enables interactive web pages. Node.js allows JavaScript to run on servers."

comparer = DocumentComparer()
comparison = comparer.compare(doc1, doc2)

print("DOCUMENT COMPARISON:")
print("=" * 80)
display(Markdown(comparison))

## 9. Running the Streamlit Apps

### From Jupyter/Colab:

```python
# Method 1: Using subprocess (for local)
import subprocess
subprocess.Popen(["streamlit", "run", "document_qa_basic.py"])
```

### From Terminal:

```bash
# Basic version
streamlit run document_qa_basic.py

# Advanced version
streamlit run document_qa_advanced.py
```

### From Colab with ngrok:

```python
!pip install pyngrok
from pyngrok import ngrok

# Set auth token
ngrok.set_auth_token("YOUR_TOKEN")

# Start streamlit
!streamlit run document_qa_advanced.py --server.port 8501 &

# Create tunnel
public_url = ngrok.connect(8501)
print(public_url)
```

## 10. Best Practices for Document Q&A Systems

### Performance Optimization:

1. **Chunking Strategy**: Balance chunk size for context vs. precision
2. **Overlap**: Use overlapping chunks to preserve context
3. **Embeddings**: Cache embeddings to avoid recomputation
4. **Batch Processing**: Process multiple documents in parallel

### Accuracy Improvements:

1. **Metadata**: Add document metadata for better filtering
2. **Reranking**: Rerank retrieved chunks before answering
3. **Citation**: Always cite sources in answers
4. **Validation**: Verify answers against source material

### User Experience:

1. **Preview**: Show document previews before processing
2. **Progress**: Display processing progress for large files
3. **Suggestions**: Offer sample questions
4. **Export**: Allow exporting Q&A history

### Production Considerations:

1. **File Validation**: Validate file types and sizes
2. **Error Handling**: Handle corrupt or unsupported files
3. **Rate Limiting**: Implement API rate limiting
4. **Persistence**: Store embeddings in persistent vector DB
5. **Security**: Sanitize inputs and manage access control

## Next Steps

Expand your document Q&A capabilities:
- Add support for more file formats (CSV, Excel, Markdown)
- Implement document clustering and categorization
- Build custom document collections
- Add export to various formats
- Integrate with cloud storage (Google Drive, Dropbox)

---

## Learn More

Build powerful document processing systems with the **[Gen AI Crash Course](https://www.buildfastwithai.com/genai-course)** by Build Fast with AI!

**Created by [Build Fast with AI](https://www.buildfastwithai.com)**