# Complete Offline RAG System with Ollama

##  Overview

This notebook implements a **complete Retrieval-Augmented Generation (RAG) system** that runs entirely offline.

### What is RAG?
- **Retrieval**: Finding relevant information from your documents
- **Augmented**: Using that information to enhance the AI's knowledge
- **Generation**: Creating accurate answers based on your documents

### Prerequisites
1. Install Ollama: https://ollama.com/download
2. Run: `ollama pull llama3.2`
3. Run: `ollama pull nomic-embed-text`
4. Create a `documents` folder and add your PDF/Markdown/HTML files

## Step 1: Install Required Libraries

### Library Choices Explained:

| Library | Why This One? | Alternatives |
|---------|---------------|-------------|
| **faiss-cpu** | Fastest vector search, battle-tested, works offline | ChromaDB (heavier), LanceDB (newer) |
| **numpy** | Industry standard for arrays, required by FAISS | No real alternative |
| **PyPDF2** | Pure Python, no dependencies, simple | pdfplumber (slower), PyMuPDF (C dependencies) |
| **beautifulsoup4** | Best HTML parser, robust | lxml (harder install), html.parser (basic) |
| **markdown** | Clean MD to text conversion | mistune (overkill), regex (error-prone) |

In [1]:
# Install required packages 
!pip install --upgrade pip setuptools wheel
!pip install faiss-cpu numpy PyPDF2==3.0.1 beautifulsoup4==4.12.2 markdown==3.4.4



##  Step 2: Import Libraries

In [2]:
import os
import json
import subprocess
import re
from pathlib import Path
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass

#for document processing
import PyPDF2
from bs4 import BeautifulSoup
import markdown

#for vector operations
import numpy as np
import faiss

print("all libraries sucessfully imported")

all libraries sucessfully imported


## Step 3: Data Structures

### Why Dataclasses?
- Type safety catches errors early
- Self-documenting
- Less boilerplate than classes
- Better than dicts for structured data

In [3]:
@dataclass
class Chunk:
    """Text chunk with metadata and embedding."""
    id: str
    text: str
    vector: Optional[np.ndarray]
    metadata: Dict

##  Step 4: Document Loading

### Design Decisions:
1. **Static methods**: No instance state needed
2. **Separate per format**: Easier to extend
3. **Error handling**: Continues on failure
4. **Page-level tracking**: Enables precise citations

In [4]:
class DocumentLoader:
    """load PDF, Markdown, and HTML documents."""
    
    @staticmethod
    def load_pdf(file_path: str) -> List[Dict]:
        """extract text from PDF,  page by page for citations."""
        chunks = []
        try:
            with open(file_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file, strict=False)
                print(f"Processing {os.path.basename(file_path)}: {len(pdf_reader.pages)} pages")

                for page_num, page in enumerate(pdf_reader.pages):
                    try:
                        text = page.extract_text()
                        if text and text.strip():
                            chunks.append({
                                'text': text,
                                'metadata': {
                                    'source': os.path.basename(file_path),
                                    'page': page_num + 1,
                                    'type': 'pdf'
                                }
                            })
                            print(f"  Page {page_num + 1}: extracted {len(text)} characters")
                        else:
                            print(f"  Page {page_num + 1}: no text found")
                    except Exception as e:
                        print(f"  Page {page_num + 1}: error - {e}")
                        continue

                print(f"Successfully extracted {len(chunks)} pages from {os.path.basename(file_path)}")
        except Exception as e:
            print(f"error loading PDF {file_path}: {e}")
        return chunks
    
    @staticmethod
    def load_markdown(file_path: str) -> List[Dict]:
        """convert Markdown to text via HTML."""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                md_content = file.read()
                html = markdown.markdown(md_content)
                soup = BeautifulSoup(html, 'html.parser')
                text = soup.get_text()
                
                return [{
                    'text': text,
                    'metadata': {
                        'source': os.path.basename(file_path),
                        'page': 1,
                        'type': 'markdown'
                    }
                }]
        except Exception as e:
            print(f"error loading Markdown {file_path}: {e}")
            return []
    
    @staticmethod
    def load_html(file_path: str) -> List[Dict]:
        """extract text from HTML, removing scripts and styles."""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                soup = BeautifulSoup(file.read(), 'html.parser')
                for script in soup(["script", "style"]):
                    script.decompose()
                text = soup.get_text()
                
                return [{
                    'text': text,
                    'metadata': {
                        'source': os.path.basename(file_path),
                        'page': 1,
                        'type': 'html'
                    }
                }]
        except Exception as e:
            print(f"error loading HTML {file_path}: {e}")
            return []
    
    @staticmethod
    def load_documents(directory: str) -> List[Dict]:
        """ load all supported documents from a directory."""
        documents = []
        doc_dir = Path(directory)
        
        if not doc_dir.exists():
            print(f"Creating {directory}...")
            doc_dir.mkdir(parents=True)
            print(f"add documents to {directory} and run again.")
            return documents
        
        for file_path in doc_dir.rglob('*'):
            if file_path.is_file():
                ext = file_path.suffix.lower()
                
                if ext == '.pdf':
                    documents.extend(DocumentLoader.load_pdf(str(file_path)))
                elif ext in ['.md', '.markdown']:
                    documents.extend(DocumentLoader.load_markdown(str(file_path)))
                elif ext in ['.html', '.htm']:
                    documents.extend(DocumentLoader.load_html(str(file_path)))
        
        print(f"loaded {len(documents)} document sections")
        return documents

print("document loader ready!")

document loader ready!


## Step 5: Text Chunking

### Why Chunking?
- **Problem**: Documents are too long for embeddings (token limits)
- **Solution**: Split into smaller, overlapping pieces

### Why Overlap?
- Prevents information loss at boundaries
- Example: "...safety protocol. First, wear PPE..." 
  - Without overlap: "First, wear PPE" loses context
  - With overlap: Previous chunk includes "safety protocol"

### Chunk Size Choice (750 chars):
- **Too small** (200): Loses context
- **Too large** (2000): Less precise retrieval
- **750**: Sweet spot for most documents (~150 tokens)

### Sentence Boundary Detection:
- Breaks at periods, not mid-sentence
- Better comprehension by LLM

In [5]:
class TextChunker:
    """text chunking with overlap and sentence boundaries."""
    
    @staticmethod
    def clean_text(text: str) -> str:
        """normalize whitespace and remove special characters."""
        text = re.sub(r'\s+', ' ', text)  #multiple spaces to single space
        text = re.sub(r'[^\w\s\.\,\!\?\-\:\;]', '', text)  #keep punctuation
        return text.strip()
    
    @staticmethod
    def chunk_text(
        text: str,
        chunk_size: int = 750,
        overlap: int = 100,
        metadata: Dict = None
    ) -> List[Chunk]:
        """split text into overlapping chunks at sentence boundaries.
        
        Args:
            chunk_size: target size (â‰ˆ150 tokens for embeddings)
            overlap: overlap size to preserve context
            metadata: source info for citations
        """
        text = TextChunker.clean_text(text)
        chunks = []
        
        if not text:
            return chunks
        
        start = 0
        chunk_index = 0
        
        while start < len(text):
            end = start + chunk_size
            
            #break at sentence boundary(last 20% of chunk)
            if end < len(text):
                search_start = end - int(chunk_size * 0.2)
                sentence_end = max(
                    text.rfind('.', search_start, end),
                    text.rfind('!', search_start, end),
                    text.rfind('?', search_start, end)
                )
                
                if sentence_end != -1 and sentence_end > start:
                    end = sentence_end + 1
            
            chunk_text = text[start:end].strip()
            
            if chunk_text:
                chunk_metadata = metadata.copy() if metadata else {}
                chunk_metadata['chunk_index'] = chunk_index
                chunk_id = f"{chunk_metadata.get('source', 'unknown')}_{chunk_index}"
                
                chunks.append(Chunk(
                    id=chunk_id,
                    text=chunk_text,
                    vector=None,
                    metadata=chunk_metadata
                ))
                
                chunk_index += 1
            
            start = end - overlap  #move with overlap
            
            if start >= len(text) - overlap:
                break
        
        return chunks

print("text chunker ready!")

text chunker ready!


## Step 6: Ollama Embedder

### What are Embeddings?
- Convert text â†’ vector of numbers (768 dimensions)
- Similar text â†’ similar vectors
- Enables semantic search

### Why Ollama?
- **Runs locally**: No API keys, no cloud
- **Free**: No usage costs
- **Private**: Data never leaves your machine

### Why nomic-embed-text?
- **Size**: 274MB (lightweight)
- **Quality**: Good accuracy for general text
- **Dimensions**: 768 (standard size)
- **Alternative**: all-MiniLM-L6-v2 (384 dims, smaller but less accurate)

### HTTP API vs CLI:
- **HTTP**: Direct, faster, better for production
- **CLI**: Subprocess overhead
- **Choice**: HTTP API for efficiency

In [None]:
class OllamaEmbedder:
    """Generate embeddings using Ollama's embedding model."""
    
    def __init__(self, model_name: str = "nomic-embed-text"):
        self.model_name = model_name
        self._verify_model()
    
    def _verify_model(self):
        """Check if model is available locally (no automatic download)."""
        try:
            result = subprocess.run(
                ['ollama', 'list'],
                capture_output=True,
                text=True,
                check=True
            )
            if self.model_name not in result.stdout:
                raise RuntimeError(
                    f"Model '{self.model_name}' not found locally.\n"
                    f"Please download it first using:\n"
                    f"  ollama pull {self.model_name}\n"
                    f"This is a one-time setup step that requires internet connection."
                )
            print(f"Found embedding model: {self.model_name}")
        except subprocess.CalledProcessError as e:
            raise RuntimeError(
                f"Cannot connect to Ollama service.\n"
                f"Please ensure Ollama is installed and running.\n"
                f"Error: {e}"
            )
        except FileNotFoundError:
            raise RuntimeError(
                "Ollama not found on your system.\n"
                "Please install Ollama from: https://ollama.com/download\n"
                "This is a one-time setup step."
            )
    
    def embed_text(self, text: str) -> np.ndarray:
        """Generate embedding vector for text using HTTP API."""
        try:
            import http.client
            
            conn = http.client.HTTPConnection("localhost", 11434, timeout=30)
            headers = {'Content-Type': 'application/json'}
            
            payload = json.dumps({
                "model": self.model_name,
                "prompt": text
            })
            
            conn.request("POST", "/api/embeddings", payload, headers)
            response = conn.getresponse()
            data = json.loads(response.read().decode())
            
            return np.array(data['embedding'], dtype=np.float32)
            
        except Exception as e:
            print(f"Embedding error: {e}")
            return np.zeros(768, dtype=np.float32)  # Fallback
    
    def embed_chunks(self, chunks: List[Chunk]) -> List[Chunk]:
        """generate embeddings for all chunks with progress."""
        print(f"Generating embeddings for {len(chunks)} chunks...")
        
        for i, chunk in enumerate(chunks):
            if i % 10 == 0 and i > 0:
                print(f" progress: {i}/{len(chunks)}")
            chunk.vector = self.embed_text(chunk.text)
        
        print("embeddings complete!")
        return chunks

print("embedder ready!")

embedder ready!


## Step 7: Vector Database

### Why FAISS?
- **Speed**: Fastest similarity search in Python
- **Offline**: No external services
- **Simple**: IndexFlatL2 = exact search, no tuning
- **Mature**: Used by Meta/Facebook in production

### FAISS Index Types:
- **IndexFlatL2**: Exact search, small datasets (<1M vectors)
- **IndexIVFFlat**: Approximate, faster for large datasets
- **IndexHNSW**: Graph-based, very fast
- **Choice**: Flat for simplicity and accuracy

### L2 Distance:
- Euclidean distance between vectors
- Lower = more similar
- Alternative: Cosine similarity (similar results, more complex)

### Persistence:
- Save index + metadata separately
- Vectors in binary (fast)
- Metadata in JSON (readable)

In [7]:
class VectorDatabase:
    """FAISS-based vector storage and retrieval with Cosine Similarity."""
    
    def __init__(self, dimension: int = 768):
        self.dimension = dimension
        #ue IndexFlatIP for cosine similarity
        self.index = faiss.IndexFlatIP(dimension) 
        self.chunks: List[Chunk] = []
    
    def add_chunks(self, chunks: List[Chunk]):
        """add chunk embeddings to the index."""
        vectors = np.array([chunk.vector for chunk in chunks], dtype=np.float32)
        
        #normalize vectors for cosine similarity
        faiss.normalize_L2(vectors)
        
        self.index.add(vectors)
        self.chunks.extend(chunks)
        print(f"Added {len(chunks)} chunks (total: {len(self.chunks)})")
    
    def search(self, query_vector: np.ndarray, top_k: int = 5) -> List[Tuple[Chunk, float]]:
        """find top-k most similar chunks using cosine similarity.
        
        Returns:
            List of (chunk, distance) tuples
            Distance is (1 - cosine_similarity), so lower = more similar
        """
        query_vector = query_vector.reshape(1, -1).astype(np.float32)
        
        #normalize query vector for cosine similarity
        faiss.normalize_L2(query_vector)
        
        #search (returns similarity scores, not distances)
        similarities, indices = self.index.search(query_vector, top_k)
        
        results = []
        for idx, similarity in zip(indices[0], similarities[0]):
            if idx < len(self.chunks):
                #convert similarity to distance: distance = 1 - similarity
                distance = 1 - similarity
                results.append((self.chunks[idx], float(distance)))
        
        return results
    
    def save(self, directory: str):
        """persist database to disk."""
        os.makedirs(directory, exist_ok=True)
        
        #save FAISS index
        faiss.write_index(self.index, os.path.join(directory, 'faiss.index'))
        
        #save chunks metadata (JSON)
        chunks_data = [{'id': chunk.id,
            'text': chunk.text,
            'metadata': chunk.metadata
        } for chunk in self.chunks]
        
        with open(os.path.join(directory, 'chunks.json'), 'w', encoding='utf-8') as f:
            json.dump(chunks_data, f, indent=2)
        
        print(f"database saved to {directory}")
    
    def load(self, directory: str, embedder) -> bool:
        """load database from disk."""
        index_path = os.path.join(directory, 'faiss.index')
        chunks_path = os.path.join(directory, 'chunks.json')
        
        if not os.path.exists(index_path) or not os.path.exists(chunks_path):
            print(f"no database found in {directory}")
            return False
        
        #load FAISS index
        self.index = faiss.read_index(index_path)
        
        #load chunks
        with open(chunks_path, 'r', encoding='utf-8') as f:
            chunks_data = json.load(f)
        
        #reconstruct chunks (re-embed for consistency)
        print("reconstructing chunk vectors...")
        self.chunks = []
        for data in chunks_data:
            chunk = Chunk(
                id=data['id'],
                text=data['text'],
                vector=embedder.embed_text(data['text']),
                metadata=data['metadata']
            )
            self.chunks.append(chunk)
        
        print(f"database loaded: {len(self.chunks)} chunks")
        return True

print("vector database ready!")

vector database ready!


##  Step 8: LLM Interface

### Why Llama 3?
- **Quality**: State-of-the-art open-source model
- **Size**: 4.7GB (manageable on consumer hardware)
- **Offline**: Runs locally
- **Fast**: Decent speed on CPU

### Temperature Parameter:
- **0.0**: Deterministic (always same answer)
- **0.3**: Slightly creative (good for QA)
- **0.7**: More creative (good for writing)
- **1.0**: Very creative (can hallucinate)
- **Choice**: 0.3 balances accuracy and naturalness


In [None]:
class OllamaLLM:
    """LLM interface using Ollama CLI (more reliable on CPU)."""
    
    def __init__(self, model_name: str = "llama3.2"):
        self.model_name = model_name
        self._verify_model()
    
    def _verify_model(self):
        """Check if model is available locally."""
        try:
            result = subprocess.run(
                ['ollama', 'list'],
                capture_output=True,
                text=True,
                check=True
            )
            if self.model_name not in result.stdout:
                raise RuntimeError(
                    f"Model '{self.model_name}' not found locally.\n"
                    f"Please download it first using:\n"
                    f"  ollama pull {self.model_name}\n"
                    f"This is a one-time setup step that requires internet connection."
                )
            print(f"Found LLM model: {self.model_name}")
        except subprocess.CalledProcessError as e:
            raise RuntimeError(
                f"Cannot connect to Ollama service.\n"
                f"Please ensure Ollama is installed and running.\n"
                f"Error: {e}"
            )
        except FileNotFoundError:
            raise RuntimeError(
                "Ollama not found on your system.\n"
                "Please install Ollama from: https://ollama.com/download\n"
                "This is a one-time setup step."
            )
    
    def generate(self, prompt: str, temperature: float = 0.3) -> str:
        """generate response using Ollama CLI (more reliable on CPU).        
        Args:
            prompt: Complete prompt with context and question
            temperature: Creativity (0.0=deterministic, 1.0=creative)
        """
        try:
            print(f"  Generating with {self.model_name} ...")
            
            # Use subprocess with CLI - more reliable than HTTP on CPU
            result = subprocess.run(
                ['ollama', 'run', self.model_name],
                input=prompt,
                capture_output=True,
                text=True,
                timeout=300,  # 5 minutes timeout
                encoding='utf-8'
            )
            
            if result.returncode != 0:
                error_msg = result.stderr or "Unknown error"
                print(f" Ollama error: {error_msg}")
                return f"Error: {error_msg}"
            
            answer = result.stdout.strip()
            
            if not answer:
                print(f" Empty response")
                return "Error: Empty response from LLM"
            
            print(f" Generated {len(answer)} characters")
            return answer
            
        except subprocess.TimeoutExpired:
            print(f"  Timeout after 5 minutes")
            return "Error: Generation timed out. Try a simpler question or smaller context."
        except Exception as e:
            error_msg = f"Error: {str(e)}"
            print(f"  {error_msg}")
            return error_msg

print("LLM interface ready!")

LLM interface ready!


## Step 9: Complete RAG System

### System Architecture:
1. **Ingest**: Load docs â†’ Chunk â†’ Embed â†’ Store
2. **Query**: Question â†’ Embed â†’ Search â†’ Retrieve chunks
3. **Generate**: Build prompt with context â†’ LLM â†’ Answer

### Distance Threshold:
- Filters out irrelevant chunks
- **1.5**: Moderate strictness
- Lower = stricter (may miss relevant info)
- Higher = lenient (may include noise)

### Prompt Engineering:
- Clear instructions: "Answer only from context"
- Structured format: CONTEXT â†’ QUESTION â†’ INSTRUCTIONS
- Citation requirement: Forces source attribution
- Safety: Refuses to guess without context

In [18]:
class RAGSystem:
    """complete RAG orchestration."""
    
    def __init__(
        self,
        documents_dir: str = "documents",
        db_dir: str = "vector_db",
        llm_model: str = "llama3.2",
        embedding_model: str = "nomic-embed-text"
    ):
        self.documents_dir = documents_dir
        self.db_dir = db_dir
        
        print("initializing RAG System...")
        self.embedder = OllamaEmbedder(embedding_model)
        self.llm = OllamaLLM(llm_model)
        self.vector_db = VectorDatabase()
        print("RAG System initialized!")
    
    def ingest_documents(
        self,
        chunk_size: int = 750,
        overlap: int = 100,
        force_rebuild: bool = False
    ):
        """Build or load vector database."""
        
        #try loading existing database
        if not force_rebuild and os.path.exists(self.db_dir):
            print("loading existing database...")
            if self.vector_db.load(self.db_dir, self.embedder):
                return
        
        print("Building new database...")
        
        #load documents
        documents = DocumentLoader.load_documents(self.documents_dir)
        if not documents:
            print("no documents found!")
            return
        
        #chunk documents
        all_chunks = []
        for doc in documents:
            chunks = TextChunker.chunk_text(
                doc['text'],
                chunk_size=chunk_size,
                overlap=overlap,
                metadata=doc['metadata']
            )
            all_chunks.extend(chunks)
        
        print(f"created {len(all_chunks)} chunks")
        
        #generate embeddings
        all_chunks = self.embedder.embed_chunks(all_chunks)
        
        #store in vector DB
        self.vector_db.add_chunks(all_chunks)
        
        #save for future use
        self.vector_db.save(self.db_dir)
    
    def query(
        self,
        question: str,
        top_k: int = 5,
        distance_threshold: float = 1.5
    ) -> Dict:
        """Answer question using RAG.
        
        Returns:
            {
                'answer': Generated answer,
                'sources': List of source chunks,
                'confidence': 'high'|'medium'|'low'
            }
        """
        print(f"\n Question: {question}")
        
        #embed query
        query_vector = self.embedder.embed_text(question)
        
        #search vector DB
        results = self.vector_db.search(query_vector, top_k=top_k)
        
        #filter by threshold
        filtered_results = [
            (chunk, dist) for chunk, dist in results
            if dist < distance_threshold
        ]
        
        if not filtered_results:
            return {
                'answer': "Insufficient context to answer this question.",
                'sources': [],
                'confidence': 'low'
            }
        
        #build context from chunks
        context_parts = []
        sources = []
        
        for i, (chunk, distance) in enumerate(filtered_results):
            context_parts.append(
                f"[Source {i+1}: {chunk.metadata['source']}, "
                f"Page {chunk.metadata.get('page', 'N/A')}]\n{chunk.text}\n"
            )
            sources.append({
                'id': chunk.id,
                'source': chunk.metadata['source'],
                'page': chunk.metadata.get('page', 'N/A'),
                'distance': distance
            })
        
        context = "\n".join(context_parts)
        
        #build prompt
        prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided context.

CONTEXT:
{context}

QUESTION: {question}

INSTRUCTIONS:
1. Answer based only on the context above
2. Cite source numbers (e.g., "According to Source 1...")
3. If context is insufficient, state that clearly
4. Be concise but thorough

ANSWER:"""
        
        #generate answer
        print("Generating answer...")
        answer = self.llm.generate(prompt, temperature=0.3)
        
        return {
            'answer': answer,
            'sources': sources,
            'confidence': 'high' if len(filtered_results) >= 3 else 'medium'
        }

print("RAG System class ready!")

RAG System class ready!


## ðŸŽ® Step 10: Initialize System

### Configuration Options:
- `documents_dir`: Where your documents are
- `db_dir`: Where vector database is saved
- `chunk_size`: 500-1000 (750 is balanced)
- `overlap`: 10-20% of chunk_size
- `force_rebuild`: Set True to rebuild from scratch

In [19]:
# Initialize RAG system
rag = RAGSystem(
    documents_dir="documents",
    db_dir="vector_db",
    llm_model="llama3.2",
    embedding_model="nomic-embed-text"
)

# Build/load database
rag.ingest_documents(
    chunk_size=750,
    overlap=100,
    force_rebuild=True  
)

initializing RAG System...
Found embedding model: nomic-embed-text
Found LLM model: llama3.2
RAG System initialized!
Building new database...
Processing flora.pdf: 10 pages
  Page 1: extracted 4059 characters
  Page 2: extracted 4664 characters
  Page 3: extracted 2931 characters
  Page 4: extracted 3098 characters
  Page 5: extracted 3477 characters
  Page 6: extracted 3609 characters
  Page 7: extracted 3364 characters
  Page 8: extracted 3834 characters
  Page 9: extracted 4666 characters
  Page 10: extracted 1916 characters
Successfully extracted 10 pages from flora.pdf
loaded 10 document sections
created 61 chunks
Generating embeddings for 61 chunks...
  Progress: 10/61
  Progress: 20/61
  Progress: 30/61
  Progress: 40/61
  Progress: 50/61
  Progress: 60/61
embeddings complete!
Added 61 chunks (total: 61)
database saved to vector_db


## ðŸ’¬ Step 11: Ask Questions!

### Tips for Good Questions:
- Be specific
- Use keywords from your documents
- One topic per question

### Tuning Parameters:
- `top_k`: More = more context, slower
- `distance_threshold`: Lower = stricter matching

In [17]:
#example: Ask a question
question = "What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models?"

result = rag.query(
    question=question,
    top_k=5,
    distance_threshold=0.6
)

# Display results
print("\n" + "="*60)
print("ANSWER:")
print("="*60)
print(result['answer'])
print("\n" + "="*60)
print(f"CONFIDENCE: {result['confidence'].upper()}")
print("="*60)
print("\nSOURCES:")
for i, source in enumerate(result['sources'], 1):
    print(f"  {i}. {source['source']} (Page {source['page']}) - Distance: {source['distance']:.4f}")
print("="*60)


 Question: What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models?
Generating answer...
  Generating with llama3.2 ...
 Generated 411 characters

ANSWER:
According to Sources 3 and 4, FLoRA aims to address the problem of parameter-efficient fine-tuning for large language models (LLMs) by proposing a family of fused forward-backward adapters (FFBAs) that combine ideas from LoRA and parallel adapters to improve fine-tuning accuracies. Specifically, it seeks to reduce inference-time latencies while maintaining or improving accuracy for similar parameter budgets.

CONFIDENCE: HIGH

SOURCES:
  1. flora.pdf (Page 9) - Distance: 0.2233
  2. flora.pdf (Page 9) - Distance: 0.2614
  3. flora.pdf (Page 1) - Distance: 0.2660
  4. flora.pdf (Page 1) - Distance: 0.2805
  5. flora.pdf (Page 10) - Distance: 0.2863


## Step 12: Batch Processing (Optional)

Process multiple questions at once.

In [20]:
# List of questions
questions = [
    "What is the main topic?",
    "What are the key points discussed?",
    "Are there any specific recommendations?"
]

# Process all questions
for i, q in enumerate(questions, 1):
    print(f"\n{'='*60}")
    print(f"Question {i}/{len(questions)}")
    print(f"{'='*60}")
    
    result = rag.query(q, top_k=3)
    
    print(f"Q: {q}")
    print(f"A: {result['answer'][:200]}...")
    print(f"Confidence: {result['confidence']}")
    print(f"Sources: {len(result['sources'])}")


Question 1/3

 Question: What is the main topic?
Generating answer...
  Generating with llama3.2 ...
 Generated 1048 characters
Q: What is the main topic?
A: Based on the provided context, it appears that the main topic revolves around advancements and studies related to natural language processing (NLP) and large language models, particularly in the areas...
Confidence: high
Sources: 3

Question 2/3

 Question: What are the key points discussed?
Generating answer...
  Generating with llama3.2 ...
 Generated 883 characters
Q: What are the key points discussed?
A: Based on the provided context, it appears that there are multiple papers related to parameter-efficient fine-tuning and prompt tuning for natural language processing tasks.

The key points discussed i...
Confidence: high
Sources: 3

Question 3/3

 Question: Are there any specific recommendations?
Generating answer...
  Generating with llama3.2 ...
 Generated 808 characters
Q: Are there any specific recommendations?
A: Based o

##  Step 13: Advanced Customization

### Adjust Chunk Size

In [None]:
# For code or technical docs: smaller chunks
# rag.ingest_documents(chunk_size=500, overlap=75, force_rebuild=True)

# For long-form content: larger chunks
# rag.ingest_documents(chunk_size=1000, overlap=150, force_rebuild=True)

### Adjust Retrieval

In [None]:
# More context (slower, more comprehensive)
# result = rag.query(question, top_k=10, distance_threshold=2.0)

# Less context (faster, more focused)
# result = rag.query(question, top_k=3, distance_threshold=1.0)

### Use Different Models

In [None]:
# Faster but smaller model
# rag_fast = RAGSystem(llm_model="llama3.2")

# Alternative model
# rag_alt = RAGSystem(llm_model="mistral")

## Step 14: Database Statistics

In [None]:
# Get database stats
print(f"Total chunks: {len(rag.vector_db.chunks)}")

# Count by source
sources = {}
for chunk in rag.vector_db.chunks:
    source = chunk.metadata['source']
    sources[source] = sources.get(source, 0) + 1

print("\nChunks per document:")
for source, count in sorted(sources.items(), key=lambda x: x[1], reverse=True):
    print(f"  {source}: {count} chunks")

## Step 15: Save/Export Results

In [None]:
# Save results to JSON
def save_qa_results(questions_and_answers, filename="qa_results.json"):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(questions_and_answers, f, indent=2, ensure_ascii=False)
    print(f"âœ… Results saved to {filename}")

# Example:
# qa_pairs = []
# for q in questions:
#     result = rag.query(q)
#     qa_pairs.append({'question': q, 'answer': result['answer']})
# save_qa_results(qa_pairs)