# Retrieval Augmented Generation

## Foundation 
 - Information retrieval - fetches relevant context
 - Generative models - produce grounded responses.
 - Reduces hallucinations
 - Allow LLMs to access private or up-to-date knowledge.
 - **Indexing Phase** :
    - **Data Ingestion -> Chunking -> Embedding -> Vector Storage**
    - Chunking Methods
    - Vector Storage Options : Pinecone/Weaviate/FAISS
 - **Retrieval Phase** :
    - **Query -> Embedding -> Vector search ->Similarity -> Topk results**
    - Retrieval Types
    - Beam Search
 - **Generation Phase** :
    - Retrieved context is injected into a structured prompt.
    - Generating the final answer.
 - Embedding Models
    - OpenAI text-embedding
    - BERT,SBERT
    - Cohere Embed

## Rag Evaluation
 - RAG must be evaluated in three separate stages: 1. retrieval quality, generation quality, end-to-end task performance; 
 - Libraries : RAGAS,TrueLens, Promptfoo

**Retrieval evaluation measures** whether the system fetched the right chunks before generation.
$$
\text{Recall@k} = \frac{\text{\# of relevant docs retrieved in topk}}{\text{Total \# relevant docs}}
$$ 

$$
\text{Precision@k} = \frac{\text{\# of relevant docs retrieved in topk}}{\text{Total \# retrieved docs,k}}
$$

**Generation Evaluation Metrics**: This evaluates the quality of the models' final answer after combining retrieved context.
 - Faithfulness : Is the answer hallucination free ? 
 - Groundedness : Is the answer fully supported by retrieved context? 
 - Answer Relevance : Check whether the generated answer addresses the user's query.
 - Correctness : Is the answer factually right ? Does it match a known correct answer?
 - Toxicity: Checking the policy violations, PII leakage, Offensive language.

**Product Experience Metrics**:
 - Task Success Rate : Does the RAG system solve the user's problem?
 - User Satisfaction :
 - Latency : Time taken for the lookup, first token generation.
 - Cost Per Query : Embedding cost + Storage Cost + Generation Cost.

**Evaluation Methods**
- Offline Evaluation(Development): using ground truth.
- Online Evaluation (After deployment) : Tracking/Feedback/A-B Testing 

# Prompting Engineering

## Types of Prompting 
* **Zero Shot Prompting**
    - Giving the model a task without any example.
    - relies on the model's pretrained knowledge to perform a task without examples.
    - Example : 'Summarize this paragraph'
* **One-Shot Prompting**
    - You give one example to show the model the pattern.
* **Few Shot Prompting**
    - Giving multiple examples to teach the model the structure or style.
* **Chain-of-Thought Prompting**
    - Asking the model to think step by step
    - Improves logical reasoning and complex problem-solving
* **Self-Consistency Prompting**
    - Getting multiple chain-of-thought answer and picking the most common one.
    - Improves reasoning realiability
* **Role Prompting**
    - Assigning a role to the model so it behaves like a specific expert.
* **Instruction Prompting**
    - Clear,direct instructions telling the model exactly what to do
* **ReAct Prompting - Reason + Act**
    - Model alternates between reasoning steps and taking actions.
    - used in agents
* **Retrieval-Augmented Prompting**
    - You pass external documents or context into the prompt ( used in rag)
 
## Structure of a good Prompt 

#  Retrieval Pipeline
* **Step 1 : Query Processing**
* **Step 2 : Query Embedding** - use an embedding model 
* **Step 3 : Document Chunking** 
    * **Chunking strategies**
        - Fixed size chunks (e.g., 512 tokens)
        - Sentence-based
        - Paragraph-based
        - Overlapping chunks
        - Semantic chunking (LLM-chunking)
* **Step 4 : Store the Document Embedding in a vector database**
   - Store chunk text,metadata,embedding vector
   - Multi-Vector-Embedding : ColBERT
     - Multi-vector search stores multiple vectors for a single chunk
* **Step 5 : Retrieval**
   - Dense Retriever Stage : Use similarity search to retrieve topk chunks
   - Hybrid Reriever: Vector search sometimes misses exact keywords,BM25 fails at semantic meaning.
      - so keyword search + vector search
      - Merge the results : Reciprocal Rank Fusion / Weighted Score Fusion
* **Step6 : Reranking**
  - Ranking the retrieved chunks again using a re-ranker.
  - A cross encoder is a bert-style transformer takes the query and candidate chunk, outputs a relevance score. This is much more accurate than vector similarity.
* **Metadata-Filtering**
  - restricting retrieval using structured fields like document type, version,date, producid,..
  - It ensures that retrieval considers relevant documents.
  - We can apply filters before retrieval to narrow the search space or after retrieval to clean up results.
* **Context Assembly**
  - Merge relevant chunks
  - Remove duplicates
* **LLM Prompt Construction**
  - System Prompt
  - Instructions
  - Frame the Context
  - User Query
* **Generation**

# Fine Tuning Topics
- **Full Fine Tuning**
  - Training the entire model on new domain specific data.
  - Takes a pre-trained model
  - Continuous training
  - Updates all the weights
  - requires substantial computational resources
  - when maximum performance on a specific task is required.
  - specialized models for critical apps
    
- **Parameter Efficient Fine-Tuning (PEFT)**: Fine tune only a small subset of model parameters, while keeping the majority frozen. 
   - LoRa : Low Rank Adaptation:
     - Adds small adaptors to model layers
     - only trains these low-rank matrices
     - can be merged back into base model after training. 
   - QLoRA : Quantized Lora
     - Adds quantization to lora
     - uses 4 bit precision for base model + lora adapters
     - extreamly memory efficient - can run on consumer gpu.

- **Fine-Tuning when**
  - specific style/tone adaptation.
  - specialized reasoning patterns - code generation llms, proofs,...
  - base model lacks domain knowledge and thats not easily retrievable.
  - latency is critical ( rag has additional retrieval step)

- **Rag when**:
  - when we want to avoid hallucination
  - combining information from multiple sources.
  - Data privacy is critical

# LLM - Inference Parameters

In [9]:
# Function calling
# Agents
# vLLM
# Token throughput
# Latency reduction
# Model distillation
# Model orchestration (vLLM, TensorRT-LLM, Ollama)
# slms : how it is useful
# mixture of experts
# gpt -oss ppt contents
# lora/qlora in detail

In [None]:
# projects 
# summarization of model development document, and corresponding rag
# model from sivakumar : ask for model id