# Retrieval Augmented Generation

## Foundation 
 - Information retrieval - fetches relevant context
 - Generative models - produce grounded responses.
 - Reduces hallucinations
 - Allow LLMs to access private or up-to-date knowledge.
 - **Indexing Phase** :
    - **Data Ingestion -> Chunking -> Embedding -> Vector Storage**
    - Chunking Methods
    - Vector Storage Options : Pinecone/Weaviate/FAISS
 - **Retrieval Phase** :
    - **Query -> Embedding -> Vector search ->Similarity -> Topk results**
    - Retrieval Types
    - Beam Search
 - **Generation Phase** :
    - Retrieved context is injected into a structured prompt.
    - Generating the final answer.
 - Embedding Models
    - OpenAI text-embedding
    - BERT,SBERT
    - Cohere Embed

## Rag Evaluation
 - RAG must be evaluated in three separate stages: 1. retrieval quality, generation quality, end-to-end task performance; 
 - Libraries : RAGAS,TrueLens, Promptfoo

**Retrieval evaluation measures** whether the system fetched the right chunks before generation.
$$
\text{Recall@k} = \frac{\text{\# of relevant docs retrieved in topk}}{\text{Total \# relevant docs}}
$$ 

$$
\text{Precision@k} = \frac{\text{\# of relevant docs retrieved in topk}}{\text{Total \# retrieved docs,k}}
$$

**Generation Evaluation Metrics**: This evaluates the quality of the models' final answer after combining retrieved context.
 - Faithfulness : Is the answer hallucination free ? 
 - Groundedness : Is the answer fully supported by retrieved context? 
 - Answer Relevance : Check whether the generated answer addresses the user's query.
 - Correctness : Is the answer factually right ? Does it match a known correct answer?
 - Toxicity: Checking the policy violations, PII leakage, Offensive language.

**Product Experience Metrics**:
 - Task Success Rate : Does the RAG system solve the user's problem?
 - User Satisfaction :
 - Latency : Time taken for the lookup, first token generation.
 - Cost Per Query : Embedding cost + Storage Cost + Generation Cost.

**Evaluation Methods**
- Offline Evaluation(Development): using ground truth.
- Online Evaluation (After deployment) : Tracking/Feedback/A-B Testing 

# Prompting Engineering

## Types of Prompting 
* **Zero Shot Prompting**
    - Giving the model a task without any example.
    - relies on the model's pretrained knowledge to perform a task without examples.
    - Example : 'Summarize this paragraph'
* **One-Shot Prompting**
    - You give one example to show the model the pattern.
* **Few Shot Prompting**
    - Giving multiple examples to teach the model the structure or style.
* **Chain-of-Thought Prompting**
    - Asking the model to think step by step
    - Improves logical reasoning and complex problem-solving
* **Self-Consistency Prompting**
    - Getting multiple chain-of-thought answer and picking the most common one.
    - Improves reasoning realiability
* **Role Prompting**
    - Assigning a role to the model so it behaves like a specific expert.
* **Instruction Prompting**
    - Clear,direct instructions telling the model exactly what to do
* **ReAct Prompting - Reason + Act**
    - Model alternates between reasoning steps and taking actions.
    - used in agents
* **Retrieval-Augmented Prompting**
    - You pass external documents or context into the prompt ( used in rag)
 
## Structure of a good Prompt 

* Clear → No ambiguity
* Specific → Defines scope
* Context-rich → Gives necessary background
* Instructional → Contains a clear verb (Explain, Generate, Evaluate, Translate…)
* Constrained → Boundaries, rules, or format
* Modular → Easy to reuse or extend
* Deterministic → Minimizes randomness
* Evaluatable → Output format that can be checked

#  Retrieval Pipeline
* **Step 1 : Query Processing**
* **Step 2 : Query Embedding** - use an embedding model 
* **Step 3 : Document Chunking** 
    * **Chunking strategies**
        - Fixed size chunks (e.g., 512 tokens)
        - Sentence-based
        - Paragraph-based
        - Overlapping chunks
        - Semantic chunking (LLM-chunking)
* **Step 4 : Store the Document Embedding in a vector database**
   - Store chunk text,metadata,embedding vector
   - Multi-Vector-Embedding : ColBERT
     - Multi-vector search stores multiple vectors for a single chunk
* **Step 5 : Retrieval**
   - Dense Retriever Stage : Use similarity search to retrieve topk chunks
   - Hybrid Reriever: Vector search sometimes misses exact keywords,BM25 fails at semantic meaning.
      - so keyword search + vector search
      - Merge the results : Reciprocal Rank Fusion / Weighted Score Fusion
* **Step6 : Reranking**
  - Ranking the retrieved chunks again using a re-ranker.
  - A cross encoder is a bert-style transformer takes the query and candidate chunk, outputs a relevance score. This is much more accurate than vector similarity.
* **Metadata-Filtering**
  - restricting retrieval using structured fields like document type, version,date, producid,..
  - It ensures that retrieval considers relevant documents.
  - We can apply filters before retrieval to narrow the search space or after retrieval to clean up results.
* **Context Assembly**
  - Merge relevant chunks
  - Remove duplicates
* **LLM Prompt Construction**
  - System Prompt
  - Instructions
  - Frame the Context
  - User Query
* **Generation**

# Fine Tuning Topics
- **Full Fine Tuning**
  - Training the entire model on new domain specific data.
  - Takes a pre-trained model
  - Continuous training
  - Updates all the weights
  - requires substantial computational resources
  - when maximum performance on a specific task is required.
  - specialized models for critical apps
    
- **Parameter Efficient Fine-Tuning (PEFT)**: Fine tune only a small subset of model parameters, while keeping the majority frozen. 
   - LoRa : Low Rank Adaptation:
     - Adds small adaptors to model layers
     - only trains these low-rank matrices
     - can be merged back into base model after training. 
   - QLoRA : Quantized Lora
     - Adds quantization to lora
     - uses 4 bit precision for base model + lora adapters
     - extreamly memory efficient - can run on consumer gpu.

- **Fine-Tuning when**
  - specific style/tone adaptation.
  - specialized reasoning patterns - code generation llms, proofs,...
  - base model lacks domain knowledge and thats not easily retrievable.
  - latency is critical ( rag has additional retrieval step)

- **Rag when**:
  - when we want to avoid hallucination
  - combining information from multiple sources.
  - Data privacy is critical

# LLM - Inference Parameters

* temperature : controls the randomness in token selection : lower temperature gives more deterministic
* top p :
* top K :
* max tokens :
* frequency penalty:
* seed
* beam search
* minimum new tokens:

| Parameter                | Controls              | Typical Range | Best For                   |
| ------------------------ | --------------------- | ------------- | -------------------------- |
| **Temperature**          | Creativity            | 0–1.2         | Tone/style + randomness    |
| **Top-p**                | Probability mass      | 0.8–1.0       | Balanced sampling          |
| **Top-k**                | # of candidate tokens | 20–200        | Precise randomness control |
| **Max tokens**           | Output length         | app-dependent | Cost & length control      |
| **Repetition penalty**   | Avoid loops           | 1.05–1.3      | Avoiding repetition        |
| **Frequency/P. penalty** | Token frequency       | 0–1.5         | Variety in reply           |
| **Stop sequences**       | Termination           | n/a           | Structured output          |
| **Logit bias**           | Token forcing         | n/a           | JSON, safety, formatting   |
| **Seed**                 | Determinism           | int           | Reproducible outputs       |


In [None]:
# projects 
# summarization of model development document, and corresponding rag
# model from sivakumar : ask for model id

# Agents

An LLM-powered entity that can plan, reason, take actions using tools, and interact with an environment to complete tasks autonomously.

## Abilities
* plan -> break task into steps
* use tools -> apis,databases,code executors
* observe results -> feedback loop
* adjust behavior -> reflection
* communicate with other agent -> collaboration

## Types
* reflex agents : simple , rule based , if x happens do y. This is used in monitoring and alerts.
* stateful agents: maintain memory, conversation, previous step, long term user preference.

## Agentic Workflows
* Agents chained together with dependencies.
* Research Agent -> Analysis agent -> Writing agent -> QA Agent

## Creating single agents

1. A Role : Persona: It defines the expertise. 
2. Abilities/Tools : toolset , the agent has access to.
3. Reasoning Pattern : CoT, ReAct, Plan -> Execute, Reflect -> Improve.
4. Control Loop 

## Multi-Agent Systems:

Contains multiple agents collaborating like a team. Each aget will be having the following elements;
1. has a specific role
2. communicate with others agents
3. can critique others' outputs
4. can share memory 

## Key Concepts in Agent Orchestration
* Task Decomposition - Breaking the task into smaller subtasks
* Tool Routing - Choosing the right tool dynamically
* Memory Management - Long term memory, short-term scratchpads,shared 
* safety + guardrails - prevent loops, bad api calls, dangerous actions
* error recovery - failed tool calls, missing info, ambiguous requirements

# Token Throughput

* Input Throughput : How fast the model can ingest tokens
* Output Throughput : How fast the model can generate tokes.
* High token throughput means, faster responses.

## Factors affecting throughput
* Model size
* Hardware
* Parallelization strategies
* Quantizatio level ( 4bit,8bit)
* Inference engine
* Batch size


Throughput is one metric for LLM serving performance. 

# Latency
* Total time from sending a prompt → receiving a response.
* This includes:
  * Queueing time
  * Tokenization
  * Model inference time
  * Output token streaming
  * Tool use delays
  * Network overhead
* Techniques to reduce latency:
  * Model quantization
  * smaller models -slms
  * distilled models
  * optimized inference engines ( TensorRT-LLM, vLLM)
  * system design techniques

High Throughput + low latency = optimal user experience