![RAG Architecture](RAG_Architecture.png)

### RAG Components
1. Document Loading
2. Document Splitting
3. Vectorstores and Embedding
4. Retrieval
5. Question Answering

## 1) Loaders
In LangChain, a document loader is a utility that helps you load data from different sources into a standardized document format so that it can be processed further (cleaned, split, embedded, retrieved, etc.).

#### Examples:
1. Text Loader
2. CSV Loader
3. PDF Loader
4. YouTube Loader
5. WebBase Loader

In [None]:
# Install dependencies
!pip install langchain
!pip install langchain-community
!pip install pypdf

#### PDF Loader

In [7]:
from langchain_community.document_loaders import PyPDFLoader

In [11]:
loader = PyPDFLoader("MachineLearning-Lecture01.pdf")
pages = loader.load()

In [12]:
print(type(pages))

<class 'list'>


In [13]:
print(type(pages[1]))

<class 'langchain_core.documents.base.Document'>


In [14]:
print(pages[1])

page_content='many biologers are there here? Wow, just a few, not many. I'm surprised. Anyone from 
statistics? Okay, a few. So where are the rest of you from?  
Student : iCME.  
Instructor (Andrew Ng) : Say again?  
Student : iCME.  
Instructor (Andrew Ng) : iCME. Cool.  
Student : [Inaudible].  
Instructor (Andrew Ng) : Civi and what else?  
Student : [Inaudible]  
Instructor (Andrew Ng) : Synthesis, [inaudible] systems. Yeah, cool.  
Student : Chemi.  
Instructor (Andrew Ng) : Chemi. Cool.  
Student : [Inaudible].  
Instructor (Andrew Ng) : Aero/astro. Yes, right. Yeah, okay, cool. Anyone else?  
Student : [Inaudible].  
Instructor (Andrew Ng) : Pardon? MSNE. All right. Cool. Yeah.  
Student : [Inaudible].  
Instructor (Andrew Ng) : Pardon?  
Student : [Inaudible].  
Instructor (Andrew Ng) : Endo ‚Äî  
Student : [Inaudible].  
Instructor (Andrew Ng) : Oh, I see, industry. Okay. Cool. Great, great. So as you can 
tell from a cross-section of this class, I think we're a very diverse 

In [23]:
print(pages[1].metadata)

{'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': 'MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 1, 'page_label': '2'}


#### Youtube Loader

In [None]:
!pip install langchain_yt_dlp

In [53]:
from langchain_yt_dlp.youtube_loader import YoutubeLoaderDL

# Basic transcript loading
loader = YoutubeLoaderDL.from_youtube_url(
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ", add_video_info=True
)

In [54]:
documents = loader.load()

In [55]:
print(documents)

[Document(metadata={'source': 'dQw4w9WgXcQ', 'title': 'Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)', 'description': 'The official video for ‚ÄúNever Gonna Give You Up‚Äù by Rick Astley. \n\nNever: The Autobiography üìö OUT NOW! \nFollow this link to get your copy and listen to Rick‚Äôs ‚ÄòNever‚Äô playlist ‚ù§Ô∏è #RickAstleyNever\nhttps://linktr.ee/rickastleynever\n\n‚ÄúNever Gonna Give You Up‚Äù was a global smash on its release in July 1987, topping the charts in 25 countries including Rick‚Äôs native UK and the US Billboard Hot 100.  It also won the Brit Award for Best single in 1988. Stock Aitken and Waterman wrote and produced the track which was the lead-off single and lead track from Rick‚Äôs debut LP ‚ÄúWhenever You Need Somebody‚Äù.  The album was itself a UK number one and would go on to sell over 15 million copies worldwide.\n\nThe legendary video was directed by Simon West ‚Äì who later went on to make Hollywood blockbusters such as Con Air, Lara C

#### WebBase Loader


In [56]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://github.com/sanketana/GenAI-Foudations/blob/main/Week06_RAG_1/notes.md")
docs = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [None]:
print(docs[0])

## 2) Splitters
A document splitter is a utility that takes a large document (or multiple documents) and breaks it into smaller, more manageable chunks of text.

#### Types of Splitters in Langchain
1. CharacterTextSplitter
2. RecursiveCharacterTextSplitter
3. TokenTextSplitter
4. Markdown / Code Splitters

![Example Splitter](Example_Splitter.png)

| Feature                  | CharacterTextSplitter                        | RecursiveCharacterTextSplitter                          |
|---------------------------|-----------------------------------------------|---------------------------------------------------------|
| Splitting method          | Fixed-size, raw character cuts               | Tries hierarchical separators (para ‚Üí sentence ‚Üí word ‚Üí char) |
| Preserves semantic meaning| ‚ùå Often cuts in middle of words/sentences    | ‚úÖ Keeps chunks aligned to natural text boundaries       |
| Default separators        | ["\n\n"]                       | `["\n\n", "\n", " ", ""]` (paragraph, line, space, char)|
| Chunk size handling       | Strict cutoff at `chunk_size`                | Tries largest separator where chunk ‚â§ `chunk_size`      |
| Chunk overlap             | ‚úÖ Supported                                 | ‚úÖ Supported                                            |
| Output consistency        | More predictable (always equal-sized chunks) | More variable (chunks may differ in size depending on separators) |
| Performance               | Faster, simpler                             | Slightly slower due to recursive splitting logic        |
| Readability of chunks     | Poor (fragments of sentences)                 | Better (complete sentences/paragraphs where possible)   |
| Best use case             | Very short/simple text; testing              | Long docs, PDFs, transcripts, RAG pipelines             |

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [61]:
# Example text
text = """Artificial Intelligence is changing the world. It is being used in healthcare, education, and entertainment. 

However, AI also raises ethical concerns. Bias, privacy, and misuse are important issues."""

In [62]:
chunk_size =50
chunk_overlap = 10

In [63]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [64]:
char_chunks = c_splitter.split_text(text)
print(char_chunks)

Created a chunk of size 109, which is longer than the specified 50


['Artificial Intelligence is changing the world. It is being used in healthcare, education, and entertainment.', 'However, AI also raises ethical concerns. Bias, privacy, and misuse are important issues.']


In [65]:
for chunk in char_chunks:
    print(chunk)

Artificial Intelligence is changing the world. It is being used in healthcare, education, and entertainment.
However, AI also raises ethical concerns. Bias, privacy, and misuse are important issues.


In [66]:
rec_chunks = r_splitter.split_text(text)
print(rec_chunks)

['Artificial Intelligence is changing the world. It', 'world. It is being used in healthcare, education,', 'and entertainment.', 'However, AI also raises ethical concerns. Bias,', 'Bias, privacy, and misuse are important issues.']


In [67]:
for chunk in rec_chunks:
    print(chunk)

Artificial Intelligence is changing the world. It
world. It is being used in healthcare, education,
and entertainment.
However, AI also raises ethical concerns. Bias,
Bias, privacy, and misuse are important issues.


## 3) Vectorstores and Embeddings

#### Combining loading and splitting

In [None]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("MachineLearning-Lecture01.pdf")
docs = loader.load()
print(docs)

In [72]:
import pprint
pprint.pp(docs[0].metadata)

{'producer': 'Acrobat Distiller 8.1.0 (Windows)',
 'creator': 'PScript5.dll Version 5.2.2',
 'creationdate': '2008-07-11T11:25:23-07:00',
 'author': '',
 'moddate': '2008-07-11T11:25:23-07:00',
 'title': '',
 'source': 'MachineLearning-Lecture01.pdf',
 'total_pages': 22,
 'page': 0,
 'page_label': '1'}


In [73]:
# Bulk Loading PDF
loaders = [
    PyPDFLoader("MachineLearning-Lecture01.pdf"),
    PyPDFLoader("MachineLearning-Lecture01.pdf"),
    PyPDFLoader("MachineLearning-Lecture02.pdf"),    
    PyPDFLoader("MachineLearning-Lecture03.pdf"),    
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [74]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [75]:
splits = text_splitter.split_documents(docs)

In [76]:
print(len(splits))

208


### Embeddings

In [None]:
!pip install langchain-openai
!pip install python-dotenv

In [133]:
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import os

load_dotenv()

# Access your API key
api_key = os.getenv("OPENAI_API_KEY")
print("API Key:", api_key[:5] + "*****")  # just to verify it‚Äôs loaded

API Key: sk-pr*****


In [100]:
embedding = OpenAIEmbeddings(
    model="text-embedding-3-small"
)
# Can also explicitly pass key as openai_api_key=api_key

In [87]:
coffee1 = "I enjoy drinking coffee in the morning."
coffee2 = "I love having a cup of filter coffee when I wake up"
market = "The stock market had a big crash yesterday."
mug = "I crashed the stock of my coffee mug yesterday."

In [101]:
coffee1_embedding = embedding.embed_query(coffee1)
coffee2_embedding = embedding.embed_query(coffee2)
market_embedding = embedding.embed_query(market)
mug_embedding = embedding.embed_query(mug)

In [90]:
import numpy as np

In [94]:
np.dot(np.array(coffee1_embedding), np.array(coffee2_embedding))

np.float64(0.6272893868874276)

In [95]:
np.dot(np.array(coffee1_embedding), np.array(market_embedding))

np.float64(0.07100796107510074)

### Vectorstores

In [None]:
!pip install chromadb

In [98]:
from langchain.vectorstores import Chroma

In [99]:
persist_directory = 'docs/chroma'
!rm -rf ./docs/chroma

In [102]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [103]:
print(vectordb._collection.count())

208


### Similarity Search

In [104]:
question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question, k=3)

In [105]:
len(docs)

3

In [106]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an account that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework problems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead ‚Äî let's see ‚Äî for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thing that I think will help you to succeed and \ndo well in this class and even help you to enjoy this class more is if you form a study \ngroup.  \nSo start looking around where you're sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to for

### Failure Cases - Duplicate Chunks in Search Results

In [119]:
question = "what did they say about the matlab"
docs = vectordb.similarity_search(question, k=5)
docs[0]

Document(metadata={'page': 8, 'total_pages': 22, 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'title': '', 'moddate': '2008-07-11T11:25:23-07:00', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'creator': 'PScript5.dll Version 5.2.2', 'page_label': '9', 'source': 'MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of ‚Äî I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB licen

### Failure Cases - Semantic Lookup ignoring Metadata

In [120]:
question = "what did they say about regression in the third lecture"
docs = vectordb.similarity_search(question, k=5)
for doc in docs:
    print(doc.metadata)

{'title': '', 'page_label': '1', 'page': 0, 'author': '', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creationdate': '2008-07-11T11:25:03-07:00', 'source': 'MachineLearning-Lecture03.pdf', 'moddate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'total_pages': 16}
{'source': 'MachineLearning-Lecture03.pdf', 'page_label': '15', 'moddate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'title': '', 'total_pages': 16, 'creationdate': '2008-07-11T11:25:03-07:00', 'page': 14, 'author': ''}
{'creationdate': '2008-07-11T11:25:03-07:00', 'moddate': '2008-07-11T11:25:03-07:00', 'page_label': '7', 'author': '', 'creator': 'PScript5.dll Version 5.2.2', 'source': 'MachineLearning-Lecture03.pdf', 'page': 6, 'title': '', 'total_pages': 16, 'producer': 'Acrobat Distiller 8.1.0 (Windows)'}
{'source': 'MachineLearning-Lecture03.pdf', 'author': '', 'page': 2, 'producer': 'Acrobat Distiller 8.1.0 (Window

## 4) Retrieval
Fetching the most relevant pieces of external information (chunks of documents, knowledge base, etc.) to provide extra context to the LLM before it generates an answer.

#### Types of Retrieval


| Attribute            | Vector Similarity                          | BM25 / Keyword                  | Hybrid Search                     | Re-ranking                          | Structured Retrieval                  |
|----------------------|-------------------------------------------|---------------------------------|----------------------------------|------------------------------------|--------------------------------------|
| How It Works          | Embed query & docs, find nearest vectors | TF-IDF based keyword match       | Combines vector + keyword search  | Retrieve many, rerank with model   | Query structured DB or API            |
| Strengths             | Captures semantic meaning                 | Exact keyword matching           | Balances semantic & lexical      | High precision ranking             | Accurate for structured facts         |
| Weaknesses            | Misses exact keywords                     | Fails on semantic similarity     | More complex infra               | Expensive at scale                  | Needs schema alignment                |
| When to Use           | General semantic search                   | Legal, technical, IDs           | Production-grade RAG             | Customer-facing apps, high accuracy | Enterprise + DB + knowledge graph    |

#### Maximum Marginal Relevance (MMR)
- You may not always want to choose the most similar responses
- Could give a very narrow view of the topic
- Eg:
1. **Wikipedia / Photosynthesis**  
   - **Without MMR**: Top-k retrieves 3 paragraphs all about the light reaction.  
   - **With MMR**: You get **light reaction + Calvin cycle + chloroplast structure** ‚Äî comprehensive and non-repetitive.  

2. **News Articles / Election Results**  
   - **Without MMR**: Top-k might select 3 snippets all repeating **who won**.  
   - **With MMR**: You get **winner + voter turnout + reactions from parties and citizens** ‚Äî broader context.  

3. **Product Reviews / Smartphone Pros**  
   - **Without MMR**: Top-k selects 3 reviews all saying **camera is good**.  
   - **With MMR**: You get **camera + battery + display quality** ‚Äî highlights diverse advantages. 

In [121]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)
docs_ss[0].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of ‚Äî I \nknow some people c'

In [122]:
docs_ss[1].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of ‚Äî I \nknow some people c'

In [123]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
docs_mmr[0].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of ‚Äî I \nknow some people c'

In [124]:
docs_mmr[1].page_content[:100]

'least squares regression being a bad idea for classification problems and then I did a \nbunch of mat'

#### Metadata Based Search
Augumenting similarity search with exact metadata wherever possible
Eg: Search for regression in 3rd lecture transcript which is MachineLearning-Lecture03.pdf (metadata filer Source: MachineLearning-Lecture03.pdf)

In [126]:
question = "what did they say about regression in the third lecture?"

docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"MachineLearning-Lecture03.pdf"}
)

In [127]:
for d in docs:
    print(d.metadata)

{'author': '', 'source': 'MachineLearning-Lecture03.pdf', 'page': 0, 'page_label': '1', 'title': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'moddate': '2008-07-11T11:25:03-07:00', 'total_pages': 16, 'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)'}
{'author': '', 'moddate': '2008-07-11T11:25:03-07:00', 'source': 'MachineLearning-Lecture03.pdf', 'page': 14, 'creationdate': '2008-07-11T11:25:03-07:00', 'page_label': '15', 'title': '', 'total_pages': 16, 'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)'}
{'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'author': '', 'total_pages': 16, 'title': '', 'moddate': '2008-07-11T11:25:03-07:00', 'creationdate': '2008-07-11T11:25:03-07:00', 'page': 6, 'page_label': '7', 'source': 'MachineLearning-Lecture03.pdf', 'creator': 'PScript5.dll Version 5.2.2'}


#### Metadata Self Query Retriever

SelfQueryRetriever uses an LLM to dynamically translate your natural language question into both semantic + metadata queries so that your vector DB returns the most relevant and context-aware chunks.

##### What it does internally:
- Determines semantic intent
- Determines metadata filters if applicable
- Sends the query to vector DB with filters

##### Why this is powerful
- You don‚Äôt need to manually write filters or craft queries.
- The LLM automatically ‚Äúunderstands‚Äù the schema and selects relevant docs.
- Works well for RAG pipelines, especially with metadata-rich corpora (like CS229 transcripts with topics, lecture numbers, etc.).

In [134]:
from langchain_openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [135]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `MachineLearning-Lecture01.pdf`, `MachineLearning-Lecture02.pdf`, or `MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [136]:
document_content_description = "Lecture notes"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [137]:
question = "what did they say about regression in the third lecture?"

In [139]:
docs = retriever.invoke(question)

In [140]:
for d in docs:
    print(d.metadata)

{'title': '', 'moddate': '2008-07-11T11:25:03-07:00', 'page_label': '3', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:03-07:00', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'MachineLearning-Lecture03.pdf', 'author': '', 'total_pages': 16, 'page': 2}
{'author': '', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'total_pages': 16, 'creator': 'PScript5.dll Version 5.2.2', 'source': 'MachineLearning-Lecture03.pdf', 'title': '', 'moddate': '2008-07-11T11:25:03-07:00', 'page_label': '11', 'page': 10, 'creationdate': '2008-07-11T11:25:03-07:00'}
{'total_pages': 16, 'page': 11, 'author': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'MachineLearning-Lecture03.pdf', 'page_label': '12', 'moddate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'title': ''}
{'source': 'MachineLearning-Lecture03.pdf', 'total_pages': 16, 'title': '', 'creationdate': '2008-07-11T11:25:0