# Advance RAG

1 - Load Document
1 - Split Document
2 - Web Scrap
5 - Embedding
3 - Similarity vs MMR

https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

## 1 - Load Document

To load documents in LangChain, you can use various document loaders depending on the file type. Here are some methods

1.  For PDFs: Use PyPDFLoader to load PDF documents into LangChain's Document format.

In [3]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("pdfs/sangkuriang.pdf")
loader.load()

[Document(metadata={'source': 'pdfs/sangkuriang.pdf', 'page': 0}, page_content='Sangkuriang Story\nThe legend tells that, long ago, there lived a beautiful woman named Dayang Sumbi, the daughter\nof the king of Sumbing Perbangkara. Her beautiful face made Dayang Sumbi contested by the\nprinces.\nAs a princess from the kingdom, Dayang Sumbi has a weaving hobby. One time, when she was\nbusy weaving cloth, suddenly her loom fell. Instead of taking it herself, Dayang Sumbi said an oath:\nif the one who took the loom were a man, then she would take him as her husba nd, but if the one\nwho took the loom were a woman, she would make her a sister.\nUnexpectedly, sometime later, there came a male dog named Si Tuma ng, which brought Dayang\nSumbi’s loom. Finally, to fulfill her oath, Dayang Sumbi married Tumang (long story short, Tumang\nwas a god who was expelled from heaven). From that marriage, a son named Sangkuria ng was\nborn.\nTime went on until Sangkuriang grew into a handsome boy. One d

2. For Text files: Use TextLoader

In [6]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('books/sangkuriang.txt')

doc = loader.load()

doc

[Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='Sangkuriang Story\nThe legend tells that, long ago, there lived a beautiful woman named Dayang Sumbi, the daughter of the king of Sumbing Perbangkara. Her beautiful face made Dayang Sumbi contested by the princes.\nAs a princess from the kingdom, Dayang Sumbi has a weaving hobby. One time, when she was busy weaving cloth, suddenly her loom fell. Instead of taking it herself, Dayang Sumbi said an oath: if the one who took the loom were a man, then she would take him as her husband, but if the one who took the loom were a woman, she would make her a sister.\nUnexpectedly, sometime later, there came a male dog named Si Tumang, which brought Dayang Sumbi’s loom. Finally, to fulfill her oath, Dayang Sumbi married Tumang (long story short, Tumang was a god who was expelled from heaven). From that marriage, a son named Sangkuriang was born.\nTime went on until Sangkuriang grew into a handsome boy. One day, Sangkuriang found

# Text Splitter

### 1. Character Text Splitter
Splits text into chunks based on a specified number of characters.
Useful for consistent chunk sizes regardless of content structure.

In [24]:
from langchain.text_splitter import CharacterTextSplitter

# Split the document into chunks
splitter = CharacterTextSplitter(separator="", chunk_size=250, chunk_overlap=0)
splitter.split_documents(doc)

[Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='Sangkuriang Story\nThe legend tells that, long ago, there lived a beautiful woman named Dayang Sumbi, the daughter of the king of Sumbing Perbangkara. Her beautiful face made Dayang Sumbi contested by the princes.\nAs a princess from the kingdom, Dayan'),
 Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='g Sumbi has a weaving hobby. One time, when she was busy weaving cloth, suddenly her loom fell. Instead of taking it herself, Dayang Sumbi said an oath: if the one who took the loom were a man, then she would take him as her husband, but if the one w'),
 Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='ho took the loom were a woman, she would make her a sister.\nUnexpectedly, sometime later, there came a male dog named Si Tumang, which brought Dayang Sumbi’s loom. Finally, to fulfill her oath, Dayang Sumbi married Tumang (long story short, Tumang wa'),
 Document(metadata={'sour

### Sentence Transformers Token Text Splitter
Splits text into chunks based on sentences, ensuring chunks end at sentence boundaries. 
Ideal for maintaining semantic coherence within chunks.

In [None]:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_size=250)
splitter.split_documents(doc)

Splits text into chunks based on tokens (words or subwords), using tokenizers like GPT-2.
Useful for transformer models with strict token limits. 

In [27]:
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_overlap=0, chunk_size=250)
splitter.split_documents(doc)

[Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='Sangkuriang Story\nThe legend tells that, long ago, there lived a beautiful woman named Dayang Sumbi, the daughter of the king of Sumbing Perbangkara. Her beautiful face made Dayang Sumbi contested by the princes.\nAs a princess from the kingdom, Dayang Sumbi has a weaving hobby. One time, when she was busy weaving cloth, suddenly her loom fell. Instead of taking it herself, Dayang Sumbi said an oath: if the one who took the loom were a man, then she would take him as her husband, but if the one who took the loom were a woman, she would make her a sister.\nUnexpectedly, sometime later, there came a male dog named Si Tumang, which brought Dayang Sumbi’s loom. Finally, to fulfill her oath, Dayang Sumbi married Tumang (long story short, Tumang was a god who was expelled from heaven). From that marriage, a son named Sangkuriang was born.\nTime went on until Sangkuriang grew into a handsome boy. One day, Sangkuriang found

Attempts to split text at natural boundaries (sentences, paragraphs) within character limit.        
Balances between maintaining coherence and adhering to character limits.

In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=250, chunk_overlap=0)
splitter.split_documents(doc)

[Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='Sangkuriang Story\nThe legend tells that, long ago, there lived a beautiful woman named Dayang Sumbi, the daughter of the king of Sumbing Perbangkara. Her beautiful face made Dayang Sumbi contested by the princes.'),
 Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='As a princess from the kingdom, Dayang Sumbi has a weaving hobby. One time, when she was busy weaving cloth, suddenly her loom fell. Instead of taking it herself, Dayang Sumbi said an oath: if the one who took the loom were a man, then she would'),
 Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='take him as her husband, but if the one who took the loom were a woman, she would make her a sister.'),
 Document(metadata={'source': 'books/sangkuriang.txt'}, page_content='Unexpectedly, sometime later, there came a male dog named Si Tumang, which brought Dayang Sumbi’s loom. Finally, to fulfill her oath, Dayang Sumbi mar

## Web Scrapping

In [22]:
from langchain_community.document_loaders import WebBaseLoader

# WebBaseLoader loads web pages and extracts their content
urls = ["https://www.apple.com/"]

# Create a loader for web content
loader = WebBaseLoader(urls)
apple_doc = loader.load()

apple_doc[0].page_content

"\n\n\n\n\n\n\n\n\n\n\n\n\nApple\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nApple\n\nAppleStoreMaciPadiPhoneWatchVisionAirPodsTV & HomeEntertainmentAccessoriesSupport\n\n\n0+\n\n\n\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\tFor a limited time, shop your state's eligible products tax‑free — online and\xa0in‑store. Learn\xa0more\n\n\n\n\n\n\n\n\n\n\n\xa0\n\nApple Vision Pro\nYou’ve never seen everything like this before.\n\nLearn more\nBuy\n\nStream Napoleon on Apple\xa0TV+\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\xa0\n\n\xa0\n\n\n\n\n\nBuy Mac or\xa0iPad for\xa0college\n\n\n\n\n\n\n\n\n\n\nwith education savings\n\n\n\n\n\n\n\nGet a gift card up\xa0to\xa0$150*\n\n\n\n\n\n\n\n\nOnly at the Apple\xa0Store\n\n\n\n\nShop\n\n\n\n\n\n\nBuy Mac or\xa0iPad for\xa0college\n\n\n\n\n\n\n\n\n\n\nwith education savings\n\n\n\n\n\n\n\nGet AirPods with\xa0Mac*\n\n\n\n\n\n\n\n\nApple\xa0Pencil with\xa0iPad*\n\n\n\n\n\n\n\n\nOnly at the Apple\xa0Store\n\n\n\n\nShop\n\n\n \n\n\n\n\n\xa0\n\nMacBook\xa0Air\nLean. Mean. M3 mac

## Youtube Scrapping

## Embedding Deep Dive

## Retriever Deep Dive

# Modify Vector DB

In [None]:
uuids = [str(i) for i in range(len(documents))]

vectorstore.add_documents(documents=documents, ids=uuids)