![](images/rag-arch-high.png)

#### Preparing data for retrieval 
![](images/rag-stages.png)

![](images/lcel-workflow.png)

![](images/chunk-size.png)

chunk size - number of characters. or number of tokens. number of characters in a token can vary. 

#### Character text splitter

In [2]:
from langchain_text_splitters import CharacterTextSplitter
# splitting text into chunks
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=100,
    chunk_overlap=10
)
text = '''This is a longer paragraph. It contains multiple sentences to demonstrate how the text splitter works. 
# The text splitter will break this paragraph into smaller chunks based on the specified chunk size and overlap. 
# This is useful for processing large texts in manageable pieces. Let's see how it performs with this example.'''
chunks = text_splitter.split_text(text)
print (chunks)
print (len(chunk) for chunk in chunks)

Created a chunk of size 103, which is longer than the specified 100
Created a chunk of size 113, which is longer than the specified 100


['This is a longer paragraph. It contains multiple sentences to demonstrate how the text splitter works.', '# The text splitter will break this paragraph into smaller chunks based on the specified chunk size and overlap.', "# This is useful for processing large texts in manageable pieces. Let's see how it performs with this example."]
<generator object <genexpr> at 0x00000182D2338120>


#### Recursive character text splitter

In [3]:

from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    separators=["\n", ".", "!", "?", ";"],
    chunk_size=100,
    chunk_overlap=10
)
chunks = splitter.split_text(text)
print (chunks)
print (len(chunk) for chunk in chunks)

['This is a longer paragraph', '. It contains multiple sentences to demonstrate how the text splitter works.', '\n# The text splitter will break this paragraph into smaller chunks based on the specified chunk size and overlap', '.', '# This is useful for processing large texts in manageable pieces', ". Let's see how it performs with this example."]
<generator object <genexpr> at 0x00000182D23383C0>


#### Splitting pdf documents into chunks

In [11]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.0.1-py3-none-any.whl (294 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.0.1


In [None]:
from langchain_community.document_loaders import PyPDFLoader
pdf_loader = PyPDFLoader(file_path='../../../data/pilot-manual-787.pdf')
documents = pdf_loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# the only change is calling split_documents() instead of split_text()
chunks = splitter.split_documents(documents)
print (chunks)
print (len(chunk) for chunk in chunks)
print (f"Total number of chunks: {len(chunks)}")

[Document(metadata={'source': 'data/pilot-manual-787.pdf', 'page': 0}, page_content='PMDG 7 37 0.00.1 \n  TUTORIAL FLIGHT  \n \nFor Simulator Use Only  DO NOT DUPLICATE  JULY 2022  \n \n \n \n \n \nPMDG 737 \n \n \nTutorial Flight  \n \nCopyright © 2011-2022 \nPMDG Simulations  \nAll Rights Reserved'), Document(metadata={'source': 'data/pilot-manual-787.pdf', 'page': 1}, page_content='0.00.2 PMDG 7 37  \nTUTORIAL FLIGHT    \n \n JULY 2022 DO NOT DUPLICATE  For Simulator Use Only  \nDISCLAIMER AND COPYRIGHT INFORMATION   \nThis manual was compiled for use only with the PMDG 737  simulation  for \nMicrosoft Flight Simulator . The information contained within this manual is derived \nfrom multiple sources and is not subject to revision or checking for accuracy. This \nmanual is not to be used for training or familiarity with any aircraft. This manual is \nnot assumed to pro vide operating procedures for use on any aircraft and is written \nfor entertainment purposes.  \n \nIt is a violati

### Loading and chunking python files without context

In [11]:
from langchain_community.document_loaders import PythonLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
loader=PythonLoader('chunking.py')
python_data = loader.load()
print(python_data[0])
# chunking without context
python_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=10) 

page_content='# from langchain_text_splitters import CharacterTextSplitter

# # splitting text into chunks
# text_splitter = CharacterTextSplitter(
#     separator="\n",
#     chunk_size=100,
#     chunk_overlap=10
# )
# text = '''This is a longer paragraph. It contains multiple sentences to demonstrate how the text splitter works. 
# The text splitter will break this paragraph into smaller chunks based on the specified chunk size and overlap. 
# This is useful for processing large texts in manageable pieces. Let's see how it performs with this example.'''
# chunks = text_splitter.split_text(text)
# print (chunks)
# print (len(chunk) for chunk in chunks)

# from langchain_text_splitters import RecursiveCharacterTextSplitter
# splitter = RecursiveCharacterTextSplitter(
#     separators=["\n", ".", "!", "?", ";"],
#     chunk_size=100,
#     chunk_overlap=10
# )
# # chunks = splitter.split_text(text)
# # print (chunks)
# # print (len(chunk) for chunk in chunks)

# # splitting documents i

### Chunking with context 

In [15]:
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language = Language.PYTHON, chunk_size=150, chunk_overlap=10)
chunks = python_splitter.split_documents(python_data)
for i, chunk in enumerate (chunks[:3]):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")

Chunk 1:
# loading pdf file. 
import openai
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings

Chunk 2:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

Chunk 3:
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI 
from langchain_core.runnables import RunnablePassthrough



#### Advanced splitting methods 
1. the previous splits are not context based. ignore context of surrounding texts. 
2. splits are made using characters rather than # tokens.  If we split documents using characters rather than tokens, we risk retrieving chunks and creating a retrieval prompt that # exceeds the maximum amount of text the model can process at once, also called the model context window. All language models break texts into tokens. These are smaller unit of texts, for processing.

![](images/tokens-chunking.png)

chunk is a group of characters or tokens. Here above, there are 5 tokens in one chunk, and there are 4 tokens in the next chunk. Actually this 4 is an overlap. Because number of tokens in a chunk, i.e. the chunk size is constant. 

#### Splitting on tokens
he TokenTextSplitter can be used to perform token splitting. It requires the name of the encoding to use, which is the encoding used by the large language model and can be retrieved using the tiktoken.encoding_for_model() method, and extracting the name with the .name attribute. Remember, the chunk_size and chunk_overlap now represent tokens rather than characters. We'll use the .split_text() method to split the example_string and view the chunks.

In [3]:
import tiktoken
from langchain_text_splitters import TokenTextSplitter
example_string = "Mary had a little lamb, its fleece was white as snow."

encoding = tiktoken.encoding_for_model('gpt-4o-mini')
splitter = TokenTextSplitter(encoding_name=encoding.name, chunk_size=10, chunk_overlap=2)
chunks =   splitter.split_text(example_string)
for i, chunk in enumerate (chunks):
    print (f"Chunk {i+1}:\n{chunk}\n")

for i, chunk in enumerate (chunks):
    print(f"Chunk {i+1}: \n no. of tokens : {len(encoding.encode(chunk))}\n{chunk}\n")


Chunk 1:
Mary had a little lamb, its fleece was white

Chunk 2:
 was white as snow.

Chunk 1: 
 no. of tokens : 10
Mary had a little lamb, its fleece was white

Chunk 2: 
 no. of tokens : 5
 was white as snow.



#### Semantic splitting 
Semantic splitting is a method that uses a language model to split text into chunks based on the meaning of the text. This method is useful when the text contains multiple sentences or paragraphs that are related to each other. The language model can identify the boundaries between sentences or paragraphs and split the text accordingly.  

In [16]:
!pip install rank_bm25 

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2


In [17]:
import os 
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain_core.documents import Document 
openai_api_key = os.environ.get('OPENAI_API_KEY')
text = """Mary had a little lamb, its fleece was white as snow. And everywhere that Mary went, the lamb was sure to go. It followed her to school one day, which was against the rule. It made the children laugh and play to see a lamb at school. And so the teacher turned it out, but still it lingered near, And waited patiently about till Mary did appear. "Why does the lamb love Mary so?" the eager children cry. "Why, Mary loves the lamb, you know." the teacher did reply.  """
embeddings = OpenAIEmbeddings(api_key=openai_api_key, model='text-embedding-3-small')
semantic_splitter = SemanticChunker(
    embeddings=embeddings, 
    breakpoint_threshold_type="gradient", 
    breakpoint_threshold_amount=0.8)
document = Document(page_content=text)
chunks = semantic_splitter.split_documents([document])
print (chunks[0])

page_content='Mary had a little lamb, its fleece was white as snow.'


#### Optimizing document retrieval 
![](images/opti-rag.png)
![](images/sparse-dense-ret.png)

#### Sparse retrieval methods 
TF-IDF and BM25 are the two popular methods for encoding text as spare vectors. TF-IDF, or Term Frequency-Inverse Document Frequency, creates a sparse vector that measures a term's frequency in a document and rarity in other documents. This helps in identifying words that best represent the document's unique content. BM25, or best matching 25, is an improvement on TF-IDF that prevents high-frequency words from being over-emphasized in the encoding. Let's try out BM25 for RAG retrieval

#### BM25Retriever - a sparse retrieval technique. 
The BM25Retriever class can be used to create a retriever from documents or text, just like the retrievers we have already used. Let's start with a small example of three statements about Python. We can use the .from_texts() method to create the retriever from these strings. The k value sets the number of items returned by the retriever when invoked.

In [19]:
from langchain_community.retrievers import BM25Retriever
chunks = ["Python was created by Guido van Rossum and first released in 1991.",
          "Python is a popular language for machine learning",
          "The PyTorch library is a popular library for AI ML"]
bm25_retriever = BM25Retriever.from_texts(chunks, k=3)
results = bm25_retriever.invoke ("When was Python created?")
print ("Most relevant document:")
print (results[0].page_content)

Most relevant document:
Python was created by Guido van Rossum and first released in 1991.


#### BM25 in RAG
