### ***Various Techniques in Semantic Chunking*** 

***What is Semantic Chunking ?*** <br>
Semantic chunking refers to breaking down information into *meaningful*, *manageable pieces* or *"chunks"* *based on their meaning*, rather than arbitrary divisions like sentence length or formatting.  <br>
Semantic chunking helps to improve the understanding of text by divinding it into units that represents distinct concepts.




### ***1. Install and Import Libraries***

In [1]:
%%capture
! pip install -qU semantic-chunkers
! pip install -qU datasets==2.19.1
! pip install -qU langchain 
! pip install -qU pypdf
! pip install -qU langchain-community

In [3]:
from datasets import load_dataset
from semantic_router.encoders import HuggingFaceEncoder
from semantic_chunkers import StatisticalChunker
from semantic_chunkers import ConsecutiveChunker
from semantic_chunkers import CumulativeChunker
from langchain_community.document_loaders import PyPDFLoader

### ***2. Loading the Data***
***I am working on a Kubernetes Notes that I made in PDF foramt. I am using the PyPDFLoader from LangChain.***

In [4]:
file_path = (
    "/home/wassim/Downloads/KubernetesNotes.pdf"
)
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()

pages[0]

Document(metadata={'source': '/home/wassim/Downloads/KubernetesNotes.pdf', 'page': 0}, page_content="Kubernetes For Everyone\nKubernetes introduction and features\nHow Kubernetes works?\nIn Kubernetes, there is a master node and multiple worker nodes, each worker node can handle\nmultiple pods.\nPods are just a bunch of containers clustered together as a working unit. You can start designing\nyour applications using pods.\nOnce your pods are ready, you can specify pod definitions to the master node, and how many you\nwant to deploy. From this point, Kubernetes is in control.\nIt takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes\nstarts new pods on a functioning worker node.\nThis makes the process of managing the containers easy and simple.\nIt makes it easy to build and add more features and improving the application to attain higher\ncustomer satisfaction.\nFinally, no matter what technology you're invested in, Kubernetes can help you.\nImage 

In [5]:
pages[:]

[Document(metadata={'source': '/home/wassim/Downloads/KubernetesNotes.pdf', 'page': 0}, page_content="Kubernetes For Everyone\nKubernetes introduction and features\nHow Kubernetes works?\nIn Kubernetes, there is a master node and multiple worker nodes, each worker node can handle\nmultiple pods.\nPods are just a bunch of containers clustered together as a working unit. You can start designing\nyour applications using pods.\nOnce your pods are ready, you can specify pod definitions to the master node, and how many you\nwant to deploy. From this point, Kubernetes is in control.\nIt takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes\nstarts new pods on a functioning worker node.\nThis makes the process of managing the containers easy and simple.\nIt makes it easy to build and add more features and improving the application to attain higher\ncustomer satisfaction.\nFinally, no matter what technology you're invested in, Kubernetes can help you.\nImage

In [6]:
# I took the text content of each page and combines it into one large string(content)
content_list = []
for page in pages:
    content_list.append(page.page_content)
content = ''.join(content_list)
len(content)


31747

In [7]:
print(content)

Kubernetes For Everyone
Kubernetes introduction and features
How Kubernetes works?
In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle
multiple pods.
Pods are just a bunch of containers clustered together as a working unit. You can start designing
your applications using pods.
Once your pods are ready, you can specify pod definitions to the master node, and how many you
want to deploy. From this point, Kubernetes is in control.
It takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes
starts new pods on a functioning worker node.
This makes the process of managing the containers easy and simple.
It makes it easy to build and add more features and improving the application to attain higher
customer satisfaction.
Finally, no matter what technology you're invested in, Kubernetes can help you.
Image credits: Source: Knoldus IncWhat is the Master node and Worker node in #Kubernetes?
Explained below,
#Containerizati

### ***3. Loading the Embedding Model (all-MiniLM-L6-v2)***

In [8]:
! pip install semantic-router[local]

Defaulting to user installation because normal site-packages is not writeable


In [9]:
import torch 
encoder = HuggingFaceEncoder(name="sentence-transformers/all-MiniLM-L6-v2")

### ***4.Various Chunking Techniques*** 

> ### ***Statistical Chunking***
This technique will automatically calculate the optimal threshold for chunking based on the similarity between different sections of text, making it efficient for large-scale document processing without the need for manual tuning.
</br>
**Scenario:**

You have a research paper that contains various sections discussing different topics related to machine learning. You want to chunk the document into smaller sections for processing, where each section represents a coherent topic.

Instead of manually deciding where to split the text, you use Statistical Chunking to determine the most appropriate splits based on the local similarity of sentences or paragraphs. The chunker dynamically identifies points in the text where the similarity between adjacent chunks drops below a calculated threshold, suggesting a natural break in the topic.

In [10]:
chunker = StatisticalChunker(
    encoder=encoder,
    #min_split_tokens=200,
    #max_split_tokens=500,
)

In [11]:
chunks = chunker(docs=[content])

[32m2024-10-19 16:11:35 INFO semantic_chunkers.utils.logger Single document exceeds the maximum token limit of 300. Splitting to sentences before semantically merging.[0m
100%|██████████| 9/9 [00:02<00:00,  3.68it/s]


In [12]:
chunker.print(chunks[0])

Split 1, tokens 133, triggered by: 0.21
[31mKubernetes For Everyone Kubernetes introduction and features How Kubernetes works? In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle multiple pods. Pods are just a bunch of containers clustered together as a working unit. You can start designing your applications using pods. Once your pods are ready, you can specify pod definitions to the master node, and how many you want to deploy. From this point, Kubernetes is in control. It takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes starts new pods on a functioning worker node. This makes the process of managing the containers easy and simple.[0m
----------------------------------------------------------------------------------------


Split 2, tokens 123, triggered by: 0.12
[32mIt makes it easy to build and add more features and improving the application to attain higher customer satisfaction. Finally, no matte

> ### ***Consecutive Chunking***
Consecutive Chunking is a simple technique where the text is split into fixed-size chunks, typically based on a number of tokens or words. It is efficient and easy to implement, making it useful when you don't need highly nuanced or dynamically calculated splits based on the content itself. </br>
#### **Advantages**: Simple,Fixed-size chunks (when we need consistent chunk sizes), Ideal for large-scale data </br>
#### **Disadvantage**: Nonsemantic awareness, Potentially cuts off sentences

In [13]:
chunker = ConsecutiveChunker(
    encoder=encoder, 
    score_threshold=0.2  
)

In [14]:
chunks = chunker(docs=[content])

100%|██████████| 9/9 [00:01<00:00,  4.60it/s]
100%|██████████| 569/569 [00:00<00:00, 37972.30it/s]


In [15]:
chunker.print(chunks[0])

Split 1, tokens None, triggered by: 0.11
[31mKubernetes For Everyone Kubernetes introduction and features How Kubernetes works? In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle multiple pods. Pods are just a bunch of containers clustered together as a working unit.[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.17
[32mYou can start designing[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.18
[34myour applications using pods. Once your pods are ready, you can specify pod definitions to the master node, and how many you[0m
----------------------------------------------------------------------------------------


Split 4, tokens None, triggered by: 0.12
[35mwant to deploy. From this point, Kubernetes is in control. It takes the pods and deploys them to the worker nods. If 

In [19]:
print(chunks)

[[Chunk(splits=['Kubernetes For Everyone', 'Kubernetes introduction and features', 'How Kubernetes works?', 'In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle', 'multiple pods.', 'Pods are just a bunch of containers clustered together as a working unit.'], is_triggered=True, triggered_score=0.10919749985378029, token_count=None, metadata=None), Chunk(splits=['You can start designing'], is_triggered=True, triggered_score=0.17153231394583307, token_count=None, metadata=None), Chunk(splits=['your applications using pods.', 'Once your pods are ready, you can specify pod definitions to the master node, and how many you'], is_triggered=True, triggered_score=0.18204440036315453, token_count=None, metadata=None), Chunk(splits=['want to deploy.', 'From this point, Kubernetes is in control.', 'It takes the pods and deploys them to the worker nods.', 'If a worker node goes down, Kubernetes', 'starts new pods on a functioning worker node.', 'This makes th

> ### ***Cumulative Chunking***
Cumulative Chunking is a technique where the chunks grow progressively as the text is processed. Each new chunk includes all the content from the previous chunks plus some additional content. This method is useful when you want to retain the context from earlier parts of the text while adding new information


In [20]:
chunker = CumulativeChunker(
    encoder=encoder, 
    score_threshold=0.2
)

In [21]:
chunks = chunker(docs=[content])

100%|██████████| 570/570 [00:07<00:00, 73.60it/s]


In [22]:
chunker.print(chunks[0])

Split 1, tokens None, triggered by: 0.11
[31mKubernetes For Everyone Kubernetes introduction and features How Kubernetes works? In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle multiple pods. Pods are just a bunch of containers clustered together as a working unit.[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.17
[32mYou can start designing[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.09
[34myour applications using pods. Once your pods are ready, you can specify pod definitions to the master node, and how many you want to deploy. From this point, Kubernetes is in control. It takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes starts new pods on a functioning worker node. This makes the process of managing the containers easy and 