# Semantic Splitter

The **SimilarSentenceSplitter** takes a piece of text and divides it into groups of sentences based on their similarity. It utilizes a similarity model to measure how similar each sentence is to its neighboring sentences. The method uses a sentence splitter to break the input text into individual sentences.

The goal is to create groups of sentences where each group contains related sentences, according to the specified similarity model. The method starts with the first sentence in the first group and then iterates through the remaining sentences. It decides whether to add a sentence to the current group based on its similarity to the previous sentence.

The **group_max_sentences** parameter controls the maximum number of sentences allowed in each group. If a group reaches this limit, a new group is started. Additionally, a new group is initiated if the similarity between consecutive sentences falls below a specified similarity_threshold.

In simpler terms, this method organizes a text into clusters of sentences, where sentences within each cluster are considered similar to each other. It's useful for identifying coherent and related chunks of information within a larger body of text.


**Related Repo:** [https://github.com/agamm/semantic-split](https://github.com/agamm/semantic-split) - Downloaded local copy as well [here](../02_semantic_splitting/). However, we are using packaged version.

## Using Semantic split module

In [None]:
from semantic_split import SimilarSentenceSplitter, SentenceTransformersSimilarity, SpacySentenceSplitter

text = """
  I dogs are amazing.
  Cats must be the easiest pets around.
  Lion is a ferocious animal.
  Rose is a beautiful flower.
  Robots are advanced now with AI.
  Flying in space can only be done by Artificial intelligence."""

model = SentenceTransformersSimilarity()
sentence_splitter = SpacySentenceSplitter()
splitter = SimilarSentenceSplitter(model, sentence_splitter)
res = splitter.split(text)
res

We use partial langchain implementation to load pdf and setup an implementation to leverage opensource option as well.

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("../docs/Intelizign Leave Policy.pdf")
pages = loader.load()

pages[:2]

In [4]:
from typing import List, Optional

def create_documents(texts: List[str], metadatas: Optional[List[dict]] = None) -> List[str]:
        semantic_chunks = []
        for i, text in enumerate(texts):
            start_index = 0
            for chunk in splitter.split(text):
                semantic_chunks.append(chunk)
        return semantic_chunks

In [9]:
# Embed the chunks
def embed_chunks(chunks: List[str]) -> List[List[float]]:
    pass

In [None]:
texts, metadatas = [], []
for doc in pages:
    texts.append(doc.page_content)
    metadatas.append(doc.metadata)

semantic_chunks = create_documents(texts, metadatas=metadatas)
print(semantic_chunks[10:20])

# Create embeddings of the contextually relevant sentences.
embedded_chunks = [embed_chunks(item) for item in semantic_chunks]
print(embedded_chunks[:5])

## Using Langchain Semantic Splitter

In [None]:
from dotenv import load_dotenv
env_loaded = load_dotenv(".env")

print(f"Env loaded: {env_loaded}")

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("../docs/Intelizign Leave Policy.pdf")
pages = loader.load()

pages[:2]

In [None]:
from langchain_openai.embeddings import AzureOpenAIEmbeddings

azure_openai_embeddings = AzureOpenAIEmbeddings(model = "text-embedding-3-small")

embedding_arr = azure_openai_embeddings.embed_query(text = "This is a test call.")

embedding_arr[:20]

In [None]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_text_splitter = SemanticChunker(azure_openai_embeddings)

docs = semantic_text_splitter.split_documents(pages)

if True:
    for doc in docs:
        print("*" * 50)
        print("METADATA:")
        print(doc.metadata)
        print("CONTENT:")
        print(doc.page_content)
        print("*" * 50)

### Further Exploration

We have several options to configure breakpoint thresold type `PERCENTILE`, `STANDARD_DEVIATION`, `INTERQUARTILE` and `GRADIENT`.

Refer to the documentation [here](https://python.langchain.com/docs/how_to/semantic-chunker/#breakpoints) for detailed information along with the examples.