# Sentence Splitter

Character splitting poses an issue as it tends to cut sentences midway. Despite attempts to address this using chunk size and overlap, sentences can still be cut off prematurely. Let's explore a novel approach that considers sentence boundaries instead.

The **SpacySentenceTokenizer** takes a piece of text and divides it into smaller chunks, with each chunk containing a certain number of sentences. It uses the Spacy library to analyze the input text and identify individual sentences.

The method allows you to control the size of the chunks by specifying the stride and overlap parameters. *The stride determines how many sentences are skipped between consecutive chunks, and the overlap determines how many sentences from the previous chunk are included in the next one*.

## Using Direct Spacy Package

In [1]:
from typing import List, Optional
from langchain_core.documents import Document

import spacy

class SpacySentenceTokenizer:
    def __init__(self, spacy_model="en_core_web_sm"):
        # Try loading the model
        try:
            self.nlp = spacy.load(spacy_model)
        except OSError:
            # If model is not found, download it and load again
            print(f"Downloading model {spacy_model}...")
            spacy.cli.download(spacy_model)
            self.nlp = spacy.load(spacy_model)

    def create_documents(self, documents, metadatas=None, overlap: int = 0, stride: int = 1) -> List[Document]:
        chunks = []
        if not metadatas:
            metadatas = [{}]* len(documents)
        for doc, metadata in zip(documents, metadatas):
            text_chunks = self.split_text(doc, overlap, stride)
            for chunk_text in text_chunks:
                chunks.append(Document(page_content=chunk_text, metadata=metadata))
        return chunks

    def split_text(self, text: str, stride: int = 1, overlap: int = 1) -> List[str]:
        sentences = list(self.nlp(text).sents)
        chunks = []
        for i in range(0, len(sentences), stride):
            chunk_text = " ".join(str(sent) for sent in sentences[i: i + overlap + 1])
            chunks.append(chunk_text)
        return chunks

The example below shows how a text with pronouns like “they” requires the context of the previous sentence to make sense. Our brute force overlap approach helps here but is also redundant at some places and leads to longer chunks 🙁

In [2]:
text = "I love dogs. They are amazing. Cats must be the easiest pets around. Tesla robots are advanced now with AI. They will take us to mars."


In [3]:
text = """Character splitting poses an issue as it tends to cut sentences midway. Despite attempts to address this using chunk size and overlap, sentences can still be cut off prematurely. Let's explore a novel approach that considers sentence boundaries instead.

The **SpacySentenceTokenizer** takes a piece of text and divides it into smaller chunks, with each chunk containing a certain number of sentences. It uses the Spacy library to analyze the input text and identify individual sentences.

The method allows you to control the size of the chunks by specifying the stride and overlap parameters. *The stride determines how many sentences are skipped between consecutive chunks, and the overlap determines how many sentences from the previous chunk are included in the next one*."""

In [None]:
tokenizer = SpacySentenceTokenizer()
splitted_text = tokenizer.split_text(text, stride=3, overlap=2)

print(splitted_text)

## Using Langchain SpacyTextSplitter

In [5]:
text = "I love dogs. They are amazing. Cats must be the easiest pets around. Tesla robots are advanced now with AI. They will take us to mars."

In [None]:
from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=500)

texts = text_splitter.split_text(text)
for text_chunk in texts:
    print("*" * 50)
    print(text_chunk)

In [9]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("../docs/Intelizign Leave Policy.pdf")
pages = loader.load()

In [None]:
from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=500)
docs = text_splitter.split_documents(pages)

if True:
    for doc in docs:
        print("*" * 50)
        print("METADATA:")
        print(doc.metadata)
        print("CONTENT:")
        print(doc.page_content)
        print("*" * 50)
