## Level 4: Semantic Chunking <a id="SemanticChunking"></a>
Isn't it weird that we have a global constant for chunk size? Isn't it even weirder that our normal chunking mechanisms don't take into account the actual content?

I'm not the only one who thinks so

<!-- <div style="text-align: center;">
    <img src="static/SemanticChunkingtweet.png" style="max-width:50%; height:auto;"><br>
    <span><i><a href="https://twitter.com/thesephist/status/1724159343237456248?s=46">Source</a></i></span>
</div> -->

There has to be a better way - let's explore and find out.

Embeddings represent the semantic meaning of a string. They don't do much on their own, but when compared to embeddings of other texts you can start to infer the relationship between chunks. I want to lean into this property and explore using embeddings to find clusters of semantically similar texts.

The hypothesis is that semantically similar chunks should be held together.

I tried a few methods:
1) **Heirarchical clustering with positional reward** - I wanted to see how heirarchical clustering of sentence embeddings would do. But because I chose to split on sentences, there was an issue with small short sentences after a long one. You know? (like this last sentenence). They could change the meaning of a chunk, so I added a positional reward and clusters were more likely to form if they were sentences next to each other. This ended up being ok, but tuning the parameters was slow and unoptimal.
2) **Find break points between sequential sentences** - Next up I tried a walk method. I started at the first sentence, got the embedding, then compared it to sentence #2, then compared #2 and #3 and so on. I was looking for "break points" where embedding distance was large. If it was above a threshold, then I considered it the start of a new semantic section. I originally tried taking embeddings of every sentence, but this turned out to be too noisy. So I ended up taking groups of 3 sentences (a window), then got an embedding, then dropped the first sentence, and added the next one. This worked out a bit better.

I'll show method #2 here - It's not perfect by any means, but it's a good starting point for an exploration and I'd love to hear about how you think it could be improved.

First, let's load up our essay that we'll run through. I'm just doing a single essay here to keep the tokens down.

We'll be using Paul Graham's [MIT essay](https://paulgraham.com/mit.html)

Great, now that we have our sentences, I want to combine the sentence before and after so that we reduce noise and capture more of the relationships between sequential sentences.

Let's create a function so we can use it again. The `buffer_size` is configurable so you can select how big of a window you want. Keep this number in mind for the later steps. I'll just use `buffer_size=1` for now.

In [1]:
import spacy
import torch
from SemChunk import Document
from SemChunk_backupv3 import Document as Document_backup
import numpy as np
import torch.nn.functional as F

from transformers import AutoTokenizer, AutoModel

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
with open('mit.txt') as file:
    document = file.read()

In [3]:
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('D:\\DSAI\\Pre-Trained Models\\all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('D:\\DSAI\\Pre-Trained Models\\all-MiniLM-L6-v2')



---
Method with optimisation to calculation of cosine distances

---

In [4]:
doc = Document(document)

In [5]:
chunks = doc.get_semantic_chunks(tokenizer=tokenizer, model=model, starting_threshold=95, step=5)

__get_sentences_timing: 904.0005207061768ms
__get_token_length_timing: 27.00042724609375ms
__combine_sentences_timing: 1.0001659393310547ms
__get_embeddings_timing: 5683.032751083374ms
__combine_sentence_embeddings_timing: 0.0ms
__calculate_cosine_distances_timing: 6.001710891723633ms
95 4362
__get_optimal_chunks_timing: 9.99903678894043ms
get_semantic_chunks_timing: 6635.033369064331ms


---
OG method with no optimisation to calculation of cosine distances

---

In [None]:
model()

In [6]:
doc_backup = Document_backup(document)

In [10]:
chunks_backup = doc_backup.get_semantic_chunks(tokenizer=tokenizer, model=model, starting_threshold=95, step=5)

__get_sentences_timing: 775.3374576568604ms
__get_token_length_timing: 24.0020751953125ms
__combine_sentences_timing: 1.001596450805664ms
__get_embeddings_timing: 5565.690040588379ms
__combine_sentence_embeddings_timing: 0.0ms
__calculate_cosine_distances_timing: 227.00095176696777ms
95 4362
__get_optimal_chunks_timing: 7.999658584594727ms
get_semantic_chunks_timing: 6605.029344558716ms


In [11]:
6800.036430358887/6635.033369064331

1.0248684599031375

In [12]:
6800.036430358887/3399

2.000599126319178