## Level 4: Semantic Chunking <a id="SemanticChunking"></a>
Isn't it weird that we have a global constant for chunk size? Isn't it even weirder that our normal chunking mechanisms don't take into account the actual content?

I'm not the only one who thinks so

<!-- <div style="text-align: center;">
    <img src="static/SemanticChunkingtweet.png" style="max-width:50%; height:auto;"><br>
    <span><i><a href="https://twitter.com/thesephist/status/1724159343237456248?s=46">Source</a></i></span>
</div> -->

There has to be a better way - let's explore and find out.

Embeddings represent the semantic meaning of a string. They don't do much on their own, but when compared to embeddings of other texts you can start to infer the relationship between chunks. I want to lean into this property and explore using embeddings to find clusters of semantically similar texts.

The hypothesis is that semantically similar chunks should be held together.

I tried a few methods:
1) **Heirarchical clustering with positional reward** - I wanted to see how heirarchical clustering of sentence embeddings would do. But because I chose to split on sentences, there was an issue with small short sentences after a long one. You know? (like this last sentenence). They could change the meaning of a chunk, so I added a positional reward and clusters were more likely to form if they were sentences next to each other. This ended up being ok, but tuning the parameters was slow and unoptimal.
2) **Find break points between sequential sentences** - Next up I tried a walk method. I started at the first sentence, got the embedding, then compared it to sentence #2, then compared #2 and #3 and so on. I was looking for "break points" where embedding distance was large. If it was above a threshold, then I considered it the start of a new semantic section. I originally tried taking embeddings of every sentence, but this turned out to be too noisy. So I ended up taking groups of 3 sentences (a window), then got an embedding, then dropped the first sentence, and added the next one. This worked out a bit better.

I'll show method #2 here - It's not perfect by any means, but it's a good starting point for an exploration and I'd love to hear about how you think it could be improved.

First, let's load up our essay that we'll run through. I'm just doing a single essay here to keep the tokens down.

We'll be using Paul Graham's [MIT essay](https://paulgraham.com/mit.html)

Great, now that we have our sentences, I want to combine the sentence before and after so that we reduce noise and capture more of the relationships between sequential sentences.

Let's create a function so we can use it again. The `buffer_size` is configurable so you can select how big of a window you want. Keep this number in mind for the later steps. I'll just use `buffer_size=1` for now.

In [1]:
import spacy
import torch
import SemChunk as sc
from SemChunk import Document
import numpy as np
import torch.nn.functional as F

from transformers import AutoTokenizer, AutoModel

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
with open('mit.txt') as file:
    document = file.read()

In [3]:
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('D:\\DSAI\\Pre-Trained Models\\all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('D:\\DSAI\\Pre-Trained Models\\all-MiniLM-L6-v2')



In [4]:
doc = Document(document)

In [5]:
chunks = doc.get_semantic_chunks(tokenizer=tokenizer, model=model, starting_threshold=95, step=5)

__get_sentences_timing: 1099.0056991577148ms
__get_token_length_timing: 25.99954605102539ms
__combine_sentences_timing: 0.0ms
__get_embeddings_timing: 5604.042291641235ms
__combine_sentence_embeddings_timing: 0.0ms
__calculate_cosine_distances_timing: 4.999876022338867ms
95 4362
__get_optimal_chunks_timing: 7.998943328857422ms
get_semantic_chunks_timing: 6746.0479736328125ms


In [92]:
# from sklearn.metrics.pairwise import cosine_similarity

In [93]:
# A = np.array([[1,2,3], [2,3,4], [5,6,7]])
# B = np.array([[1,2,3], [1,2,3], [1,1,1]])

In [176]:
# cosine_sim_1 = """
# def cos_sim(anchor_array, comparison_array):   
#     return cosine_similarity(anchor_array, comparison_array)

# """

In [177]:
# np.mean(timeit.repeat(stmt=cosine_sim_1, number=1000000, repeat=30))

In [178]:
# cosine_sim_2 = """
# def cos_sim(anchor_array, comparison_array):   
#     # return np.dot(comparison_array, anchor_array)/(norm(comparison_array, axis=1)*norm(anchor_array))
#     # print (np.dot(comparison_array, anchor_array))
#     # print (norm(comparison_array, axis=1))
#     # print (norm(anchor_array, axis=1))
#     return np.dot(comparison_array, anchor_array)/(norm(comparison_array, axis=1)*norm(anchor_array, axis=1))

# """

In [179]:
# np.mean(timeit.repeat(stmt=cosine_sim_2, number=1000000, repeat=30))

---
Cosine Similarity from Sklearn is of comparable speed already, no point optimising it

---

In [180]:
import spacy

# Load in nlp model
nlp = spacy.load("en_core_web_sm")

def get_sentences(document):
    return [str(sentence) for sentence in nlp(document).sents]

In [181]:
def get_token_length(sentences, tokenizer):
    # Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    return torch.sum(encoded_input["attention_mask"], axis=1)

In [182]:
def combine_sentences(single_sentences_list, buffer_size=1):
    indexed_sentences = [{'sentence': x, 'index' : i} for i, x in enumerate(single_sentences_list)]
    # Go through each sentence dict
    for i in range(len(indexed_sentences)):

        # Create a string that will hold the sentences which are joined
        combined_sentence = ''

        # Add sentences before the current one, based on the buffer size.
        for j in range(i - buffer_size, i):
            # Check if the index j is not negative (to avoid index out of range like on the first one)
            if j >= 0:
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += indexed_sentences[j]['sentence'] + ' '

        # Add the current sentence
        combined_sentence += indexed_sentences[i]['sentence']

        # Add sentences after the current one, based on the buffer size
        for j in range(i + 1, i + 1 + buffer_size):
            # Check if the index j is within the range of the sentences list
            if j < len(indexed_sentences):
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += ' ' + indexed_sentences[j]['sentence']

        # Then add the whole thing to your dict
        # Store the combined sentence in the current sentence dict
        indexed_sentences[i]['combined_sentence'] = combined_sentence
        # sentences[i]['combined_sentences_indexes'] = [i-buffer_size, i, i+buffer_size] # Don't need the indexes anymore

    return indexed_sentences

In [183]:
def get_embeddings(sentences, model, tokenizer):
    # Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    respective_token_len = torch.sum(encoded_input["attention_mask"], axis=1)

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

    # return sentence_embeddings, respective_token_len

    # Map into dictionary for ease of future lookup
    emb_token_pair = [{'embedding': emb, 'token_length': length} for emb, length in zip(sentence_embeddings, respective_token_len)]

    return dict(enumerate(emb_token_pair))

In [267]:
def combine_sentence_embeddings(combined_sentences, embeddings):
    for i, sentence in enumerate(combined_sentences):
        sentence['combined_sentence_embedding'] = embeddings[i]['embedding']
        sentence['token_length'] = embeddings[i]["token_length"]

    return combined_sentences

In [268]:
def calculate_cosine_distances(sentences):
    distances = []
    for i in range(len(sentences) - 1):
        embedding_current = sentences[i]['combined_sentence_embedding']
        embedding_next = sentences[i + 1]['combined_sentence_embedding']
        
        # Calculate cosine similarity
        similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]
        
        # Convert to cosine distance
        distance = 1 - similarity

        # Append cosine distance to the list
        distances.append(distance)

        # Store distance in the dictionary
        sentences[i]['distance_to_next'] = distance

    # Optionally handle the last sentence
    # sentences[-1]['distance_to_next'] = None  # or a default value

    return distances, sentences

In [188]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [184]:
single_sentences_list = get_sentences(document)

In [185]:
respective_sentence_length = get_token_length(single_sentences_list, tokenizer)

In [186]:
combined_sentences = combine_sentences(single_sentences_list)

In [269]:
embeddings = get_embeddings([x['combined_sentence'] for x in combined_sentences], model, tokenizer)

In [270]:
combined_sentences = combine_sentence_embeddings(combined_sentences, embeddings)

In [271]:
distances, _ = calculate_cosine_distances(combined_sentences)

In [272]:
distances

[0.08057284355163574,
 0.1375684142112732,
 0.17561519145965576,
 0.06512320041656494,
 0.18760502338409424,
 0.1476793885231018,
 0.16683268547058105,
 0.18630290031433105,
 0.15812575817108154,
 0.07983028888702393,
 0.08562672138214111,
 0.1622338891029358,
 0.07978606224060059,
 0.18610018491744995,
 0.06589436531066895,
 0.09110867977142334,
 0.45044267177581787,
 0.2242092490196228,
 0.32570064067840576,
 0.071067214012146,
 0.18791961669921875,
 0.10814130306243896,
 0.09627401828765869,
 0.23668622970581055,
 0.2781423330307007,
 0.11351317167282104,
 0.2664569020271301,
 0.27265340089797974,
 0.2593122720718384,
 0.2754741311073303,
 0.09199053049087524,
 0.32527297735214233,
 0.7885842025279999,
 0.3085896968841553,
 0.1764904260635376,
 0.5832135677337646,
 0.38720273971557617,
 0.3229341506958008,
 0.3504432439804077,
 0.17462849617004395,
 0.20303332805633545,
 0.272402286529541,
 0.21797770261764526,
 0.41525721549987793,
 0.12195217609405518,
 0.39996421337127686,
 0.183

In [253]:
# embeddings[0]['embedding']

In [254]:
# embeddings[1]['embedding']

In [204]:
embeddings[1]['embedding'].size()

torch.Size([384])

In [235]:
# torch.resize(embeddings[1]['embedding'])

In [233]:
# torch.cat((embeddings[0]['embedding'].reshape(-1,1), embeddings[1]['embedding'].reshape(-1,1)), axis=1)

In [244]:
all_embeddings = [embeddings[index]['embedding'].reshape(-1,1) for index in embeddings]

In [245]:
all_embeddings = torch.cat(all_embeddings, axis=1).T

In [252]:
torch.eq(embeddings[277]['embedding'], all_embeddings[277])

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, Tr

In [265]:
cosine_similarity(all_embeddings, all_embeddings)[277][278]

0.7833367

In [279]:
cosine_similarity([embeddings[1]['embedding']], [embeddings[2]['embedding']])

array([[0.8624316]], dtype=float32)

In [284]:
check_dist = 1-cosine_similarity(all_embeddings, all_embeddings)
check_dist_values = check_dist.diagonal(1)

In [291]:
check_dist_values.tolist()

[0.08057254552841187,
 0.13756811618804932,
 0.17561495304107666,
 0.06512308120727539,
 0.18760502338409424,
 0.14767926931381226,
 0.16683244705200195,
 0.18630284070968628,
 0.15812557935714722,
 0.07983016967773438,
 0.08562678098678589,
 0.16223376989364624,
 0.07978606224060059,
 0.18610018491744995,
 0.0658942461013794,
 0.09110856056213379,
 0.4504425525665283,
 0.22420907020568848,
 0.32570070028305054,
 0.0710674524307251,
 0.18791961669921875,
 0.10814177989959717,
 0.09627443552017212,
 0.23668622970581055,
 0.27814245223999023,
 0.11351287364959717,
 0.2664564251899719,
 0.27265334129333496,
 0.2593120336532593,
 0.2754738926887512,
 0.09199070930480957,
 0.3252732753753662,
 0.7885842323303223,
 0.3085898756980896,
 0.17649054527282715,
 0.5832134485244751,
 0.3872026205062866,
 0.32293426990509033,
 0.35044336318969727,
 0.17462831735610962,
 0.20303308963775635,
 0.2724020481109619,
 0.2179776430130005,
 0.41525715589523315,
 0.12195241451263428,
 0.3999640941619873,
 0

In [286]:
np.array(distances)

array([0.08057284, 0.13756841, 0.17561519, 0.0651232 , 0.18760502,
       0.14767939, 0.16683269, 0.1863029 , 0.15812576, 0.07983029,
       0.08562672, 0.16223389, 0.07978606, 0.18610018, 0.06589437,
       0.09110868, 0.45044267, 0.22420925, 0.32570064, 0.07106721,
       0.18791962, 0.1081413 , 0.09627402, 0.23668623, 0.27814233,
       0.11351317, 0.2664569 , 0.2726534 , 0.25931227, 0.27547413,
       0.09199053, 0.32527298, 0.7885842 , 0.3085897 , 0.17649043,
       0.58321357, 0.38720274, 0.32293415, 0.35044324, 0.1746285 ,
       0.20303333, 0.27240229, 0.2179777 , 0.41525722, 0.12195218,
       0.39996421, 0.18376541, 0.32601511, 0.14166689, 0.1647563 ,
       0.46901512, 0.18905544, 0.69110075, 0.315795  , 0.12639666,
       0.45682788, 0.1916374 , 0.19710988, 0.2999019 , 0.18545187,
       0.13161129, 0.29354095, 0.17007947, 0.12362266, 0.32723784,
       0.05673009, 0.37116116, 0.39677846, 0.22782773, 0.16113961,
       0.34259498, 0.15751094, 0.30207372, 0.48306412, 0.06345

In [290]:
np.isclose(check_dist_values, np.array(distances), atol=1e-8)

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [6]:
# doc.get_sentences()
# doc.__get_sentences()

In [7]:
# doc.single_sentences_list[:5]

In [7]:
# doc.get_token_length(tokenizer)

In [8]:
# doc.respective_sentence_length

In [9]:
# doc.combine_sentences()

In [10]:
# doc.combined_sentences[:5]

In [11]:
# doc.get_embeddings(model, tokenizer)

In [18]:
# doc.embeddings

In [13]:
# doc.combine_sentence_embeddings()

In [14]:
# doc.combined_sentences[:5]

In [15]:
# doc.calculate_cosine_distances()

In [16]:
# doc.semantic_distances

In [None]:
# output = doc.get_optimal_chunks(starting_threshold=95, step=5)
# output

In [4]:
single_sentences_list = sc.get_sentences(document)

In [5]:
respective_sentence_length = sc.get_token_length(single_sentences_list, tokenizer)

In [6]:
combined_sentences = sc.combine_sentences(single_sentences_list)

In [7]:
embeddings = sc.get_embeddings([x['combined_sentence'] for x in combined_sentences], model, tokenizer)

In [8]:
combined_sentences = sc.combine_sentence_embeddings(combined_sentences, embeddings)

In [9]:
distances, combined_sentences = sc.calculate_cosine_distances(combined_sentences)

In [18]:
chunks = sc.get_optimal_chunks(distances, combined_sentences)
# chunks[:5]

AttributeError: module 'SemChunk' has no attribute 'get_optimal_chunks'