[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/weaviate-features/services-research/late_chunking.ipynb)

In [33]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer
import numpy as np
import torch


In [3]:
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en',trust_remote_code=True)

In [4]:
def chunk_by_tokens(input_text: str, tokenizer: callable, chunk_size: int = 512):
    """
    Split the input text into chunks of approximately chunk_size tokens
    """
    tokens = tokenizer(input_text, return_offsets_mapping=True, add_special_tokens=False)
    token_offsets = tokens['offset_mapping']
    
    chunks = []
    span_annotations = []
    
    for i in range(0, len(token_offsets), chunk_size):
        chunk_end = min(i + chunk_size, len(token_offsets))
        if chunk_end - i > 0:
            start_offset = token_offsets[i][0]
            end_offset = token_offsets[chunk_end - 1][1]
            chunks.append(input_text[start_offset:end_offset])
            span_annotations.append((i, chunk_end))
    
    return chunks, span_annotations

In [5]:
def late_chunking(
    model_output: 'BatchEncoding', span_annotation: list, max_length=None
):
    token_embeddings = model_output
    outputs = []
    for embeddings, annotations in zip(token_embeddings, span_annotation):
        if (
            max_length is not None
        ):  # remove annotations which go bejond the max-length of the model
            annotations = [
                (start, min(end, max_length - 1))
                for (start, end) in annotations
                if start < (max_length - 1)
            ]
        pooled_embeddings = [
            embeddings[start:end].sum(dim=0) / (end - start)
            for start, end in annotations
            if (end - start) >= 1
        ]
        pooled_embeddings = [
            embedding.detach().cpu().numpy() for embedding in pooled_embeddings
        ]
        outputs.append(pooled_embeddings)

    return outputs

In [17]:
text = """As Weaviate celebrates its fifth anniversary, we've had the privilege of collaborating with tens of thousands of developers, gaining invaluable insights into the evolving landscape of AI projects and strategies. Our users constantly push the boundaries of what’s possible. As they continue to scale their applications in production, they guide the evolution of our product and the market itself. The need for optionality One of the main reasons developers choose Weaviate is the optionality it offers in terms of machine learning models, frameworks, and deployment. With new AI models and tools emerging daily, it's crucial to build systems that allow flexibility for tech stacks to evolve. This optionality, combined with ease of use, helps teams scale AI prototypes into production faster. Flexibility is also vital when it comes to architecture. Different use cases have different requirements. For example, we work with many software companies and those operating in regulated industries. They often require multi-tenancy to isolate data and maintain compliance. When building a Retrieval Augmented Generation (RAG) application, using account or user-specific data to contextualize results, data must remain within a dedicated tenant for its user group. Weaviate’s native, multi-tenant architecture shines for customers who need to prioritize data privacy while maintaining fast retrieval and accuracy. On the other hand, we support some very large scale single-tenant use cases that orient toward real-time data access. Many of these are in e-commerce and industries that compete on speed and customer experience. Even the slightest latency can send their users looking elsewhere. These use cases leverage our HNSW index on hot storage and vector compression to ensure low latency. The point is, there is no one-size-fits-all solution so optionality is key. I’m very proud that through learning from our customers and community, we’re building a solution that supports diverse use cases and the evolving needs of developers. Introducing hot, warm, and cold storage tiers It’s amazing to see our customers' products gain popularity, attracting more users, and in many cases, tenants. However, as multi-tenant use cases scale, infrastructure costs can quickly become prohibitive. Since multi-tenancy is a core tenet of our architecture, the next logical step for us was to build a way to help customers drive more efficient resource consumption. We’re pleased to offer tenant offloading and hot, warm, and cold storage tiers as part of our latest release. Weaviate users (Open Source and Enterprise Cloud) can now deactivate or offload tenants to less-expensive warm or cold storage and reactivate them dynamically, based on the unique patterns of their use case. Here’s what it might look like in practice: One of our customers develops an email platform with tens of thousands of users. 80% of their users are only active during a 12-hour window (US business hours). With our new storage tiers, they can offload tenants to cold storage to save on infrastructure costs when users are inactive. When a user comes online, they can quickly warm up the tenant. This way they reduce storage costs while still offering performance that meets the needs of their customers. alt The Weaviate AI Unit To adapt to this product change and the evolving AI stack, we’ve introduced a new pricing unit to our Enterprise Cloud offering. An AI Unit (AIU) is a Weaviate-specific unit that can be applied to hot, warm, and cold storage tiers and compute costs. AIUs enable customers to better monitor usage and improve budgeting. In addition to resource costs, AIUs will apply to new AI-native Apps as they are released (more on that next). Apps and tools to fuel AI-native development As we continue to listen to our community, it’s clear that developers need an AI-native framework offering not just flexibility, but also modular GUI tools to interact with their data and accelerate their use cases. We’re excited about a new line of AI-native apps and tools that will help developers and business users accelerate common use cases. Recommender App Our first app is a Recommender service, now in private beta. The Recommender is a fully managed, low-code way to build scalable recommendation systems. It offers configurable endpoints for item-to-item, item-to-user, and user-to-user recommendation scenarios across multimodal data. Sign up for the private beta here, and stay tuned for more Apps updates coming soon. alt Weaviate Cloud Tools Lastly, new Weaviate Cloud Tools give developers and non-technical users an easier way to manage, explore, and interact with their data within Weaviate Cloud. The Query and Collections tools are available now in the Weaviate Cloud Console. It’s been an exciting few months, and I’m ecstatic to continue learning from our community and empowering developers to build the future of AI-native possibilities. To dive deeper into our latest product updates, join our upcoming webinar."""
chunks, span_annotations = chunk_by_tokens(text, tokenizer, 128)

In [7]:
for i,chunk in enumerate(chunks):
    print(f"*** Chunk {i} Start ***")
    print(chunk)
    print(f"*** Chunk {i} End ***")


*** Chunk 0 Start ***
As Weaviate celebrates its fifth anniversary, we've had the privilege of collaborating with tens of thousands of developers, gaining invaluable insights into the evolving landscape of AI projects and strategies. Our users constantly push the boundaries of what’s possible. As they continue to scale their applications in production, they guide the evolution of our product and the market itself. The need for optionality One of the main reasons developers choose Weaviate is the optionality it offers in terms of machine learning models, frameworks, and deployment. With new AI models and tools emerging daily, it's crucial to build systems that allow flexibility for
*** Chunk 0 End ***
*** Chunk 1 Start ***
tech stacks to evolve. This optionality, combined with ease of use, helps teams scale AI prototypes into production faster. Flexibility is also vital when it comes to architecture. Different use cases have different requirements. For example, we work with many softwar

In [8]:
token_embeddings = model.encode(text, output_value="token_embeddings").unsqueeze(0)

In [9]:
chunk_embeddings = late_chunking(token_embeddings, [span_annotations])

In [10]:
cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

In [11]:
query_text = "what do customers need to prioritze?"
query = model.encode(query_text)
print(f"The query is: {query_text}")

naive_results = []
late_results = []

for i in range(len(chunks)):
    chunk_n = i
    naive_sim = cos_sim(query, model.encode(chunks[chunk_n]))
    late_sim = cos_sim(query, chunk_embeddings[0][chunk_n])
    
    naive_results.append((naive_sim, chunk_n))
    late_results.append((late_sim, chunk_n))

# Sort results in descending order of cosine similarity
naive_results.sort(reverse=True)
late_results.sort(reverse=True)

print("\nTop 10 results for Naive Chunking:")
for sim, chunk_n in naive_results[:2]:
    print(f"Chunk {chunk_n}: Cosine Similarity = {sim}")
    print(chunks[chunk_n].strip())
    print("----")

print("\nTop 10 results for Late Chunking:")
for sim, chunk_n in late_results[:2]:
    print(f"Chunk {chunk_n}: Cosine Similarity = {sim}")
    print(chunks[chunk_n].strip())
    print("----")

The query is: what do customers need to prioritze?

Top 10 results for Naive Chunking:
Chunk 8: Cosine Similarity = 0.7555344104766846
product updates, join our upcoming webinar.
----
Chunk 3: Cosine Similarity = 0.7479638457298279
diverse use cases and the evolving needs of developers. Introducing hot, warm, and cold storage tiers It’s amazing to see our customers' products gain popularity, attracting more users, and in many cases, tenants. However, as multi-tenant use cases scale, infrastructure costs can quickly become prohibitive. Since multi-tenancy is a core tenet of our architecture, the next logical step for us was to build a way to help customers drive more efficient resource consumption. We’re pleased to offer tenant offloading and hot, warm, and cold storage tiers as part of our latest release. Weaviate users (Open
----

Top 10 results for Late Chunking:
Chunk 2: Cosine Similarity = 0.7008509039878845
data privacy while maintaining fast retrieval and accuracy. On the other h