# LateChunkQdrant

Based on the [Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models](https://arxiv.org/abs/2409.04701) paper.

This notebooks explains how apply `LateChunkEmbeddings` and `LatechunkQdrant` vectorstore.

**Notes:**
- The key idea behind Late Chunking is to first embed the entire text, then split it into chunks later. To implement Late Chunking in Langchain, we use `LateChunkQdrant` vectorstore that applies the late chunking technique.

- Can combine with any `text_splitting` used in LangChain or you can custom with the [Chunk](https://github.com/jina-ai/late-chunking/blob/main/chunked_pooling/chunking.py) used in the paper. We'll give the example with handle the same method of authors.

## Setup

In [None]:
%pip install -qU langchain-jina qdrant-client beautifulsoup4 transformers

### Credentials
To access Jina embedding models you'll need to go https://jina.ai/embeddings/ get an API key.

In [2]:
import os
import getpass

if not os.getenv("JINA_API_KEY"):
    os.environ["JINA_API_KEY"] = getpass.getpass("Enter your key: ") # "jina_*"

## Instantiation

import EmbeddingTabs from "@theme/EmbeddingTabs";

In [3]:
from langchain_jina import LateChunkEmbeddings

text_embeddings = LateChunkEmbeddings(
    jina_api_key=os.environ.get("JINA_API_KEY"),
    model_name="jina-embeddings-v3"
)

For our purpose, we need to ensure the input text fits within the model’s context length. Therefore, we will use the tokenizer from Hugging Face check input tokenized length.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)

In the Latechunk process, we need to ensure that the entire text being embedded fits within the model's context length limit (8192 tokens). Therefore, we load the `tokenizer` from the `transformers` library to handle and validate the input length.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')

text_splitter.tokenizer = tokenizer 

### LateChunkQdrant

Config several parameters use in database

In [None]:
class Config:
    ROOT = "qdrantDB"
    CLT_NAME = "demo"
    TOPK = 5

Here we create LateChunkQdrant database. We set the return documents with 5 docs

In [None]:
import os
from qdrant_client import QdrantClient
from langchain_community.docstore.document import Document
from langchain_jina import LateChunkQdrant


client = QdrantClient()

vectorstore = LateChunkQdrant(
    client, 
    collection_name=Config.CLT_NAME,
    embeddings=text_embeddings, 
    text_splitter=text_splitter
)

if os.path.isdir(os.path.join(Config.ROOT, "collection", Config.CLT_NAME)):
    print(f"===== Load exits collection: {Config.CLT_NAME} ======")
    vectorstore = vectorstore.from_existing_collection(
        embedding=text_embeddings, 
        path=Config.ROOT,
        collection_name=Config.COLLECTION_NAME, 
        text_splitter=text_splitter
    )

else:
    print(f"===== Create new collection: {Config.CLT_NAME} ======")
    with open("state_of_the_union.txt") as f:
        state_of_the_union = f.read()

    documents  = [
        Document(
            page_content=state_of_the_union, 
            metadata={"source": "state_of_the_union.txt"}
        ),
    ]

    vectorstore = vectorstore.from_documents(
        documents=documents, 
        embedding=text_embeddings, 
        text_splitter=text_splitter,
        path=Config.ROOT, 
        collection_name=Config.CLT_NAME
    )



## Query vector store

Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent.

### Query directly

#### Similarity search
Performing a simple similarity search can be done as follows:

In [None]:
results = vectorstore.similarity_search(
    "What did the president say about ketanji brown jackson?",
    k=3,
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. [{'source': 'state_of_the_union.txt', '_id': 'fe9823ebb87b47839c39c93781a3fffe', '_collection_name': 'demo'}]
* And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. [{'source': 'state_of_the_union.txt', '_id': 'f3c77ec9d0fe44cbb953bde75ae586bb', '_collection_name': 'demo'}]
* As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. [{'source': 'state_of_the_union.txt', '_id': 'b213772f1bc74d12a099a62a2d04aa63', '_collection_name': 'demo'}]


#### Similarity search with score

If you want to execute a similarity search and receive the corresponding scores you can run:

In [24]:
results = vectorstore.similarity_search_with_score(
    "what did the president say about ketanji brown jackson?", 
    k=3
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=0.289770] One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. [{'source': 'state_of_the_union.txt', '_id': 'fe9823ebb87b47839c39c93781a3fffe', '_collection_name': 'demo'}]
* [SIM=0.287874] And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. [{'source': 'state_of_the_union.txt', '_id': 'f3c77ec9d0fe44cbb953bde75ae586bb', '_collection_name': 'demo'}]
* [SIM=0.272244] As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. [{'source': 'state_of_the_union.txt', '_id': 'b213772f1bc74d12a099a62a2d04aa63', '_collection_name': 'demo'}]


#### Search by vector

You can also search by vector:

In [None]:
results = vectorstore.similarity_search_by_vector(
    embedding=text_embeddings.embed_query("Protect Americans from COVID-19"),
    k=1
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

* First, stay protected with vaccines and treatments. We know how incredibly effective vaccines are. If you’re vaccinated and boosted you have the highest degree of protection. [{'source': 'state_of_the_union.txt', '_id': '3b89fdcd5f134708819d4c7581cfc122', '_collection_name': 'demo'}]


### Query by turning into retriever

In [35]:
retriever = vectorstore.as_retriever(
    search_kwargs={"k": Config.TOPK}
)
retriever.invoke("What did the president say about ketanji brown jackson?")

[Document(metadata={'source': 'state_of_the_union.txt', '_id': 'fe9823ebb87b47839c39c93781a3fffe', '_collection_name': 'demo'}, page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.'),
 Document(metadata={'source': 'state_of_the_union.txt', '_id': 'f3c77ec9d0fe44cbb953bde75ae586bb', '_collection_name': 'demo'}, page_content='And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'),
 Document(metadata={'source': 'state_of_the_union.txt', '_id': 'b213772f1bc74d12a099a62a2d04aa63', '_collection_name': 'demo'}, page_content='As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.'),
 Document(metadata={'source': 'state_of_the_union.txt', '

## Usage for retrieval-augmented generation

For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:

- [Tutorials](/docs/tutorials/)
- [How-to: Question and answer with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)
- [Retrieval conceptual docs](https://python.langchain.com/docs/concepts/retrieval)

## API reference

For detailed documentation of all `LatecChunkQdrant` vector store features and configurations head to the API reference: https://python.langchain.com/api_reference/chroma/vectorstores/langchain_chroma.vectorstores.Qdrant.html

### Tips
- The LateChunking approach is used to solve the problem of text segments losing meaning due to missing context. It is particularly effective with coherent documents, where each part is related to the whole.

- For very long documents, not all of the context may be required. Therefore, when the text is divided into chapters or larger sections, enough context is provided for the embedding model to process all tokens accurately.
  
- For multiple documents (hundreds of pages), the LateChunkQdrant automatically splits them into smaller batches, making the processing more manageable and optimized for hardware.