In [None]:
%%capture
!pip install llama-index==0.10.37 openai==1.30.1 llama-index-embeddings-openai==0.1.9 qdrant-client==1.9.1 llama-index-vector-stores-qdrant==0.2.8 llama-index-llms-openai==0.1.19

In [None]:
import os
import sys
from getpass import getpass
import nest_asyncio

from IPython.display import Markdown, display

from dotenv import load_dotenv

nest_asyncio.apply()

load_dotenv("")

sys.path.append('../helpers')

from utils import setup_llm, setup_embed_model, setup_vector_store

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OPENAI_API_KEY: ")

In [None]:
QDRANT_URL = os.environ['QDRANT_URL'] or getpass("Enter your Qdrant URL:")

In [None]:
QDRANT_API_KEY = os.environ['QDRANT_API_KEY'] or  getpass("Enter your Qdrant API Key:")

In [None]:
from utils import get_documents_from_docstore

senpai_documents = get_documents_from_docstore("../data/words-of-the-senpais")

# Optimizing Chunk Size

In this lesson, we'll explore what chunking is, how it affects the indexing and retrieval process, and how you can customize chunk size and overlap to optimize your results.

> **The Chunking Commandment:** Your goal is not to chunk for chunking sake, our goal is to get our data in a format where it can be retrieved for value later.
>
> -- Greg Kamradt, [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

## Understanding Chunking

When documents are ingested into an index, `LlamaIndex` splits them into smaller pieces called "chunks."  This process is known as chunking. By default, LlamaIndex uses a *chunk size* of 1024 and a *chunk overlap* of 20. 

But what do these numbers mean, and how do they impact the indexing and retrieval process?

### Chunk Size

The chunk size determines the maximum number of tokens (roughly equivalent to words) that each chunk will contain. With a default chunk size of 1024, `LlamaIndex` will split your documents into chunks that are no longer than 1024 tokens each.

#### **🤏 Smaller Chunk Size**

*   More precise and focused embeddings

*   Beneficial for retrieving specific information

#### **👐Larger Chunk Size**

*   More general embeddings with broader context

*   Useful for document overviews, but may miss details

### Chunk Overlap

*   Shared tokens between adjacent chunks (default: 20)

*   Maintains context and prevents information loss

I recommend taking a look at [this chunk visualizer](https://huggingface.co/spaces/m-ric/chunk_visualizer) to get an intuitive sense for chunk size and overlap.

## 🤔 The Impact of Chunk Size
 
I recommend reading [this blog](https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5) post by the LlamaIndex team.

 #### **📏 Relevance and Granularity**

*   Smaller chunks (e.g., 128) offer granularity but risk missing vital information, or lack sufficient context.

*   Larger chunks (e.g., 512) are more likely to capture necessary context, but also run the risk of including irrelevant information.

*   Faithfulness and Relevancy metrics help assess response quality. 

 #### **🎯 Chunk Size and Use Case**

*   **Question Answering:** Shorter, specific chunks for precise answers.

*   **Summarization:** Longer chunks to capture the overall context.

 #### **⏳ Response Generation Time**

*   Larger chunks provide more context but may slow down the system.

*   Balancing comprehensiveness with speed is crucial.
    
 #### **⚖️ Finding the Optimal Size**

*   Testing various chunk sizes is essential for specific use cases and datasets. 

*   Balancing information capture with efficiency is key.

### Considerations When Customizing Chunk Size

When deciding on a chunk size, there are a few things to keep in mind:

| Factor | Description |
|--------|-------------|
| 📄 **Data Characteristics** | The optimal chunk size depends on the data you're indexing. Long, detailed documents, may require a larger chunk size to capture more context. Smaller chunk size may be more appropriate for short, focused passages. |
| 🔍 **Retrieval Requirements** | If you need to retrieve very specific details, a smaller chunk size may be better. If you're looking for more general information, a larger chunk size may suffice. |
| 🔢 **Similarity Parameters** | With a smaller chunk size, the embeddings become more specific, and as a result, there might be more relevant chunks that match a given query. To accommodate this increase in potentially relevant chunks, it is advisable to increase the `similarity_top_k` parameter. This adjustment ensures that the query engine does not overlook relevant results due to a too narrow top-k selection. |

### There are [various methods](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules) you can use to chunk your documents. 

| Parser Type | Splitter Name | Description |
|-------------|---------------|-------------|
| 📁 File-Based Node Parsers | 📄`SimpleFileNodeParser` | The simplest flow: `FlatFileReader` + `SimpleFileNodeParser` which automatically use the best node parser for each type of content. Then, you may want to chain the file-based node parser with a text-based node parser to account for the actual length of the text. |
| | 🌐`HTMLNodeParser` | This node parser uses beautifulsoup to parse raw HTML. By default, it will parse a select subset of HTML tags, but you can override this. The default tags are: ["p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "b", "i", "u", "section"] |
| | 🎭`JSONNodeParser` | The `JSONNodeParser` parses raw JSON. |
| | 📝`MarkdownNodeParser` | The `MarkdownNodeParser` parses raw markdown text. |
| ✂️ Text-Splitters | 💻`CodeSplitter` | Splits raw code-text based on the language it is written in. |
| | 🦜🔗`LangchainNodeParser` | You can also wrap any existing text splitter from langchain with a node parser. |
| | 📜`SentenceSplitter` | The `SentenceSplitter` attempts to split text while respecting the boundaries of sentences. |
| | 🪟`SentenceWindowNodeParser` | The `SentenceWindowNodeParser` splits all documents into individual sentences. The resulting nodes also contain the surrounding "window" of sentences around each node in the metadata.|
| | 🧠`SemanticSplitterNodeParser` | Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other. |
| | 🪙`TokenTextSplitter` | The `TokenTextSplitter` attempts to split to a consistent chunk size according to raw token counts. |
| 🔗 Relation-Based Node Parsers | 🌿`HierarchicalNodeParser` | This node parser will chunk nodes into hierarchical nodes. This means a single input will be chunked into several hierarchies of chunk sizes, with each node containing a reference to it's parent node. |


## We're only going to focus on a few strategies

I'll show you how to split/chunk test using each method below. 


 - 🪙`TokenTextSplitter`
 
 - 📜`SentenceSplitter`

### We'll cover these in later lessons
 
 - 🪟`SentenceWindowNodeParser`

 - 🧠`SemanticSplitterNodeParser`

# 🪙 [`TokenTextSplitter`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/node_parser/text/token.py)

The primary function is to divide a given text into smaller chunks, ensuring each chunk stays within a specified token limit. 

### **How it Works**

1.  **Tokenization:** It utilizes a tokenizer to break down the text into individual tokens (words or subwords).  The default tokenizer is the `tiktoken` tokenizer for GPT-3.5-Turbo.

2.  **Chunking:** It then groups these tokens into chunks, ensuring each chunk's size is within the defined `chunk_size` limit. 

3.  **Overlap Handling:** To maintain context and coherence between chunks, it can incorporate an overlap, specified by `chunk_overlap`, where the last few tokens of one chunk are repeated at the beginning of the next.

### Arguments you need to know

*   **`chunk_size`**: Controls the maximum token count for each chunk. Defualts to 1024.

*   **`chunk_overlap`**: Determines the number of overlapping tokens between consecutive chunks. Defaults to 20.

*   **`separator`**: Specifies the primary character used to split the text into words. Defaults to space (`" "`). 

*   **`backup_separators`**: Provides additional characters for splitting if the primary separator isn't sufficient. Defaults to new line character (`"\n"`).

Note: The order of splitting is: 1. split by separator, 2. split by backup separators (if any), 3. split by characters

*   **`include_metadata`**: Enables or disables the inclusion of metadata within each chunk. Defaults to `True`.

* **`include_prev_next_rel`**: Enables or disables tracking the relationship between nodes. Defaults to `True`.

### Usage Example

The basic usage pattern is as follows (you don't need to pass anything if you want to keep the default values.):

```python
from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter()

nodes = splitter.get_nodes_from_documents(documents)
```
I'll limit our exploration to the `chunk_sizes = [64, 128, 256, 512]` and hold `chunk_overlap` fixed to 16 tokens. 

In [None]:
senpai_documents[42].text

In [None]:
from llama_index.core.node_parser import TokenTextSplitter

example_split = TokenTextSplitter(chunk_size=64, chunk_overlap=16).split_text(senpai_documents[42].text)

In [None]:
example_split

In [None]:
len(example_split[0].split(' '))

In [None]:
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")

def num_tokens_from_string(string: str,encoding=encoding) -> int:
    """Returns the number of tokens in a text string."""
    encoding = encoding
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [None]:
tokens = encoding.encode(example_split[0])

for token in tokens:
    print(encoding.decode_single_token_bytes(token))

In [None]:
num_tokens_from_string(example_split[0])

In [None]:
from llama_index.core.node_parser import TokenTextSplitter

def token_splitter(chunk_size, documents):
    splitter = TokenTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=16,
        )
    nodes = splitter.get_nodes_from_documents(documents)
    return nodes

In [None]:
token_splitter_results = {}

chunk_sizes = [64, 128, 256, 512]

# Iterate over each chunk size and perform token splitting
for size in chunk_sizes:
    key = f"token_split_chunk_size_{size}"
    token_splitter_results[key] = token_splitter(size, senpai_documents)

In [None]:
for key, value in token_splitter_results.items():
    print(f"With {key} we get {len(value)} chunks.")

# [📜`SentenceSplitter`](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/node_parser/text/sentence.py)

The `SentenceSplitter` class, as its name suggests, specializes in splitting text while trying to keep complete sentences and paragraphs together. This is in contrast to the `TokenTextSplitter`, which focuses on token limits.

### How it Works

1. **Initial Splitting**

    *   The text is first divided into paragraphs using the specified `paragraph_separator` (defaults to triple newline characters `"\n\n\n"`).

    *   Each paragraph is then further split using a "chunking tokenizer" (defaults to [`PunktSentenceTokenizer`](https://www.nltk.org/api/nltk.tokenize.PunktSentenceTokenizer.html) from the `nltk` library). Which basically looks for sentences boundaries.

    *   If these methods don't yield enough splits, it resorts to a backup regex and the default separators (`CHUNKING_REGEX = "[^,.;。？！]+[,.;。？！]?"`).

2. **Chunking with Sentence Awareness**

    *   The resulting splits are grouped into chunks, keeping sentences together as much as possible. 

    *   It considers the `is_sentence` flag for each split during this process.

    *   Chunk size and overlap still play a role, but sentence boundaries are given preference.

3. **Overlap Handling**

    *   Similar to `TokenTextSplitter`, it incorporates overlap between chunks to maintain context. 

    *   However, it prioritizes using the last complete sentence for overlap rather than just the last few tokens.

### Arguments you need to know

*   **`chunk_size`**: The target token size for each chunk.

*   **`chunk_overlap`**: The number of overlapping tokens between chunks.

*   **`separator`**: The default separator for splitting (e.g., space).

*   **`paragraph_separator`**: The string used to identify paragraph breaks.

*   **`secondary_chunking_regex`**: A backup regex for splitting if the primary methods are insufficient.

### Usage Example

```python
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50)

nodes = splitter.get_nodes_from_documents(documents)
```

### When to Use SentenceSplitter

*   When preserving complete sentences and paragraphs is essential for understanding the context.

*   When dealing with text where sentence boundaries are meaningful (e.g., legal documents, narratives).

*   When you want to avoid having broken sentences at the beginning or end of chunks.

In [None]:
from llama_index.core.node_parser import SentenceSplitter

SentenceSplitter(chunk_size=64, chunk_overlap=16).split_text(senpai_documents[42].text)

In [None]:
def sentence_splitter(chunk_size, documents):
    splitter = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=16,
        )
    nodes = splitter.get_nodes_from_documents(documents)
    return nodes

In [None]:
sentence_splitter_results = {}

# Iterate over each chunk size and perform sentence splitting
for size in chunk_sizes:
    key = f"sentence_split_chunk_size_{size}"
    sentence_splitter_results[key] = sentence_splitter(size, senpai_documents)

In [None]:
for key, value in sentence_splitter_results.items():
    print(f"With {key} we get {len(value)} chunks.")

### Recap: `TokenTextSplitter` vs `SentenceSplitter`

`TokenTextSplitter` splits the text into chunks based on a specified number of tokens. It uses a tokenizer to break down the text into individual tokens (words or subwords), and then groups these tokens into chunks of a specified size. If the text doesn't divide evenly into the specified chunk size, the last chunk will contain the remaining tokens, which could be less than the specified chunk size.

`SentenceSplitter`, on the other hand, splits the text into chunks based on sentences. It uses a sentence boundary detection algorithm to identify where sentences begin and end, and then groups these sentences into chunks. The size of these chunks can vary depending on the length of the sentences.

# Select Strategy for Ingestion

In [None]:
import random
random.seed(0)
# Randomly select a key from the chunk_size_results dictionary
strategies = list(token_splitter_results.keys()) + list(sentence_splitter_results.keys())
random_key = random.choice(strategies)
print(f"Randomly selected key: {random_key}")

# Ingest to Qdrant

In [None]:
from llama_index.core.settings import Settings
from utils import setup_llm, setup_embed_model, setup_vector_store

COLLECTION_NAME = "wots_sentence_split_chunk_size_256"

setup_llm(
    provider="openai",
    api_key=OPENAI_API_KEY, 
    temperature=0.75, 
    model="gpt-4o", 
    system_prompt="""Use ONLY the provided context and generate a complete, coherent answer to the user's query. 
    Your response must be grounded in the provided context and relevant to the essence of the user's query.
    """
    )

setup_embed_model(
    provider="openai", 
    model="text-embedding-3-small",
    api_key=OPENAI_API_KEY
    )

vector_store = setup_vector_store(QDRANT_URL, QDRANT_API_KEY, COLLECTION_NAME)

In [None]:
from utils import ingest

# what splitter are we gonna use?
sent_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=16)

transforms = [sent_splitter, Settings.embed_model]

split_nodes = ingest(
    documents=senpai_documents,
    transformations=transforms,
    vector_store=vector_store
)

## Build Index Over VectorStore

In [None]:
from llama_index.core import StorageContext
from utils import create_index

index = create_index(
    from_where="vector_store", 
    embed_model=Settings.embed_model,
    vector_store=vector_store,
    )

# Create Query Engine

The default response mode for the query engine is `refine` which will create and refine an answer by sequentially going through each retrieved text chunk. This makes a separate LLM call per Node/retrieved chunk.

I am changing the response mode to `compact`, which is similar to refine but it concatenate the chunks beforehand, resulting in less LLM calls. 

You can [visit the LlamaIndex docs](https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/) to learn more about the choices you have here. Just note that this is also a hyperparameter that you have control over, which will also impact your generation. 

I will leave this up to you to hack around with.

I'm also going to change the value of `similiarty_top_k` from it's default value of 2 to 5. This is an arbitrary choice and simple meant to illustrate that it's a hyperparameter under your control which will affect your generation results.  Increasing this value means you will increase your probability of fetching the most relevant documents from the vector database. 

#### Vector Store Query Mode

The query engine also has a parameter for `vector_store_query_mode`, there are [several choices you can make here](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/vector_stores/types.py). 

`Maximal Marginal Relevance (MMR)` balances relevance and diversity when selecting a subset of items from a larger set.

The key idea behind MMR is to iteratively select items that are both highly relevant to the query and also different from the items already selected. This is achieved by maximizing a score that combines two components:

1. **Relevance**: The similarity between the item and the query (e.g. cosine similarity)

2. **Diversity**: The maximum similarity between the item and any of the items already selected

The MMR score is a linear combination of these components, controlled by a parameter `λ` (lambda) that determines the trade-off between relevance and diversity:

`MMR = argmax [λ * Relevance(item, query) - (1-λ) * max(Similarity(item, selected_item))]`

- If λ is close to 1, MMR puts more emphasis on relevance. 

- If λ is close to 0, MMR favors diversity.

MMR can improve the retrieval component by selecting a diverse set of relevant passages. This helps capture different aspects of the query and provides the language model with a richer context for generating the final answer.

The benefits of using MMR in RAG include:

1. Avoid selecting passages that contain very similar information.

2. Increases the chances of covering different facets of the query.

3. Providing the language model with a diverse set of relevant passages can lead to more comprehensive and well-rounded answers.

### I will leave it up to you to experiment with using MMR (or, not using it) as well as experimenting with different λ values.

The pattern for how to use it is there for you to see.

In [None]:
from utils import create_query_engine

query_engine = create_query_engine(
    index=index, 
    mode="query",
    response_mode="compact",
    similiarty_top_k=5,
    vector_store_query_mode="mmr", 
    vector_store_kwargs={"mmr_threshold": 0.42}
    )

### We can't forget about the prompt!

In [None]:
from utils import display_prompt_dict
display_prompt_dict(query_engine.get_prompts())

In [None]:
from prompts import ANSWER_GEN_PROMPT

print(ANSWER_GEN_PROMPT)

In [None]:
from llama_index.core import PromptTemplate

ANSWER_GEN_PROMPT_TEMPLATE = PromptTemplate(ANSWER_GEN_PROMPT)

query_engine.update_prompts({'response_synthesizer:text_qa_template':ANSWER_GEN_PROMPT_TEMPLATE})

In [None]:
display_prompt_dict(query_engine.get_prompts())

### Instantiate Query Pipeline

In [None]:
from utils import create_query_pipeline

from llama_index.core.query_pipeline import InputComponent

input_component = InputComponent()

chain = [input_component, query_engine]

query_pipeline = create_query_pipeline(chain)

In [None]:
Settings.llm

In [None]:
query_pipeline.run(input="How can I become the best in the world at what I do?")

In [None]:
query_pipeline.run(input="How can I build my brand and make a name for myself in order to be uniquely qualified for emerging opportunities in technology?")

In [None]:
query_pipeline.run(input="How can I set up systems to be the most successful version of myself while working the least hard possible?")