#### SimilarityPostprocessor Overview
- **SimilarityPostprocessor** filters nodes based on a similarity score threshold.
- Nodes scoring below the threshold are removed, ensuring only relevant and similar content remains.
- This process ensures that nodes passed to the language model for response generation are highly semantically correlated with the query.
- It enhances response relevance by retaining only nodes with a high similarity to the query.


In [1]:
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

In [3]:
reader = SimpleDirectoryReader('files/postprocessors')

In [4]:
documents = reader.load_data()

In [5]:
len(documents)

1

In [7]:
#documents

#### How LlamaIndex divides documents into nodes

1. **Document Loading**:
   - `SimpleDirectoryReader('files/postprocessors')` loads documents from a specified directory.
   - Each document may be large (e.g., a long article or text file).

2. **Document Chunking**:
   - When `documents = reader.load_data()` is called, the `SimpleDirectoryReader` processes documents by splitting them into smaller pieces called "nodes."
   - Chunking is often based on the length of the text (number of words, tokens, or paragraphs) to ensure each node is manageable for the language model.

3. **Nodes**:
   - A node consists of a contiguous section of the document, such as a paragraph, a few sentences, or a fixed number of tokens.
   - These nodes are later indexed or processed by `VectorStoreIndex`, enabling efficient search and retrieval.

**Customization:** The chunking process ensures that large documents are broken into smaller, manageable parts (nodes) for the language model.


In [8]:
# Create an index from the documents (this will chunk them into nodes)
index = VectorStoreIndex.from_documents(documents)

In [10]:
# Access the nodes (documents) from the index's docstore
nodes = index.docstore.docs

# Iterate over the nodes and print them
for node_id, node in nodes.items():
    print(f"Node ID: {node_id}")
    print(f"Node Content:\n{node.get_text()}\n")

Node ID: e9f253fb-6554-4113-a7fd-49291bd96059
Node Content:
In the quaint village of Lavender Hollow, nestled snugly within the embrace of towering trees and rolling hills, a ginger tabby cat named Fluffy resided in a cozy cottage that exuded warmth and tranquility. Fluffy, with her enchanting amber eyes and majestic ginger fur, had become an integral part of the tapestry of life that unfolded within the village.

It was a beautiful morning, with ethereal rays of sunlight streaming through the curtains, casting a soft golden glow upon Fluffy's slumbering form. As her eyes fluttered open, Fluffy could sense the promise of excitement and adventure that awaited her just outside the window. It was a tranquil Wednesday morning, February 7th, 2024, and the world outside seemed to beckon her with promises of new discoveries and delightful escapades.

With a graceful stretch, Fluffy emerged from her cozy bed and padded over to the kitchen, where the Joneses, a family known for their unwavering

#### How Chunking is Done:

- When you load documents using `SimpleDirectoryReader` and pass them to `VectorStoreIndex`, 
  the documents are split (or chunked) into smaller pieces called **nodes**.
- The purpose of this chunking is to ensure that large documents are broken into manageable sections 
  that the model can process efficiently.

#### Default Chunking:

- By default, LlamaIndex uses simple heuristics to chunk documents.
- It might split based on a fixed number of tokens (e.g., 512 tokens) or based on logical breaks like paragraphs or sentences.

#### Custom Chunking:

- You can customize how documents are chunked by using a different `TextSplitter` or setting chunking parameters.
- For example, you can define how long each chunk should be or how the splits should happen (e.g., by sentences or by tokens).


#### Using custom chunking (TextSplitter)

In [12]:
from llama_index.core.text_splitter import TokenTextSplitter

In [13]:
reader = SimpleDirectoryReader('files/postprocessors')

In [27]:
documents = reader.load_data()

In [32]:
# Customize chunking using TokenTextSplitter
#text_splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=20)  # Customize chunk size and overlap

In [33]:
from llama_index.core.node_parser import SimpleNodeParser

In [38]:
# Initialize the parser
parser = SimpleNodeParser.from_defaults(chunk_size=200, chunk_overlap=20)

# Parse documents into nodes
nodes = parser.get_nodes_from_documents(documents)

In [39]:
len(nodes)

11

...back to the main code

In [40]:
reader    = SimpleDirectoryReader('files/postprocessors')
documents = reader.load_data()

index     = VectorStoreIndex.from_documents(documents)

In [41]:
retriever = index.as_retriever(retriever_mode='default')

In [42]:
nodes = retriever.retrieve(
    "What did Fluffy found in the gentle stream?"
)

In [43]:
print('Initial nodes:')
for node in nodes:
    print(f"Node: {node.node_id} - Score: {node.score}")

Initial nodes:
Node: 329318c0-ed03-450e-8fb2-25dd75221023 - Score: 0.8739163932659445
Node: ff128de5-4559-4ac8-ac1a-4eb518cae11d - Score: 0.855709528796912


In [44]:
pp = SimilarityPostprocessor(
    nodes            =nodes, 
    similarity_cutoff=0.86
)

In [45]:
remaining_nodes = pp.postprocess_nodes(nodes)

In [46]:
print('Remaining nodes:')
for node in remaining_nodes:
    print(f"Node: {node.node_id} - Score: {node.score}")

Remaining nodes:
Node: 329318c0-ed03-450e-8fb2-25dd75221023 - Score: 0.8739163932659445
