# LLM Based Chunking

## Propositions

Unlike the conventional use of passages or sentences, a new paper [Dense X Retrieval: What Retrieval Granularity Should We Use?](https://chentong0.github.io/factoid-wiki/) introduces a novel retrieval unit for dense retrieval called "propositions." Propositions are atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format.

The three principles below define propositions as atomic expressions of meanings in text:

* Each proposition should represent a distinct piece of meaning in the text, collectively embodying the semantics of the entire text.
* A proposition must be minimal and cannot be further divided into separate propositions.
* A proposition should contextualize itself and be self-contained, encompassing all the necessary context from the text (e.g., coreference) to interpret its meaning.

### Leveraging Opensource Propositioner model using Flan T5 architecture.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import json

model_name = "chentong00/propositionizer-wiki-flan-t5-large"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Setting up use_fast=False due to this (https://github.com/huggingface/transformers/releases/tag/v4.0.0)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

In [None]:
title = "General Information"
section = ""
content = "Cats love dogs. Think They are amazing. Dogs must be the easiest pets around. Tesla robots are advanced now with AI. They will take us to mars."

input_text = f"Title: {title}. Section: {section}. Content: {content}"

input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids.to(device), max_new_tokens=512).cpu()

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

## Multi-vector indexing

Another approach involves multi-vector indexing, where semantic search is performed for a vector derived from something other than the raw text. There are various methods to create multiple vectors per document.


**Smaller chunks:**

Divide a document into smaller chunks and embed them (referred to as ParentDocumentRetriever).

**Summary:**

Generate a summary for each document and embed it along with, or instead of, the document.

**Hypothetical questions:**

Form hypothetical questions that each document would be appropriate to answer, and embed them along with, or instead of, the document.

Each of these utilizes either a text2text or an LLM with a prompt to obtain the necessary chunk. The system then indexes both the newly generated chunk and the original text, improving the recall of the retrieval system. You can find more details of these techniques in Langchain’s official [documentation](https://python.langchain.com/docs/how_to/multi_vector/).