# RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

In [None]:
from pypdf import PdfReader
from litellm import completion
import os
import io

In [None]:
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

FILE_PATH = "generate_wikipage.pdf"
reader = PdfReader(FILE_PATH)
num_of_pages = len(reader.pages)
print(f"Number of pages: {num_of_pages}")

Number of pages: 27


In [None]:
# You can revise how much of the pdf file to use for this demo
text = ""
for page_num in range(num_of_pages):
    page = reader.pages[page_num]
    text += page.extract_text() + ' '

In [None]:
print(text[:100])

Assisting in Writing Wikipedia-like Articles From Scratch
with Large Language Models
Yijia Shao Yuch


1) **Building**: RAPTOR recursively embeds, clusters, and summarizes chunks of text to construct a tree with varying levels of summarization from the bottom up. You can create a tree from the text in 'sample.txt' using `RA.add_documents(text)`.

2) **Querying**: At inference time, the RAPTOR model retrieves information from this tree, integrating data across lengthy documents at different abstraction levels. You can perform queries on the tree with `RA.answer_question`.

### Building the tree

In [None]:
from raptor import RetrievalAugmentation

2024-03-20 13:29:18,784 - Loading faiss.
2024-03-20 13:29:18,807 - Successfully loaded faiss.


In [None]:
RA = RetrievalAugmentation()

# construct the tree
RA.add_documents(text)

2024-03-20 13:29:22,228 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT3TurboSummarizationModel object at 0x28911d450>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x28f98b990>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-03-20 13:29:22,228 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selec

### Querying from the tree

```python
question = # any question
RA.answer_question(question)
```

In [None]:
question = "what is storm?"

answer = RA.answer_question(question=question)

print("Answer: ", answer)

2024-03-20 13:59:57,205 - Using collapsed_tree


Answer:  STORM is a writing system designed to aid in the synthesis of topic outlines through retrieval and multi-perspective question asking. It operates by discovering diverse perspectives on a given topic, simulating conversations, gathering conversations from sources, refining outlines, and adding trusted sources. The system aims to address challenges at the pre-writing stage, specifically focusing on how to research a topic and create an outline before starting to write. STORM has shown significant improvements in organization and coverage in article creation, outperforming other baseline models. It follows a multi-stage approach involving generating questions, reading and asking experts, splitting queries, and searching for information to enhance the research capabilities of Large Language Models (LLMs).


In [None]:
# Save the tree by calling RA.save("path/to/save")
SAVE_PATH = "demo/paper"
RA.save(SAVE_PATH)

2024-03-20 13:31:22,883 - Tree successfully saved to demo/paper


In [None]:
# load back the tree by passing it into RetrievalAugmentation

RA = RetrievalAugmentation(tree=SAVE_PATH)

answer = RA.answer_question(question=question)
print("Answer: ", answer)

2024-03-20 14:01:22,458 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT3TurboSummarizationModel object at 0x292e36ed0>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x29292dcd0>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-03-20 14:01:22,462 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selec

Answer:  STORM is a writing system designed to aid in the synthesis of topic outlines through retrieval and multi-perspective question asking. It operates by discovering diverse perspectives on a given topic, simulating conversations, gathering conversations from sources, refining outlines, and adding trusted sources. The system aims to address challenges at the pre-writing stage, specifically focusing on how to research a topic and create an outline before starting to write. STORM has shown significant improvements in organization and coverage compared to previous work, with a 25% absolute increase in organization and a 10% increase in coverage. It follows a multi-stage approach involving pre-writing and writing stages to create outlines with multi-level section headings, refined using topics, references, and conversations to generate full articles. STORM outperforms other baseline models in article generation and has been evaluated positively by experts for offering more depth in art

## Using other Open Source Models for Summarization/QA/Embeddings (using HuggingFace)

> Note: Please note that this approach is extremely slow on CPU. Instead use frameworks like Ollama for utilizing custom LLMs.

If you want to use other models such as Llama or Mistral, you can very easily define your own models and use them with RAPTOR.

In [None]:
import torch
from raptor import BaseSummarizationModel, BaseQAModel, BaseEmbeddingModel, RetrievalAugmentationConfig
from transformers import AutoTokenizer, pipeline

In [None]:
# if you want to use the Gemma, you will need to authenticate with HuggingFace, Skip this step, if you have the model already downloaded
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoTokenizer, pipeline
import torch

# You can define your own Summarization model by extending the base Summarization Class.
class GEMMASummarizationModel(BaseSummarizationModel):
    def __init__(self, model_name="google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the GEMMA model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.summarization_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),  # Use "cpu" if CUDA is not available
        )

    def summarize(self, context, max_tokens=150):
        # Format the prompt for summarization
        messages=[
            {"role": "user", "content": f"Write a summary of the following, including as many key details as possible: {context}:"}
        ]

        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        # Generate the summary using the pipeline
        outputs = self.summarization_pipeline(
            prompt,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )

        # Extracting and returning the generated summary
        summary = outputs[0]["generated_text"].strip()
        return summary


In [None]:
class GEMMAQAModel(BaseQAModel):
    def __init__(self, model_name= "google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.qa_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
        )

    def answer_question(self, context, question):
        # Apply the chat template for the context and question
        messages=[
              {"role": "user", "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}"}
        ]
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        # Generate the answer using the pipeline
        outputs = self.qa_pipeline(
            prompt,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )

        # Extracting and returning the generated answer
        answer = outputs[0]["generated_text"][len(prompt):]
        return answer

In [None]:
from sentence_transformers import SentenceTransformer
class SBertEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model_name="sentence-transformers/multi-qa-mpnet-base-cos-v1"):
        self.model = SentenceTransformer(model_name)

    def create_embedding(self, text):
        return self.model.encode(text)


In [None]:
RAC = RetrievalAugmentationConfig(summarization_model=GEMMASummarizationModel(), qa_model=GEMMAQAModel(), embedding_model=SBertEmbeddingModel())
RA = RetrievalAugmentation(config=RAC)

question = "what is storm?"

answer = RA.answer_question(question=question)

print("Answer: ", answer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Exploring Tree structure

In [None]:
tree = RA.tree

In [None]:
nodes = tree.all_nodes
n_layers = tree.num_layers

In [None]:
tree.root_nodes

{134: <raptor.tree_structures.Node at 0x2a5a0ee90>,
 135: <raptor.tree_structures.Node at 0x2a5a0f510>,
 136: <raptor.tree_structures.Node at 0x2a5a0f690>,
 137: <raptor.tree_structures.Node at 0x2a5a0f490>,
 138: <raptor.tree_structures.Node at 0x2a5a0e750>}

In [None]:
def print_tree_layers(root_nodes):
    """
    Iterates over the tree from the root nodes and prints node index and text layer by layer.

    Args:
      root_nodes: A dictionary mapping node index to Node objects.
    """

    all_nodes = tree.all_nodes
    current_layer = list(root_nodes.values())  # Convert root_nodes to a list for iteration
    level = 0
    while current_layer:
        print(f"================= Level {level} ================= ")
        next_layer = []
        for node in current_layer:
            print(f"Index: {node.index}, Text: {node.text}\n")
            next_layer.extend(all_nodes.get(child_index) for child_index in node.children)

        current_layer = next_layer
        level += 1

print_tree_layers(tree.root_nodes)


Index: 134, Text: The document discusses various papers and presentations related to natural language processing (NLP) and information retrieval. Some key details include:

1. Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf presented a paper titled "FLAIR: An easy-to-use framework for state-of-the-art NLP" at the 2019 Conference.

2. Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Z

Index: 135, Text: The STORM research project focuses on the pre-writing stage of creating articles, particularly for the 2022 Winter Olympics opening ceremony. It introduces a method called STORM, involving two stages: pre-writing and writing. In the pre-writing stage, the system creates an outline (O) with multi-level section headings, refined using topics (t), references (R), and conversations to generate the full article (S). Evaluations of the outline are done using metrics like heading soft recall and heading entity recall

Index: 136, Text: Researchers con

## Generating an image from the tree

In [None]:
# Make sure you have installed graphviz and it's in your system path

from graphviz import Digraph
from typing import Set

def create_graph(tree):
    # dot = Digraph(engine='neato' if layout != 'dot' else 'dot')  # Change layout algorithm if not dot

    dot = Digraph()

    # Add nodes
    for index, node in tree.all_nodes.items():
        dot.node(str(index), label=str(index))

    # Add edges
    for index, node in tree.all_nodes.items():
        for child_index in node.children:
            dot.edge(str(index), str(child_index))

    # print(dot)
    return dot

# Create and display the graph
graph = create_graph(tree)
graph.attr(layout='dot')  # twopi, dot, fdp, sfdp, neato, and twopi.
graph.render('tree_graph', format='png', cleanup=True)

'tree_graph.png'

In [None]:
import litellm

# If you'd want to log the output
def my_custom_logging_fn(model_call_dict):
    print(f"model call details: {model_call_dict}")

# Streaming response generation
def generate_responses():
    response = completion(
        model="ollama/gemma:2b",
        messages=[{ "content": "who are you?", "role": "user"}],
        api_base="http://localhost:11434",
        stream=True,
        # logger_fn=my_custom_logging_fn
    )
    for chunk in response:
        content = chunk['choices'][0]['delta'].content
        if content is not None:  # Skip over responses with None content
            yield content

# Usage
for response in generate_responses():
    print(response, end="")

[92m13:35:26 - LiteLLM:INFO[0m: [92m

POST Request Sent from LiteLLM:
curl -X POST \
http://localhost:11434/api/generate \
-d '{'model': 'gemma:2b', 'prompt': 'who are you?', 'options': {}, 'stream': True}'
[0m

2024-03-20 13:35:26,137 - [92m

POST Request Sent from LiteLLM:
curl -X POST \
http://localhost:11434/api/generate \
-d '{'model': 'gemma:2b', 'prompt': 'who are you?', 'options': {}, 'stream': True}'
[0m



I am a large language model, trained by Google. I am a conversational AI that can engage in human-like conversations on a wide range of topics.

**Here are some of my capabilities:**

* Natural language processing (NLP)
* Natural language generation (NLG)
* Machine learning
* Knowledge base access
* Question answering
* Summarization
* Translation
* Storytelling

I am still under development, but I am constantly learning and improving. I am here to assist you with your queries and provide you with information and entertainment.

In [None]:
# Non-Streaming response generation


response = completion(
        model="ollama/gemma:2b",
        messages=[{ "content": "who are you?", "role": "user"}],
        api_base="http://localhost:11434",
    )

print(response.choices[0].message.content)

## Custom Summarization model using Ollama
> You need to install and run Ollama, and pull your LLM model e.g. Gemma-2b. See docs [here](https://ollama.com/)

In [None]:
from raptor import BaseSummarizationModel


# You can define your own Summarization model by extending the base Summarization Class.
class GEMMASummarizationModel(BaseSummarizationModel):
    def __init__(self, model_name="ollama/gemma:2b"):
        self.model = model_name

    def summarize(self, context, max_tokens=150):
        # Format the prompt for summarization
        messages=[
            { "content": "You are an expert in summarizing text.", "role": "system"},
            {"role": "user", "content": f"Write a summary of the following, including as many key details as possible: {context}:"}
        ]

        response = completion(
            model=self.model,
            messages=messages,
            api_base="http://localhost:11434"
        )
        return response.choices[0].message.content


2024-03-20 11:14:17,242 - Loading faiss.
2024-03-20 11:14:17,267 - Successfully loaded faiss.


In [None]:
from raptor import BaseQAModel

class GEMMAQAModel(BaseQAModel):
    def __init__(self, model_name= "ollama/gemma:2b"):
        self.model = model_name

    def answer_question(self, context, question):
        # Apply the chat template for the context and question
        messages=[
              {"role": "user", "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}"}
        ]
        response = completion(
            model=self.model,
            messages=messages,
            api_base="http://localhost:11434"
        )
        return response.choices[0].message.content

In [None]:
from raptor import BaseEmbeddingModel

from sentence_transformers import SentenceTransformer

class SBertEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model_name="sentence-transformers/multi-qa-mpnet-base-cos-v1"):
        self.model = SentenceTransformer(model_name)

    def create_embedding(self, text):
        return self.model.encode(text)

In [None]:
from raptor import RetrievalAugmentationConfig, RetrievalAugmentation

RAC = RetrievalAugmentationConfig(summarization_model=GEMMASummarizationModel(), qa_model=GEMMAQAModel(), embedding_model=SBertEmbeddingModel())

2024-03-20 11:14:43,364 - Load pretrained SentenceTransformer: sentence-transformers/multi-qa-mpnet-base-cos-v1
2024-03-20 11:14:43,708 - Use pytorch device: cpu


In [None]:
RA = RetrievalAugmentation(config=RAC)

2024-03-20 11:14:47,401 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <__main__.GEMMASummarizationModel object at 0x2a07a3ad0>
            Embedding Models: {'EMB': <__main__.SBertEmbeddingModel object at 0x2a07ad090>}
            Cluster Embedding Model: EMB
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-03-20 11:14:47,406 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization 

In [None]:
# logging.disable(logging.CRITICAL)

text = ""
for page_num in range(2):
    page = reader.pages[page_num]
    text += page.extract_text() + ' '


RA.add_documents(text)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
question = "what is storm?"

answer = RA.answer_question(question=question)

print("Answer: ", answer)

2024-03-20 13:35:50,257 - Using collapsed_tree


Answer:  STORM is a writing system designed to aid in the synthesis of topic outlines through retrieval and multi-perspective question asking. It operates by discovering diverse perspectives on a given topic, simulating conversations, gathering conversations from sources, refining outlines, and adding trusted sources. The system aims to address challenges at the pre-writing stage, specifically focusing on how to research a topic and create an outline before starting to write. STORM has shown significant improvements in organization and coverage in article creation, outperforming other baseline models. It follows a multi-stage approach involving generating questions, reading and asking experts, splitting queries, and searching for information to enhance the research capabilities of Large Language Models (LLMs).
