### RAPTOR: Recursive Abstractive Processing and Thematic Organization for Retrieval

#### Overview

###### RAPTOR is an advanced information retrieval and question-answering system that combines hierarchical document summarization, embedding-based retrieval, and contextual answer generation. It aims to efficiently handle large document collections by creating a multi-level tree of summaries, allowing for both broad and detailed information retrieval.

#### Motivation

###### Traditional retrieval systems often struggle with large document sets, either missing important details or getting overwhelmed by irrelevant information. RAPTOR addresses this by creating a hierarchical structure of the document collection, allowing it to navigate between high-level concepts and specific details as needed.

#### Key Components

###### 1. Tree Building: Creates a hierarchical structure of document summaries.
###### 2.Embedding & Clustering: Organizes document and summaries based on semantic similarity.
###### 3. Vectorstore: Efficiently stores and retrieves document and summary embeddings
###### 4. Contextual Retriever: Selects the most relevant information for a given query.
###### 5. Answer Generation: Produces coherent responses based on retrieved information

#### Method Details

##### Tree Building

###### 1. Start with original documents at level 0.
###### 2. For each level:
######          -> Embed the texts using a language model
######          -> Cluster the embeddings (e.g., using a Gaussian Mixture Models)
######          -> Generate summaries for each cluster
######          -> Use these summaries as the texts for the next level
###### 3. Continue until reaching a single summary or a maximum level

##### Embedding Retrieval

###### 1. Embed all documents and summaries from all levels of the tree.
###### 2. Store these embeeddings in a vectorstore (e.g, FAISS) for efficient similarity search.
###### 3. For a given query:
######        -> Embed the query
######        -> Retrieve the most similar documents/summaries from the vectorstore

##### Contextual Information

###### 1. Take the retrieved document/summaries
###### 2. Use a language model to extract only the most relevant parts for the given query.

##### Answer Generation

###### 1. Combine the relevant parts into a context.
###### 2. Use a language model to generate an answer based on this context and the original query

#### Benefits of this Approach

###### 1. Scalability: Can handle large document collections by working with summaries at different levels
###### 2. Flexibility: Capable of providing both high-level overviews and specific details.
###### 3. Context-Awareness: Retrieves information from the most appropriate level of abstraction.
###### 4. Efficiency: Uses embeddings and vectorstore for fast retrieval.
###### 5. Traceability: Maintains link between summaries and original documents, allowing for source verification.

#### Conclusion

###### RAPTOR represents a significant sdvancement in information retrieval and question-answering systems. By combining hierarchical summarization with embedding-based retrieval and contextual answer generation, it offers powerful & flexible approach to handle large document collections. The system's ability to navigate different levels of abstraction allows it to provide relevant & contextually appropriate answers to a wide range of queriess.

###### While RAPTOR shows great promise, future work could focus on optimizing the tree-building process, improving summary quality, and enhancing the retrieval mechanism to better handle complex, multi-faceted queries. Additionally integrating this approach with other AI technologies could lead to even more sophisticated information processing syatems

### Imports & Setup

In [None]:
import numpy as np
import pandas as pd
from typing import List, Dict, Any
from sklearn.mixture import GaussianMixture
from langchain.chains.llm import LLMChain
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.schema import AIMessage
from langchain.docstore.document import Document

import matplotlib.pyplot as plt
import logging
import os
import sys
from dotenv import load_dotenv

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path since we work with notebooks
from helper_functions import * 
from evaluation.evaluate_rag import *

#Load environment variables from a .env file
load_dotenv()

#Set the OpenAI API environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')


##### Define Logging, LLMs & Embeddings

In [None]:
# Setup logging
logging.basicConfig(level = logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 

embeddings = OpenAIEmbeddings()
llm = ChatOpenAI(model_name = "gpt-4o-mini")

##### Helper Functions

In [None]:
def extract_text(item):
    """Extract text content from either a string or AIMessage object."""
    if isinstance(item, AIMessage):
         return item.content