# bRAG: Indexing and Advanced Retrieval

![indexing_overview](./image/indexing_overview.png)

## Preface: Chunking

We don't explicity cover document chunking / splitting.

For an excellent review of document chunking, see this video from Greg Kamradt:

https://www.youtube.com/watch?v=8OJC21T2SL4

## Pre-requisites (optional but recommended)

### Only do the first step if you have never created a virtual environment for this repository. Otherwise, make sure that the Python Kernel that you selected is from your `venv/` folder.

In [23]:
# Create virtual environment
! python3 -m venv ../venv

In [24]:
# Activate virtual Python environment
! source ../venv/bin/activate

In [25]:
# If your Python is not from your venv path, ensure that your IDE's kernel selection (on the top right corner) is set to the correct path 
# (your path output should contain "...venv/bin/python")

! which python

/Users/taha/Desktop/bRAGAI/code/gh/bRAG-langchain/venv/bin/python


In [26]:
# Install all packages
! pip3 install -r ../requirements.txt --quiet

### * If you choose to skip the pre-requisites and install only the packages specific to this notebook using your global Python path environment, execute the command below; otherwise, proceed to the next step.

In [27]:
! pip3 install --quiet langchain_community tiktoken langchain-openai langchainhub chromadb langchain youtube-transcript-api pytube yt_dlp

## Environment

`(1) Packages`

In [10]:
import os
from dotenv import load_dotenv

# Load all environment variables from .env file
load_dotenv()

# Access the environment variables
langchain_tracing_v2 = os.getenv('LANGCHAIN_TRACING_V2')
langchain_endpoint = os.getenv('LANGCHAIN_ENDPOINT')
langchain_api_key = os.getenv('LANGCHAIN_API_KEY')

## LLM
openai_api_key = os.getenv('OPENAI_API_KEY')

## Pinecone Vector Database
pinecone_api_key = os.getenv('PINECONE_API_KEY')
pinecone_api_host = os.getenv('PINECONE_API_HOST')
index_name = os.getenv('PINECONE_INDEX_NAME')


`(2) LangSmith`

https://docs.smith.langchain.com/

In [11]:
os.environ['LANGCHAIN_TRACING_V2'] = langchain_tracing_v2
os.environ['LANGCHAIN_ENDPOINT'] = langchain_endpoint
os.environ['LANGCHAIN_API_KEY'] = langchain_api_key

`(3) API Keys`

In [12]:
os.environ['OPENAI_API_KEY'] = openai_api_key
openai_model = "gpt-3.5-turbo"

## bRAG: Multi-representation Indexing

Flow: 

![multi_representation](./image/multi-representation.png)

Docs:

https://blog.langchain.dev/semi-structured-multi-modal-rag/

https://python.langchain.com/docs/how_to/multi_vector/

Paper:

https://arxiv.org/abs/2312.06648

In [13]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [14]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(model="gpt-3.5-turbo",max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

In [15]:
from langchain.storage import InMemoryByteStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

  vectorstore = Chroma(collection_name="summaries",


In [16]:
query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

Document(metadata={'doc_id': '7e716fb3-088f-41f2-a80f-4c5f3a447c01'}, page_content='The document discusses the concept of building LLM (large language model) powered autonomous agents. It explains the key components of such agents, including planning, memory, and tool use, with examples and case studies like AutoGPT, GPT-Engineer, and ChemCrow. It highlights challenges such as finite context length, reliability of the natural language interface, and planning difficulties. The article provides a detailed overview of how LLMs can be used to create powerful general problem-solving agents, with a focus on task decomposition, memory types, and tool use capabilities.')

In [17]:
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]

  retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n|\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\nemojisearch.app\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS"

Related idea is the [parent document retriever](https://python.langchain.com/docs/how_to/parent_document_retriever/).

## bRAG: RAPTOR

Flow:

![raptor](./image/raptor.png)

Deep dive video:

https://www.youtube.com/watch?v=jbGchdTL7d0

Paper:

https://arxiv.org/pdf/2401.18059.pdf

Full code:

https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb

## bRAG: ColBERT

RAGatouille makes it as simple to use ColBERT. 

ColBERT generates a contextually influenced vector for each token in the passages. 

ColBERT similarly generates vectors for each token in the query.

Then, the score of each document is the sum of the maximum similarity of each query embedding to any of the document embeddings:

See [here](https://hackernoon.com/how-colbert-helps-developers-overcome-the-limits-of-rag) and [here](https://python.langchain.com/docs/integrations/retrievers/ragatouille) and [here](https://til.simonwillison.net/llms/colbert-ragatouille).

In [18]:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

[Jan 18, 11:33:55] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  self.scaler = torch.cuda.amp.GradScaler()


In [19]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None

full_document = get_wikipedia_page("Hayao_Miyazaki")

In [20]:
RAG.index(
    collection=[full_document],
    index_name="Miyazaki-123",
    max_document_length=180,
    split_documents=True,
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Jan 18, 11:34:14] #> Creating directory .ragatouille/colbert/indexes/Miyazaki-123 


[Jan 18, 11:34:14] [0] 		 #> Encoding 122 passages..


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 4/4 [00:03<00:00,  1.29it/s]

[Jan 18, 11:34:17] [0] 		 avg_doclen_est = 131.80328369140625 	 len(local_sample) = 122
[Jan 18, 11:34:17] [0] 		 Creating 1,024 partitions.
[Jan 18, 11:34:17] [0] 		 *Estimated* 16,080 embeddings.
[Jan 18, 11:34:17] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki-123/plan.json ..



  sub_sample = torch.load(sub_sample_path)
  centroids = torch.load(centroids_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')


used 20 iterations (0.3251s) to cluster 15276 items into 1024 clusters
[0.04, 0.041, 0.041, 0.036, 0.033, 0.037, 0.036, 0.036, 0.034, 0.036, 0.035, 0.036, 0.036, 0.039, 0.035, 0.038, 0.033, 0.035, 0.035, 0.038, 0.035, 0.035, 0.037, 0.039, 0.038, 0.034, 0.041, 0.036, 0.035, 0.037, 0.036, 0.04, 0.04, 0.037, 0.037, 0.034, 0.039, 0.036, 0.037, 0.04, 0.037, 0.039, 0.035, 0.034, 0.035, 0.034, 0.034, 0.039, 0.037, 0.034, 0.034, 0.034, 0.036, 0.037, 0.036, 0.037, 0.038, 0.038, 0.042, 0.034, 0.037, 0.037, 0.034, 0.037, 0.039, 0.037, 0.037, 0.039, 0.031, 0.035, 0.034, 0.035, 0.036, 0.037, 0.037, 0.037, 0.034, 0.04, 0.035, 0.036, 0.036, 0.038, 0.035, 0.039, 0.033, 0.036, 0.039, 0.039, 0.034, 0.044, 0.036, 0.038, 0.036, 0.035, 0.038, 0.038, 0.039, 0.036, 0.038, 0.036, 0.04, 0.041, 0.037, 0.036, 0.036, 0.035, 0.037, 0.034, 0.038, 0.033, 0.037, 0.037, 0.035, 0.033, 0.037, 0.036, 0.037, 0.038, 0.037, 0.04, 0.033, 0.034, 0.035, 0.036, 0.034, 0.037, 0.035, 0.038]


0it [00:00, ?it/s]

[Jan 18, 11:34:18] [0] 		 #> Encoding 122 passages..


100%|██████████| 4/4 [00:02<00:00,  1.70it/s]
1it [00:02,  2.38s/it]
  return torch.load(codes_path, map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 1258.04it/s]

[Jan 18, 11:34:20] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 18, 11:34:20] #> Building the emb2pid mapping..
[Jan 18, 11:34:20] len(emb2pid) = 16080



100%|██████████| 1024/1024 [00:00<00:00, 199506.10it/s]

[Jan 18, 11:34:20] #> Saved optimized IVF to .ragatouille/colbert/indexes/Miyazaki-123/ivf.pid.pt
Done indexing!





'.ragatouille/colbert/indexes/Miyazaki-123'

In [21]:
results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
results

Loading searcher for index Miyazaki-123 for the first time... This may take a few seconds
[Jan 18, 11:34:30] #> Loading codec...
[Jan 18, 11:34:30] #> Loading IVF...
[Jan 18, 11:34:30] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  ivf, ivf_lengths = torch.load(os.path.join(self.index_path, "ivf.pid.pt"), map_location='cpu')


[Jan 18, 11:34:35] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 1798.59it/s]

[Jan 18, 11:34:35] #> Loading codes and residuals...



  return torch.load(codes_path, map_location='cpu')
  return torch.load(residuals_path, map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 374.83it/s]

[Jan 18, 11:34:35] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Jan 18, 11:34:40] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What animation studio did Miyazaki found?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  7284,  2996,  2106,  2771,  3148, 18637,  2179,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


[{'content': '=== Studio Ghibli ===\n\n\n==== Early films (1985–1995) ====\nFollowing the success of Nausicaä of the Valley of the Wind, Miyazaki and Takahata founded the animation production company Studio Ghibli on June 15, 1985, as a subsidiary of Tokuma Shoten, with offices in Kichijōji designed by Miyazaki. Miyazaki named the studio after the Caproni Ca.309 and the Italian word meaning "a hot wind that blows in the desert"; the name had been registered a year earlier.',
  'score': 25.893789291381836,
  'rank': 1,
  'document_id': 'dc89f3d9-1989-4bf1-b1c0-5c245155c6d4',
  'passage_id': 42},
 {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. He co-founded Studio Ghibli and serves as its honorary chairman. Over the course of his career, Miyazaki has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one

In [22]:
retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What animation studio did Miyazaki found?")

  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


[Document(metadata={}, page_content='=== Studio Ghibli ===\n\n\n==== Early films (1985–1995) ====\nFollowing the success of Nausicaä of the Valley of the Wind, Miyazaki and Takahata founded the animation production company Studio Ghibli on June 15, 1985, as a subsidiary of Tokuma Shoten, with offices in Kichijōji designed by Miyazaki. Miyazaki named the studio after the Caproni Ca.309 and the Italian word meaning "a hot wind that blows in the desert"; the name had been registered a year earlier.'),
 Document(metadata={}, page_content='Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. He co-founded Studio Ghibli and serves as its honorary chairman. Over the course of his career, Miyazaki has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in 