# Context Enrichment Window for Document Retrieval with Llama Index

This notebook introduces a new RAG technique that leverages a context enrichment window to improve document retrieval. By incorporating surrounding context into text chunks, the system aims to return more coherent and contextually rich results when querying the vector index.

## System Introduction

The system integrates a specialized context enrichment mechanism into the Llama Index pipeline. It is designed to ingest documents from a corpus, enrich each document chunk with a configurable window of adjacent sentences, and build a vector index. In retrieval, it uses custom postprocessing to reconstruct the most coherent piece of text around the query match. This integration not only retains the advantages of vector search but also addresses its common issue of returning isolated text fragments.

## Underlying Concept

Traditional vector-based retrieval may return segments of text that lack context. This technique enriches document chunks by adding a "window" of surrounding sentences. The idea is to provide additional context to each chunk, ensuring that when a document is retrieved, it comes with enough contextual information to be coherent and complete. This approach also allows for flexible adjustment of the window size to suit different datasets and application needs.

## System Components

1. **Document Retrieval Module:** Loads a persisted llama_index vector store and retrieves documents relevant to a query. 
   - *Example Prompt for Retrieval:* "Find the latest market trends using the llama_index tool."

2. **Corpus Ingestion Module:** Reads documents from a specified directory using the SimpleDirectoryReader, then creates or updates the llama_index vector store.
   - **Bulk Ingestion:** You can ingest a corpus by directly uploading files to the `/tools/rag/llama_index_context_window/corpus` folder. To start the batch processing, simply make an API call:
     ```bash
     curl -X POST \
       http://localhost:5000/llama-index-ingest-corpus \
       -H "Content-Type: application/json" \
       -d '{"isContextWindow": true}'
     ```
     The script processes the information in the files, transferring it to the vector and graph database. By default, the SimpleDirectoryReader attempts to read any files it encounters (treating them as text), and explicitly supports file types such as:
     - .csv (Comma-Separated Values)
     - .docx (Microsoft Word)
     - .epub (EPUB eBook format)
     - .hwp (Hangul Word Processor)
     - .ipynb (Jupyter Notebook)
     - .jpeg, .jpg (JPEG image)
     - .mbox (MBOX email archive)
     - .md (Markdown)
     - .mp3, .mp4 (Audio and video)
     - .pdf (Portable Document Format)
     - .png (Portable Network Graphics)
     - .ppt, .pptm, .pptx (Microsoft PowerPoint)
     
     For more details, refer to the [SimpleDirectoryReader Documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/).

## How It Works

### 1. Document Ingestion

- The ingestion module reads documents from a corpus directory using the `SimpleDirectoryReader`.
- A `SentenceWindowNodeParser` is applied via an `IngestionPipeline` to enrich each document chunk with a window of adjacent sentences. This provides additional context around each text segment.
- The module then either creates a new index or updates an existing one by inserting the enriched nodes, and persists the index to disk.

### 2. Document Retrieval

- The retrieval module loads the persisted index from disk.
- A custom metadata replacement postprocessor scans the enriched context window to extract the segment of text surrounding the original sentence, based on a configurable maximum number of adjacent characters.
- If a similarity cutoff is specified, an additional postprocessor refines the results to ensure only highly relevant nodes are returned.
- The final output is a list of processed text fragments that maintain coherent context for downstream applications.

## Workflow Diagram

![LLama Index Context Window Workflow](./llama_index_context_window_workflow.png)

## System Advantages

- **Contextual Richness:** Enriching document chunks with a surrounding sentence window improves the coherence of the retrieved content.
- **Enhanced Retrieval Quality:** By providing extra context, the technique mitigates the problem of isolated text fragments, leading to more accurate responses.
- **Flexibility:** The context window size is configurable, allowing the system to be tuned for various types of datasets and use cases.
- **Seamless Integration:** It builds upon the existing Llama Index framework, maintaining the benefits of vector search while addressing its limitations.

## Practical Benefits

- **More Coherent Results:** Retrieval returns enriched, context-aware text fragments that are easier to understand and more useful for tasks like question answering.
- **Mitigated Fragmentation:** The approach overcomes the common issue of isolated text fragments in vector-based retrieval systems.
- **Adjustable Context:** Users can adjust the window size based on the complexity and structure of their documents, optimizing retrieval quality.

Overall, this technique enhances the performance of document retrieval systems by ensuring that the output retains necessary contextual information, leading to improved downstream processing and interpretation.

## Implementation Insights

- The **retrieve.py** module leverages llama_index's `StorageContext` and `load_index_from_storage` along with a custom postprocessor (`CustomMetadataReplacementPostProcessor`) to process and trim enriched text chunks during document retrieval.
- The **ingest_corpus.py** module uses `SimpleDirectoryReader` and `SentenceWindowNodeParser` via an `IngestionPipeline` to ingest documents with context window enrichment and update or create a vector index.

## Parameters

**LLAMA_INDEX_CONTEXT_WINDOW_TOP_K_RAG_RETRIEVE:**
This environment variable determines the number of top enriched document chunks retrieved during a query in the context window retrieval system. It sets the `similarity_top_k` parameter for the query engine, balancing the breadth and precision of retrieval.

**LLAMA_INDEX_CONTEXT_WINDOW_MAX_ADJACENT_CHARS_RAG_RETRIEVE:**
This variable specifies the maximum number of characters to include before and after the original sentence when reconstructing the context from the enriched window. It ensures that the returned text fragment maintains sufficient surrounding context for coherence.

**LLAMA_INDEX_CONTEXT_WINDOW_SIZE_INGEST:**
This parameter defines the number of adjacent sentences (on each side) to include during the ingestion process. It is used by the `SentenceWindowNodeParser` to enrich document chunks with additional context, thereby improving the quality of retrieval.

## Conclusion

Integrating a dynamic context enrichment window into the Llama Index workflow revolutionizes the way document retrieval systems operate. This approach adeptly stitches together neighboring sentences to create a more complete and coherent narrative around query matches, thereby overcoming the fragmentation common in conventional vector search. By fusing robust ingestion, refined vector indexing, and custom postprocessing, the system delivers retrieval outputs that capture the full essence of the source material. Ultimately, this innovative method offers unparalleled flexibility and precision, empowering users to tackle complex data challenges with confidence.