#### Week 3: Building Advanced RAG Applications.  Authored by Chris Sanchez.

# Week 3 - Notebook 7 --> Context Enrichment

# Background  
---
The background material for this topic/notebook is covered in the [Week 3 Course Content](https://uplimit.com/course/rag-applications/session/session_clzlsa20a01di197e4tij7vgm/module/module_clzlsa9a702cy1dm671cg0xr0) section titled **Context Enrichment**.  

# Overview
---
The concept of Context Enrichment became popularized with the advent of LLMs that could reason over ever-increasing context window sizes. Prior to 2022 most open-source Reader models were limited to a 512 token context window. From a vector search perspective, adding the Context Enrichment technique to your toolbox allows the best of both worlds, because you can figure out which chunk size works best for your embedding model/reranker combo and then expand the retrieved text chunk with surrounding context so that the Reader LLM has the additional context it needs to answer the user query. Though not the only reason, this technique is partly why setting the chunk overlap parameter to zero (when chunking your text into sentences) is a good call.  

This notebook will walk you through the process of creating an `expanded_content` field that you can add to an existing dataset, which can then be indexed onto your Weaviate cluster. 
- No need to create a new dataset, simpy use a prexisting dataset (i.e. `huberman_minilm_256.parquet`)
- Group dataset episodes together by `video_id`.  Performing this step will ensure that all before and after text chunks are all from the same episode and there is no "bleed-over" into another episode.
- Loop over each set of episode chunks and join pre-, current, and post- chunks together as a single string.  The window size can be set as a parameter.
- Join each chunk to the original dataset as an additional `expanded_content` field.
- Either index the new dataset on a new collection, or update an existing collection.  The properties file used to create the index schema on Weaviate already includes an `expanded_content` property. 

In [1]:
import sys
sys.path.append('../')

In [2]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), override=True)
import os

In [3]:
from src.database.properties_template import properties
from src.database.weaviate_interface_v4 import WeaviateWCS
from src.preprocessor.preprocessing import FileIO
from sentence_transformers import SentenceTransformer
from rich import print
from tqdm import tqdm
import tiktoken

  from tqdm.autonotebook import tqdm, trange


### Load Data
No need to create a new dataset, simply use the data that you already have. 

In [5]:
data_path = "/workspaces/rag-applications/data/huberman-minilmL6-256.parquet" #'../data/huberman_minilm-256.parquet'
data = FileIO.load_parquet(data_path)

Shape of data: (23123, 13)
Memory Usage: 2.29+ MB


### Create Expanded Content

In [6]:
from itertools import groupby

def groupby_episode(data: list[dict], key_field: str='video_id') -> list[list[dict]]:
    '''
    Separates entire podcast corpus into individual 
    lists of discrete episodes.
    '''
    episodes = []
    for _, group in groupby(data, lambda x: x[key_field]):
        episode = [chunk for chunk in group]
        episodes.append(episode)
    return episodes

In [7]:
def create_expanded_content(data: list[dict] | list[list[str]],
                            window_size: int=1,
                            num_episodes: int=193,
                            key_field: str='video_id'
                            ) -> list[list[str]]:
    '''
    Creates expanded content from original chunks of text, for use with 
    expanded content retrieval.  Takes in raw data as a list of dictionaries 
    or accepts a list of chunked episodes already grouped. 
    
    Window size sets the number of chunks before and after the original chunk.  
    For example a window_size of 2 will return five joined chunks.  2 chunks 
    before original chunk, the original, and 2 chunks after the original.  
    
    Expanded content is grouped by podcast episode, and chunks are assumed 
    to be kept in order by which they will be joined as metadata in follow-on 
    processing.
    '''
    # check if dictionaries or already grouped episodes is being passed in
    if isinstance(data[0], dict):
        # groupby data into episodes using video_id key
        episodes = groupby_episode(data, key_field)
        assert len(episodes) == num_episodes, f'Number of grouped episodes does not equal num_episodes ({len(episodes)} != {num_episodes})'

        # extract content field and ensure episodes maintain their grouping
        chunk_list = [[d['content'] for d in alist] for alist in episodes]
        
    expanded_contents = []
    for episode in tqdm(chunk_list):
        episode_container = []
        for i, chunk in enumerate(episode):
            start = max(0, i-window_size)
            end = i+window_size+1
            expanded_content = ' '.join(episode[start:end])
            episode_container.append(expanded_content)
        expanded_contents.append(episode_container)
    return expanded_contents

# Assignment 3.1 - 
***
#### *Create Expanded Content chunks and join them to existing data*

#### INSTRUCTIONS
1. Execute the `create_expanded_content` function.  Depending on your chunk size, it is likely best to use the default window size of 1.  Meaning, 1 chunk of text will be added before and after the original text chunk, for a total of three chunks for each `expanded_content` field.
2. Assuming you are going to join the data back to the original dataset from which it came, you'll need to flatten out the list of episode into a single list of text chunks.
3. Write a function that updates your original dataset with the new expanded content by updating the dataset with an `expanded_content` key. 

In [8]:
########################
# START YOUR CODE HERE #
########################

expanded_content = create_expanded_content(data=data,
                                           window_size=1,
                                           num_episodes=193,
                                           key_field="video_id")
flattened_content = [content for episode in expanded_content for content in episode]

# azimuth check to ensure you're heading in the right direction
flat_length = len(flattened_content)
data_length = len(data)
assert flat_length == data_length, 'Mismatch in lengths. Double check how you flattened your expanded_content'

def join_expanded_content(data: list[dict],
                          flattened_content: list[list[str]],
                          new_key: str = 'expanded_content'
                          ) -> list[dict]:
    '''
    Updates data with an expanded_content key.
    '''

########################
# END YOUR CODE HERE #
########################
    for i, expanded_content in enumerate(flattened_content):
        data[i]["expanded_content"] = expanded_content

    return data

data = join_expanded_content(data, flattened_content)

100%|██████████| 193/193 [00:00<00:00, 3563.92it/s]


#### After executing the above function, run the following cells as a post-check

In [9]:
#ensure all expanded content fields are present
for d in data:
    assert d.get('expanded_content', -1) != -1
    
#compare your initial result with the following cell
print(data[0]['expanded_content'])

<details> 
    <summary>
        Click to compare your results from the cell above with the following:
</summary>  
    
```
"Welcome to the Huberman Lab guest series, where I and an expert guest discuss science and science-based tools for everyday life. I'm Andrew Huberman, and I'm a professor of neurobiology and ophthalmology at Stanford School of Medicine. Today's episode marks the first in our six-episode series all about sleep. Our expert guest for this series is Dr. Matthew Walker, professor of neuroscience and psychology and the director of the Center for Sleep Science at the University of California, Berkeley. He is also the author of the bestselling book, Why We Sleep. During the course of the six-episode series, for which we release one episode per week, starting with this episode one, we cover essentially all aspects of sleep and provide numerous practical tools to improve your sleep. For instance, we discuss the biology of sleep, including the different sleep stages, as well as why sleep is so important for our mental and physical health. We also talk about how sleep regulates things like emotionality and learning and neuroplasticity, that is your brain's ability to change in response to experience. And we discuss the various things that you can do to improve your sleep. Everything from how to time lighting, temperature, exercise, eating, and the various things that can impact sleep both positively and negatively, such as alcohol, cannabis, and various supplements and drugs that have been shown to improve sleep. We also talk about naps, dreaming and the role of dreams, and lucid dreaming, which is when you dream and you are aware that you are dreaming. In today's episode one, we specifically focus on why sleep is so important and what happens when we do not get enough sleep or enough quality sleep. We also talk about the various sleep stages, and we also talk about a very specific formula that everyone should know for themselves called QQRT, which is an acronym that stands for quality, quantity, regularity, and timing of sleep. Four factors which today you'll learn how to identify specifically for you what your optimal QQRT is, and then to apply that in order to get the best possible night's sleep, which of course equates to the best possible level of focus and alertness throughout your days. Both Dr."
```
</details>

### Index the Data
---
You have two options here:
1. Easy way: Simply index the data on a new Collection.
2. Hard way: Read all existing uuids on current Collection and update each object by linking the doc_ids.

As mentioned earlier, the `expanded_content` property is already part of the index configuration of properties.  See the last property entry after printing the `properties` variable: 

In [10]:
print(properties)

## Conclusion
---
After you've indexed the data you will now have a way to retrieve content on a fine-grained level, and provide your LLM Reader with an expanded context.  You will be able to see this in action when you add `expanded_content` as a `return_property` in your Streamlit UI. 🎉  

However, the important idea here is how does adding expanded content affect your RAG system performance?  Assuming you completed Notebook 5 you now have the means at your disposal of running an evaluation with the Reader LLM ingesting the `expanded_content` field as context instead of the `content` field. Setting up your own evaluation will be a good exercise in understanding how all of these pieces are put together.  You'll note that the function used to build the user message at run time `generate_prompt_series` as well as the  `create_context_blocks` function have a `content_key` parameter.  You will want to pass in `expanded_content` as an argument here instead of `content`. 

## OPTIONAL: Update an existing Collection
---
For those interested in doing things the hard way here is some starter code.  No guarantee that this code will work as written, but it gives you the idea of how you would accomplish this task; or just create a new Collection... 😀:

In [11]:
# get collection object
api_key = os.environ['WEAVIATE_API_KEY']
endpoint = os.environ['WEAVIATE_ENDPOINT']

client = WeaviateWCS(endpoint, api_key)
collection = client._client.collections.get('Huberman_minilm_256')

In [12]:
# This step will take a few minutes to read every object id on the Weaviate cluster
doc_id_cache = {item.properties['doc_id']:item.uuid for item in tqdm(collection.iterator())}

23123it [00:22, 1047.38it/s]


`doc_id_cache` example:
```
{'-OBCwiPPfEU_8': _WeaviateUUIDInt('018455e9-47ab-41cc-b592-c431fd8df75f'),
 '-OBCwiPPfEU_4': _WeaviateUUIDInt('03a18709-0334-4f18-9375-4a8a3f162cfb'),
 '-OBCwiPPfEU_0': _WeaviateUUIDInt('0da24442-3263-46d7-91af-ef2a34d27a9c'),
 '-OBCwiPPfEU_1': _WeaviateUUIDInt('219354b1-dd2e-46c0-94cb-cdd51f915175'),
 '-OBCwiPPfEU_6': _WeaviateUUIDInt('27f967f2-3e9c-453d-8d21-378e4e15ffac'),
 '-OBCwiPPfEU_7': _WeaviateUUIDInt('332a363f-afcf-4fb7-9370-bbded23b8803'),
 '-OBCwiPPfEU_3': _WeaviateUUIDInt('59c539ac-8f95-4b13-bb6c-1fb323d7e64a'),
 '-OBCwiPPfEU_2': _WeaviateUUIDInt('746401bd-98c2-4494-ba08-d1b21d9abfc5'),
 '-OBCwiPPfEU_9': _WeaviateUUIDInt('77a3a7ae-8d77-4ae5-a0cd-705fb4286198'),
 '-OBCwiPPfEU_5': _WeaviateUUIDInt('803d2756-e3bd-44ef-88e6-43bc700be480')}
```

##### Finally you'll want to loop through your dataset, grab the doc_id value and expanded_content value and then update each object on the Weaviate cluster by using the uuid as found on the doc_id_cache

In [13]:
#Because this is not batched, expect this process to run for several minutes
for d in tqdm(data):
    doc_id = d['doc_id']
    expanded_content = d['expanded_content']
    uuid = doc_id_cache[doc_id]
    collection.data.update(uuid=uuid, properties={'expanded_content': expanded_content})

100%|██████████| 23123/23123 [22:12<00:00, 17.36it/s]
