#### Week 3: Building Advanced RAG Applications.  Authored by Chris Sanchez.

# Week 3 - Notebook 7 --> Context Enrichment

# Overview
---
This notebook will walk you through the process of creating an `expanded_content` field that you can add to an existing dataset, which can then be indexed onto your Weaviate cluster. 
- No need to create a new dataset, simpy use a prexisting dataset (i.e. `huberman_minilm_256.parquet`)
- Group dataset episodes together by `video_id`.  Performing this step will ensure that all before and after text chunks are all from the same episode and there is no "bleed-over" into another episode.
- Loop over each set of episode chunks and join pre-, current, and post- chunks together as a single string.  The window size can be set as a parameter.
- Join each chunk to the original dataset as an additional `expanded_content` field.
- Either index the new dataset on a new collection, or update an existing collection.  The properties file already includes an `expanded_content` property. 

In [1]:
import sys
sys.path.append('../')

In [2]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), override=True)

In [3]:
from src.database.properties_template import properties
from src.database.database_utils import get_weaviate_client
from src.preprocessor.preprocessing import FileIO
# from llama_index.text_splitter import SentenceSplitter
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import tiktoken

### Load Data
No need to create a new dataset, simply use the data that you already have. 

In [4]:
data_path = '../data/huberman_minilm-256.parquet'
data = FileIO.load_parquet(data_path)



Shape of data: (23905, 13)
Memory Usage: 2.37+ MB


### Create Expanded Content

In [5]:
from itertools import groupby

def groupby_episode(data: list[dict], key_field: str='video_id') -> list[list[dict]]:
    '''
    Separates entire Impact Theory corpus into individual 
    lists of discrete episodes.
    '''
    episodes = []
    for _, group in groupby(data, lambda x: x[key_field]):
        episode = [chunk for chunk in group]
        episodes.append(episode)
    return episodes

In [6]:
def create_expanded_content(data: list[dict]=None, 
                            chunk_list: list[list[str]]=None, 
                            window_size: int=1,
                            num_episodes: int=193,
                            key_field: str='video_id'
                            ) -> list[list[str]]:
    '''
    Creates expanded content from original chunks of text, for use with 
    expanded content retrieval.  Takes in raw data in dict format or 
    accepts a list of chunked episodes already grouped. 
    
    Window size sets the number of chunks before and after the original chunk.  
    For example a window_size of 2 will return five joined chunks.  2 chunks 
    before original chunk, the original, and 2 chunks after the original.  
    
    Expanded content is grouped by podcast episode, and chunks are assumed 
    to be kept in order by which they will be joined as metadata in follow-on 
    processing.
    '''
    if not data and not chunk_list:
        raise ValueError("Either data or a chunk_list must be passed as an arg")
        
    if data:
        # groupby data into episodes using video_id key
        episodes = groupby_episode(data, key_field)
        assert len(episodes) == num_episodes, f'Number of grouped episodes does not equal num_episodes ({len(episodes)} != {num_episodes})'

        # extract content field and ensure episodes maintain their grouping
        chunk_list = [[d['content'] for d in alist] for alist in episodes]
        
    expanded_contents = []
    for episode in tqdm(chunk_list):
        episode_container = []
        for i, chunk in enumerate(episode):
            start = max(0, i-window_size)
            end = i+window_size+1
            expanded_content = ' '.join(episode[start:end])
            episode_container.append(expanded_content)
        expanded_contents.append(episode_container)
    return expanded_contents

# Assignment 3.1 - 
***
#### *Create Expanded Content chunks and join them to existing data*

#### INSTRUCTIONS
1. Execute the `create_expanded_content` function.  Depending on your chunk size is likely best to use the default window size of 1.  Meaning, 1 chunk of text will be added before and after the original text chunk, for a total of three chunks for each `expanded_content` field.
2. Assuming you are going to join the data back to the original dataset from which it came, you'll need to flatten out the list of episode into a single list of text chunks.
3. Write a function that combined your original dataset with the new expanded content by updating the dataset with an `expanded_content` key. 

In [52]:
########################
# START YOUR CODE HERE #
########################

expanded_content = create_expanded_content(None)
flattened_content = None

# azimuth check to ensure you're heading in the right direction
flat_length = len(flattened_content)
data_legnth = len(data)
assert flat_length == data_length, 'Mismatch in lengths. Double check how you flattened your expanded_content'

def join_expanded_content(data: list[dict],
                          flattened_content: list[list[str]]
                          ) -> list[dict]:
    '''
    Updates data with an expanded_content key.
    '''

    
########################
# END YOUR CODE HERE #
########################
    
    return data

data = join_expanded_content(None, None)

#### After executing the above function, run the following cell as a post-check

In [34]:
for d in data:
    assert d.get('expanded_content', -1) != -1

### Index the Data
---
You have two options here:
1. Easy way: Simply index the data on a new Collection.
2. Hard way: Read all existing uuids on current Collection and update each object by linking the doc_ids.

As mentioned earlier, the expanded_content property is already part of the index configuration of properties.  See the last property entry after printing the `properties` variable: 

In [49]:
from rich import print

# print(properties)

## Conclusion
---
After you've indexed the data you will now have a way to retrieve content on a fine-grained level, and provide your LLM Reader with an expanded context.  You will be able to see this in action when you add `expanded_content` as a `return_property` in your Streamlit UI. 🎉

## OPTIONAL: Update an existing Collection
---
For those interested in doing things the hard way here is some starter code.  No guarantee that this code will work as written, but it gives you the idea of how you would accomplish this task; or just create a new Collection... 😀:

In [12]:
# get collection object
client = get_weaviate_client()
collection = client._client.collections.get('Huberman_minilm_256')

In [11]:
# This step will take a few minutes to read every object id on the Weaviate cluster
doc_id_cache = {item.properties['doc_id']:item.uuid for item in tqdm(collection.iterator())}

`doc_id_cache` example:
```
{'-OBCwiPPfEU_8': _WeaviateUUIDInt('018455e9-47ab-41cc-b592-c431fd8df75f'),
 '-OBCwiPPfEU_4': _WeaviateUUIDInt('03a18709-0334-4f18-9375-4a8a3f162cfb'),
 '-OBCwiPPfEU_0': _WeaviateUUIDInt('0da24442-3263-46d7-91af-ef2a34d27a9c'),
 '-OBCwiPPfEU_1': _WeaviateUUIDInt('219354b1-dd2e-46c0-94cb-cdd51f915175'),
 '-OBCwiPPfEU_6': _WeaviateUUIDInt('27f967f2-3e9c-453d-8d21-378e4e15ffac'),
 '-OBCwiPPfEU_7': _WeaviateUUIDInt('332a363f-afcf-4fb7-9370-bbded23b8803'),
 '-OBCwiPPfEU_3': _WeaviateUUIDInt('59c539ac-8f95-4b13-bb6c-1fb323d7e64a'),
 '-OBCwiPPfEU_2': _WeaviateUUIDInt('746401bd-98c2-4494-ba08-d1b21d9abfc5'),
 '-OBCwiPPfEU_9': _WeaviateUUIDInt('77a3a7ae-8d77-4ae5-a0cd-705fb4286198'),
 '-OBCwiPPfEU_5': _WeaviateUUIDInt('803d2756-e3bd-44ef-88e6-43bc700be480')}
```

##### Finally you'll want to loop through your dataset, grab the doc_id value and expanded_content value and then update each object on the Weaviate cluster by using the uuid as found on the doc_id_cache

In [None]:
for d in data:
    doc_id = d['doc_id']
    expanded_content = d['expanded_content']
    uuid = doc_id_cache[doc_id]
    collection.data.update(uuid=uuid, properties={'expanded_content': expanded_content}