#### Week 1: Building Advanced RAG Applications.  Authored by Chris Sanchez.

# Week 1 - Notebook 3

# Overview
***
Welcome to the final notebook for Week 1! Take a look at all the ground we've covered so far:
- Chunking/splitting
- Text vectorization
- Combining metadata
- Collection Configuration
- Data Indexing
- Keyword search
- Vector search
- OPTIONAL: Searching with Filters

We are now prepared to move on to a very important topic, **Retrieval Evaluation**.  I hope you've noticed that the search results will differ (sometimes slightly, sometimes by a lot) depeding on which search method you used: `keyword_search` or `vector_search`.  As humans, it's fairly easy for us to determine whether the returned search results are relevant to the query that was submitted, (though even here there will be differing opinions on result relevance).  But how do we systematically determine which search method is better in general?  And how do we measure the relative performance of our retrieval system if we change one of it's parameters...for example, changing our embedding model? What about measuring system performance over time as more documents are added to our datastore?

We need a way to evaluate our retrieval system, and this notebook will show you "one way" of doing that.  I say "one way" because there are many ways to approach this problem, and the method I'm showing you is not perfect (if anything it's a bit too conservative).  Ultimately, measuring retrieval performance is hard because it requires a lot of time and effort, and absent any user [click-data](https://en.wikipedia.org/wiki/Click_tracking), requires some form of data labeling.  With the advent of powerful generative LLMs the process of measuring retrieval performance has become much easier. Let's take a look at how that works.

# Retrieval Evaluation - Process
***
Here's a high-level overview of how the Retrieval Evaluation process in this notebook works:

1. Generate a "golden dataset" of query-context pairs.  I've already pre-generated multiple golden datasets for our course. I randomly selected 100 document chunks (contexts) from the Huberman Lab corpus and those chunks were then submitted to the `gpt-3.5-turbo` model which generated queries that can be answered by the context.  The output was 100 query-context pairs along with their associated doc_ids. 
   - **Baseline Assumptions**:
     - The generated query-context pairs are, in fact, relevant to one another i.e. the query can be answered by the context that it's paired with
     - The generated queries are simliar in style and length to the type of queries that end users would ask
2. The golden dataset consists of three primary keys: `corpus`, `relevant_docs`, and `queries`
     - The `corpus` is the original text context/chunk with it's associated `doc_id`
     - The `queries` are the LLM generated queries, one (or more) for each entry in the `corpus`
     - The `relevant_docs` is a simple lookup table linking the `corpus` docs to the generated `queries`
3. We pass the golden dataset into a retrieval evluation function which does the following:
   - Takes in a `retriever` arg (`WeaviateWCS`) and a few other configuration params
   - Iterates over all queries in the golden dataset and retrieves search results for each query from the Weaviate datastore
   - Extracts all `doc_id` values from the retrieved results
   - Extracts the `doc_id` from the associated `relevant_docs` for each query
   - Checks if the relevant doc_id is in the list of retrieved result doc_ids
   - After all queries are completed a `hit_rate` score and `mrr` score are calculated for the entire golden dataset
   - Writes results to an `eval_results` folder

#### In a Nutshell
Ulitmately, given a golden dataset consisting of queries, relevant docs, and their associated doc_ids, the `retrieval_evaluation` function is checking if the relevant doc_id is found in the list of retrieved results doc_ids, for each query.

#### Problems with this Approach
The problems with this approach are many, I'll cover a few here:
- The **Assumptions** (see section 1 above) about the golden dataset must hold true.  Given that the pairs are generated by `gpt-3.5-turbo`, I think the first assumption will generally be true.  When reviewing the dataset I did find a few questions that were not answerable given the context, but for the most part they were.  The 2nd assumption though, is going to be dependent on your particular search use case.  I think for the purposes of this course, the questions generated are a decent reflection of how someone would query this dataset, and therefore do the job of measuring retriever performance.  But I would always check a real-world query distribution before using an approach like the one presented here.
- This approach is conversative in that there is only "one" right answer.  Either the relevant `doc_id` is in the results list or it isn't.  In reality, there are going to be several documents that could potentially answer the generated query, but we have no way to account for these other relevant documents, unless of course, we want to manually add doc_ids to the golden dataset (and depending on your business case, you may actually want to do that).
- We aren't measuring recall or precision because we aren't classifying other documents as "negatives".  As was just mentioned, the other documents in the results list may or may not be good matches, we just don't know.  Because we don't know, we can't really classify the other documents as "negatives".  So for this approach, we are measuring the ["hit rate"](https://uplimit.com/course/rag-applications/v2/module/retreival-evaluation#corise_clp66zqui003i2a777aldseor) which is simply a count of the number of times that we found a relevant `doc_id` match in the results list and [Mean Reciprocal Rank (MRR)](https://uplimit.com/course/rag-applications/v2/module/retreival-evaluation#corise_clp66zqui003j2a77u8lnrk5b).  We're using MRR instead of other metrics such as Mean Average Precision (MAP) because we are only looking at a [single relevant answer](https://stats.stackexchange.com/questions/127041/mean-average-precision-vs-mean-reciprocal-rank).  Hit rate is a good enough metric for determining if our retriever is retrieving quality results, and MRR will become more important later on when we add a Reranker to the mix.  

In [1]:
#standard library imports
import sys
sys.path.append('../')

from typing import Any
import time
import os

# utilities
from tqdm import tqdm
from rich import print
from dotenv import load_dotenv, find_dotenv
env = load_dotenv(find_dotenv(), override=True)

# Assignment 1.3
***
#### Instructions:
* Import the `/data/golden_datasets/golden_256.json` dataset using the `load_json` method of the FileIO Class
* Instantiate a new WeaviateWCS client (Retriever) and set the `collection_name` of the Collection that you created in Notebook 2
* Evaluate your retriever results using the `retrieval_evaluation` function
* Submit your results in the form of a text file to Uplimit (the function autogenerates a report in the `dir_outpath` directory).

In [2]:
from src.evaluation.retrieval_evaluation import calc_hit_rate_scores, calc_mrr_scores, record_results
from src.database.weaviate_interface_v4 import WeaviateWCS
from src.database.database_utils import get_weaviate_client
from src.preprocessor.preprocessing import FileIO


data_path = '../data/golden_datasets/golden_256.json'

#################
##  START CODE ##
#################


### Load QA dataset
golden_dataset = FileIO.load_json(data_path)

### Instantiate Weaviate client and set Collection name
api_key = os.environ['WEAVIATE_API_KEY']
url = os.environ['WEAVIATE_ENDPOINT']
model_path = 'sentence-transformers/all-MiniLM-L6-v2'
retriever = WeaviateWCS(endpoint=url, api_key=api_key, model_name_or_path=model_path)
collection_name = 'Huberman_minilm_256'


#################
##  END CODE   ##
#################

# should see 100 queries
print(f'Num queries in Golden Dataset: {len(golden_dataset["queries"])}')

/usr/local/lib/python3.10/site-packages/pydantic/_internal/_config.py:284: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
/usr/local/lib/python3.10/site-packages/litellm/proxy/_types.py:167: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
  @root_validator(pre=True)
/usr/local/lib/python3.10/site-packages/litellm/proxy/_types.py:254: PydanticDeprecatedSince20: `pydantic.config.Extra` is deprecated, use literal values instead (e.g. `extra='allow'`). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]



# Project 1: Retrieval Evaluation

In [4]:
def retrieval_evaluation(dataset: dict, 
                         collection_name: str, 
                         retriever: WeaviateWCS,
                         retrieve_limit: int=5,
                         chunk_size: int=256,
                         query_properties: list[str]=['content'],
                         return_properties: list[str]=['doc_id', 'content'],
                         dir_outpath: str='./eval_results',
                         include_miss_info: bool=False
                         ) -> dict[str, str|int|float]:
    '''
    Given a dataset and a retriever evaluate the performance of the retriever. Returns a dict of kw and vector
    hit rates and mrr scores. If inlude_miss_info is True, will also return a list of kw and vector responses 
    and their associated queries that did not return a hit, for deeper analysis. Text file with results output
    is automatically saved in the dir_outpath directory.

    Args:
    -----
    dataset: dict
        Dataset to be used for evaluation
    collection_name: str
        Name of Collection on Weaviate host to be used for retrieval
    retriever: WeaviateWCS
        WeaviateWCS object to be used for retrieval 
    retrieve_limit: int=5
        Number of documents to retrieve from Weaviate host, increasing this value too high 
        will artificially inflate the hit rate score of your retriever.
    chunk_size: int=256
        Number of tokens used to chunk text. This value is purely for results 
        recording purposes and does not affect results. 
    return_properties: list[str]=['doc_id', 'content']
        list of properties to be returned from Weaviate host for display in response
    dir_outpath: str='./eval_results'
        Directory path for saving results.  Directory will be created if it does not
        already exist. 
    include_miss_info: bool=False
        Option to include queries and their associated kw and vector response values
        for queries that are "total misses"
    '''

    results_dict = {'n':retrieve_limit, 
                    'Retriever': retriever.model_name_or_path, 
                    'chunk_size': chunk_size,
                    'query_props': query_properties,
                    'kw_hit_rate': 0,
                    'kw_mrr': 0,
                    'vector_hit_rate': 0,
                    'vector_mrr': 0,
                    'total_misses': 0,
                    'total_questions':0
                    }
    
    start = time.perf_counter()
    miss_info = []
    for query_id, q in tqdm(dataset['queries'].items(), 'Queries'):
        results_dict['total_questions'] += 1
        hit = False
        
        try:
            kw_response = retriever.keyword_search(request=q, collection_name=collection_name, query_properties=query_properties,
                                                   limit=retrieve_limit, return_properties=return_properties)
            vector_response = retriever.vector_search(request=q, collection_name=collection_name, 
                                                   limit=retrieve_limit, return_properties=return_properties)
            
            #collect doc_ids and position of doc_ids to check for document matches
            kw_doc_ids = {result['doc_id']:i for i, result in enumerate(kw_response, 1)}
            vector_doc_ids = {result['doc_id']:i for i, result in enumerate(vector_response, 1)}
            
            #extract doc_id for scoring purposes
            doc_id = dataset['relevant_docs'][query_id]
 
            #increment hit_rate counters and mrr scores
            if doc_id in kw_doc_ids:
                results_dict['kw_hit_rate'] += 1
                results_dict['kw_mrr'] += 1/kw_doc_ids[doc_id]
                hit = True
            if doc_id in vector_doc_ids:
                results_dict['vector_hit_rate'] += 1
                results_dict['vector_mrr'] += 1/vector_doc_ids[doc_id]
                hit = True
                
            # if no hits, let's capture that
            if not hit:
                results_dict['total_misses'] += 1
                miss_info.append({'query': q, 'kw_response': kw_response, 'vector_response': vector_response})
        except Exception as e:
            print(e)
            continue
    

    #use raw counts to calculate final scores
    calc_hit_rate_scores(results_dict, search_type=['kw', 'vector'])
    calc_mrr_scores(results_dict, search_type=['kw', 'vector'])
    
    end = time.perf_counter() - start
    print(f'Total Processing Time: {round(end/60, 2)} minutes')
    record_results(results_dict, chunk_size, dir_outpath=dir_outpath, as_text=True)
    
    if include_miss_info:
        return results_dict, miss_info
    return results_dict

### Run evaluation over golden dataset

In [4]:
#################
##  START CODE ##
#################
eval_results = retrieval_evaluation(golden_dataset, collection_name, retriever)

Queries: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:21<00:00,  4.74it/s]


In [5]:
print(eval_results)

# Conclusion
***

We now have a way of measuring the performance of our retrieval system.  This will allow you to make tweaks/changes to the system and then be able to objectively tell whether or not the tweak/change improved or degraded its performance.  Here are a few things to consider going forward:  

- Keep in mind what the ulitmate goal of the system is that you are building.  For this course, we are trying to retrieve the most relevant documents as possible that will effectively address a user query, assuming the information is found within the corpus.  This means that we don't need pages and pages of relevant results, we actually only need the top 3-5, just enough to allow our Reader (the OpenAI LLM) to answer the user query.  This is an important point to be thinking about as you are making changes to the retrieval system.
- Feel free to set the `include_miss_info` param to `True`.  Doing so will return a list of both the keyword and vector responses that did not contain the relevant `doc_id` (a "total_miss" means the `doc_id` was not present in either the `kw_doc_ids` or the `vector_doc_ids`).  Take a look at the style of the queries being asked and compare them with the returned responses.  Why are those responses being returned?  Are they close to the intent of the query?
- Last but not least, you are now free to make changes to your system to improve the `hit_rate` and `mrr` scores.  If it were me, I'd start with switching out to a more performant [embedding model](https://huggingface.co/spaces/mteb/leaderboard).  There will be more opportunities to pick up some low hanging fruit, but we'll have to wait until the following week when hybrid search and Rerankers are introduced.  Whatever you do though, don't change params for the `SentenceSplitter` that you use for chunking the corpus.  Due to the way the golden dataset is derived, it's unfortunately dependent on those original `SentenceSplitter` settings remaining the same across evaluations. That is, of course, unless you want to build out your own golden dataset....

# *** OPTIONAL: Chunk Size Evaluation ***

In our initial Notebook we created a dataset with a chunk size of 256.  In order to evaluat the impact that chunk size has on retrieval for both search methods, it's a useful exercise to execute the `retrieval_evaluation` function on datasets of multiple chunk sizes.  In order to accomplish that follow these simple steps:
- Bust out the `create_dataset` function that you created in Assignment 1.4.  Create datasets of chunks lengths **128** and **512**.  **Ensure that you set the `chunk_overlap` param to zero for each run.  Golden datasets of corresponding chunks lengths have already been created for you in the `data/golden_datasets` directory.
- Index those datasets on Weaviate ensuring that you stick to the standard naming convention as discussed in Notebook 2 i.e `"Huberman_minilm_<chunk_size>"`
- Evaluate results using the chunk sizes as a parameter, see example code below:

### Sample code for automated chunk_size evaluation

In [9]:
all_results = []
for size in [128, 256, 512]:

    #load golden datasets
    data_path = f'../data/golden_datasets/golden_{size}.json'
    golden_dataset = FileIO().load_json(data_path)
    
    #assign collection name
    collection_name = f'Huberman_minilm_{size}'
    print(f'Running test on chunk size {size} on {collection_name} Collection')

    #get results by chunk_size
    results = retrieval_evaluation(golden_dataset, collection_name, retriever, query_properties=['content'], chunk_size=size)
    all_results.append(results)

Queries:   0%|                                                                                                                                                                                                       | 0/100 [00:00<?, ?it/s]

Queries:   1%|█▉                                                                                                                                                                                             | 1/100 [00:00<00:38,  2.58it/s]

Queries:   3%|█████▋                                                                                                                                                                                         | 3/100 [00:00<00:16,  5.86it/s]

Queries:   5%|█████████▌                                                                                                                                                                                     | 5/100 [00:00<00:12,  7.59it/s]

Queries:   7%|█████████████▎                                                                                                                                                                                 | 7/100 [00:00<00:10,  8.61it/s]

Queries:   9%|█████████████████▏                                                                                                                                                                             | 9/100 [00:01<00:09,  9.25it/s]

Queries:  11%|████████████████████▉                                                                                                                                                                         | 11/100 [00:01<00:09,  9.67it/s]

Queries:  13%|████████████████████████▋                                                                                                                                                                     | 13/100 [00:01<00:08,  9.95it/s]

Queries:  15%|████████████████████████████▌                                                                                                                                                                 | 15/100 [00:01<00:08, 10.11it/s]

Queries:  17%|████████████████████████████████▎                                                                                                                                                             | 17/100 [00:01<00:08, 10.19it/s]

Queries:  19%|████████████████████████████████████                                                                                                                                                          | 19/100 [00:02<00:07, 10.28it/s]

Queries:  21%|███████████████████████████████████████▉                                                                                                                                                      | 21/100 [00:02<00:07, 10.37it/s]

Queries:  23%|███████████████████████████████████████████▋                                                                                                                                                  | 23/100 [00:02<00:07, 10.42it/s]

Queries:  25%|███████████████████████████████████████████████▌                                                                                                                                              | 25/100 [00:02<00:07, 10.46it/s]

Queries:  27%|███████████████████████████████████████████████████▎                                                                                                                                          | 27/100 [00:02<00:06, 10.46it/s]

Queries:  29%|███████████████████████████████████████████████████████                                                                                                                                       | 29/100 [00:03<00:06, 10.48it/s]

Queries:  31%|██████████████████████████████████████████████████████████▉                                                                                                                                   | 31/100 [00:03<00:06, 10.51it/s]

Queries:  33%|██████████████████████████████████████████████████████████████▋                                                                                                                               | 33/100 [00:03<00:06, 10.51it/s]

Queries:  35%|██████████████████████████████████████████████████████████████████▌                                                                                                                           | 35/100 [00:03<00:06, 10.52it/s]

Queries:  37%|██████████████████████████████████████████████████████████████████████▎                                                                                                                       | 37/100 [00:03<00:05, 10.53it/s]

Queries:  39%|██████████████████████████████████████████████████████████████████████████                                                                                                                    | 39/100 [00:03<00:05, 10.54it/s]

Queries:  41%|█████████████████████████████████████████████████████████████████████████████▉                                                                                                                | 41/100 [00:04<00:05, 10.54it/s]

Queries:  43%|█████████████████████████████████████████████████████████████████████████████████▋                                                                                                            | 43/100 [00:04<00:05, 10.55it/s]

Queries:  45%|█████████████████████████████████████████████████████████████████████████████████████▌                                                                                                        | 45/100 [00:04<00:05, 10.54it/s]

Queries:  47%|█████████████████████████████████████████████████████████████████████████████████████████▎                                                                                                    | 47/100 [00:04<00:05, 10.47it/s]

Queries:  49%|█████████████████████████████████████████████████████████████████████████████████████████████                                                                                                 | 49/100 [00:04<00:04, 10.48it/s]

Queries:  51%|████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                                             | 51/100 [00:05<00:04, 10.48it/s]

Queries:  53%|████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                         | 53/100 [00:05<00:04, 10.51it/s]

Queries:  55%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                     | 55/100 [00:05<00:04, 10.53it/s]

Queries:  57%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                                 | 57/100 [00:05<00:04, 10.54it/s]

Queries:  59%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                              | 59/100 [00:05<00:03, 10.55it/s]

Queries:  61%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                          | 61/100 [00:06<00:03, 10.54it/s]

Queries:  63%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                      | 63/100 [00:06<00:03, 10.45it/s]

Queries:  65%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                  | 65/100 [00:06<00:03, 10.48it/s]

Queries:  67%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                              | 67/100 [00:06<00:03, 10.49it/s]

Queries:  69%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                           | 69/100 [00:06<00:02, 10.50it/s]

Queries:  71%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                       | 71/100 [00:07<00:02, 10.52it/s]

Queries:  73%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                   | 73/100 [00:07<00:02, 10.53it/s]

Queries:  75%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                               | 75/100 [00:07<00:02, 10.49it/s]

Queries:  77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                           | 77/100 [00:07<00:02, 10.52it/s]

Queries:  79%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                        | 79/100 [00:07<00:01, 10.53it/s]

Queries:  81%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                    | 81/100 [00:07<00:01, 10.54it/s]

Queries:  83%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                | 83/100 [00:08<00:01, 10.54it/s]

Queries:  85%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                            | 85/100 [00:08<00:01, 10.54it/s]

Queries:  87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                        | 87/100 [00:08<00:01, 10.55it/s]

Queries:  89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                     | 89/100 [00:08<00:01, 10.54it/s]

Queries:  91%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                 | 91/100 [00:08<00:00, 10.53it/s]

Queries:  93%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋             | 93/100 [00:09<00:00, 10.50it/s]

Queries:  95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌         | 95/100 [00:09<00:00, 10.51it/s]

Queries:  97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎     | 97/100 [00:09<00:00, 10.51it/s]

Queries:  99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████  | 99/100 [00:09<00:00, 10.52it/s]

Queries: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:09<00:00, 10.21it/s]


Queries: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:20<00:00,  4.79it/s]


Queries:   0%|                                                                                                                                                                                                       | 0/100 [00:00<?, ?it/s]

Queries:   2%|███▊                                                                                                                                                                                           | 2/100 [00:00<00:09, 10.64it/s]

Queries:   4%|███████▋                                                                                                                                                                                       | 4/100 [00:00<00:09, 10.61it/s]

Queries:   6%|███████████▍                                                                                                                                                                                   | 6/100 [00:00<00:08, 10.59it/s]

Queries:   8%|███████████████▎                                                                                                                                                                               | 8/100 [00:00<00:08, 10.59it/s]

Queries:  10%|███████████████████                                                                                                                                                                           | 10/100 [00:00<00:08, 10.54it/s]

Queries:  12%|██████████████████████▊                                                                                                                                                                       | 12/100 [00:01<00:08, 10.50it/s]

Queries:  14%|██████████████████████████▌                                                                                                                                                                   | 14/100 [00:01<00:08, 10.53it/s]

Queries:  16%|██████████████████████████████▍                                                                                                                                                               | 16/100 [00:01<00:07, 10.54it/s]

Queries:  18%|██████████████████████████████████▏                                                                                                                                                           | 18/100 [00:01<00:07, 10.54it/s]

Queries:  20%|██████████████████████████████████████                                                                                                                                                        | 20/100 [00:01<00:07, 10.54it/s]

Queries:  22%|█████████████████████████████████████████▊                                                                                                                                                    | 22/100 [00:02<00:07, 10.54it/s]

Queries:  24%|█████████████████████████████████████████████▌                                                                                                                                                | 24/100 [00:02<00:07, 10.54it/s]

Queries:  26%|█████████████████████████████████████████████████▍                                                                                                                                            | 26/100 [00:02<00:07, 10.56it/s]

Queries:  28%|█████████████████████████████████████████████████████▏                                                                                                                                        | 28/100 [00:02<00:06, 10.55it/s]

Queries:  30%|█████████████████████████████████████████████████████████                                                                                                                                     | 30/100 [00:02<00:06, 10.55it/s]

Queries:  32%|████████████████████████████████████████████████████████████▊                                                                                                                                 | 32/100 [00:03<00:06, 10.55it/s]

Queries:  34%|████████████████████████████████████████████████████████████████▌                                                                                                                             | 34/100 [00:03<00:06, 10.56it/s]

Queries:  36%|████████████████████████████████████████████████████████████████████▍                                                                                                                         | 36/100 [00:03<00:06, 10.56it/s]

Queries:  38%|████████████████████████████████████████████████████████████████████████▏                                                                                                                     | 38/100 [00:03<00:05, 10.56it/s]

Queries:  40%|████████████████████████████████████████████████████████████████████████████                                                                                                                  | 40/100 [00:03<00:05, 10.56it/s]

Queries:  42%|███████████████████████████████████████████████████████████████████████████████▊                                                                                                              | 42/100 [00:03<00:05, 10.56it/s]

Queries:  44%|███████████████████████████████████████████████████████████████████████████████████▌                                                                                                          | 44/100 [00:04<00:05, 10.56it/s]

Queries:  46%|███████████████████████████████████████████████████████████████████████████████████████▍                                                                                                      | 46/100 [00:04<00:05, 10.55it/s]

Queries:  48%|███████████████████████████████████████████████████████████████████████████████████████████▏                                                                                                  | 48/100 [00:04<00:04, 10.55it/s]

Queries:  50%|███████████████████████████████████████████████████████████████████████████████████████████████                                                                                               | 50/100 [00:04<00:04, 10.56it/s]

Queries:  52%|██████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                                           | 52/100 [00:04<00:04, 10.56it/s]

Queries:  54%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                       | 54/100 [00:05<00:04, 10.56it/s]

Queries:  56%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                                   | 56/100 [00:05<00:04, 10.55it/s]

Queries:  58%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                                               | 58/100 [00:05<00:03, 10.56it/s]

Queries:  60%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                            | 60/100 [00:05<00:03, 10.55it/s]

Queries:  62%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                        | 62/100 [00:05<00:03, 10.55it/s]

Queries:  64%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                    | 64/100 [00:06<00:03, 10.55it/s]

Queries:  66%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                | 66/100 [00:06<00:03, 10.55it/s]

Queries:  68%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                            | 68/100 [00:06<00:03, 10.55it/s]

Queries:  70%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                         | 70/100 [00:06<00:02, 10.56it/s]

Queries:  72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                     | 72/100 [00:06<00:02, 10.55it/s]

Queries:  74%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                 | 74/100 [00:07<00:02, 10.56it/s]

Queries:  76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                             | 76/100 [00:07<00:02, 10.56it/s]

Queries:  78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                         | 78/100 [00:07<00:02, 10.55it/s]

Queries:  80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                      | 80/100 [00:07<00:01, 10.55it/s]

Queries:  82%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                  | 82/100 [00:07<00:01, 10.54it/s]

Queries:  84%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                              | 84/100 [00:07<00:01, 10.55it/s]

Queries:  86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                          | 86/100 [00:08<00:01, 10.56it/s]

Queries:  88%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                      | 88/100 [00:08<00:01, 10.55it/s]

Queries:  90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                   | 90/100 [00:08<00:00, 10.54it/s]

Queries:  92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊               | 92/100 [00:08<00:00, 10.50it/s]

Queries:  94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌           | 94/100 [00:08<00:00, 10.48it/s]

Queries:  96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍       | 96/100 [00:09<00:00, 10.48it/s]

Queries:  98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 98/100 [00:09<00:00, 10.51it/s]

Queries: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:09<00:00, 10.55it/s]


In [21]:
print(all_results)

In [3]:
from src.evaluation.retrieval_evaluation import calc_hit_rate_scores, calc_mrr_scores, record_results
from src.database.weaviate_interface_v4 import WeaviateWCS
from src.database.database_utils import get_weaviate_client
from src.preprocessor.preprocessing import FileIO


data_path = '../data/golden_datasets/golden_256.json'

#################
##  START CODE ##
#################


### Load QA dataset
golden_dataset = FileIO.load_json(data_path)

### Instantiate Weaviate client and set Collection name
api_key = os.environ['WEAVIATE_API_KEY']
url = os.environ['WEAVIATE_ENDPOINT']
model_path = 'Snowflake/snowflake-arctic-embed-l'
retriever = WeaviateWCS(endpoint=url, api_key=api_key, model_name_or_path=model_path)
collection_name = 'Huberman_arctic_256'


#################
##  END CODE   ##
#################

# should see 100 queries
print(f'Num queries in Golden Dataset: {len(golden_dataset["queries"])}')

/usr/local/lib/python3.10/site-packages/pydantic/_internal/_config.py:284: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
/usr/local/lib/python3.10/site-packages/litellm/proxy/_types.py:167: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/
  @root_validator(pre=True)
/usr/local/lib/python3.10/site-packages/litellm/proxy/_types.py:254: PydanticDeprecatedSince20: `pydantic.config.Extra` is deprecated, use literal values instead (e.g. `extra='allow'`). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide

In [5]:



eval_results = retrieval_evaluation(golden_dataset, collection_name, retriever)

Queries: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:47<00:00,  2.13it/s]


In [6]:
print(eval_results)

In [None]:
{
    'n': 5,
    'Retriever': 'sentence-transformers/all-MiniLM-L6-v2',
    'chunk_size': 256,
    'query_props': ['content'],
    'kw_hit_rate': 0.9,
    'kw_mrr': 0.82,
    'vector_hit_rate': 0.71,
    'vector_mrr': 0.59,
    'total_misses': 8,
    'total_questions': 100
}