# Evaluating RAG w/ Alpha Tuning
Evaluation is a crucial piece of development when building a RAG pipeline. Likewise, alpha tuning can be a time consuming exercise to build out, so what does the performance benefit look like for all of this extra effort? Let's dig into that.

### Fixtures
- For our dataset: Several research papers on AI (same used in the blog post referenced below) /data
- Our vector db: [Pinecone](https://www.pinecone.io/)
- For our embedding model: [ada-002](https://platform.openai.com/docs/models/embeddings) 
- For our LLM: [GPT-3.5 Turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo)

These fixtures were chosen because they're well integrated within LlamaIndex to make this notebook very transferrable to those looking to reproduce this.
Likewise, Pinecone supports hybrid search, which is a requirement for hybrid searches to be possible. Koda Retriever will also be used!

### Testing

Koda Retriever was largely inspired from the alpha tuning [blog post written by Ravi Theja](https://blog.llamaindex.ai/llamaindex-enhancing-retrieval-performance-with-alpha-tuning-in-hybrid-search-in-rag-135d0c9b8a00) from Llama Index. For that reason, we'll follow a similar pattern and evaluate with:
- MRR (Mean Reciprocal Rank)
- Hit Rate

### Agenda:
- Fixture Setup
- Data Ingestion
- Synthetic Query Generation
- Alpha Mining & Evaluation
- Koda Retriever vs Vanilla Hybrid Retriever

In [None]:
# Import all the necessary modules
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.postprocessor import LLMRerank
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core import Settings
from llama_index.packs.koda_retriever import KodaRetriever
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.core import SimpleDirectoryReader
import os
from pinecone import Pinecone
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.evaluation import generate_qa_embedding_pairs
import pandas as pd

## Fixture Setup

In [None]:
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
index = pc.Index("llama2-paper")  # this was previously created in my pinecone account

Settings.llm = OpenAI()
Settings.embed_model = OpenAIEmbedding()

# if you recreate this using the default dataset in pinecone, you'll want to set `text_key = "summary"`
vector_store = PineconeVectorStore(pinecone_index=index)
vector_index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store, embed_model=Settings.embed_model
)

reranker = LLMRerank(llm=Settings.llm)  # optional

koda_retriever = KodaRetriever(
    index=vector_index,
    llm=Settings.llm,
    reranker=reranker,  # optional
    verbose=True,
    similarity_top_k=10,
)

vanilla_retriever = vector_index.as_retriever()

pipeline = IngestionPipeline(
    transformations=[Settings.embed_model], vector_store=vector_store
)

## Data Ingestion

Three research papers in `/data` are going to be ingested into our Pinecone instance.

Our chunking strategy will solely be [semantic](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking.html) - although this is not recommended for production. For production use cases it is recommended more analysis and other chunking strategies are considered for production use cases.

In [None]:
# Simple function to load documents from a file - also taken from the source blog post


def load_documents(file_path, num_pages=None):
    if num_pages:
        documents = SimpleDirectoryReader(input_files=[file_path]).load_data()[
            :num_pages
        ]
    else:
        documents = SimpleDirectoryReader(input_files=[file_path]).load_data()
    return documents


# Load the documents
doc1 = load_documents(
    "/workspaces/llama_index/llama-index-packs/llama-index-packs-koda-retriever/examples/data/dense_x_retrieval.pdf",
    num_pages=9,
)
doc2 = load_documents(
    "/workspaces/llama_index/llama-index-packs/llama-index-packs-koda-retriever/examples/data/llama_beyond_english.pdf",
    num_pages=7,
)
doc3 = load_documents(
    "/workspaces/llama_index/llama-index-packs/llama-index-packs-koda-retriever/examples/data/llm_compiler.pdf",
    num_pages=12,
)
docs = [doc1, doc2, doc3]
nodes = list()

node_parser = SemanticSplitterNodeParser(
    embed_model=Settings.embed_model, breakpoint_percentile_threshold=95
)
# ingestion
for doc in docs:
    _nodes = node_parser.build_semantic_nodes_from_documents(
        documents=doc,
    )
    nodes.extend(_nodes)

    pipeline.run(nodes=_nodes)

Upserted vectors:   0%|          | 0/26 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/23 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/37 [00:00<?, ?it/s]

## Synthetic Query Generation

In [None]:
qa_dataset = generate_qa_embedding_pairs(nodes=nodes, llm=Settings.llm)

[TextNode(id_='7fc3876d-f73f-4b4e-8710-53923a6f6ef3', embedding=[0.0002587825874798, 0.021306069567799568, 0.01790156401693821, -0.02210138365626335, 0.0058183567598462105, 0.02769649401307106, -8.795501344138756e-06, -0.003264977131038904, -0.031338199973106384, -0.039430879056453705, 0.02178046852350235, 0.03580312803387642, -0.014127305708825588, 0.023873401805758476, 0.0033399739768356085, 0.011253009550273418, 0.017301589250564575, 0.0054451171308755875, 0.01838991418480873, -0.01890617236495018, -0.0030644044745713472, -0.02380363829433918, -0.025436125695705414, -0.012773874215781689, -0.003517873352393508, -0.0016778354765847325, 0.04827701300382614, -0.018655018880963326, -0.007332245819270611, -0.002000496257096529, 0.013415707275271416, -0.005291635170578957, -0.007192716933786869, -0.010471646673977375, -0.0005084085860289633, 0.018948029726743698, 0.0006897962302900851, -0.00768804457038641, 0.019143370911478996, -0.0163667444139719, 0.02378968521952629, 0.0255058910697698



[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

100%|██████████| 86/86 [03:06<00:00,  2.17s/it]


{'4593ea90-2088-4185-8db0-6aa65bdc106e': 'How does the choice of retrieval unit, such as document, passage, sentence, or proposition, impact the performance of dense retrieval and downstream tasks in open-domain NLP?',
 '861e2131-add7-47cd-afdf-52b6e27562e7': 'Can you explain the significance of using proposition-based retrieval over traditional passage or sentence-based methods in dense retrieval, as highlighted in the study?',
 'db511668-5d08-44d2-ad99-67f8a6f1495d': 'How can the corpus be used for inference in a research study?',
 '359a1385-6ae2-4cb7-8970-2564a4f746fd': "As a teacher, how would you explain the importance of using diverse questions in assessments to ensure a comprehensive evaluation of students' understanding?",
 'c8b0cefb-d6d0-4aca-9c0f-ce55d051fe91': 'How does the choice of retrieval unit impact the performance of dense retrieval models according to the information provided in the document?',
 '0a15445b-f7ae-457a-9800-0f8fc29a74f7': 'What is the significance of sel

## Alpha Mining & Evaluation

We're going to update the alpha values of a vector index retriever right before evaluation. 
We'll be evaluating alpha values in increments of .1; 0 to 1 where 0 is basic text search and 1 is a pure vector search.

In [None]:
def calculate_metrics(eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()
    return hit_rate, mrr


async def alpha_mine(
    qa_dataset,
    vector_store_index,
    alpha_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
):
    retriever = VectorIndexRetriever(
        index=vector_store_index,
        vector_store_query_mode="hybrid",
        alpha=0.0,  # this will change
        similarity_top_k=10,
    )

    results = dict()

    for alpha in alpha_values:
        retriever._alpha = alpha
        retriever_evaluator = RetrieverEvaluator.from_metric_names(
            metric_names=["mrr", "hit_rate"], retriever=retriever
        )
        eval_results = await retriever_evaluator.aevaluate_dataset(dataset=qa_dataset)

        hit_rate, mrr = calculate_metrics(eval_results)

        results[alpha] = {"hit_rate": hit_rate, "mrr": mrr}

    return results


results = await alpha_mine(qa_dataset=qa_dataset, vector_store_index=vector_index)
results

{0.0: {'hit_rate': 0.11627906976744186, 'mrr': 0.034057770394979696},
 0.1: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 0.2: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 0.3: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 0.4: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 0.5: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 0.6: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 0.7: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 0.8: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 0.9: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239},
 1.0: {'hit_rate': 0.9069767441860465, 'mrr': 0.7831879844961239}}

### Conclusions

As seen above, the alpha values between .1 and 1 are basically the same. This was a single dataset that was tested on and more than likely you'll want to test on multiple datasets for multiple purposes. For our purposes, the default .5 for most hybrid retrievers probably works well - although as shown in the [original blog post](https://blog.llamaindex.ai/llamaindex-enhancing-retrieval-performance-with-alpha-tuning-in-hybrid-search-in-rag-135d0c9b8a00) that started this llama pack, datasets can have wildly different results among their alpha values. 

## Bonus: Koda Retriever vs Vanilla Hybrid Retriever

Finally, we'll evaluate and compare a vanilla hybrid retriever against our koda retriever.
This koda retriever will use default alpha values and categories provided in the alpha pack.

In [None]:
async def compare_retrievers(retrievers, qa_dataset):
    results = dict()

    for name, retriever in retrievers.items():
        retriever_evaluator = RetrieverEvaluator.from_metric_names(
            metric_names=["mrr", "hit_rate"], retriever=retriever
        )

        eval_results = await retriever_evaluator.aevaluate_dataset(dataset=qa_dataset)

        hit_rate, mrr = calculate_metrics(eval_results)

        results[name] = {"hit_rate": hit_rate, "mrr": mrr}

    return results


retrievers = {"vanilla": vanilla_retriever, "koda": koda_retriever}

results = await compare_retrievers(retrievers=retrievers, qa_dataset=qa_dataset)
results