## Benchmarking Results - Uptrain Evals

**Overview**: In this notebook, we will compare different UpTrain stanard evals against human judgements. For every example, we have an eval score along with an explanation. Each example is assigned a score by a human for every eval. 

There are total 6 evals, we have convered in this notebook:
- Context Relevance
- Response Conciseness
- Response Match
- Factual Accuracy
- Response Completeness with respect to Context
- Response Relevance

Each score has a value between 0 and 1. 

For our evaluations, we have used Financial QA dataset. The FiQA dataset has roughly 6,000 questions and 57,000 answers. Financial QA is hard because the vocabularies are context specific. In this experiment, we have randomly picked 30 questions and performed our evaluations on top of it. 

In [46]:
import numpy as np 
import polars as pl 
import os 
import tempfile

In [None]:
url = "https://oodles-dev-training-data.s3.us-west-1.amazonaws.com/benchmark.jsonl"
TEMP_DIR = tempfile.gettempdir()
dataset_path = os.path.join(TEMP_DIR, "benchmark.jsonl")

if not os.path.exists(dataset_path):
    import httpx
    r = httpx.get(url)
    with open(dataset_path, "wb") as f:
        f.write(r.content)
        

In [47]:
dataset = pl.read_ndjson(dataset_path)

In [48]:
print("Number of test cases: ", len(dataset))
print("Couple of samples: ", dataset[0:2])

Number of test cases:  30
Couple of samples:  shape: (2, 21)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ question  ┆ context   ┆ response  ┆ ground_tr ┆ … ┆ human_sco ┆ human_sco ┆ human_sco ┆ human_sc │
│ ---       ┆ ---       ┆ ---       ┆ uth       ┆   ┆ re_respon ┆ re_respon ┆ re_factua ┆ ore_resp │
│ str       ┆ str       ┆ str       ┆ ---       ┆   ┆ se_releva ┆ se_match  ┆ l_accurac ┆ onse_com │
│           ┆           ┆           ┆ str       ┆   ┆ nce       ┆ ---       ┆ y         ┆ pletenes │
│           ┆           ┆           ┆           ┆   ┆ ---       ┆ f64       ┆ ---       ┆ …        │
│           ┆           ┆           ┆           ┆   ┆ f64       ┆           ┆ f64       ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ Do the    ┆ It really ┆ Base

In [50]:
print(dataset.columns)

['question', 'context', 'response', 'ground_truth', 'score_context_relevance', 'explanation_context_relevance', 'score_response_completeness_wrt_context', 'explanation_response_completeness_wrt_context', 'score_response_relevance', 'explanation_response_relevance', 'score_response_conciseness', 'explanation_response_conciseness', 'score_response_match', 'score_factual_accuracy', 'explanation_factual_accuracy', 'human_score_context_relevance', 'human_score_response_conciseness', 'human_score_response_relevance', 'human_score_response_match', 'human_score_factual_accuracy', 'human_score_response_completeness_wrt_context']


## Context Relevance

In [60]:
human_context_relevance_score = []
llm_context_relevance_score = []

In [61]:
for i in range(len(dataset)):
    human_context_relevance_score.append(dataset[i]['score_context_relevance'][0])
    
for i in range(len(dataset)):
    llm_context_relevance_score.append(dataset[i]['human_score_context_relevance'][0])

In [62]:
np.mean(np.abs(np.array(human_context_relevance_score)- np.array(llm_context_relevance_score)))

0.03333333333333333

In 'Context Relevance' Eval, MAE(Mean Average Error) is 0.03 against human judgements.

## Response Conciseness

In [65]:
human_response_conciseness_score = []
llm_response_conciseness_score = []

In [66]:
for i in range(len(dataset)):
    human_response_conciseness_score.append(dataset[i]['score_response_conciseness'][0])
    
for i in range(len(dataset)):
    llm_response_conciseness_score.append(dataset[i]['human_score_response_conciseness'][0])

In [67]:
np.mean(np.abs(np.array(human_response_conciseness_score)- np.array(llm_response_conciseness_score)))

0.11666666666666667

In 'Response Conciseness' Eval, MAE(Mean Average Error) is 0.12 against human judgements.

## Response Match  

In [69]:
human_response_match_score = []
llm_response_match_score = []

In [70]:
for i in range(len(dataset)):
    human_response_match_score.append(dataset[i]['score_response_match'][0])
    
for i in range(len(dataset)):
    llm_response_match_score.append(dataset[i]['human_score_response_match'][0])

In [71]:
np.mean(np.abs(np.array(human_response_match_score)- np.array(llm_response_match_score)))

0.1592717652717653

In 'Response Match' Eval, MAE(Mean Average Error) is 0.16 against human judgements.

## Factual Accuracy

In [74]:
human_factual_accuracy_score = []
llm_factual_accuracy_score = []

In [75]:
for i in range(len(dataset)):
    human_factual_accuracy_score.append(dataset[i]['score_factual_accuracy'][0])
    
for i in range(len(dataset)):
    llm_factual_accuracy_score.append(dataset[i]['human_score_factual_accuracy'][0])

In [76]:
np.mean(np.abs(np.array(human_factual_accuracy_score)- np.array(llm_factual_accuracy_score)))

0.08333333333333333

In 'Factual Accuracy' Eval, MAE(Mean Average Error) is 0.08 against human judgements.

## Response Completeness with respect to Context


In [77]:
human_response_completeness_wrt_context_score = []
llm_response_completeness_wrt_context_score = []

In [78]:
for i in range(len(dataset)):
    human_response_completeness_wrt_context_score.append(dataset[i]['score_response_completeness_wrt_context'][0])
    
for i in range(len(dataset)):
    llm_response_completeness_wrt_context_score.append(dataset[i]['human_score_response_completeness_wrt_context'][0])

In [79]:
np.mean(np.abs(np.array(human_response_completeness_wrt_context_score)- np.array(llm_response_completeness_wrt_context_score)))

0.24000000000000002

In 'Response Completeness with respect to Context' Eval, MAE(Mean Average Error) is 0.24 against human judgements.

## Response Relevance

In [80]:
human_response_relevance_score = []
llm_response_relevance_score = []

In [81]:
for i in range(len(dataset)):
    human_response_relevance_score.append(dataset[i]['score_response_relevance'][0])
    
for i in range(len(dataset)):
    llm_response_relevance_score.append(dataset[i]['human_score_response_relevance'][0])

In [82]:
np.mean(np.abs(np.array(human_response_relevance_score)- np.array(llm_response_relevance_score)))

0.1813333333333333

In 'Response Relevance' Eval, MAE(Mean Average Error) is 0.18 against human judgements.