## Benchmarking Results - Uptrain Evals

**Overview**: In this notebook, we will compare different UpTrain stanard evals against human judgements. For every example, we have an eval score along with an explanation. Each example is assigned a score by a human for every eval. 

There are total 6 evals, we have covered in this notebook:
- Context Relevance
- Response Conciseness
- Response Match
- Factual Accuracy
- Response Completeness with respect to Context
- Response Relevance

Each score has a value between 0 and 1. 

For our evaluations, we have used Financial QA dataset. The FiQA dataset has roughly 6,000 questions and 57,000 answers. Financial QA is hard because the vocabularies are context specific. In this experiment, we have randomly picked 30 questions and performed our evaluations on top of it. 

In [1]:
import numpy as np 
import polars as pl 
import os
import tempfile
import json

In [2]:
url = "https://uptrain-assets.s3.ap-south-1.amazonaws.com/data/benchmark.jsonl"
dataset_path = os.path.join('./', "benchmark.jsonl")

if not os.path.exists(dataset_path):
    import httpx
    r = httpx.get(url)
    with open(dataset_path, "wb") as f:
        f.write(r.content)
        

In [4]:
dataset = pl.read_ndjson(dataset_path)

In [5]:
print("Number of test cases: ", len(dataset))
print("Couple of samples: ", dataset[0:2])

Number of test cases:  30
Couple of samples:  shape: (2, 10)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ question  ┆ response  ┆ ground_tr ┆ context   ┆ … ┆ human_sco ┆ human_sco ┆ human_sco ┆ human_sc │
│ ---       ┆ ---       ┆ uth       ┆ ---       ┆   ┆ re_respon ┆ re_respon ┆ re_factua ┆ ore_resp │
│ str       ┆ str       ┆ ---       ┆ str       ┆   ┆ se_releva ┆ se_match  ┆ l_accurac ┆ onse_com │
│           ┆           ┆ str       ┆           ┆   ┆ nce       ┆ ---       ┆ y         ┆ pletenes │
│           ┆           ┆           ┆           ┆   ┆ ---       ┆ f64       ┆ ---       ┆ …        │
│           ┆           ┆           ┆           ┆   ┆ f64       ┆           ┆ f64       ┆ ---      │
│           ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ Do the    ┆ Based on  ┆ No. 

In [6]:
print(dataset.columns)

['question', 'response', 'ground_truth', 'context', 'human_score_context_relevance', 'human_score_response_conciseness', 'human_score_response_relevance', 'human_score_response_match', 'human_score_factual_accuracy', 'human_score_response_completeness_wrt_context']


## Running Evals

In [7]:
from uptrain import APIClient, Evals, ResponseMatching


In [13]:
UPTRAIN_API_KEY = "up-*****************" ## INSERT YOUR UPTRAIN KEY HERE

client = APIClient(uptrain_api_key=UPTRAIN_API_KEY)



results = client.log_and_evaluate(
    project_name="benchmark",
    data=dataset,
    checks=[
           Evals.CONTEXT_RELEVANCE,
           Evals.RESPONSE_CONCISENESS,
           ResponseMatching(method='llm'),
           Evals.RESPONSE_COMPLETENESS_WRT_CONTEXT,
           Evals.FACTUAL_ACCURACY
           Evals.RESPONSE_RELEVANCE 
           ]
)


[32m2023-11-05 03:04:24.482[0m | [1mINFO    [0m | [36muptrain.framework.remote[0m:[36mlog_and_evaluate[0m:[36m455[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain server[0m


In [14]:
dataset = pl.DataFrame(results)

## Context Relevance

In [33]:
print('Question:', dataset[16]['question'][0])
print('Context:', dataset[16]['context'][0])

Question: Can I use my long position stocks as margin for my short sold stocks?
Context: Yes, you call the broker and tell him to use those shares to deliver to the short position.
Yes you can.  This is known as a short selling against the box.  In the old days, this was used to delay a taxable event.  You could lock in a gain without triggering a taxable event.  Any loss on one side of the box would be offset by a loss on the other side, and vice versa.  However, the IRS clamped down on this, and you will realize the gain on your long position as soon as you go short on the other side.  See http://www.investopedia.com/terms/s/sellagainstthebox.asp. As to how to initiate the short cover, just transfer the long position to the same account as your short position and make sure your broker covers the short.  Should be relatively easy.
When you short a stock, you can lose an unlimited amount of money if the trade goes against you. If the shorted stock gaps up overnight you can lose more mo

In [35]:
print(dataset[16]['score_context_relevance'][0])
print(dataset[16]['explanation_context_relevance'][0])

1.0
1. The question is asking whether it is possible to use long position stocks as margin for short sold stocks.
2. The extracted context mentions that it is possible to use long position stocks to deliver to the short position, which implies that they can be used as margin.
3. The context also mentions that this practice is known as short selling against the box and provides some historical background on its use.
4. It further explains that the IRS has clamped down on this practice and that the gain on the long position will be realized as soon as the short position is initiated.
5. The context provides instructions on how to initiate the short cover by transferring the long position to the same account as the short position.
6. The context also mentions the risks associated with short selling, such as the potential for unlimited losses if the trade goes against you.
7. It contrasts this with long positions, where the maximum loss is 100% and the gains are potentially unlimited.

Bas

In this example, we can clearly see that the context is highly relevant to the question because of which it has a high context relevancy score.

In [36]:
print('Question:', dataset[5]['question'][0])
print('Context:', dataset[5]['context'][0])

Question: Can I deduct personal loans or use them as tax “write offs?”
Context:  "Basically, the money you pay in student loan interest is tax deductible, which means as far as the IRS is concerned, you didn't make that money. However, what that saves you on your taxes is a percentage of a percentage; you save the amount of your current marginal rate on the money you paid as interest. Simple example with made-up numbers: Let's say you had a student loan outstanding, and you were making payments of $150 monthly on it. Total payments to said loan in one tax year would be $1800. Of that amount, let's for the sake of argument say that half, $900, was interest. You get your 1098-E with that number on it, and reduce your taxable income by that amount. You're currently doing well, not outstanding but OK, so you're in the 25% tax bracket that most single middle-classers are in. So, your reduction in taxable income of $900 saves you the 25% that those 900 simoleons would have been taxed at, whi

In [38]:
print(dataset[5]['score_context_relevance'][0])
print(dataset[5]['explanation_context_relevance'][0])

0.5
1. Read the question: "Can I deduct personal loans or use them as tax 'write offs?'"
2. Read the extracted context: The extracted context discusses the tax deductibility of student loans and provides an example of how the deduction works. It also mentions that the deduction reduces the cost of capital but does not credit the tax bill in the amount of interest owed.
3. Compare the question and the extracted context: The extracted context provides information about the tax deductibility of student loans, but it does not specifically address personal loans. Therefore, it does not directly answer the question.
4. Determine the answer: The extracted context can provide some relevant information about the tax deductibility of loans, but it does not answer the question completely. Therefore, the correct choice is (B) The extracted context can give some relevant answer for the given question but can't answer it completely.
5. Print the score: 0.5
0.5


As we can see for this particular example, the context is not relevant to the question and thus, achieving low context relevancy score.

### Comparative Analysis against human judgements

In [39]:
human_context_relevance_score = list(dataset['human_score_context_relevance'])
llm_context_relevance_score = list(dataset['score_context_relevance'])
np.mean(np.abs(np.array(human_context_relevance_score)- np.array(llm_context_relevance_score)))

0.15

This shows that the 'context relevancy' metric is 85% accurate in comparison to human evaluations. 

## Response Conciseness

In [67]:
print('Question:', dataset[12]['question'][0])
print('Response:', dataset[12]['response'][0])

Question: Should I worry too much about saving my 20% down before buying my first house?
Response: Based on the information in the context, it seems that there are varying opinions on whether you should worry too much about saving your 20% down before buying your first house. Some factors to consider include the need to avoid paying private mortgage insurance (PMI) and having liquidity for emergency funds or other investments. Additionally, having funds set aside for repairs may be a good idea. It is also mentioned that the current low interest rates might make it more sensible to put money into a different investment vehicle and that closing costs should be taken into account. Ultimately, the decision should depend on your personal financial situation, including factors like your monthly payments, interest rates, retirement savings, other debt, and the possibility of changes in life circumstances.


In [68]:
print(dataset[12]['score_response_conciseness'][0])
print(dataset[12]['explanation_response_conciseness'][0])

1.0
1. Read the question: "Should I worry too much about saving my 20% down before buying my first house?"
2. Read the response: "Based on the information in the context, it seems that there are varying opinions on whether you should worry too much about saving your 20% down before buying your first house. Some factors to consider include the need to avoid paying private mortgage insurance (PMI) and having liquidity for emergency funds or other investments. Additionally, having funds set aside for repairs may be a good idea. It is also mentioned that the current low interest rates might make it more sensible to put money into a different investment vehicle and that closing costs should be taken into account. Ultimately, the decision should depend on your personal financial situation, including factors like your monthly payments, interest rates, retirement savings, other debt, and the possibility of changes in life circumstances."
3. Compare the response to the question and identify any

In this case, we can clearly see that even though the response is big but it does not contain any irrelevant information, resulting in a high 'response conciseness' score.

In [55]:
print('Question:', dataset[16]['question'][0])
print('Response:', dataset[16]['response'][0])

Question: Can I use my long position stocks as margin for my short sold stocks?
Response: Yes, you can use your long position stocks as margin for your short sold stocks. You would need to transfer the long position to the same account as your short position and make sure your broker covers the short. It is important to note that short selling comes with the risk of losing an unlimited amount of money if the trade goes against you.


In [56]:
print(dataset[16]['score_response_conciseness'][0])
print(dataset[16]['explanation_response_conciseness'][0])

0.5
1. Read the question: "Can I use my long position stocks as margin for my short sold stocks?"
2. Read the response: "Yes, you can use your long position stocks as margin for your short sold stocks. You would need to transfer the long position to the same account as your short position and make sure your broker covers the short. It is important to note that short selling comes with the risk of losing an unlimited amount of money if the trade goes against you."
3. Compare the response to the question:
   - The response directly answers the question by stating that it is possible to use long position stocks as margin for short sold stocks.
4. Check for additional irrelevant information:
   - The response provides additional information about transferring the long position to the same account as the short position and the risk of losing an unlimited amount of money in short selling.
   - While this additional information may be useful, it is not directly relevant to the question of whe

In this particular example, the response does contain additional irrelevant information, resulting in a low 'response conciseness' score.

### Comparative Analysis against human judgements 

In [86]:
human_response_conciseness_score = list(dataset['human_score_response_conciseness'])
llm_response_conciseness_score = list(dataset['score_response_conciseness'])

np.mean(np.abs(np.array(human_response_conciseness_score)- np.array(llm_response_conciseness_score)))

0.25

This shows that the 'response conciseness' metric is 75% accurate in comparison to human evaluations. 

## Response Match  


In [72]:
print('Response:', dataset[19]['response'][0])
print('Ground Truth:', dataset[19]['ground_truth'][0])

Response: 
Based on the information provided in the context, the state tax issues for a New Jersey resident with DC tax withheld and likely refunded are as follows:

1. The individual may file an extension request with the NJ Division of Taxation until they receive an answer from DC regarding the refund. Alternatively, they can file without claiming the credit and amend their return later if DC refuses to refund.

2. It is unlikely that DC will consider the individual as a resident for staying a couple of months in hotels over the year.


Ground Truth: If you already filed the DC return, you can try and wait with filing the NJ return until you get the answer from DC.  You can file an extension request with the NJ division of taxation here. Or, you can file without claiming the credit, and worst case amend later and claim it if DC refuse to refund. I find it highly unlikely that DC will decide that a person staying for a couple of months over the year in hotels will count as a resident.

In [74]:
print(dataset[19]['score_response_match'][0])

1.0


In this specific example, response has the same meaning as given in the ground truth. The two sentences are different syntactically but has the same meaning. 

### Comparative Analysis against human judgements 

In [85]:
human_response_match_score = list(dataset['human_score_response_match'])
llm_response_match_score = list(dataset['score_response_match'])

np.mean(np.abs(np.array(human_response_match_score)- np.array(llm_response_match_score)))

0.23558952039597197

This shows that the 'response matching' metric is 78% accurate in comparison to human evaluations. 

## Factual Accuracy


In [78]:
print('Question:',dataset[25]['question'][0])
print('Response:',dataset[25]['response'][0])
print('Context:',dataset[25]['context'][0])

Question: Lend money at a rate linked to the prime rate
Response: The prime rate is the interest rate at which banks lend to consumers and businesses with the highest credit ratings.
Context: The prime rate is the interest rate banks use amongst themselves to lend money to each other only. It is used as the basis (sometimes) for what interest rate banks charge you.  The prime rate is based loosely on the Fed rate.  There is a committee that meets regularly to set this and other industry interest rates. http://en.wikipedia.org/wiki/Prime_rate I am not 100% positive the following is totally accurate The banks keep our deposits and pay us interest for doing so.  They are paying us interest because they take yours, mine and everybody elses deposits as a large lump sum and invest that money.  Sometimes as business loans, sometimes as mortgages and sometimes as credit card. The banks have a book of business that will be EXACTLY how much credit they have extended to everybody.  But they do no

In [79]:
print(dataset[25]['score_factual_accuracy'][0])
print(dataset[25]['explanation_factual_accuracy'][0])

0.0
1. The prime rate is the interest rate at which banks lend to consumers and businesses.
Argument for yes: The context mentions that the prime rate is the interest rate banks charge consumers and businesses for loans.
Argument for no: The context states that the prime rate is the interest rate banks use amongst themselves to lend money to each other only, not to consumers and businesses.
Judgement: no. Argument for no looks stronger.
2. The prime rate is for borrowers with the highest credit ratings.
Argument for yes: The context mentions that if you have a high credit rating, the bank might give you a personal loan at the prime rate.
Argument for no: The context does not explicitly state that the prime rate is specifically for borrowers with the highest credit ratings.
Judgement: no. Argument for no looks stronger.



In this example, we can clearly see that the response contains hallucinations because of which the 'factual accuracy' is low.

In [80]:
print('Question:', dataset[12]['question'][0])
print('Response:', dataset[12]['response'][0])
print('Context:', dataset[12]['context'][0])


Question: Should I worry too much about saving my 20% down before buying my first house?
Response: Based on the information in the context, it seems that there are varying opinions on whether you should worry too much about saving your 20% down before buying your first house. Some factors to consider include the need to avoid paying private mortgage insurance (PMI) and having liquidity for emergency funds or other investments. Additionally, having funds set aside for repairs may be a good idea. It is also mentioned that the current low interest rates might make it more sensible to put money into a different investment vehicle and that closing costs should be taken into account. Ultimately, the decision should depend on your personal financial situation, including factors like your monthly payments, interest rates, retirement savings, other debt, and the possibility of changes in life circumstances.
Context: The primary reason to put 20% down on your home is to avoid paying PMI (private

In [81]:
print(dataset[12]['score_factual_accuracy'][0])
print(dataset[12]['explanation_factual_accuracy'][0])

1.0
1. There are varying opinions on whether you should worry too much about saving your 20% down before buying your first house.
Argument for yes: The context mentions that some people are saying that the traditional 20% down is the way to go, implying that there are varying opinions on this matter.
Argument for no: No arguments.
Judgement: yes. Argument for yes looks stronger.

2. Factors to consider include the need to avoid paying private mortgage insurance (PMI) and having liquidity for emergency funds or other investments.
Argument for yes: The context explicitly states that having liquidity for emergency funds or other investments is a factor to consider.
Argument for no: No arguments.
Judgement: yes. Argument for yes looks stronger.

3. Having funds set aside for repairs may be a good idea.
Argument for yes: The context mentions that depending on whether the house is move-in ready or a fixer-upper, having funds set aside for repairs is a good idea.
Argument for no: No arguments

All the facts contained in the response are true and can be verified from the given context.

### Comparative Analysis against human judgements 

In [83]:
human_factual_accuracy_score = list(dataset['human_score_factual_accuracy'])
llm_factual_accuracy_score = list(dataset['score_factual_accuracy'])

np.mean(np.abs(np.array(human_factual_accuracy_score)- np.array(llm_factual_accuracy_score)))

0.12333333333333334

This shows that the 'factual accuracy' metric is 88% accurate in comparison to human evaluations. 

## Response Completeness with respect to Context

In [49]:
print('Question:',dataset[16]['question'][0])
print('Response:',dataset[16]['response'][0])
print('Context:',dataset[16]['context'][0])


Question: Can I use my long position stocks as margin for my short sold stocks?
Response: Yes, you can use your long position stocks as margin for your short sold stocks. You would need to transfer the long position to the same account as your short position and make sure your broker covers the short. It is important to note that short selling comes with the risk of losing an unlimited amount of money if the trade goes against you.
Context: Yes, you call the broker and tell him to use those shares to deliver to the short position.
Yes you can.  This is known as a short selling against the box.  In the old days, this was used to delay a taxable event.  You could lock in a gain without triggering a taxable event.  Any loss on one side of the box would be offset by a loss on the other side, and vice versa.  However, the IRS clamped down on this, and you will realize the gain on your long position as soon as you go short on the other side.  See http://www.investopedia.com/terms/s/sellagain

In [50]:
print(dataset[16]['score_response_completeness_wrt_context'][0])
print(dataset[16]['explanation_response_completeness_wrt_context'][0])

1.0
1. The question asks if the user can use their long position stocks as margin for their short sold stocks.
2. The context provides relevant information about short selling against the box and how to initiate the short cover.
3. The generated response acknowledges that the user can use their long position stocks as margin for their short sold stocks.
4. The generated response also includes the information from the context about transferring the long position to the same account as the short position and ensuring the broker covers the short.
5. The generated response mentions the risk of losing an unlimited amount of money if the trade goes against the user, which is relevant information.

The generated answer answers the given question adequately by extracting all the relevant information from the context and providing a clear response.

1.0
1.0


In this example, the response has all the adequate information from the given context to answer the question.

### Comparative Analysis against human judgements 

In [104]:
human_response_completeness_wrt_context_score = list(dataset['human_score_response_completeness_wrt_context'])
llm_response_completeness_wrt_context_score = list(dataset['score_response_completeness_wrt_context'])
np.mean(np.abs(np.array(human_response_completeness_wrt_context_score)- np.array(llm_response_completeness_wrt_context_score)))

0.27374


This shows that the 'response completeness wrt context' metric is 73% accurate in comparison to human evaluations. 

## Response Relevance

In [54]:
print('Question:', dataset[19]['question'][0])
print('Response:', dataset[19]['response'][0])

Question: State tax issues for NJ resident with DC tax withheld, and likely refunded
Conntext: 
Based on the information provided in the context, the state tax issues for a New Jersey resident with DC tax withheld and likely refunded are as follows:

1. The individual may file an extension request with the NJ Division of Taxation until they receive an answer from DC regarding the refund. Alternatively, they can file without claiming the credit and amend their return later if DC refuses to refund.

2. It is unlikely that DC will consider the individual as a resident for staying a couple of months in hotels over the year.




In [55]:
print(dataset[19]['score_response_relevance'][0])
print(dataset[19]['explanation_response_relevance'][0])

1.0
1. The question asks for the state tax issues for a New Jersey resident with DC tax withheld and likely refunded.
2. The response provides two points related to the state tax issues for the given scenario.
3. Both points in the response are directly relevant to the question and provide information about the actions the individual can take regarding their tax situation.
4. There is no additional irrelevant information in the response.
5. Therefore, the generated answer has no additional irrelevant information.

Score: 1.0
1.0
1. The question asks for the state tax issues for a New Jersey resident with DC tax withheld and likely refunded.
2. The response provides two state tax issues that are relevant to the situation described in the question.
3. The first issue mentioned is that the individual may file an extension request with the NJ Division of Taxation until they receive an answer from DC regarding the refund. This addresses the potential delay in receiving the refund and provid

In this example, the response does contain all the required information to answer this particular question. 

### Comparative Analysis against human judgements 

In [15]:
human_response_relevance_score = list(dataset['human_score_response_relevance'])
llm_response_relevance_score = list(dataset['score_response_relevance'])
np.mean(np.abs(np.array(human_response_relevance_score)- np.array(llm_response_relevance_score)))

0.2622222222222222

This shows that the 'response relevancy' metric is 74% accurate in comparison to human evaluations. 