<a href="https://colab.research.google.com/github/uptrain-ai/uptrain/blob/main/examples/benchmarks/claude_3_vs_gpt_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="uptrain">
  </a>
</h1>

# Claude 3 vs GPT-4
Claude 3 was recently launched by Anthropic as a competitor to OpenAI's GPT-4. In this notebook, we will compare the two models to see if you should make the switch from GPT-4 to Claude 3.

To do this comparison, we will use UpTrain's Response Matching operator. This operator takes in two values - response and ground_truth - and returns a score between 0 and 1. The score is 1 if the response is very similar the ground_truth and 0 if the response is completely different from the ground_truth.

We have curated a dataset of 25 questions and context pairs. For each question, we will get responses from both GPT-4 and Claude-3-Opus. We will take the response from GPT-4 as the ground_truth and compare the response from Claude-3-Opus to the ground_truth using the Response Matching operator. We will then do the same with GPT-3.5-Turbo and Claude-3-Sonnet, respectively.

# Import the required libraries

In [1]:
from uptrain import Settings
from uptrain.operators import TextCompletion, JsonReader

import os
import polars as pl
import nest_asyncio
nest_asyncio.apply()



# Download the dataset

In [2]:
url = "https://uptrain-assets.s3.ap-south-1.amazonaws.com/data/uptrain_benchmark.jsonl"
dataset_path = os.path.join('./', "uptrain_benchmark.jsonl")

if not os.path.exists(dataset_path):
    import httpx
    r = httpx.get(url)
    with open(dataset_path, "wb") as f:
        f.write(r.content)  

dataset = pl.read_ndjson(dataset_path)
print(dataset)

shape: (25, 3)
┌───────────────────────────────────┬───────────────────────────────────┬─────┐
│ question                          ┆ context                           ┆ idx │
│ ---                               ┆ ---                               ┆ --- │
│ str                               ┆ str                               ┆ i64 │
╞═══════════════════════════════════╪═══════════════════════════════════╪═════╡
│ How to get a grip on finance?'    ┆ Try downloading a finance app li… ┆ 1   │
│ How do “held” amounts appear on … ┆ "The ""hold"" is just placeholde… ┆ 2   │
│ Does negative P/E ratio mean sto… ┆ P/E is the number of years it wo… ┆ 3   │
│ Should a retail trader choose a … ┆ "That\'s like a car dealer adver… ┆ 4   │
│ Possibility to buy index funds a… ┆ "As user quid states in his answ… ┆ 5   │
│ …                                 ┆ …                                 ┆ …   │
│ Discuss the role of inflation in… ┆ Inflation is a pervasive economi… ┆ 21  │
│ Explain the concept of 

# Get responses from Claude-3-Opus

In [3]:
dataset_path="./uptrain_benchmark.jsonl"
claude_settings = Settings(model="claude-3-opus-20240229", rpm_limit=4)
dataset = JsonReader(fpath=dataset_path).setup(settings=claude_settings).run()["output"]

dataset = dataset.with_columns([pl.lit("claude-3-opus-20240229").alias("model")])
dataset_with_claude_responses = TextCompletion(col_in_prompt="question", col_out_completion="claude_3_opus_response").setup(settings=claude_settings).run(dataset)["output"]
dataset_with_claude_responses

  0%|          | 0/25 [00:00<?, ?it/s]

100%|██████████| 25/25 [05:33<00:00, 13.32s/it]


question,context,idx,model,claude_3_opus_response
str,str,i64,str,str
"""How to get a g…","""Try downloadin…",1,"""claude-3-opus-…","""Getting a grip…"
"""How do “held” …","""""The """"hold"""" …",2,"""claude-3-opus-…","""When a credit …"
"""Does negative …","""P/E is the num…",3,"""claude-3-opus-…","""A negative P/E…"
"""Should a retai…","""""That\'s like …",4,"""claude-3-opus-…","""The decision t…"
"""Possibility to…","""""As user quid …",5,"""claude-3-opus-…","""Yes, it is pos…"
…,…,…,…,…
"""Discuss the ro…","""Inflation is a…",21,"""claude-3-opus-…","""Inflation is a…"
"""Explain the co…",""" The Earth's …",22,"""claude-3-opus-…","""Plate tectonic…"
"""How did the su…",""" The Surreal…",23,"""claude-3-opus-…","""The Surrealist…"
"""Discuss the im…",""" Globalizatio…",24,"""claude-3-opus-…","""Globalization …"


In [4]:
dataset_with_claude_responses

question,context,idx,model,claude_3_opus_response
str,str,i64,str,str
"""How to get a g…","""Try downloadin…",1,"""claude-3-opus-…","""Getting a grip…"
"""How do “held” …","""""The """"hold"""" …",2,"""claude-3-opus-…","""When a credit …"
"""Does negative …","""P/E is the num…",3,"""claude-3-opus-…","""A negative P/E…"
"""Should a retai…","""""That\'s like …",4,"""claude-3-opus-…","""The decision t…"
"""Possibility to…","""""As user quid …",5,"""claude-3-opus-…","""Yes, it is pos…"
…,…,…,…,…
"""Discuss the ro…","""Inflation is a…",21,"""claude-3-opus-…","""Inflation is a…"
"""Explain the co…",""" The Earth's …",22,"""claude-3-opus-…","""Plate tectonic…"
"""How did the su…",""" The Surreal…",23,"""claude-3-opus-…","""The Surrealist…"
"""Discuss the im…",""" Globalizatio…",24,"""claude-3-opus-…","""Globalization …"


# Get Responses from GPT-4

In [5]:
gpt_settings = Settings(model="gpt-4", rpm_limit=100)
dataset = dataset_with_claude_responses.with_columns([pl.lit("gpt-4").alias("model")])
experiment_dataset = TextCompletion(col_in_prompt="question", col_out_completion="gpt_4_response").setup(settings=gpt_settings).run(dataset)["output"]
experiment_dataset

  0%|          | 0/25 [00:00<?, ?it/s]

100%|██████████| 25/25 [00:32<00:00,  1.32s/it]


question,context,idx,model,claude_3_opus_response,gpt_4_response
str,str,i64,str,str,str
"""How to get a g…","""Try downloadin…",1,"""gpt-4""","""Getting a grip…","""1. Take online…"
"""How do “held” …","""""The """"hold"""" …",2,"""gpt-4""","""When a credit …","""""Held"" amounts…"
"""Does negative …","""P/E is the num…",3,"""gpt-4""","""A negative P/E…","""A negative P/E…"
"""Should a retai…","""""That\'s like …",4,"""gpt-4""","""The decision t…","""Whether a reta…"
"""Possibility to…","""""As user quid …",5,"""gpt-4""","""Yes, it is pos…","""Yes, it is pos…"
…,…,…,…,…,…
"""Discuss the ro…","""Inflation is a…",21,"""gpt-4""","""Inflation is a…","""Inflation is a…"
"""Explain the co…",""" The Earth's …",22,"""gpt-4""","""Plate tectonic…","""Plate tectonic…"
"""How did the su…",""" The Surreal…",23,"""gpt-4""","""The Surrealist…","""Surrealism tre…"
"""Discuss the im…",""" Globalizatio…",24,"""gpt-4""","""Globalization …","""Globalization …"


# Use the Response Matching operator to get the scores

In [6]:
from uptrain import EvalLLM, ResponseMatching

settings = Settings(evaluate_locally=False)

# Drop the "context" and "model" columns as they are not needed for local evaluation
experiment_dataset = experiment_dataset.drop(["context", "model"])

eval_llm = EvalLLM(settings=settings)
results = eval_llm.evaluate(
    data=experiment_dataset,
    checks=[
        ResponseMatching(
            method="llm",
        )
    ],
    schema={
        "question": "question",
        "response": "claude_3_opus_response",
        "ground_truth": "gpt_4_response",
    }
)

[32m2024-03-07 10:44:05.173[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate_on_server[0m:[36m341[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m


[32m2024-03-07 10:44:30.283[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m330[0m - [1mServer is not running![0m


# Analysis

Let's take a few examples and see the scores for each model.

In [7]:
row = results[0]
print("Question:", row["question"])

Question: How to get a grip on finance?'


In [8]:
print("GPT-4 Response:\n\n")
print(row["gpt_4_response"])

GPT-4 Response:


1. Take online courses and Workshops: Several online platforms such as Coursera, Udemy, and Khan Academy offer introductory courses on finance. 

2. Read Books: Reading books can give you a profound understanding of finance. Some popular books include: "The Intelligent Investor" by Benjamin Graham, "Common Stocks and Uncommon Profits" by Philip Fisher and "Thinking, Fast and Slow" by Daniel Kahneman.

3. Attend Seminars: Attending seminars and workshops can provide first-hand knowledge as well as an opportunity to interact with industry professionals.

4. Networking: Joining a local finance or investment club can provide opportunities for learning from others' experiences.

5. Use Finance Apps: Personal finance apps such as Mint and PocketGuard can help keep track of individual income and spending and create budgets.

6. Follow Finance Blogs and Websites: Websites such as Investopedia can provide helpful articles, glossaries, and tutorials on various finance topics.



In [9]:
print("Claude-3-Opus Response:\n\n")
print(row["claude_3_opus_response"])

Claude-3-Opus Response:


Getting a grip on your finances involves several steps:

1. Track your income and expenses: Use a budgeting app or spreadsheet to monitor your cash flow. This will help you understand where your money is going and identify areas where you can cut back.

2. Create a budget: Based on your income and expenses, create a realistic budget that allocates your money towards essential expenses, savings, and discretionary spending.

3. Set financial goals: Establish short-term and long-term financial goals, such as paying off debt, saving for emergencies, or planning for retirement.

4. Pay off high-interest debt: Prioritize paying off high-interest debt, like credit card balances, to reduce the amount of interest you pay over time.

5. Build an emergency fund: Aim to save enough money to cover 3-6 months' worth of expenses in case of unexpected events like job loss or medical emergencies.

6. Save and invest regularly: Automate your savings and investments to build wea

In [10]:
print("Response Matching Score:", row["score_response_match"])

Response Matching Score: 0.9729729729729729


# Now let's do the same to compare Claude-3 Sonnet and GPT-3.5-Turbo

## Get responses from Claude 3

In [11]:
dataset_path="./uptrain_benchmark.jsonl"
claude_settings = Settings(model="claude-3-sonnet-20240229", rpm_limit=4)
dataset = JsonReader(fpath=dataset_path).setup(settings=claude_settings).run()["output"]

dataset = dataset.with_columns([pl.lit("claude-3-sonnet-20240229").alias("model")])
dataset_with_claude_responses = TextCompletion(col_in_prompt="question", col_out_completion="claude_3_sonnet_response").setup(settings=claude_settings).run(dataset)["output"]
dataset_with_claude_responses

  0%|          | 0/25 [00:00<?, ?it/s]

100%|██████████| 25/25 [05:21<00:00, 12.86s/it]


question,context,idx,model,claude_3_sonnet_response
str,str,i64,str,str
"""How to get a g…","""Try downloadin…",1,"""claude-3-sonne…","""Here are some …"
"""How do “held” …","""""The """"hold"""" …",2,"""claude-3-sonne…","""On traditional…"
"""Does negative …","""P/E is the num…",3,"""claude-3-sonne…","""A negative pri…"
"""Should a retai…","""""That\'s like …",4,"""claude-3-sonne…","""The decision t…"
"""Possibility to…","""""As user quid …",5,"""claude-3-sonne…","""In Canada, you…"
…,…,…,…,…
"""Discuss the ro…","""Inflation is a…",21,"""claude-3-sonne…","""Inflation play…"
"""Explain the co…",""" The Earth's …",22,"""claude-3-sonne…","""Plate tectonic…"
"""How did the su…",""" The Surreal…",23,"""claude-3-sonne…","""The surrealist…"
"""Discuss the im…",""" Globalizatio…",24,"""claude-3-sonne…","""Globalization …"


In [12]:
dataset_with_claude_responses

question,context,idx,model,claude_3_sonnet_response
str,str,i64,str,str
"""How to get a g…","""Try downloadin…",1,"""claude-3-sonne…","""Here are some …"
"""How do “held” …","""""The """"hold"""" …",2,"""claude-3-sonne…","""On traditional…"
"""Does negative …","""P/E is the num…",3,"""claude-3-sonne…","""A negative pri…"
"""Should a retai…","""""That\'s like …",4,"""claude-3-sonne…","""The decision t…"
"""Possibility to…","""""As user quid …",5,"""claude-3-sonne…","""In Canada, you…"
…,…,…,…,…
"""Discuss the ro…","""Inflation is a…",21,"""claude-3-sonne…","""Inflation play…"
"""Explain the co…",""" The Earth's …",22,"""claude-3-sonne…","""Plate tectonic…"
"""How did the su…",""" The Surreal…",23,"""claude-3-sonne…","""The surrealist…"
"""Discuss the im…",""" Globalizatio…",24,"""claude-3-sonne…","""Globalization …"


# Get Responses from GPT-4

In [13]:
gpt_settings = Settings(model="gpt-3.5-turbo", rpm_limit=100)
dataset = dataset_with_claude_responses.with_columns([pl.lit("gpt-3.5-turbo").alias("model")])
experiment_dataset = TextCompletion(col_in_prompt="question", col_out_completion="gpt_35_turbo_response").setup(settings=gpt_settings).run(dataset)["output"]
experiment_dataset

  0%|          | 0/25 [00:00<?, ?it/s]

100%|██████████| 25/25 [00:06<00:00,  3.76it/s]


question,context,idx,model,claude_3_sonnet_response,gpt_35_turbo_response
str,str,i64,str,str,str
"""How to get a g…","""Try downloadin…",1,"""gpt-3.5-turbo""","""Here are some …","""1. Set financi…"
"""How do “held” …","""""The """"hold"""" …",2,"""gpt-3.5-turbo""","""On traditional…","""""Held"" amounts…"
"""Does negative …","""P/E is the num…",3,"""gpt-3.5-turbo""","""A negative pri…","""A negative P/E…"
"""Should a retai…","""""That\'s like …",4,"""gpt-3.5-turbo""","""The decision t…","""It ultimately …"
"""Possibility to…","""""As user quid …",5,"""gpt-3.5-turbo""","""In Canada, you…","""In Canada, it …"
…,…,…,…,…,…
"""Discuss the ro…","""Inflation is a…",21,"""gpt-3.5-turbo""","""Inflation play…","""Inflation is t…"
"""Explain the co…",""" The Earth's …",22,"""gpt-3.5-turbo""","""Plate tectonic…","""Plate tectonic…"
"""How did the su…",""" The Surreal…",23,"""gpt-3.5-turbo""","""The surrealist…","""The surrealist…"
"""Discuss the im…",""" Globalizatio…",24,"""gpt-3.5-turbo""","""Globalization …","""Globalization …"


# Use the Response Matching operator to get the scores

In [14]:
from uptrain import EvalLLM, ResponseMatching

settings = Settings(evaluate_locally=False)

# Drop the "context" and "model" columns as they are not needed for local evaluation
experiment_dataset = experiment_dataset.drop(["context", "model"])

eval_llm = EvalLLM(settings=settings)
results = eval_llm.evaluate(
    data=experiment_dataset,
    checks=[
        ResponseMatching(
            method="llm",
        )
    ],
    schema={
        "question": "question",
        "response": "claude_3_sonnet_response",
        "ground_truth": "gpt_35_turbo_response",
    }
)

[32m2024-03-07 10:49:58.773[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate_on_server[0m:[36m341[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m


[32m2024-03-07 10:50:24.652[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m330[0m - [1mServer is not running![0m


# Analysis

Let's take a few examples and see the scores for each model.

In [19]:
row = results[1]
print("Question:", row["question"])

Question: How do “held” amounts appear on statements and affect balances of traditional credit cards?'


In [20]:
print("GPT-3.5-Turbo Response:\n\n")
print(row["gpt_35_turbo_response"])

GPT-3.5-Turbo Response:


"Held" amounts typically appear on credit card statements as pending charges or authorizations. These are temporary holds placed on the cardholder's account for a certain amount of money, such as when making a hotel reservation or renting a car. The held amount is not deducted from the available balance immediately but may affect the overall available credit on the card.

For traditional credit cards, these held amounts do not impact the current balance that is due for payment. However, they can affect the credit available to the cardholder if the held amount is close to or equal to the available credit limit. This can potentially limit the cardholder's ability to make additional purchases until the held amount is no longer pending.

It is important for cardholders to keep track of held amounts and understand how they can impact their available credit and spending ability. Held amounts will eventually be released and the actual charge will be posted to the acc

In [21]:
print("Claude-3-Sonnet Response:\n\n")
print(row["claude_3_sonnet_response"])

Claude-3-Sonnet Response:


On traditional credit card statements, any "held" amounts are typically shown separately from the current balance owed. Here's how they are displayed and affect balances:

1. Current Balance: This is the total amount you owe on your credit card as of the statement date. It includes all new charges, fees, interest charges, and any remaining balance from the previous statement that wasn't paid in full.

2. Held Amounts/Pending Transactions: Many credit card issuers will display a separate section or line item for "held" or "pending" amounts. These are transactions that have been authorized but not yet posted or settled to your account.

3. Available Credit: Your available credit is your total credit limit minus your current balance and any held amounts. The held amounts temporarily reduce your available credit even though they haven't been added to the current balance yet.

4. Impact on Balance: Held amounts do not directly affect your current balance on the s

In [22]:
print("Response Matching Score:", row["score_response_match"])

Response Matching Score: 0.9729729729729729
