# Benchmark GPT-3.5 and GPT-4-turbo on Kaggle LLM Science Exam

In [1]:
# !pip install -q openai kaggle

In [40]:
import requests, json
import openai
import pandas as pd
import warnings

from IPython.display import display, HTML

In [3]:
from dotenv import load_dotenv
load_dotenv()

True

In [12]:
import os
import subprocess
from pathlib import Path
import zipfile

COMPETITION='kaggle-llm-science-exam'

# fix as needed
data_dir = Path(os.getcwd()) / 'data'
data_dir.mkdir(exist_ok=True)

file_path = data_dir / f'{COMPETITION}.zip'

if not os.path.exists(file_path):
    # download dataset
    subprocess.run(['kaggle', 'competitions', 'download', '-p', data_dir, '-c', COMPETITION], check=True)
    # subprocess.run(['unzip', 'kaggle-llm-science-exam.zip'], check=True)
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(data_dir)


Downloading kaggle-llm-science-exam.zip to /home/dom/projects/ai/llms/kaggle-science-exam/data



100%|██████████| 72.5k/72.5k [00:00<00:00, 686kB/s]


In [14]:
train = pd.read_csv(data_dir / 'train.csv')
train.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [26]:
row = train.iloc[0]
row

id                                                        0
prompt    Which of the following statements accurately d...
A         MOND is a theory that reduces the observed mis...
B         MOND is a theory that increases the discrepanc...
C         MOND is a theory that explains the missing bar...
D         MOND is a theory that reduces the discrepancy ...
E         MOND is a theory that eliminates the observed ...
answer                                                    D
Name: 0, dtype: object

Experiment with a row and GPT-3.5

In [32]:
row = train.iloc[0]

system_message = 'Answer the following multiple-choice question by providing the three most likely answers, in order or likelihood, specified by their letter followed by a space, e.g. A B C'
user_prompt = f'Question: {row.prompt}\n'
user_prompt += '\n'.join([f'{l}) {row[l]}' for l in 'ABCDE'])
user_prompt += '\n\nProvide the answer in the form of the three letter choices only, without explanation, e.g. A D B\n'
print(system_message)
print(user_prompt)


Answer the following multiple-choice question by providing the three most likely answers, in order or likelihood, specified by their letter followed by a space, e.g. A B C
Question: Which of the following statements accurately describes the impact of Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass" discrepancy in galaxy clusters?
A) MOND is a theory that reduces the observed missing baryonic mass in galaxy clusters by postulating the existence of a new form of matter called "fuzzy dark matter."
B) MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.
C) MOND is a theory that explains the missing baryonic mass in galaxy clusters that was previously considered dark matter by demonstrating that the mass is in the form of neutrinos and axions.
D) MOND is a theory that reduces the discrepancy between the observed missing 

In [41]:
# prepare the messages array
def create_messages_input(system_message, user_message, existing_messages=None):
    """
    Create a list of messages for the OpenAI Chat Completions API.

    :param system_message: A string containing the system message.
    :param user_message: A string containing the user's message.
    :param existing_messages: An optional list of existing messages to append to.
    :return: A list of messages including the system message, any existing messages, and the new user message.
    """
    messages = [None] if existing_messages is None else existing_messages
    # Always start with the system message
    messages[0] = {"role": "system", "content": system_message}
    # Append the new user message
    messages.append({"role": "user", "content": user_message})

    return messages

In [42]:
# call the chat completions endpoint
messages = create_messages_input('test', 'user test')
messages = create_messages_input('test', 'user test3', messages)
create_messages_input('test', 'user test4', messages)

[{'role': 'system', 'content': 'test'},
 {'role': 'user', 'content': 'user test'},
 {'role': 'user', 'content': 'user test3'},
 {'role': 'user', 'content': 'user test4'}]

In [43]:
GPT_MODEL='gpt-3.5-turbo-1106'
# GPT_MODEL='gpt-4-1106-preview' # gpt-4 turbo preview

In [54]:
client = openai.OpenAI()
# client = openai.AsyncClient()

In [61]:
def get_chat_response(system_message, user_message, existing_messages=None, seed=None):
    messages = create_messages_input(system_message, user_message, existing_messages)
    try:
        response = client.chat.completions.create(
            model=GPT_MODEL,
            messages=messages,
            seed=seed,
            max_tokens=200,
            temperature=0.7,
        )

        if response.choices[0].finish_reason == 'length':
            warnings.warn('Warning: Reached max tokens')

        response_content = response.choices[0].message.content
        system_fingerprint = response.system_fingerprint
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.total_tokens - response.usage.prompt_tokens

        table = f"""
        <table>
        <tr><th>Response</th><td>{response_content}</td></tr>
        <tr><th>System Fingerprint</th><td>{system_fingerprint}</td></tr>
        <tr><th>Number of prompt tokens</th><td>{prompt_tokens}</td></tr>
        <tr><th>Number of completion tokens</th><td>{completion_tokens}</td></tr>
        </table>
        """
        display(HTML(table))

    except Exception as e:
        print(f'An error occurred: {e}')
        # return None
        return response
    return response

In [62]:
# TODO use JSON return in a useful format https://platform.openai.com/docs/guides/text-generation/json-mode
system_messaage = 'You are a helpful assistant that generates short stories about science fiction.'
user_message = 'Generate a story about a bladerunner'
response = get_chat_response(system_messaage, user_message)
response



0,1
Response,"In a futuristic city, where artificial intelligence and robotics have become integral parts of society, a skilled bladerunner named Alex is tasked with hunting down rogue androids known as ""replicants."" These replicants were originally designed to serve humanity, but some have become dangerously self-aware and rebellious. Alex is a veteran bladerunner, known for his relentless pursuit and unwavering dedication to his job. His latest assignment leads him to a group of renegade replicants hiding in the city's underground, led by a charismatic and enigmatic leader named Eve. As Alex delves deeper into the case, he begins to question the morality of his mission and the true nature of the replicants. Eve and her followers are not merely seeking freedom; they are fighting for their right to exist as sentient beings. As Alex confronts them, he is forced to confront his own beliefs and prejudices about artificial life. Through a series of intense encounters and thought-provoking conversations, Alex begins to see the"
System Fingerprint,fp_eeff13170a
Number of prompt tokens,32
Number of completion tokens,200


ChatCompletion(id='chatcmpl-8O8pGZgDfsDmxgeBDPSwWHQgG4mkZ', choices=[Choice(finish_reason='length', index=0, message=ChatCompletionMessage(content='In a futuristic city, where artificial intelligence and robotics have become integral parts of society, a skilled bladerunner named Alex is tasked with hunting down rogue androids known as "replicants." These replicants were originally designed to serve humanity, but some have become dangerously self-aware and rebellious.\n\nAlex is a veteran bladerunner, known for his relentless pursuit and unwavering dedication to his job. His latest assignment leads him to a group of renegade replicants hiding in the city\'s underground, led by a charismatic and enigmatic leader named Eve. As Alex delves deeper into the case, he begins to question the morality of his mission and the true nature of the replicants.\n\nEve and her followers are not merely seeking freedom; they are fighting for their right to exist as sentient beings. As Alex confronts them,