# Kaggle LLM Science Exam Dataset Exploration and GPT Tests

- Manually get a feel for what the dataset is like and how GPT performs
- Experiment with different ways of using GPT (see the benchmark notebook for much more on this)

Mostly sourced from [Jeremy Howard's notebook](https://www.kaggle.com/code/jhoward/getting-started-with-llms) introducing the competition with GPT

In [1]:
# !pip install -Uq llm 'urllib3>2' fastkaggle

In [2]:
import time
import numpy as np
import pandas as pd
from fastcore.utils import *
from fastkaggle import *

In [3]:
import dotenv
dotenv.load_dotenv()

True

In [4]:
import os
import subprocess
from pathlib import Path
import zipfile

COMPETITION='kaggle-llm-science-exam'

# fix as needed
data_dir = Path(os.getcwd()) / 'data'
data_dir.mkdir(exist_ok=True)

file_path = data_dir / f'{COMPETITION}.zip'

if not os.path.exists(file_path):
    # download dataset
    subprocess.run(['kaggle', 'competitions', 'download', '-p', data_dir, '-c', COMPETITION], check=True)
    # subprocess.run(['unzip', 'kaggle-llm-science-exam.zip'], check=True)
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(data_dir)


In [5]:
trn = pd.read_csv(data_dir / 'train.csv')
trn

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D
...,...,...,...,...,...,...,...,...
195,195,What is the relation between the three moment ...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...,C
196,196,"What is the throttling process, and why is it ...",The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,B
197,197,What happens to excess base metal as a solutio...,"The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be...",B
198,198,"What is the relationship between mass, force, ...",Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...,D


## Answering the questions ourselves

Become the human exam taker to explore the data

In [6]:
# n_questions = 200
# for index, row in trn.head(n_questions).iterrows():
#     print(f"Question: {row.prompt}")
#     print(f"Option A: {row.A}")
#     print(f"Option B: {row.B}")
#     print(f"Option C: {row.C}")
#     print(f"Option D: {row.D}")
#     print(f"Option E: {row.E}", flush=True)
#     user_answer = input("Answer [A/B/C/D/E]: ")
#     if user_answer == row.answer:
#         print("Correct!")
#     else:
#         print("Incorrect! The correct answer is: ", row.answer)
#     print(flush=True)

## Using the OpenAI API

Skip the usual EDA, and do some natural language data exploration instead

In [6]:
def prompt1(row):
    return f"""Question: {row.prompt}
A: {row.A}
B: {row.B}
C: {row.C}
D: {row.D}
E: {row.E}
Answer: """

In [7]:
r0 = trn.iloc[0]
print(prompt1(r0), r0.answer)

Question: Which of the following statements accurately describes the impact of Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass" discrepancy in galaxy clusters?
A: MOND is a theory that reduces the observed missing baryonic mass in galaxy clusters by postulating the existence of a new form of matter called "fuzzy dark matter."
B: MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.
C: MOND is a theory that explains the missing baryonic mass in galaxy clusters that was previously considered dark matter by demonstrating that the mass is in the form of neutrinos and axions.
D: MOND is a theory that reduces the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 2.
E: MOND is a theory that eliminates the observed m

Go with GPT-3.5 Turbo initially

In [8]:
# latest model as of 26/11/2023
GPT_MODEL='gpt-3.5-turbo-1106'
cost_prompt_tokens = 0.001 / 1000
cost_completion_tokens = 0.002 / 1000

Using Simon Willison's cli and python `llm` lib which allows for the same interface to API and locally hosted open source models for this test.

But it doesn't know about the latest GPT-3.5

We can either register it programatically, or add to a yaml with extra openai models. I've done the latter so it's also available via the cli but either works

See [Adding more OpenAI models](https://llm.datasette.io/en/stable/other-models.html#adding-more-openai-models)

```yaml
- model_name: gpt-3.5-turbo-1106
  model_id: gpt-3.5-turbo-1106
  aliases: ["1106"]
```

In [9]:
import llm
# Current version doesn't know about the latest GPT-3.5 so add to aliases
model = llm.get_model(GPT_MODEL)

In [15]:
response = model.prompt(prompt1(r0), temperature=0.0)
response.text()

'B: MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.'

In [16]:
model = llm.get_model('gpt-3.5-turbo')
response = model.prompt(prompt1(r0), temperature=0.0)
response.text()

'D: MOND is a theory that reduces the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 2.'

Quick check for incorrect answers

In [10]:
def summary(i):
    row = trn.iloc[i]
    print(i, prompt1(row), row.answer)

def check1_incorrect(i):
    row = trn.iloc[i]
    ans = model.prompt(prompt1(row), temperature=0.0).text()
    if ans[0] == row.answer:
        return
    return ans, row.answer

In [19]:
model = llm.get_model('gpt-3.5-turbo')
for i in range(10):
    resp = check1_incorrect(i)
    if resp:
        summary(i)
        print(resp)
        print('---')

2 Question: Which of the following statements accurately describes the origin and significance of the triskeles symbol?
A: The triskeles symbol was reconstructed as a feminine divine triad by the rulers of Syracuse, and later adopted as an emblem. Its usage may also be related to the Greek name of Sicily, Trinacria, which means "having three headlands." The head of Medusa at the center of the Sicilian triskeles represents the three headlands.
B: The triskeles symbol is a representation of three interlinked spirals, which was adopted as an emblem by the rulers of Syracuse. Its usage in modern flags of Sicily has its origins in the ancient Greek name for the island, Trinacria, which means "Sicily with three corners." The head of Medusa at the center is a representation of the island's rich cultural heritage.
C: The triskeles symbol is a representation of a triple goddess, reconstructed by the rulers of Syracuse, who adopted it as an emblem. Its significance lies in the fact that it repre

In [20]:
model = llm.get_model('gpt-3.5-turbo-1106')
for i in range(10):
    resp = check1_incorrect(i)
    if resp:
        summary(i)
        print(resp)
        print('---')

0 Question: Which of the following statements accurately describes the impact of Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass" discrepancy in galaxy clusters?
A: MOND is a theory that reduces the observed missing baryonic mass in galaxy clusters by postulating the existence of a new form of matter called "fuzzy dark matter."
B: MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.
C: MOND is a theory that explains the missing baryonic mass in galaxy clusters that was previously considered dark matter by demonstrating that the mass is in the form of neutrinos and axions.
D: MOND is a theory that reduces the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 2.
E: MOND is a theory that eliminates the observed

## Manually checking abilities manually with web browsing chat models

Let's check what's going on with question 2 about the triskless symbol by using ChatGPT and getting it to use it's browsing for the wikipedia page. We can also use Bing for this and get it to use the currently loaded web page. Just manually pasting the information in.

Help it a little by giving it a chance to think about the information and answer more, by using more tokens that it can reevaluate on each new generation. This includes the chat history too

**ChatGPT 4**
Asking "*Please accurately describe the origin and significance of the triskeles symbol.*"

<blockquote>
The triskelion, also known as triskeles, is an ancient symbol consisting of three interlocked spirals or three bent human legs. Its origins and significance are diverse, reflecting its use across various cultures and time periods.

1. **Ancient Use**: The triskelion has been used since the Neolithic era. It appears in many ancient cultures, from the Neolithic megalithic monuments of the British Isles to ancient Greek and Sicilian contexts. 

2. **Symbolic Meanings**:
   - **Movement and Progress**: In some cultures, the three arms of the triskelion are seen as representing motion; this could be the movement of the sun across the sky or a more abstract concept of life's movement and progression.
   - **Triplicities**: The symbol often embodies concepts of triplicity, like past, present, and future, or earth, water, and sky. In Greek and Roman contexts, it was sometimes associated with the trinities of gods or goddesses.
   - **Celtic Significance**: In Celtic context, the triskelion holds a particular significance. It is often associated with the concept of triplicity in unity, which resonates with many aspects of Celtic belief systems that involve three domains, elements, or realms.
   - **Sicilian Emblem**: The symbol is famously associated with Sicily, where it is known as the "Trinacria." The symbol here often includes the head of Medusa and three bent legs. It represents the triangular shape of the island and its three capes.

3. **Modern Usage**: Today, the triskelion is used in various contexts. It's found in corporate logos, national symbols, and popular culture. It is also a significant symbol in Neopaganism, especially in Celtic-inspired traditions, where it is used to represent a variety of triplicities in the cosmology and theology of the religions.

In summary, the triskelion is a symbol with deep historical roots and a variety of meanings, depending on the cultural context. It often symbolizes concepts of movement, progression, and the unity of three parts.
</blockquote>

Follow up with *"First go through each of the 5 options, explaining why it is or isn't a good description, and then finally say which you think is most accurate."*, and pasting in the multiple choice answers, results in the following response

<blockquote>
Let's evaluate each option:

A: This description correctly associates the triskeles with Syracuse and the Greek name of Sicily, Trinacria, which means "having three headlands." The reference to a "feminine divine triad" is intriguing but not widely supported in historical records. The Medusa's head representing three headlands is a symbolic interpretation but lacks direct historical evidence.

B: This option correctly identifies the triskeles as three interlinked spirals and its association with Syracuse. It links the symbol to the ancient Greek name Trinacria, meaning "Sicily with three corners," which is accurate. The Medusa's head representing Sicily's rich cultural heritage is a valid symbolic interpretation but not a direct historical explanation.

C: This description suggests the triskeles represents a triple goddess, which is a speculative interpretation not widely supported by historical evidence. It correctly associates the symbol with the Greek name for Sicily, Trinacria, and the element "tria" meaning three. The Medusa's head representing three headlands is symbolic but not directly substantiated.

D: This option correctly describes the triskeles as three interlocked spiral arms and its emblematic use by Syracuse rulers. The connection to Sicily's rich cultural heritage and Greek mythology is valid, and the representation of Medusa's head as a symbol of Greek influence is a plausible interpretation.

...

Based on historical accuracy and common interpretations, **Option D** seems the most accurate. It correctly identifies the symbol's design, its association with Syracuse and Sicily, and provides a plausible interpretation of Medusa's head representing the influence of Greek mythology on Sicilian culture.
</blockquote>

which agrees with the current default `gpt-3.5-turbo`, but not the most recent or the answer in the dataset. So, what's right?

Ask ChatGPT to browse wikipedia

<blockquote>
Re-evaluating the options based on the Wikipedia article:

A: This description aligns well with the historical association of the triskeles with Syracuse and Sicily, specifically its Greek name Trinacria, meaning 'having three headlands'. The concept of a "feminine divine triad" and its association with the triskeles is a reconstruction by the Duc de Luynes, linked with Hecate, but this is not a widely accepted interpretation. The Medusa head represents Sicily's heritage but its direct connection to three headlands is more symbolic than historical.

...

D: This option is accurate in describing the triskeles as three interlocked spiral arms and its emblematic use by Syracuse rulers. The connection to Sicily's rich cultural heritage and Greek mythology is valid, and Medusa's head represents the Greek influence on Sicilian culture, though this is more interpretative.

...

Based on this evaluation, **Option D** remains the most accurate. It correctly identifies the symbol's design, its association with Syracuse and Sicily, and provides a plausible interpretation of Medusa's head in terms of Greek cultural influence.
</blockquote>

"Why is A not a good answer?"

<blockquote>
Option A is not the best answer because it heavily relies on the speculative reconstruction of the triskeles as a "feminine divine triad" by the Duc de Luynes, a theory that is not widely supported in historical records. While it accurately connects the triskeles to Syracuse and the Greek name of Sicily, Trinacria, the specific interpretation involving a divine triad and the Medusa head symbolizing three headlands is more of a symbolic interpretation rather than a broadly accepted historical fact. The historical use of the triskeles, as indicated in the Wikipedia article, does not strongly support this specific interpretation, making Option A less accurate compared to others.
</blockquote>

Reading the Wikipedia article, this does seem good reasoning and in fact the Kaggle answer is not correct, or at least it's not the best as suggested.

It's very useful to be able to use the models and Bing to quickly get good information, even when they themselves (and humans creating datasets!) can be wrong

## GPT-3.5 baseline

In [1]:
from concurrent.futures import ThreadPoolExecutor

In [19]:
check1_incorrect(0)

('B: MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.',
 'D')

In [20]:
trn.iloc[0]

id                                                        0
prompt    Which of the following statements accurately d...
A         MOND is a theory that reduces the observed mis...
B         MOND is a theory that increases the discrepanc...
C         MOND is a theory that explains the missing bar...
D         MOND is a theory that reduces the discrepancy ...
E         MOND is a theory that eliminates the observed ...
answer                                                    D
Name: 0, dtype: object

Had to wrap the API call in a retry block as I was getting a lot of timeouts with OpenAI. Surprised Jeremy managed to get through it all without any. Maybe just a bad time, something in the `llm` lib, or just the concurrency

In [45]:
with ThreadPoolExecutor(4) as e:
    def _check1_incorrect(i):
        retries = 4
        while retries > 0:
            try:
                return check1_incorrect(i)
            except Exception as e:
                print(e)
                retries -= 1
                if retries == 0:
                    return str(e)
                time.sleep(1)
        return None

    results = e.map(_check1_incorrect, range(0, 200), timeout=40)

Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)
Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)
Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)
Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)


In [46]:
results = list(results)

In [48]:
sum(1 for r in results if r is not None and 'timed out' in r)

0

We got all the result so let's see what the error rate is

In [51]:
sum(1 for r in results if r is not None) / len(results)

0.31

It's possible, though, that some of the results were not formatted correctly to extract the answer in the simple way we have, but move on to more advanced prompting and investigate/correct this at that point.

A quick check shows just one badly formatted answer during this run: a space before the letter

`[(' A: The dimension of an object in a CW complex is the largest n for which the n-skeleton is nontrivial, where the empty set is considered to have dimension -1 and the boundary of a discrete set of points is the empty set.',
  'A')]`

In [52]:
[res for res in results if res is not None and res[0][0] not in 'ABCDE']

[(' A: The dimension of an object in a CW complex is the largest n for which the n-skeleton is nontrivial, where the empty set is considered to have dimension -1 and the boundary of a discrete set of points is the empty set.',
  'A')]

## Better prompting via more time to think and conversations

We can force the model to spend more tokens on thinking about the problem, so that each new token has more of its own reasoning to pull back into the inference of the next one in the autoregressive generation. Lots of ways to do this, with the simplest being Chain of Thought (CoT).

`llm` has a wrapper around building up a list of previous assistant and user messages in the `conversation`

In [63]:
conversation = model.conversation()

In [64]:
response = conversation.prompt('What is the source of the amen break?')

In [65]:
response.json()

{'content': 'The Amen break is a 6-second drum break that comes from the song "Amen, Brother" by the 1960s funk and soul band The Winstons. It has been widely sampled and used in various genres of music, particularly in hip-hop and electronic music.',
 'role': 'assistant',
 'finish_reason': 'stop',
 'id': 'chatcmpl-8PV35QbwFYAANUKusY73iXrqyDrr5',
 'object': 'chat.completion.chunk',
 'model': 'gpt-3.5-turbo-1106',
 'created': 1701088243}

In [66]:
response.text()

'The Amen break is a 6-second drum break that comes from the song "Amen, Brother" by the 1960s funk and soul band The Winstons. It has been widely sampled and used in various genres of music, particularly in hip-hop and electronic music.'

In [67]:
conversation.prompt('How does relate to drum and bass?').text()

'The Amen break has had a significant impact on the development of drum and bass music. It is one of the most sampled drum breaks in history and has been used extensively in the creation of drum and bass tracks. The fast tempo and intricate rhythms of the Amen break lend themselves well to the high-energy and complex drum patterns that are characteristic of drum and bass music. As a result, the Amen break has become a staple in the genre and is closely associated with the sound of drum and bass. Many iconic drum and bass tracks and producers have utilized the Amen break, further solidifying its influence on the genre.'

Use this to improve our question prompring

In [68]:
summary(52)

52 Question: What is resistivity?
A: Resistivity is an extrinsic property of a material that describes how difficult it is to make electrical current flow through it. It is measured in ohms and is dependent on the material's shape and size.
B: Resistivity is a measure of the resistance of a material to electrical current flow. It is measured in ohm-meters and is dependent on the material's shape and size.
C: Resistivity is an intrinsic property of a material that describes how difficult it is to make electrical current flow through it. It is measured in ohm-meters and is independent of the material's shape and size.
D: Resistivity is a measure of the electrical current that can flow through a material. It is measured in ohms and is dependent on the material's shape and size.
E: Resistivity is a measure of the electrical current that can flow through a material. It is measured in ohm-meters and is independent of the material's shape and size.
Answer:  C


Going from the original notebook's approach, answers B and C are close but resistivity doesn't depend on the material's shape and size. Units of ohm-meters is correct and is for the unit quantity (m), whereas resistance in Ohms takes the full size into account. Go with this to experiment

In [69]:
conversation = model.conversation()
question = "Precisely and fully describe the answer to the following question: what is resistivity?"
print(conversation.prompt(question).text())

Resistivity is a measure of a material's ability to resist the flow of electric current through it. It is a fundamental property of a material and is defined as the resistance of a specific length and cross-sectional area of the material. Resistivity is typically denoted by the symbol ρ and is measured in ohm-meters (Ω⋅m). It is an important parameter in determining the behavior of materials in electrical and electronic applications and is used in calculations for determining the resistance of a material and its conductivity.


In [121]:
def prompt_explanation(i):
    row = trn.iloc[i]
    return f"""
    I will ask a multiple choice question, with 5 answers A, B, C, D, E.
    First output 'Options: ' followed by the 5 options, one per line, going through each and describing why it is or isn't a good description of the answer.
    Then output 'Summary: ' followed by a summary description of the answer you think is most accurate and why.
    Finally, output 'Answers: ' followed by a sorted list of the most to least accurate answers, using just the letters to identify the answers in the list, e.g. 'Answers: B C E A D'.
    NOTE: it is VERY IMPORTANT that the final output line contains just 'Answers: ' followed by the sorted list of letters A-E for most to least accurate.

    Question: {row.prompt}
    A: {row.A}
    B: {row.B}
    C: {row.C}
    D: {row.D}
    E: {row.E}
    """

In [122]:
question = prompt_explanation(52)
print(conversation.prompt(question).text())

Options:
A: Resistivity is an extrinsic property of a material that describes how difficult it is to make electrical current flow through it. It is measured in ohms and is dependent on the material's shape and size. This is not a good description as resistivity is an intrinsic property, not extrinsic, and it inaccurately describes resistivity's independence from the material's shape and size.

B: Resistivity is a measure of the resistance of a material to electrical current flow. It is measured in ohm-meters and is dependent on the material's shape and size. This is a good description as it correctly defines resistivity as a measure of the material's resistance to current flow and mentions the correct units, although it inaccurately describes the dependence on shape and size.

C: Resistivity is an intrinsic property of a material that describes how difficult it is to make electrical current flow through it. It is measured in ohm-meters and is independent of the material's shape and siz

prod even more...a little excessive but...

In [123]:
print(conversation.prompt('Why is C better than B?'))

I believe there was a mistake in my previous response. Looking back at the options, both C and B accurately describe resistivity. Both options correctly identify resistivity as a measure of the resistance of a material to electrical current flow, measured in ohm-meters. Additionally, option C points out that resistivity is independent of the material's shape and size, providing a more comprehensive description. Therefore, it would be more accurate to say that option C provides a more complete and detailed explanation of resistivity, making it a slightly better option. Thank you for catching that, and I apologize for any confusion caused.


In [124]:
def response_explanation(i):
    conversation = model.conversation()
    row = trn.iloc[i]
    initial_question = f"Precisely and fully describe the answer to the following question: {row.prompt}"
    initial_answer = conversation.prompt(initial_question).text()
    return conversation.prompt(prompt_explanation(i)).text()

In [125]:
response = response_explanation(52)
print(response)
print('Length: ', len(response))

Options:
A: This option is incorrect because resistivity is an intrinsic property, not an extrinsic property, and it is not measured in ohms, but in ohm-meters.
B: This option accurately describes resistivity as a measure of the resistance of a material to electrical current flow, measured in ohm-meters and dependent on the material's shape and size.
C: This option accurately describes resistivity as an intrinsic property, measured in ohm-meters, and independent of the material's shape and size.
D: This option inaccurately describes resistivity as a measure of electrical current flow through a material and incorrectly states that it is measured in ohms.
E: This option accurately describes resistivity as a measure of electrical current flow through a material, measured in ohm-meters and independent of the material's shape and size.

Summary: The most accurate description of resistivity is option C, as it correctly identifies resistivity as an intrinsic property of a material, measured i

let's get a very approximate cost estimate for completion tokens. assuming a token is about 2/3 of a word, which is typical

In [126]:
n_words = len(response.split(' '))
n_words *= len(trn)
print(f'Words: {n_words}, tokens: ~{n_words * 3 / 2}, cost: ~${n_words * 3 / 2 * cost_completion_tokens:.2f}')

Words: 32200, tokens: ~48300.0, cost: ~$0.10


This is just the completion tokens, so let's be pessimistic and say it's about the same for the prompt but at half the cost gives about $0.15, or up to something like $0.40 for longer questions and including everything

In [127]:
with ThreadPoolExecutor(10) as e:
    def _response_explanation(i):
        retries = 4
        while True:
            try:
                return response_explanation(i)
            except Exception as e:
                print(e)
                retries -= 1
                time.sleep(1)
                if retries > 0:
                    continue
                return str(e)

    results = e.map(_response_explanation, range(trn.shape[0]))

In [128]:
results = list(results)

In [134]:
results[0], trn.iloc[0].answer

('Options:\nA: This is not a good description. MOND does not postulate the existence of "fuzzy dark matter," but rather proposes modifications to Newtonian gravity to explain the observed discrepancies in the mass of galaxy clusters.\nB: This is not a good description. MOND does not increase the discrepancy between observed missing baryonic mass and measured velocity dispersions; rather, it aims to address this discrepancy.\nC: This is not a good description. MOND does not explain the missing baryonic mass by attributing it to neutrinos and axions, but rather by modifying the laws of gravity.\nD: This is a good description. MOND aims to reduce the discrepancy between the observed missing baryonic mass and the measured velocity dispersions, and this statement accurately describes that aim.\nE: This is not a good description. While MOND proposes modifications to gravity, it does not completely eliminate the observed missing baryonic mass discrepancy.\n\nSummary: Answer D is the most accu

Store results in CSV to take a look and be able to come back to them later

In [136]:
import pandas as pd

df = pd.DataFrame(results, columns=['response'])
df.to_csv('kaggle_llm_science_exam_gpt-3.5-turbo-1106.csv', index=False)

In [137]:
# load back in
# df = pd.read_csv('kaggle_llm_science_exam_gpt-3.5-turbo-1106.csv')
# results = list(df.response)

['Options:\nA: This is not a good description. MOND does not postulate the existence of "fuzzy dark matter," but rather proposes modifications to Newtonian gravity to explain the observed discrepancies in the mass of galaxy clusters.\nB: This is not a good description. MOND does not increase the discrepancy between observed missing baryonic mass and measured velocity dispersions; rather, it aims to address this discrepancy.\nC: This is not a good description. MOND does not explain the missing baryonic mass by attributing it to neutrinos and axions, but rather by modifying the laws of gravity.\nD: This is a good description. MOND aims to reduce the discrepancy between the observed missing baryonic mass and the measured velocity dispersions, and this statement accurately describes that aim.\nE: This is not a good description. While MOND proposes modifications to gravity, it does not completely eliminate the observed missing baryonic mass discrepancy.\n\nSummary: Answer D is the most accu

It looks like there are answers in all but in some cases there's just one letter, so need to deal with that a little more carefully

In [158]:
import re

def extract_answers(result):
    m = re.search(r'Answers:\s*([A-E])\s+([A-E])\s+([A-E])\s+([A-E])\s+([A-E])', result)
    if m is not None:
        return m.groups()
    m = re.search(r'Answers:\s*([A-E])', result)
    if m is not None:
        m = m.groups()[0]
        return (m, m, m, m, m)
    return 'ABCDE'

pred_answers = [extract_answers(r) for r in results]

Score results using the official Kaggle evaluation giving points for first, second, or third answer being correct as 1, 1/2, 1/3 resp., 0 otherwise

In [159]:
def score_answer(pred, actual):
    return 1 if pred[0] == actual else 1/2 if pred[1] == actual else 1/3 if pred[2] == actual else 0

np.mean([score_answer(pred, actual) for pred, actual in zip(pred_answers, trn.answer)])

0.775

Not bad, but not close to the top of the leaderboard, and we've just used GPT-3.5-turbo and little prompt engineering which is easily available to all. At time of writing, the competition is closed and the 1st place solution scored XXX. 

The top entry used, among other tricks, generating a new dataset using wikipedia, RAG, fine tuning and then testing against the evaluation set with ensembling. Most top entries followed a similar approaches. Some of the hardest aspect of actually doing this in the competition were the limits on available compute and data, even though it was fine to train elsewhere and bring the models in.