## Introduction

In this notebook, we shall demonstrate how to design two GPT-4o agents that are able to play "20 questions" with each other efficiently (i.e., with a high success %).

**Note:** Success % denotes the % of games that the "guesser" agent wins.

The game play data thus generated can be used to clone GPT-4o's behavior into any open weights instruction tuned LLM (such as `"Meta-Llama-3.1-8B-Instruct"`) by performing supervised fine-tuning.

## Imports & Setup

We'll need the following two packages.

In [1]:
!pip install -q openai
!pip install -q unidecode

Let's import a file containing a previous round of GPT-4o game play data on the "things" provided in Apple's <a href="https://github.com/apple/ml-entity-deduction-arena/tree/main/data/things" target="_blank">ml-entity-deduction-arena</a> repository.

We'll just grab the keywords and "alts" from this file; we don't need anything else from it.

**Note:** "Alts" are alternative words that are accepted as valid guesses (instead of the secret keyword).

In [2]:
import pandas as pd

apple_things_game_play_data = pd.read_csv("/kaggle/input/gpt-4o-game-play-data-llm-20-questions/apple_things_game_play_data.csv")
apple_things_game_play_data

Unnamed: 0,game_idx,keyword,alts,questions,answers,guesses,round_of_correct_guess
0,1,monkey,,Is it a proper noun?,no,common noun,8
1,1,monkey,,Is it a man-made thing?,no,is it natural?,8
2,1,monkey,,Is it a living thing?,yes,animal,8
3,1,monkey,,Is it an animal?,yes,dog,8
4,1,monkey,,Is it a mammal?,yes,cat,8
...,...,...,...,...,...,...,...
7161,448,fries,,Is it flavored with a common seasoning like sa...,yes,lay's chips,-1
7162,448,fries,,Is it a potato chip flavored with cheese?,no,potato crisps,-1
7163,448,fries,,Is it flavored with salt?,yes,potato chips,-1
7164,448,fries,,Is it a plain salted potato chip?,no,bbq chips,-1


In [3]:
apple_things = apple_things_game_play_data['keyword'].unique().tolist()
len(apple_things)

406

In [4]:
apple_things[:15]

['monkey',
 'bulldog',
 'lobster',
 'feather duster',
 'minivan',
 'comb',
 'zebra',
 'helicopter',
 'sloth',
 'bookshelf',
 'shampoo',
 'chameleon',
 'spaceship',
 'kayak',
 'broom']

Let's grab the alts corresponding to each keyword. (For most keywords, there are no alts.)

In [5]:
apple_alts = []
i = 0
for thing in apple_things:
    subset = apple_things_game_play_data.loc[apple_things_game_play_data['keyword'] == thing].reset_index(drop=True)
    apple_alts.append(subset['alts'][0])
len(apple_alts)

406

In [6]:
apple_alts[:15]

[nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'spacecraft',
 nan,
 nan]

To use GPT-4o, we need to instantiate an OpenAI client.

In [7]:
from kaggle_secrets import UserSecretsClient
import openai

user_secrets = UserSecretsClient()
openai_client = openai.OpenAI(api_key=user_secrets.get_secret("OPENAI_API_KEY")) # Replace `api_key` with your own OpenAI API key.

`unidecode` will be used to replace accented Unicode characters with plain ASCII characters.

In [8]:
from unidecode import unidecode

# Exploratory:
unidecode("São Paulo")

'Sao Paulo'

## Main Classes

The "guesser" agent is encapsulated by the `Guesser` class. It has two turn types: "ask" and "guess".

For the both the turn types, a system prompt provides GPT-4o some context, and gives it an instruction. Next, a user prompt provides GPT-4o a hint, which is a succinct summary of information about previous rounds of the game. The hint tells GPT-4o the facts that are known to be true, and the facts that are known to be false.

The hint for the "guess" turn type also informs GPT-4o about its previous guesses. GPT-4o is instructed not to repeat any of these (in the system prompt).

In [9]:
class Guesser:
    def __init__(self, openai_client):
        self.openai_client = openai_client

    def get_gpt4o_completion(self, messages, max_tokens, temperature):
        response = self.openai_client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature
        )
        return response.choices[0].message.content

    def get_questioner_messages(self, questions, answers):
        df = pd.DataFrame({'question': questions, 'answer': answers})
        df_no = df.loc[df['answer'] == "no"]
        questions_no = df_no['question'].tolist()
        # If questions_no above is empty, the following method call will return an empty string.
        questions_no = "\n".join(questions_no)
        df_yes = df.loc[df['answer'] == "yes"]
        questions_yes = df_yes['question'].tolist()
        # If questions_yes above is empty, the following method call will return an empty string.
        questions_yes = "\n".join(questions_yes)

        hint = ""
        if len(questions_no) > 0:
            hint += f"""\nThe following are false:
{questions_no}\n"""
        if len(questions_yes) > 0:
            if len(questions_no) > 0:
                hint += f"""\nBut the following are true:
{questions_yes}\n"""
            else:
                hint += f"""\nThe following are true:
{questions_yes}\n"""

        messages = [{
                'role': "system",
                'content': f"""A human has thought of an entity. The human has kept it a secret.

But we do know that the secret entity is a thing. It can be a man-made thing, a natural thing, a living thing, a substance, a type of food, or a type of place. It is a common noun.

The user will provide you a hint delimited by <hint> </hint>. Your task is to ask a new question that will narrow down the set of possible entities.

Only ask a question that can be answered by 'yes' or 'no'. Do not ask for a hint. Make your question brief, and do not say anything else."""
        }]
        messages.append({
            'role': "user",
            'content': f"""Hint:
<hint>{hint}</hint>"""
        })
        return messages

    def ask(self, questions, answers):
        messages = self.get_questioner_messages(questions, answers)
        question = self.get_gpt4o_completion(messages, max_tokens=75, temperature=0.8)
        return question

    def get_guesser_messages(self, questions, answers, guesses):
        df = pd.DataFrame({'question': questions, 'answer': answers})
        df_no = df.loc[df['answer'] == "no"]
        questions_no = df_no['question'].tolist()
        # If questions_no above is empty, the following method call will return an empty string.
        questions_no = "\n".join(questions_no)
        df_yes = df.loc[df['answer'] == "yes"]
        questions_yes = df_yes['question'].tolist()
        # If questions_yes above is empty, the following method call will return an empty string.
        questions_yes = "\n".join(questions_yes)

        hint = ""
        if len(questions_no) > 0:
            hint += f"""\nThe following are false:
{questions_no}\n"""
        if len(questions_yes) > 0:
            if len(questions_no) > 0:
                hint += f"""\nBut the following are true:
{questions_yes}\n"""
            else:
                hint += f"""\nThe following are true:
{questions_yes}\n"""
        if len(guesses) > 0:
            hint += "\nPrevious guesses:\n" + "\n".join(guesses) + "\n"

        messages = [{
                'role': "system",
                'content': f"""A human has thought of an entity. The human has kept it a secret.

But we do know that the secret entity is a thing. It can be a man-made thing, a natural thing, a living thing, a substance, a type of food, or a type of place. It is a common noun.

Your task is to guess the secret entity, by utilizing a hint delimited by <hint> </hint>

Limit your response to a maximum of two words. Do not repeat any of the previous guesses."""
        }]
        messages.append({
            'role': "user",
            'content': f"""Hint:
<hint>{hint}</hint>"""
        })
        return messages

    def guess(self, questions, answers, guesses):
        messages = self.get_guesser_messages(questions, answers, guesses)
        guess = self.get_gpt4o_completion(messages, max_tokens=10, temperature=0.8)
        guess = unidecode(guess)
        guess = guess.lower()
        return guess

The "answerer" agent is encapsulated by the `Answerer` class. It has just one turn type: "answer".

A system prompt gives GPT-4o an instruction along with the secret keyword. Next, a user prompt provides GPT-4o the question.

**Note:** Unlike the `Guesser`, the `Answerer` is not provided information about the game history. It simply generates a 'yes' or 'no' answer that is factually correct, based on (i) its knowledge about the keyword and (ii) the question.

In [10]:
class Answerer:
    def __init__(self, openai_client):
        self.openai_client = openai_client

    def get_gpt4o_completion(self, messages, max_tokens, temperature):
        response = self.openai_client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature
        )
        return response.choices[0].message.content

    def get_messages(self, keyword, question):
        messages = [{'role': "system", 'content': f"""Your task is to reply with a single word answer - either 'yes' or 'no'.

I am providing you an entity delimited by <entity> </entity>.

Entity:
<entity>{keyword}</entity>

Next, the user will ask you a question delimited by <question> </question>.

Please use the entity and the question, and formulate a single word 'yes' or 'no' answer that is factually correct. You must use all your knowledge about the entity to formulate the answer.

Never mention the entity in your answer. Do not provide an explanation for your answer, and do not use any other words."""}]
        messages.append({
            'role': "user", 'content': f"""Question:
<question>{question}</question>"""
        })
        return messages

    def answer(self, keyword, question):
        messages = self.get_messages(keyword, question)
        answer = self.get_gpt4o_completion(messages, max_tokens=2, temperature=0.2)
        answer = answer.lower()
        if answer not in ['no', 'yes']:
            answer = 'no'
        return answer

Finally, the game is encapsulated by the `Game` class. An instance of this class will contain the secret keyword & alts (if any). Moreover, it will contain the `Guesser` & `Answerer` objects. It will also maintain the state of the game - the questions, the answers, the guesses, the current round, and the round of correct guess.

Each round consists of a question, an answer and a guess. There are a maximum of 20 rounds. The game ends if (i) there is an illegal answer or (ii) there is a correct guess.

In [11]:
import numpy as np
from pathlib import Path

class Game:
    def __init__(self, game_play_data_file_path, game_idx, keyword, alts, openai_client):
        self.game_play_data_file_path = game_play_data_file_path
        self.game_idx = game_idx
        self.keyword = keyword.strip().lower()
        if alts is np.nan:
            self.alts = []
        else:
            self.alts = [a.strip().lower() for a in alts.split(" | ")]
        self.guesser = Guesser(openai_client)
        self.answerer = Answerer(openai_client)
        self.questions = ["Is it a proper noun?"] # Historical relic; see note below.
        self.answers = ["no"] # Historical relic; see note below.
        self.guesses = ["common noun"] # Historical relic; see note below.
        self.current_round = 1
        self.round_of_correct_guess = -1 # This will remain -1 if there is no correct guess.

    def play_round(self):
        question = self.guesser.ask(self.questions, self.answers)
        self.questions.append(question)

        answer = self.answerer.answer(self.keyword, question)
        self.answers.append(answer)

        guess = self.guesser.guess(self.questions, self.answers, self.guesses)
        self.guesses.append(guess)

        return question, answer, guess

    def play_game(self):
        for i in range(20):
            print(f"Current round: {self.current_round}")
            question, answer, guess = self.play_round()
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f"Guess: {guess}")

            if answer not in ["no", "yes"]:
                print("Illegal answer!")
                print("---")
                break # The game ends.

            if guess == self.keyword or guess in self.alts:
                print("Correct guess!")
                self.round_of_correct_guess = self.current_round
                print("---")
                break # The game ends.
            print("---")
            self.current_round += 1
        return self.keyword, self.alts, self.round_of_correct_guess

    def save_game_data(self):
        if len(self.guesses) > 0:
            game_df = pd.DataFrame({
                'game_idx': [self.game_idx] * len(self.guesses),
                'keyword': [self.keyword] * len(self.guesses),
                'alts': [" | ".join(self.alts)] * len(self.guesses),
                'questions': self.questions,
                'answers': self.answers,
                'guesses': self.guesses,
                'round_of_correct_guess': [self.round_of_correct_guess] * len(self.guesses)
            })
            csv_file = Path(self.game_play_data_file_path)
            if csv_file.is_file(): # File exists.
                game_df.to_csv(self.game_play_data_file_path, mode="a", header=False, index=False) # Appending to existing file.
            else:
                game_df.to_csv(self.game_play_data_file_path, index=False) # Creating new file.
            print("Game data saved to disk.")
        else:
            print("Game data not found. Could not save to disk.")

**Historical note (not that important):** 

- At one point, the "LLM 20 Questions" competition had two categories of keywords: "locations" and "things". However, with just a few weeks left in the competition, <a href="https://www.kaggle.com/competitions/llm-20-questions/discussion/523198" target="_blank">the "locations" category was removed</a>.
- When "locations" were still part of the competition, we were hardcoding the first question to be `"Is it a proper noun?"` to distinguish "locations" (proper nouns) from "things" (common nouns). GPT-4o's role started from the first answer, not the first question (as a consequence of hardcoding the first question).
- In fact, the subsequent prompt used (from round 2 onwards) was different, depending on the answer to the first question - `"no"` (-> the keyword must be a "thing") / `"yes"` (-> the keyword must be a "location").
- Our team had already generated a large amount of data (and burnt through hundreds of dollars of OpenAI credits) when this change to the competition was announced. We didn't want to throw all this data away.
- So we came up with a hack to re-use all this data for the new scenario where there are "things" only (i.e., there are no "locations" any more). Our hack comprised of four steps: (i) delete all the game play data for "locations", (ii) get rid of the "locations" specific prompt, i.e., retain only the "things" specific prompt, (iii) continue the previous games for "things" for one more round (since the first question `"Is it a proper noun?"` was now pointless and redundant - making the previous games effectively "19 questions" instead of "20 questions"), and (iv) instantiate the `Game` class with the following state:

```
self.questions = ["Is it a proper noun?"] # This first question (for round 1) was hardcoded when "locations" were still part of the competition.
self.answers = ["no"] # The answer to the above question for "things" was always "no".
self.guesses = ["common noun"] # The most common first round guess for "things" was "common noun".
```

With this hack, the above initial state effectively became the zeroth round - a sort of redundant "dummy round". It meant that the "actual start" of the game was after this dummy round.

However, if you are starting from scratch (on "things" only), please **don't worry about the above hack**. Instead, feel free to use any question, answer & guess combination that you feel is appropriate for the zeroth (dummy) round. For example, you can use:

```
self.questions = ["Is it a thing?"] # Example question for the zeroth (dummy) round.
self.answers = ["yes"] # Example answer for the zeroth (dummy) round.
self.guesses = ["something"] # Example guess for the zeroth (dummy) round.
```

## Game Play

The game play data is saved in the following file.

In [12]:
game_play_data_file_path = "/kaggle/working/game_play_data.csv"

In [13]:
csv_file = Path(game_play_data_file_path)
if csv_file.is_file(): # File exists.
    all_games = pd.read_csv(game_play_data_file_path)
    next_game_idx = all_games['game_idx'].max() + 1
else:
    next_game_idx = 0 # Games are zero indexed.
next_game_idx

0

Let's play 10 games, and save the game play data to disk.

In [14]:
n_games_now = 10 # Set the number of games to be played now.
n_successful_games = 0
for game_idx in range(next_game_idx, next_game_idx + n_games_now):
    print(f"Game ID: {game_idx}")
    print("---")
    keyword = apple_things[game_idx] # Games are zero indexed.
    alts = apple_alts[game_idx] # Games are zero indexed.
    game = Game(game_play_data_file_path, game_idx, keyword, alts, openai_client)
    keyword, alts, round_of_correct_guess = game.play_game()
    print("Game completed!")
    print(f"Keyword: {keyword}")
    print(f"Alts: {alts}")
    print(f"Round of correct guess: {round_of_correct_guess}")
    game.save_game_data()
    if round_of_correct_guess != -1: # Successful game.
        n_successful_games += 1
    print("---\n---")
print(f"Done playing {n_games_now} games!")
print(f"Number of successful games: {n_successful_games}")
pct_successful_games = (n_successful_games / n_games_now) * 100
print(f"% of successful games: {round(pct_successful_games, 2)}")
print("---")

Game ID: 0
---
Current round: 1
Question: Is it man-made?
Answer: no
Guess: tree
---
Current round: 2
Question: Is it a living thing?
Answer: yes
Guess: animal
---
Current round: 3
Question: Is it an animal?
Answer: yes
Guess: dog
---
Current round: 4
Question: Is it a mammal?
Answer: yes
Guess: cat
---
Current round: 5
Question: Is it a domestic animal?
Answer: no
Guess: horse
---
Current round: 6
Question: Is it a wild animal?
Answer: yes
Guess: lion
---
Current round: 7
Question: Is it a carnivore?
Answer: no
Guess: elephant
---
Current round: 8
Question: Does it primarily live in the forest?
Answer: yes
Guess: tiger
---
Current round: 9
Question: Is it a primate?
Answer: yes
Guess: monkey
Correct guess!
---
Game completed!
Keyword: monkey
Alts: []
Round of correct guess: 9
Game data saved to disk.
---
---
Game ID: 1
---
Current round: 1
Question: Is it a living thing?
Answer: yes
Guess: animal
---
Current round: 2
Question: Is it an animal?
Answer: yes
Guess: dog
---
Current round:

## Conclusion

This notebook demonstrated how "20 questions" game play data can be generated by making a GPT-4o `Guesser` agent play with a GPT-4o `Answerer` agent.

The above process was used to generate game play data for various sets of keywords. This resulted in the following dataset: https://www.kaggle.com/datasets/sambitmukherjee/gpt-4o-game-play-data-llm-20-questions

A dataset such as the above can be used to clone GPT-4o's behavior into any open weights instruction tuned LLM (such as `"Meta-Llama-3.1-8B-Instruct"`) by performing supervised fine-tuning.