# Adding a new game to the framework - FirstLast

### Disclaimer
This notebook closely follows the tutorial steps from [docs/howto_add_games_example.md](https://github.com/clp-research/clembench/blob/main/docs/howto_add_games_example.md)
 in the [**clp-research/clembench**](https://github.com/clp-research/clembench) repository. <br> 
While the core content is derived from the original tutorial, some additional detailed descriptions and explanations have been added to serve as a source of knowledge during the development of our project.

## Description
- The players should engage in a turn-based conversation about a predefined topic. 
- Player A starts with an utterance whose first and last token must start with a `predefined letter`, say d. 
- Player B must then reply with an utterance whose first and last token must be the `next one in the alphabet` (here, an e). <br> <br>
And so on, for n turns **(where each turn is comprised by an utterance from A and an utterance from B)**.<br> <br>
- If an utterance does not conform to these rules (i.e. it is incorrect), the players lose the game. <br>
- `move rule:` If an utterance does not start with 'I SAY: ' (i.e., it is invalid), the game is immediately aborted. <br>
- If all utterances up to turn n are valid and correct, the game is successful.

For instance, if the topic is **birds**, the initial letter is `h` and the number of turns is 2, this would be a successful game:

- Hi! I love birds, but it's hard to identify them. I need help. `(h: hi / help)`
- I know what you mean. I can try to help, please describe it. `(i: I/ it)`
- Just a moment... Ok, it's blue but looks like an Eurasian jay. `(j: just / jay)`
- Kick in more details, otherwise I don't know. `(k: kick / know)`

Each turn, we need to check two aspects: 
- **MOVE_RULE**: Does the utterance start with 'I SAY'?
- **GAME_RULE**: Do the first/last tokens start with `predefined_letter`?

# Adding a new game:
We will need at least the following components:
- **Game resources:** all data, prompt templates and text files that are necessary to create instances of a game and to group these instances into experiments.
- **Instances:** JSON containing the configuration of each instance, grouped into experiments. 
    - This must be done by a script named `instancegenerator.py`, with a class that inherits from `GameInstanceGenerator`.
- **Game Master:** Controls and enforces MOVE/GAME Rules, inheriting from GameMaster. 
    - This must be implemented in a file `master.py`.
- **Players:** Defines the programatic behaviour and any other attributes of a player, inheriting from `Player`.
    - This can be implemented in a file named `players.py`.
- **Game Benchmark:** a class that realises the game, inheriting from `GameBenchmark`. 
    - This can also live in the file `master.py`.

To define an **Episode**, we have to instantiate the initial promopts and define 3 more parameters:
- Topic
- Letter
- Number of Turns

# Defining prompts with game rules
- In the prompt template we can define variables that can be replaced later.
- Prompts have to be adjusted for Player A and B, based on their roles.

In [None]:
path = Path('resources') / 'initial_prompts'

with open(path / 'initial_prompt_a.template', 'w') as file:
    file.write(
        "Let's play a game. You must have a conversation about $topic with your partner. Your first turn must start and end with words that begin with the letter $letter. The reply of your partner must be similar, with the letter that comes after $letter in the alphabet. Then it's your turn again with the next letter, and so on. You'll do it for $nturns turns. Always start your utterance with I SAY: and then give your answer. If you break the rules, you lose."
    )

with open(path / 'initial_prompt_b.template', 'w') as file:
    file.write(
        "Let's play a game. You must have a conversation about $topic with your partner. Their first turn must start and end with words that begin with the letter $letter. Your reply must be similar, with the letter that comes after $letter in the alphabet. Then it's their turn again with the next letter, and so on. You'll do it for $nturns turns. Always start your utterance with I SAY: and then give your answer. If you break the rules, you lose."
    )

# Save topics to topics.txt
topics = ['dogs', 'cats', 'birds', 'trees']
with open(path / 'topics.txt', 'w') as file:
    for topic in topics:
        file.write(topic + '\n') 


# Creating game instances
- instancegenerator.py will create instances.json
- Create Class that inherits from GameInstanceGenerator and define `_on_generate`
- In main: Instantiate Class and call `.generate()`method

- `_on_generate:` Define experiments and instances, based on what dimensions we want to evaluate later.
- `Experiment` Set of instances with the same topic
    - `Instance:`
        - Initial Letter
        - Initial Prompt
        - Number of turns

The instances.json file should contain everything that the game master needs to set up the configuration of a game play!
#### Example:

In [None]:
{
    "experiments": [
        {
            "name": "NAME_1",
            "game_instances": [
                {
                    "game_id": 0,
                    "first_letter": "LETTER",
                    "n_turns": "N",
                    "prompt_player_a": "PROMPT_A",
                    "prompt_player_b": "PROMPT_B",
                },
                {
                    "game_id": 1,
                    "first_letter": "LETTER",
                    "n_turns": "N",
                    "prompt_player_a": "PROMPT_A",
                    "prompt_player_b": "PROMPT_B",
                },
            ]
        },
        {
            "name": "NAME_2",
            "game_instances": [
                {
                    "game_id": 0,
                    "first_letter": "LETTER",
                    "n_turns": "N",
                    "prompt_player_a": "PROMPT_A",
                    "prompt_player_b": "PROMPT_B",
                },
                {
                    "game_id": 1,
                    "first_letter": "LETTER",
                    "n_turns": "N",
                    "prompt_player_a": "PROMPT_A",
                    "prompt_player_b": "PROMPT_B",
                },
            ]
        },
    ]
}

In [None]:
# save the contents of this cell as games/firstlast/instancegenerator.py
import random
import string

from clemgame.clemgame import GameInstanceGenerator

# set the name of the game in the script, as you named the directory
# this name will be used everywhere, including in the table of results
GAME_NAME = 'firstlast'
# we will create 10 instances for each experiment; vary this as you wish
N_INSTANCES = 10
# if the generation involves randomness, remember to set a random seed
SEED = 123

class FirstLastGameInstanceGenerator(GameInstanceGenerator):
    def __init__(self):
        # GameInstanceGenerator
        super().__init__(GAME_NAME)
    
    # define on_generate, a mandatory method
    def on_generate(self):
        # get the list of topics, which will be our experiments
        topics = self.load_file('resources/topics.txt').strip('\n').split('\n')
        # get the prompts for player a and player b
        # we'll keep the prompts fixed in all instances, replacing only the
        # necessary slots (but you can do it differently)
        prompt_a = self.load_template('resources/initial_prompts/initial_prompt_a')
        prompt_b = self.load_template('resources/initial_prompts/initial_prompt_b')

        # building the file, one experiment at a time
        for topic in topics:
            # create an experiment (for us, named after a topic)
            experiment = self.add_experiment(topic)
            # build N_INSTANCES instances for each experiment
            for game_id in range(N_INSTANCES):
                # set the parameters
                # here we do it randomly, but that can also be read from a file
                # one of the first 5 letters in the alphabet
                letter = random.choice(string.ascii_lowercase[:5])
                # up to 8 turns, so that we don't run out of letters
                n_turns = random.randint(3, 8)
                # create a game instance, using a game_id counter/index
                instance = self.add_game_instance(experiment, game_id)
                # populate the game instance with its parameters
                instance['first_letter'] = letter
                instance['n_turns'] = n_turns
                instance['prompt_player_a'] = self.create_prompt(
                    topic, prompt_a, letter, n_turns)
                instance['prompt_player_b'] = self.create_prompt(
                    topic, prompt_b, letter, n_turns)
    
    # an additional method, specific for our example
    def create_prompt(self,
                      topic: str,
                      prompt: str,
                      letter: str,
                      n_turns: int) -> str:
        """Replace a prompt template with slot values."""
        text = string.Template(prompt).substitute(topic=topic, letter=letter,
                                                  nturns=n_turns)
        return text


if __name__ == '__main__':
    random.seed(SEED)
    # always call this, which will actually generate and save the JSON file
    FirstLastGameInstanceGenerator().generate()

#### Results

In [None]:
{
    "experiments": [
        {
            "name": "dogs",
            "game_instances": [
                {
                    "game_id": 0,
                    "first_letter": "a",
                    "n_turns": 5,
                    "prompt_player_a": "Test Prompt A",
                    "prompt_player_b": "Test Prompt B"
                },
                {
                    "game_id": 1,
                    "first_letter": "a",
                    "n_turns": 6,
                    "prompt_player_a": "Test Prompt A",
                    "prompt_player_b": "Test Prompt B"
                },
                {
                    "game_id": 2,
                    "first_letter": "c",
                    "n_turns": 3,
                    "prompt_player_a": "Test Prompt A",
                    "prompt_player_b": "Test Prompt B"
                },
                {
                    "game_id": 3,
                    "first_letter": "a",
                    "n_turns": 6,
                    "prompt_player_a": "Test Prompt A",
                    "prompt_player_b": "Test Prompt B"
                }
            ]
        }
    ]
}

# Creating the Game
- `master.py`: Implement a Class that inherits from **GameMaster**,
    - Define how the game is played
    - Enforce Rules
    - Log actions for later evaluation

### Player Class
- Role of Player A and B are symmetric in this case, so we can instantiate both with the same class
- `_custom_response()`: Useful in two cases: player is really a program / testing your program -> `model_name = "programmatic"`
- Includes List to present dialogue history

In [None]:
# save the contents of this cell as games/firstlast/players.py

import random
from string import ascii_lowercase as letters
from typing import List

from clemgame.clemgame import Player


class Speaker(Player):
    def __init__(self, model_name: str, player: str, letter: str):
        # always initialise the Player class with the model_name argument
        # if the player is a program and you don't want to make API calls to
        # LLMS, use model_name="programmatic"
        super().__init__(model_name)
        self.player: str = player
        self.initial_letter: str = letter

        # a list to keep the dialogue history
        self.history: List = []

    # implement this method as you prefer, with these same arguments
    def _custom_response(self, messages, turn_idx) -> str:
        """Return a mock message with the suitable letter and format."""
        # get the first letter of the content of the last message
        # messages is a list of dictionaries with messages in openai API format
        if turn_idx == 1 and self.player == 'A':
            letter = 'I SAY: ' + self.initial_letter
        else:
            previous_letter = messages[-1]['content'][7].lower()
            # introduce a small probability that the player fails
            letter = self._sample_letter(previous_letter)
        # return a string whose first and last tokens start with the next letter     
        return f"{letter}xxx from {self.player}, turn {turn_idx} {letter.replace('I SAY: ', '')}xxx."

    # an additional method specific for this game
    # for testing, we want the utterances to be invalid or incorrect sometimes
    def _sample_letter(self, letter: str) -> str:
        """Randomly decide which letter to use in the message."""
        prob = random.random() 
        index = letters.index(letter)
        if prob < 0.05:
            # correct but invalid (no tag)
            return letters[index + 1]
        if prob < 0.1:
            # valid tag but wrong letter
            return 'I SAY: ' + letter
        # valid and correct
        return 'I SAY: ' + letters[index + 1]

## GameMaster Class
- Class inherits from **GameMaster**, implement `setup()` and `play()` methods to create and run **episodes** and `compute_scores()`, that will calculate the metrics for the evaluation.
- Scores are computed after the game is done using the separate **score** argument of the cli script.  
- Metrics that every game must compute are defined in `clemgame/metrics.py`

#### Initialisation
- First define how to initialise **GameMaster**
- `__init__` Method gets the experiment object and a list of Player names as strings.
    - Any needed attributes should be initialised here. 

In [None]:
import copy
from typing import List, Dict, Tuple
from string import ascii_lowercase as letters

import numpy as np

import clemgame.metrics as ms
from clemgame.clemgame import GameMaster, GameBenchmark
from clemgame import get_logger

from games.firstlast.players import Speaker
from games.firstlast.instancegenerator import GAME_NAME

class FirstLast(GameMaster):
    """Implement mechanisms for playing FirstLast."""
    def __init__(self, experiment: Dict, player_backends: List[str]):
        super().__init__(GAME_NAME, experiment, player_backends)

        # save experiment and player attributes that will be necessary later
        self.topic = experiment['name']
        self.model_a = player_backends[0]
        self.model_b = player_backends[1]

        # initialise attributes that will be used for the evaluation scores
        self.aborted: bool = False
        self.lose: bool = False
        self.complete_turns: int = 0

## Logging
The **GameRecorder** class has built-in methods for logging, some of the most important:
- At the beginning of every turn, call `log_next_turn()`.
- In the Game Setup, call `log_players()` to save the models.
- Use `log_event()` to log all types of actions with `from_` and `to`.
- The `action` Object passed to `log_event()` must contain a key `type` and a key `content`.
    - `type`: Type of message, like send/get msg, error, parse, metadata, etc.
    - `content`: The actual content of the message.
- Use only values `Player 1`, `Player 2`, `GM`, for the `from_` and `to` arguments.
    - `GM` in `from_` and `to`: Anything that GM emits.
- All events that involve making an API call should pass an additional `call` argument to `log_event()` containing API input/output.
- Besides events, the GM also has to compute and log scores: episode level / turn level

#### Setup
The setup method gets all `keys=values` in the instance dictionary, as we defined above. <br>
- In the Example:
    - Instantiate both players and empty dialogue histories
    - initial turn index
    - initial letter
    - some other variables 

In [None]:
def setup(self, first_letter: str, n_turns: int, prompt_player_a: str,
              prompt_player_b: str, game_id: int) -> None:
        """Setup the episode (mandatory)."""

        self.n_turns = n_turns

        # instantiate both players
        self.player_a = Speaker(self.model_a, 'A', first_letter)
        self.player_b = Speaker(self.model_b, 'B', first_letter)

        # initialise game variables
        self.current_turn: int = 0
        self.current_letter: str = first_letter

        # initialise common metrics
        self.request_counts = [0] * (n_turns + 1)
        self.parsed_request_counts = [0] * (n_turns + 1)
        self.violated_request_counts = [0] * (n_turns + 1)

        # add initial prompts to each player's messages
        self.initiate(prompt_player_a, prompt_player_b)

        # always log the details of the players in this format (see logdoc)
        self.log_players({
            'GM': 'Game master for FirstLast',
            'Player 1': f'Player A: {self.model_a}',
            'Player 2': f'Player B: {self.model_b}'
            })

        # log any additional keys that will be relevant for evaluation
        self.log_key('n_turns', n_turns)

#### Auxiliary Methods
The `play()` Method is broken down into different modules:
- proceed()
- update_letter()
- _append_utterance()
- parse()
- check_correctness()
- log_eval_assets()
All of these are defined in `tutorial_first_last/master.py`


#### Defining a turn
1. Send the initial prompt to player A
2. Get player A's response
3. Check if response can be parsed -> if not: abort or loop
4. Check `GAME_RULE`
5. Repeat for Player B
6. Log all events in-between all steps

- The dialogue history has to be built:
    - Messages produced by player: **assistant**
    - Messages recieved by player: **user** 
    - `swap` roles for player B <-> Player A

## Computing scores for the evaluation
- During the game play, some attributes kept track of counters that are used for the evaluation, but we have not computed all evaluation scores yet. <br>
- This is done by the mandatory `compute_scores()` method. <br>
- It gets the full `interaction.json` file as input and must compute and log both turn-level and episode-level scores. <br>
- This is a separate step which does not occur in the same runtime as the game play.  <br>
- Therefore, all relevant information should get saved into `interaction.json` and accessed again by `compute_scores()` when scoring. <br>
- `Important:` If the game is aborted, all episode-level scores must be set to `numpy.nan` and turn-level scores can be computed for the valid turns before the abortion action.
- All games must compute `METRIC_ABORTED`, the binary `METRIC_SUCCESS`, and its `BENCH_SCORE` (0=fail, 100=success)

## GameBenchMark
- Final step that "informs" the framework about the game
- You have to implement a child class of `GameBenchMark`, which defines if the game is single player or not
- Implementation can be found in `games/tutorial_first_last/master.py`

# Results
After adding all these files together, we can run the game by `python3 scripts/cli.py run -g tutorial_first_last -m gpt-4o-2024-08-06`. <br>
We get the following results:
```
2024-09-02 19:43:50,427 - benchmark.run - INFO - Run experiment 1 of 4: dogs
Playing games: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.28it/s]
2024-09-02 19:43:58,267 - benchmark.run - INFO - Run experiment 2 of 4: cats
Playing games: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.40it/s]
2024-09-02 19:44:05,414 - benchmark.run - INFO - Run experiment 3 of 4: birds
Playing games: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.33it/s]
2024-09-02 19:44:12,915 - benchmark.run - INFO - Run experiment 4 of 4: trees
Playing games: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:08<00:00,  1.24it/s]
```

Which means that the execution was successful. 

# Transcribing Results
Now that the interactions and all important metrics have been saved to the `results` folder, we can create a transcription for a better overview of the game.
- `python3 scripts/cli.py transcribe -g tutorial_first_last` <br>
This results in a readable version of each `episode`

# Scores
Scoring also goes through with the following results:
```
2024-09-02 20:13:08,347 - benchmark.run - INFO - Score game 1 of 1: tutorial_first_last
2024-09-02 20:13:08,347 - benchmark.run - INFO - Scoring: trees
Scoring episodes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 3409.45it/s]
2024-09-02 20:13:08,352 - benchmark.run - ERROR - tutorial_first_last: '10' exceptions occurred: See clembench.log for details.
2024-09-02 20:13:08,352 - benchmark.run - INFO - Scoring: birds
Scoring episodes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 4146.21it/s]
2024-09-02 20:13:08,355 - benchmark.run - ERROR - tutorial_first_last: '10' exceptions occurred: See clembench.log for details.
2024-09-02 20:13:08,355 - benchmark.run - INFO - Scoring: cats
Scoring episodes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 4822.15it/s]
2024-09-02 20:13:08,357 - benchmark.run - ERROR - tutorial_first_last: '10' exceptions occurred: See clembench.log for details.
2024-09-02 20:13:08,357 - benchmark.run - INFO - Scoring: dogs
Scoring episodes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 4728.64it/s]
2024-09-02 20:13:08,360 - benchmark.run - ERROR - tutorial_first_last: '10' exceptions occurred: See clembench.log for details.

```

We need to look into why the exceptions occurred but overall the game runs, and this can be a good base for implementing our own. 