In [11]:
import sys
from string import Template

In [12]:
sys.path.append('/Users/das/work/local/Gits/2023/clembench')

In [13]:
sys.path.append('/Users/philippsadler/Opts/Git/clembench')

# Prototyping Dialogue Games with and for `clemgame`

This notebook demonstrates how the package can be used to prototype a game: play around with prompts, define the MOVE RULEs and the GAME RULEs (see below), and set up the game loop.

## Setting up the model

In [14]:
from backends import ModelSpec, Model, get_model_for, load_model_registry

In [15]:
load_model_registry() # load default model registry from backends folder

In [16]:
THIS_MODEL = 'gpt-3.5-turbo-1106' # fails if no registered entry

In [17]:
THIS_MODEL = 'gpt-4-1106-preview' # fails if no registered entry

In [18]:
THIS_MODEL = dict(model_id="gpt-3.5-turbo-1106", backend="openai") # works without registered entry when openai_api.py backend is available

In [19]:
THIS_MODEL = ModelSpec(model_name="gpt-3.5-turbo-1106", backend="openai") # works without registered entry when openai_api.py backend is available

In [20]:
lmm: Model = get_model_for(THIS_MODEL)

In [21]:
lmm.set_gen_args(temperature = 0.0, max_tokens= 100) 

In [22]:
?lmm.generate_response

In [23]:
def add_message(context, msg, role='user', sysmsg = 'You are a helpful assistant'):
    if context == []:
        context = [
            {"role": "system", "content": sysmsg}
        ]
    context.append({"role": role, "content": msg})
    return context

In [24]:
add_message([], "hello, how are you?")

[{'role': 'system', 'content': 'You are a helpful assistant'},
 {'role': 'user', 'content': 'hello, how are you?'}]

Let us try a query:

In [25]:
prompt = add_message([], "What is benchmarking?")
_, resp, resp_text = lmm.generate_response(prompt)
resp_text

"Benchmarking is the process of comparing and measuring an organization's performance, processes, products, or services against those of other organizations that are considered to be leaders in the industry. The goal of benchmarking is to identify best practices, improve performance, and gain a competitive advantage by learning from the successes of others. Benchmarking can be applied to various aspects of an organization, such as operations, customer service, financial performance, and more. It is a valuable tool for identifying areas for improvement and setting performance"

## Prototyping a game

To set up a game, we need two things. First, we need what we shall call a MOVE RULE. This (kind of) rule defines what *form* a valid game move must take. (Having this form enables us to parse it and to even understand in the first place what the move was. For example, if we were to play chess and we would define that requested moves have to follow the [UCI](https://en.wikipedia.org/wiki/Universal_Chess_Interface) notation, then `e2e4` would validate (because it is in that format), and `Nf3` would not (and neither would the text `I will move my knight to F3` and similar).

This is independent of the question of whether `e2e4` is a legal move in the current game state. These kinds of things are defined by the GAME RULEs. These rules determine what effect valid (= well-formed) moves have on the current game state: whether they are legal, and whether they update the game state, and if so, whether they perhaps even update the game into a goal state. We might also want to define other criteria that determine whether the game goes on or not, like a limit on the number of turns the game is played.

What is important to keep in mind when setting up a clemgame is that both move and game rules need to be checkable *programmatically* (by the game master that you need to implement). That is, it shouldn't require the kind of "intelligence" that you want to test in the first place. Anything that is formally define-able should be ok.

Let's try a very simple game. A and B are supposed to have a conversation, but in each of their turns, the first letter must be such that taken together, the sequence of first letters follows the sequence in the alphabet, with a given starting letter. So if that starting letter is 'h', the first utterance start with h, the second with i, and so on. This is the GAME RULE. If any player violates that rule, the game ends as a failure. If however the rule has not been violated for $n$ turns, it's a success. (The MOVE RULE is "start your utterance with 'I SAY:'". If the output of the player does not have this form, it is rejected. This can lead to a re-prompt (explaining again the move rule), or it can lead to the game being abandoned.)

(This is not a terribly good game, as it does not require the integration of information across very many turns -- you just need to look at the previous turn to know which letter you are supposed to use. We might like to impose the constraint that the turns need to stick to a certain topic -- and indeed we will do so in the prompts below -- but that isn't something that we can test. Or at least not easily, and we won't attempt to here. But the game shall suffice to explain the basic concepts, and demonstrate the process of prototyping a game idea.)

#### The initial prompts

Let's first try to find a good initial prompt for player A, who kicks things off. (Here it took me a couple of iterations to find the following prompt. We'll use that to continue.)

In [26]:
init_prompt_A = Template('''Let us play a game. I will tell you a topic and you will give me a short sentence that fits to this topic.
But I will also tell you a letter. The sentence that you give me has to start with this letter.
After you have answered, I will give you my reply, which will start with the letter following your letter in the alphabet.
Then it is your turn again, to produce a reply starting with the letter following that one. And so on. Let's go.
Start your utterance with I SAY: and do no produce any other text.
The topic is: $topic
The letter is: $letter''')

In [27]:
topic = 'birds'
letter = 'h'
init_prompt_A.substitute(topic=topic, letter=letter)

"Let us play a game. I will tell you a topic and you will give me a short sentence that fits to this topic.\nBut I will also tell you a letter. The sentence that you give me has to start with this letter.\nAfter you have answered, I will give you my reply, which will start with the letter following your letter in the alphabet.\nThen it is your turn again, to produce a reply starting with the letter following that one. And so on. Let's go.\nStart your utterance with I SAY: and do no produce any other text.\nThe topic is: birds\nThe letter is: h"

Let us see what the model makes of this:

In [29]:
prompt_A, resp, resp_text = lmm.generate_response(add_message([], init_prompt_A.substitute(topic='birds', letter='h')))
resp_text

'I SAY: Hummingbirds are fascinating creatures with their rapid wing beats.'

Not too bad! Let's code the MOVE RULE (and use it to get at the text itself):

In [30]:
prefix = 'I SAY: '
def parse_reply(text, prefix=prefix):
    if not text.startswith(prefix):
        return False
    return text[len(prefix):]

In [31]:
parse_reply(resp_text)

'Hummingbirds are fascinating creatures with their rapid wing beats.'

Now we need the GAME RULE:

In [32]:
def check_move(text, letter):
    token = text.split()
    if token[0][0].lower() == letter: # and token[-1][0].lower() == letter:
        return True
    return False

In [33]:
check_move(parse_reply(resp_text), letter)

True

Alright! Now for player B, who needs a slightly different prompt (as they *continue* the conversation):

In [34]:
init_prompt_B = Template('''Let us play a game. I will give you a sentence.
The first word in my sentence starts with a certain letter.
I want you to give me a sentence as a reply, with the same topic as my sentence, but different from my sentence.
The first word of your sentence should start with the next letter in the alphabet from the letter my sentence started with.
Let us try to have a whole conversation like that.
Please start your reply with 'I SAY:' and do not produce any other text.
Let's go.
My sentence is: $sentence
What do you say?''')

In [35]:
sentence = parse_reply(resp_text)
prompt_B = init_prompt_B.substitute(sentence=sentence)
prompt_B

"Let us play a game. I will give you a sentence.\nThe first word in my sentence starts with a certain letter.\nI want you to give me a sentence as a reply, with the same topic as my sentence, but different from my sentence.\nThe first word of your sentence should start with the next letter in the alphabet from the letter my sentence started with.\nLet us try to have a whole conversation like that.\nPlease start your reply with 'I SAY:' and do not produce any other text.\nLet's go.\nMy sentence is: Hummingbirds are fascinating creatures with their rapid wing beats.\nWhat do you say?"

In [36]:
prompt_B, resp_B, resp_B_text = lmm.generate_response(add_message([], prompt_B))
resp_B_text

'I SAY: Insects play a crucial role in the ecosystem as pollinators and decomposers.'

In [37]:
parse_reply(resp_B_text)

'Insects play a crucial role in the ecosystem as pollinators and decomposers.'

Alright! Now we need to check this move, both for well-formedness and against the game rules.

Here, it would be helpful to already formalise the game state. (Actually, when you're just playing around with an idea, you can postone this. I've only added this after iterating on this a couple of times to get a feel for what I need to keep track of.)

To determine whether a move met the game rule, we need to know what letter it was supposed to start with. If it met the rule, we advance the counter of successful moves, and also advance the expected letter. We also check whether we have reached the maximal number of turns. If so, the game is a win. If however the move did not meet the rule, we end the game as a failure. 

If the attempted move didn't even validate, we abort the game.

In [38]:
class InitialsGameState():
    def __init__(self, letter):
        self.letter = letter
        self.n_moves = 0
        self.success = False
        self.aborted = False

Before A's move, the game state was:

In [39]:
this_game = InitialsGameState('h')
this_game.letter, this_game.n_moves

('h', 0)

A's move was successful, so we should increment the letter that we are expecting now. We can define a method for that on the class:

In [40]:
class InitialsGameState():
    def __init__(self, letter):
        self.letter = letter
        self.n_moves = 0
        self.success = False
        self.aborted = False

    def increment_state(self):
        self.letter = chr(ord(self.letter) + 1 )
        self.n_moves += 1

(Since we've redefined the state class, need to set to before A's move again:

In [41]:
this_game = InitialsGameState('h')
this_game.increment_state()
this_game.letter, this_game.n_moves

('i', 1)

In [42]:
check_move(parse_reply(resp_B_text), letter=this_game.letter)

True

(By now it should be clear that parsing and validating could be methods of the game state, but we'll skip that for now...)

In [43]:
this_game.increment_state()

In [44]:
this_game.letter

'j'

#### The Game Loop

Not too bad. From here on, the game consists of giving A the turn (with all previous history), parsing and validating the response, then giving B the turn, parsing and validating the response, and so on, breaking the loop if a) an unparseable move was attempted (more correctly: no understandable move was made), or b) a loosing move was made (wrong initial), or c) the max # of turns has been reached. We are not going to implement that here. We can simulate the loop by going back to A's cell below, and break whenver we run out of patience.

In [45]:
next_prompt_A = add_message(prompt_A, resp_text, role='assistant')
next_prompt_A = add_message(next_prompt_A, resp_B_text, role='user')
next_prompt_A

[{'role': 'system', 'content': 'You are a helpful assistant'},
 {'role': 'user',
  'content': "Let us play a game. I will tell you a topic and you will give me a short sentence that fits to this topic.\nBut I will also tell you a letter. The sentence that you give me has to start with this letter.\nAfter you have answered, I will give you my reply, which will start with the letter following your letter in the alphabet.\nThen it is your turn again, to produce a reply starting with the letter following that one. And so on. Let's go.\nStart your utterance with I SAY: and do no produce any other text.\nThe topic is: birds\nThe letter is: h"},
 {'role': 'assistant',
  'content': 'I SAY: Hummingbirds are fascinating creatures with their rapid wing beats.'},
 {'role': 'user',
  'content': 'I SAY: Insects play a crucial role in the ecosystem as pollinators and decomposers.'}]

In [46]:
prompt_A, resp, resp_text = lmm.generate_response(next_prompt_A)
resp_text

'Now it\'s your turn to reply with a sentence starting with the letter "J" on the topic of birds.'

In [47]:
psd_reply = parse_reply(resp_text)
if psd_reply:
    if check_move(psd_reply, letter=this_game.letter):
        print('YAY')
    else:
        print('LOST')
else:
    print('NOT WELL-FORMED')

NOT WELL-FORMED


If we get a non well-formed reply, we could add a reprompting loop (for which we'd need to design a prompt) and try for couple of iteration if we can prise one out of the model... Or immediately break out here...

In [48]:
this_game.increment_state()

In [49]:
next_prompt_B = add_message(prompt_B, resp_B_text, role='assistant')
next_prompt_B = add_message(next_prompt_B, resp_text, role='user')
next_prompt_B

[{'role': 'system', 'content': 'You are a helpful assistant'},
 {'role': 'user',
  'content': "Let us play a game. I will give you a sentence.\nThe first word in my sentence starts with a certain letter.\nI want you to give me a sentence as a reply, with the same topic as my sentence, but different from my sentence.\nThe first word of your sentence should start with the next letter in the alphabet from the letter my sentence started with.\nLet us try to have a whole conversation like that.\nPlease start your reply with 'I SAY:' and do not produce any other text.\nLet's go.\nMy sentence is: Hummingbirds are fascinating creatures with their rapid wing beats.\nWhat do you say?"},
 {'role': 'assistant',
  'content': 'I SAY: Insects play a crucial role in the ecosystem as pollinators and decomposers.'},
 {'role': 'user',
  'content': 'Now it\'s your turn to reply with a sentence starting with the letter "J" on the topic of birds.'}]

In [50]:
prompt_B, resp_B, resp_B_text = lmm.generate_response(next_prompt_B)
resp_B_text

'I SAY: Jays are known for their striking blue feathers and their cleverness in finding food.'

In [51]:
psd_reply = parse_reply(resp_B_text)
if psd_reply:
    if check_move(psd_reply, letter=this_game.letter):
        print('YAY')
    else:
        print('LOST')
else:
    print('NOT WELL-FORMED')

LOST


Ok. If we made it here, we can jump again to the beginning of the loop (see cell above), and just execute the cells again.

Good. Looks like this game has a chance of working with at least one model! (In fact, it looks like this game is too easy, and we'd need to think again about what it really was supposed to show...) 

To turn this into something that could be run as part of a proper benchmark, we'd now need to do a lot of work wrapping stuff around this, like running it repeated times from different starting letters, logging everything properly, etc. etc.. But fear not! This is what the `clemgame` framework gives you, and what is described in the `howto_add_game_example.ipynb` notebook.

Remember: the purpose of this notebook was to show you what the *outcome* of a prototyping session might look like. The actual prototyping will involve trying out various prompts, coming to a better idea of what the game state is and how to represent it, etc. That is, it will look a lot messier. But it might not be a bad idea at the end of your session to clean up what you have, and prepare something that looks more like this notebook. This should form a good basis for then implementing the game properly in the framework.