# Inhaltsverzeichnis
Die allgemeinen Tic Tac Toe Methoden sind in der beiliegenden Datei `TTT_functions.py` geschrieben um dieses Notebook etwas zu kürzen.
### 1. Q-Learning Implementierung wie für Blatt 3

### 2. Neuronales Netz

### 3. Tests

In [40]:
from TTT_functions import *

# Q-Learning

Der Gegenspieler wird immer als Umwelt interpretiert. Das heißt wird von einem Zustand $S$ aus eine Aktion $A$ gewählt, so ist der Folgezustand $S'$ dadurch noch nicht eindeutig bestimmt. Dieser wird erst durch die nächste Aktion des Gegners festgelegt.

So kann auch während dem Spiel gelernt werden, da die Wertung vom Zustands-Aktionspaar $(S,A)$ von der Übergangswahrscheinlichkeit $S \to S'$ abhängt.

Indem dies für beide Spieler gleichzeitig gemacht wird kann ein Spiel also die Q-Matrix mit Informationen über alle im Spiel vorkommenden Zustands-Aktionspaare aktualisiert werden, sodass die KI gleichzeitig das Verhalten als beginnender und als zweiter Spieler lernt.

## Trainingsalgorithmus

In [41]:
def train_q_learning(learning_rate, discount_factor, base_exploration_rate, num_episodes=1e4, reward_dict={"win":1, "loss":-1, "draw":0, "move":-0.05}):
    """
    play Tic Tac Toe [num_episodes] times to learn using Q-Learning with the given learning rate and discount_factor.
    inputs:
        learning_rate - (float) in range [0,1] - alpha
        discount_factor - (float) in range [0,1] - gamma
        base_exploration_rate - (float) - the starting exploration rate
        num_episodes - (int) - number of episodes for training
        reward_dict - (dict) - a dictionary specifying the rewards for winning, losing and draw
            -> must have keys "win", "loss", "draw"
    returns:
        (dict) - the Q-table after training with the given parameters
    """
    Q_table = dict()    # assign values to every visited state-action pair
    N_table = dict()    # counting how often each state-action pair was visited
    action_dict = dict()    # save the possible actions for each state

    games = []
    exploration_rate = base_exploration_rate
    for n in range(num_episodes):
        # play episode
        state_hist = play_episode(Q_table, action_dict, exploration_rate, discount_factor, learning_rate, reward_dict, N_table)
        exploration_rate *= base_exploration_rate
        games.append(state_hist)

    print("final exploration rate:", exploration_rate)
    
    return Q_table, N_table, games
        

## Simulieren einer Episode

In [42]:
def play_episode(Q_table, action_dict, exploration_rate, discount_factor=0.95, learning_rate=0.1, reward_dict={"win":1, "loss":-1, "draw":0, "move":-0.05}, N_table=dict()):
    """
    self-play an entire episode
    returns:
        (list) - state history
        (list) - action history
    
    action_dict is changed in-place
    """ 
    field = [0 for _ in range(9)]
    sign = 1
    action_history = []
    state_history = []
    while True:
        state = tuple(field)
        state_history.append(state)
        # get possible actions
        try:
            actions = action_dict[state]
        except KeyError:
            actions = get_actions(field)
            action_dict[state] = actions

        if len(state_history) > 2: 
            # we know the state that resulted from the last action
            update_q_table(Q_table, state_history, action_history, actions, discount_factor, learning_rate, reward_dict, N_table)

        if len(actions) == 0:
            break # game has ended

        action = choose_Q_action(state, actions, Q_table, exploration_rate=exploration_rate)
        action_history.append(action)
        field[action] = sign
        sign = sign%2 + 1 # toggle sign between 1 and 2

    last_state = state_history[-2]
    last_action = action_history[-1]
    if not (last_state, last_action) in Q_table.keys():
        Q_table[(last_state, last_action)] = 0
        N_table[(last_state, last_action)] = 0
    # print("before", Q_table[(last_state, last_action)])
    Q_table[(last_state, last_action)] += learning_rate*(reward_dict["win"] - Q_table[(last_state, last_action)])
    N_table[(last_state, last_action)] += 1
    
    return state_history


def update_q_table(Q_table, state_history, action_history, actions, discount_factor, learning_rate, reward_dict, N_table):
    """
    update the second to last state in the Q-table
    returns:
        None
    """
    prev_state = state_history[-3] # S = state
    prev_action = action_history[-2] # A = action
    state = state_history[-1] # S' = next state after action A

    reward = get_reward(list(state), actions, reward_dict) # R = Reward
    next_rewards = [] # Q(S', a') for all actions a'
    for action in actions:
        try:
            next_rewards.append(Q_table[(state, action)])
        except KeyError:
            Q_table[(state, action)] = 0
            N_table[(state, action)] = 0
            next_rewards.append(0)

    if not (prev_state, prev_action) in Q_table.keys():
        Q_table[(prev_state, prev_action)] = 0
        N_table[(prev_state, prev_action)] = 0
    if ((1,2,0,0,1,2,0,0,0),8) in Q_table.keys():
        test_value = str(Q_table[((1,2,0,0,1,2,0,0,0),8)])
    # Q(S,A) += alpha*(R + gamma * max(S', a') - Q(S,A))
    Q_table[(prev_state, prev_action)] += learning_rate*(reward + discount_factor * max(next_rewards, default=0) - Q_table[(prev_state, prev_action)])
    N_table[(prev_state, prev_action)] += 1
    if ((1,2,0,0,1,2,0,0,0),8) in Q_table.keys():
        if test_value != str(Q_table[((1,2,0,0,1,2,0,0,0),8)]):
            print("old value:", test_value)
            print("reward was", reward)
            print("new value:", Q_table[((1,2,0,0,1,2,0,0,0),8)])


def get_reward(field, actions, reward_dict):
    """
    return the reward for the given field and possible actions
    """
    if len(actions) > 0:
        reward = reward_dict["move"]
    else:
        winner = game_ended(field, get_winner=True)
        if winner == 0: #draw
            reward = reward_dict["draw"]
        else:
            reward = reward_dict["loss"]
    return reward

## Auswählen einer Aktion
Die Aktionen werden nach einer $\varepsilon$-greedy Strategie ausgewählt. $\varepsilon$ ist dabei die Wahrscheinlichkeit, dass Exploration, also eine zufällige Aktion gewählt wird.

In [43]:
import random
def choose_Q_action(state, actions, Q_table, exploration_rate=0):
    """
    choose an action based on the possible actions, the current Q-table and the current exploration rate
    inputs:
    -------
        state - (tuple) or (list) - the state as a tuple or list
        actions (tuple) or (list) - all possible actions in the given state
        Q_table - (dict) - dictionary storing all known Q-values
        exploration_rate - (float) in [0,1] - probability of choosing exploration rather than exploitation
    """
    r = random.random()
    if r > exploration_rate:
        # print("exploit", r, exploration_rate)
        # exploit knowledge
        action_values = []
        for action in actions:
            try:
                action_values.append(Q_table[(state,action)])
            except KeyError:
                action_values.append(0)
        max_value = max(action_values)
        best_actions = []
        for action, value in zip(actions, action_values):
            if value == max_value:
                best_actions.append(action)
        # return random action with maximum expected reward
        return random.choice(best_actions)
    # explore environment through random move
    return random.choice(actions)

## Anwenden des Q-Learning Algorithmus

In [11]:
learning_rate = 0.01
discount_factor = 0.95
num_episodes = int(1e5)
exploration_rate = 1-(5/num_episodes)
reward_dict = {"win":1,      # reward for win
               "loss":-1,    # reward for loss
               "draw":0,     # reward for draw
               "move":-0.05} # reward per non-terminal move

%time Q_table, N_table, games = train_q_learning(learning_rate, discount_factor, exploration_rate, num_episodes=num_episodes, reward_dict=reward_dict)

final exploration rate: 0.00673676792504109
Wall time: 7.66 s


## Speichern der Q-Matrix

In [44]:
def export_Q_table(Q_table, filename="Q_table.txt"):
    """
    write the given Q-table into a file
    """
    with open(filename, "w") as file:
        file.write("Q_table = {\n")
        for key, value in Q_table.items():
            file.write(str(key) + ":" + str(value) + ",\n")
        file.write("}")

In [45]:
export_Q_table(Q_table)

# Neuronales Netz
### Idee:
Wir nutzen die zuvor erzeugt Q-Matrix um ein Neuronales Netzwerk zu trainieren, welches die Q-Funktion approximiert.

Dazu wird das Netzwerk ein Zustands-Aktions Paar als Eingabe bekommen und eine Wertung ($\in \mathbb{R}$) zurückgeben.

## Trainingsdaten vorbereiten

In [70]:
import numpy as np
def prepare_data(Q_table):
    """
    prepare the data given in a Q-table for training the neural network
    """
    training_inputs = []
    training_outputs = []
    for state_action, value in Q_table.items():
        # input_data   =      state_info       +    action_info
        training_inputs.append( list(state_action[0]) + [state_action[1]/4] )
        training_outputs.append ( (value+1)/3 ) #make sure the target value is in range [0,1]
    return np.array(training_inputs), np.array(training_outputs)

## Netzwerk initialisieren

In [71]:
from tensorflow.keras import layers
import tensorflow.keras as keras

def create_ttt_network(hidden_layers):
    model = keras.Sequential()
    model.add( keras.Input(shape=(10,)) ) # input layer - 10 Nodes
    for size in hidden_layers:
        model.add( layers.Dense(size) )
    model.add( layers.Dense(1) )   # output layer - 1 Node

    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

## Netzwerk trainieren

Hier wird das Netzwerk mit den oben generierten Daten trainiert. Anstatt oben eine neue Q-Matrix zu berechnen kann auch eine bekannte aus einer Datei geladen werden.

In [72]:
def train_network(model, samples, labels, epochs=50, batch_size=32):
    return model.fit(samples, labels, epochs=epochs, batch_size=batch_size, use_multiprocessing=True)

def import_Q_table(filename="Q_table.txt"):
    """
    import Q_table as dictionary:
    keys are state-action pairs as a tuple of a tuple (9 integers: 0/1/2) and an integer (0-8)
    values are the corresponding Q-values

    Example:
        Q_table[((0,0,0,0,1,0,0,0,0),2)] -> 0.3
    """
    Q_table = dict()
    with open(filename, "r") as file:
        for line in file.readlines():
            if line == "Q_table = {\n" or line == "}":
                continue
            state_action, value = line[:-2].split(":")
            state = tuple([int(x.strip(" ")) for x in state_action[2:-5].split(",")])
            action = int(state_action[-2])
            Q_table[(state, action)] = float(value)
    return Q_table

#### Load Q-table from File

In [73]:
# optionally generate data
Q_table = import_Q_table(filename="Q_table.txt")

#### actually train the network 

In [78]:
# prepare data
inputs, outputs = prepare_data(Q_table)
print(f"using {inputs.shape[0]} test values")
# create model
model = create_ttt_network([10,5])
# train network with given data
history = train_network(model, inputs, outputs, epochs=15, batch_size=15)

[[0.   0.   0.   0.   0.   1.   2.   0.   0.   0.  ]
 [0.   0.   0.   0.   0.   1.   2.   0.   0.   0.25]
 [0.   0.   0.   0.   0.   1.   2.   0.   0.   0.5 ]
 [0.   0.   0.   0.   0.   1.   2.   0.   0.   0.75]
 [0.   0.   0.   0.   0.   1.   2.   0.   0.   1.  ]
 [0.   0.   0.   0.   0.   1.   2.   0.   0.   1.75]
 [0.   0.   0.   0.   0.   1.   2.   0.   0.   2.  ]
 [0.   0.   0.   0.   0.   0.   0.   0.   0.   1.25]
 [1.   0.   0.   0.   0.   1.   2.   0.   0.   0.25]
 [1.   0.   0.   0.   0.   1.   2.   0.   0.   0.5 ]]
[0.33014686 0.32935545 0.33005992 0.33016879 0.33011688 0.32884704
 0.33011004 0.31533826 0.33235608 0.33234883]
using 16164 test values
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


# Aktion auswählen

In [75]:
import random
def choose_NN_action(state, actions, model, exploration_rate=0):
    """
    choose an action based on the possible actions, the given neural network (model) and the current exploration rate
    """
    r = random.random()
    if r > exploration_rate:
        # exploit knowledge
        action_values = []
        for action in actions:
            action_values.append( model.predict( [list(state)+[action/4]])[0][0] )
        print(list(zip(actions, action_values)))
        max_value = max(action_values)
        best_actions = []
        for action, value in zip(actions, action_values):
            if value == max_value:
                best_actions.append(action)
        # return random action with maximum expected reward
        return random.choice(best_actions)
    # explore environment through random move
    return random.choice(actions)

# Spielen: Mensch vs. KI
Bevor gespielt werden kann muss dies natürlich mit einigen funktionen vorbereitet werden:

In [76]:
def play_AI(Q_table, network):
    choose_action = get_ai_function(Q_table, network)
    start_player = ""
    while not start_player.lower() in ["me", "ai"]:
        start_player = input("Who starts? (me/ ai)\n")
    
    field = [0 for _ in range(9)]
    sign = 1
    print_field(field)

    playing = True
    while playing:
        actions = get_actions(field)
        if len(actions) == 0:
            break
        if start_player == "ai":
            action = choose_action(tuple(field), actions, exploration_rate=0)
        else:
            action = get_human_action(actions)
            if action == "end":
                print("game interrupted")
                return
        field[action] = sign
        sign = sign%2 + 1
        print_field(field)
        start_player = "ai" if start_player == "me" else "me"
        playing = print_winner(field)


def print_winner(field):
        """
        if the given field is in a terminal state, print the winner.
        """
        winner = game_ended(field, get_winner=True)
        playing = True
        if winner != None:
            playing = False
            if winner == 0:
                print("draw!")
            elif winner == 1:
                print("'o' won!")
            else:
                print("'x' won!")
        return playing


def get_human_action(actions):
    """
    get user input for the next action
    """
    action = -5
    while not action in actions:
        action = input(f"Choose your action ({str(actions)[1:-1]})\n")
        try:
            action = int(action)
            return action
        except:
            if action.lower() == "end":
                return "end"


def get_ai_function(Q_table, network):
    """
    returns a function that chooses an AI-move based on a given field
    inputs:
        Q_table - (dict) - dictionary with Q_values
        network - (keras.model) - keras model
    returns:
        (function) - a function that chooses an action.
            arguments: (state, actions, exploration_rate=0)
    """
    user_input = ""
    while not user_input.upper() in ["NN", "Q"]:
        user_input = input("What AI shall be the opponent?\n(Neural Network -> NN or Q-Learning -> Q)\n")

    if user_input.upper() == "NN":
        def choose_ai_action(state, actions, exploration_rate=0):
            return choose_NN_action(state, actions, network, exploration_rate=exploration_rate)
    else:
        def choose_ai_action(state, actions, exploration_rate=0):
            return choose_Q_action(state, action, Q_table, exploration_rate=exploration_rate)
    return choose_ai_action

In [77]:
play_AI(Q_table, model)

-------------
|   |   |   |
-------------
|   |   |   |
-------------
|   |   |   |
-------------
[(0, 0.30547655), (1, 0.3070523), (2, 0.30862808), (3, 0.31020388), (4, 0.31177965), (5, 0.31335545), (6, 0.3149312), (7, 0.31650695), (8, 0.31808275)]
-------------
|   |   |   |
-------------
|   |   |   |
-------------
|   |   | o |
-------------
-------------
|   |   |   |
-------------
|   | x |   |
-------------
|   |   | o |
-------------
[(0, 0.30680954), (1, 0.3083853), (2, 0.30996114), (3, 0.31153685), (5, 0.31468844), (6, 0.3162642), (7, 0.31784004)]
-------------
|   |   |   |
-------------
|   | x |   |
-------------
|   | o | o |
-------------
-------------
|   |   |   |
-------------
|   | x |   |
-------------
| x | o | o |
-------------
[(0, 0.33672395), (1, 0.33829987), (2, 0.33987558), (3, 0.34145138), (5, 0.34460282)]
-------------
|   |   |   |
-------------
|   | x | o |
-------------
| x | o | o |
-------------
-------------
|   |   | x |
-------------
|   | x | o |
