![Py4Eng](img/logo.png)

# Reinforcement learning
## Yoav Ram

Reinforcement learning is a semi-supervied learning method in which we don't have target _per se_, but can still evaluate the value of model predictions, even if only after the fact.

In many exmaples of reinforcement learning the model is trained to play a game.
In this spirit, we'll train a model to play the simplest thinking game of all: tic-tac-toe, or X-O.
This is similar to [teaching a pigeon to play checkers](https://www.youtube.com/watch?v=TYY3A06cgaY).

![X-O](https://duckduckgo.com/i/60ac44b2.png)

But there are many much more sophisticated examples and uses for RL, for example (click on the images to follow the links):

[![pong](http://karpathy.github.io/assets/rl/pong.gif)](http://karpathy.github.io/2016/05/31/rl/)

[![catch](https://edersantana.github.io/articles/keras_rl/catch.gif)](https://edersantana.github.io/articles/keras_rl/)

[![jumping](https://www.cs.ubc.ca/~van/papers/2016-TOG-deepRL/dog_teaser.png)](https://www.cs.ubc.ca/~van/papers/2016-TOG-deepRL/index.html)

[![AlphaGo](https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/FloorGoban.JPG/300px-FloorGoban.JPG)](https://deepmind.com/research/alphago/)


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import itertools
import time
import pickle
from collections import defaultdict
from IPython.display import HTML
import tensorflow as tf
try:
    import keras
except ModuleNotFoundError:
    from tensorflow import keras
print('GPU:', tf.test.is_gpu_available())
import seaborn as sns
sns.set(
    style='ticks',
    context='talk',
    palette='Set1'
)

# Basic game elements

We will model a board as an array of 18 elements. 
The first 9 are for player x and the last 9 are for player o.
Each 9 elements represent a 3x3 board for that player, with ones where the player put her mark (x or o) and zeros where she didn't.

The first function is for displaying a board.

In [2]:
int_to_str = {0: ' ', 1: 'x', 2: 'o'}
css = """
    width:2em;
    height:2em;
    border: 1px solid #000; 
    background: white;
    text-align: center;
    
"""
def display(board):
    board = (board[:9] + 2 * board[9:]).reshape((3, 3))
    table = "<table>"
    for i in range(3):
        table += "<tr>"
        for j in range(3):
            table += "<td style='{}'>{}</td>".format(css, int_to_str[board[i,j]])
        table += "</tr>"
    table += "</table>"
    return HTML(table.format(*board.ravel()))
    
board = np.zeros(18)
board[4] = 1
board[9 + 5] = 1
board[8] = 1
display(board)

0,1,2
,,
,x,o
,,x


The next function checks if a move is legal, i.e., is not trying to put a mark where an x or o mark is already there.

In [3]:
def is_legal_move(board, move):
    return 0 <= move < 9 and board[move] == 0 and board[9 + move] == 0

In [4]:
board = np.zeros(18)
board[0] = 1
print(is_legal_move(board, 0), is_legal_move(board, -1))
display(board)

False False


0,1,2
x,,
,,
,,


Now a function that draws a **random legal move**.

In [5]:
def random_move(board):
    board = board[:9] + board[9:]
    board = board.reshape((3, 3))
    empty = np.where(board.ravel() == 0)[0]
    return np.random.choice(empty, 1)[0]

In [6]:
board = np.zeros(18)
board[random_move(board)] = 1
board[9 + random_move(board)] = 1
board[random_move(board)] = 1
display(board)

0,1,2
,x,
x,,
,,o


The next functions check if the board is full or if a specific player has won.

In [7]:
def is_full(board):
    return (board[:9] + board[9:]).all()

def is_winner(board, player):
    board = board[9:] if player else board[:9]
    board = board.reshape((3, 3))
    if (board[0,:].all() or 
        board[1,:].all() or 
        board[2,:].all() or 
        board[:,0].all() or 
        board[:,1].all() or 
        board[:,2].all()):
        return True
    elif board[0,0] and board[1,1] and board[2,2]:
        return True
    elif board[0,2] and board[1,1] and board[2,0]:
        return True
    return False

In [8]:
board = np.zeros(18)
player = 1
while not is_winner(board, player) and not is_full(board):
    player = (player + 1) % 2
    move = random_move(board)
    board[9*player + move] = 1    
display(board)

0,1,2
o,o,x
x,x,o
x,x,o


# Model

We use Keras to build a simple model with one dense layer and a softmax readout layer.

The model input is the current board as an array of 18 elements; the model output is a probability vector for the 9 possible moves (0-8).

In [10]:
model = keras.models.Sequential()
model.add(keras.layers.Dense(128, input_shape=(18,), activation='relu'))
model.add(keras.layers.Dense(9))
model.add(keras.layers.Activation('softmax'))

We use the **mean squared error** as a loss function and a simple stochastic gradient descent optimizer. The choice of mean squared error will become clear when we see how the targets ($y$) are constructed.

In [11]:
model.compile(keras.optimizers.SGD(lr=.2), "mse")

# Playing a game

The model will be player x, and player o will be implemented with naive random moves.

If the model wins, it gets a reward of 1. If it losses the reward is -1. Ties give a reward of 0.
This rewards will be used to construct the targets ($y$) so that when the model chose a move that lead to a win/loss/tie we will reinforce that move using a reward of 1/-1/0 using the mean squared error loss function.

Note that at the begining, the random player 2 has an advantage because it never plays an illegal move, so at least it knows the rules.
The model (player 1), on the other hand, plays illegal moves and will be penalized for them: he losses (reward is -1) when it plays an illegal move. This will allow our model to learn the rules.

During a game, sometimes (with probability $\epsilon$) we let the model choose a random move instead of the move it would choose using it's prediction function. This is done to add some noise and help get the model out of bad strategies. This strategy is called _exploration_.

During the game we save at each step the prediction and the move that the model chose, and in the last step we also save the reward. 
That game "memory" is the result of the `play_game` function.

In [13]:
def play_game(verbose=False, ϵ= 0.1):
    board = np.zeros(18, dtype=int)
    memory = []

    for i in range(5): # max number of turns is 5
        turn = dict()
        memory.append(turn)
        
        turn['board'] = board.copy()

        if is_full(board):
            if verbose: print('board full')
            turn['reward'] = 0
            break
            
        # player x
        if np.random.rand() < ϵ: # exploration
            pred = np.ones(9) / 9
        else: # exploitation
            # predicts expects 2D and returns 2D
            pred = model.predict(board.reshape(1,-1)).ravel() 
        move = np.random.multinomial(1, pred).argmax() # draw random move from move distribution
        turn['pred'] = pred
        turn['move'] = move

        if not is_legal_move(board, move):
            if verbose: print('illegal move by player x:', move)
            turn['reward'] = -1
            break

        board[move] = 1
        
        if is_winner(board, 0): 
            if verbose: print('player x wins')
            turn['reward'] = 1
            break
            
        # player o
        if is_full(board):
            if verbose: print('board full')
            turn['reward'] = 0
            break
            
        move = random_move(board)
        board[9 + move] = 1
        
        if is_winner(board, 1): 
            if verbose: print('player o wins')
            turn['reward'] = -1
            break
    return memory

# Training

Before training the model player x is pretty bad, getting a negative average score and winning only 7%-9% of games.

In [14]:
def score(num_games=1000):
    scores = np.array([play_game(ϵ=0)[-1]['reward'] for _ in range(num_games)])
    return (scores==1).mean(), (scores==0).mean(), (scores==-1).mean()

wins, ties, losses = score(1000)
print("X won {:.2%} and lost {:.2%} of games".format(wins, losses))

X won 9.10% and lost 90.80% of games


Training on a single game is done by stacking the game boards (one for every x turn) as inputs ($X$).

The targets ($Y$) is an array of zeros with the reward at the index of the move the model chose.

So the model should try to increase the probability of choosing moves with reward of 1 (wins) and decrease the probability of choosing moves with reward -1.

In [15]:
def train_on_game(memory):
    X = [] # boards
    Y = [] # rewards
    reward = memory[-1]['reward']
    
    for turn in memory:
        board = turn['board']
        X.append(board)
        
        y = np.zeros(9)
        move = turn['move']
        y[move] = reward
        Y.append(y)

    X = np.array(X)
    Y = np.array(Y)
    return model.train_on_batch(X, Y)

Let's train the model.

In [16]:
def train(num_of_games):
    # we define "epoch" so that we have 10 epochs overall
    epoch = num_of_games//10 
    loss = 0
    tic = time.time()
    for i in range(1, num_of_games + 1):
        memory = play_game()
        loss += train_on_game(memory) # sum losses since epoch started
        if i % epoch == 0:
            toc = time.time()
            # print elapsed time and average epoch loss
            print("{} games: {:.4f} seconds, loss={:.4f}".format(i, toc-tic, loss/epoch))
            tic = toc
            loss = 0

In [43]:
train(100000)

10000 games: 17.1543 seconds, loss=0.1431
20000 games: 17.2152 seconds, loss=0.1410
30000 games: 18.1610 seconds, loss=0.1399
40000 games: 17.8315 seconds, loss=0.1393
50000 games: 18.7205 seconds, loss=0.1385
60000 games: 20.0695 seconds, loss=0.1375
70000 games: 19.1501 seconds, loss=0.1364
80000 games: 19.2436 seconds, loss=0.1346
90000 games: 19.6842 seconds, loss=0.1332
100000 games: 17.8237 seconds, loss=0.1309


In [44]:
wins, ties, losses = score(1000)
print("X won {:.2%} and lost {:.2%} of games".format(wins, losses))

X won 32.30% and lost 67.20% of games


After training on 100,000 games the model still has a negative score, but wins ~32% of games rather than ~7%.

Lets keep training; note that one epoch is a tenth of the training time, so the duration per epoch will be higher now.

In [45]:
train(400000)

40000 games: 74.1880 seconds, loss=0.1241
80000 games: 78.1307 seconds, loss=0.1134
120000 games: 79.7379 seconds, loss=0.1091
160000 games: 77.0359 seconds, loss=0.1071
200000 games: 75.8316 seconds, loss=0.1059
240000 games: 70.6603 seconds, loss=0.1059
280000 games: 73.6886 seconds, loss=0.1047
320000 games: 72.2727 seconds, loss=0.1051
360000 games: 72.4834 seconds, loss=0.1048
400000 games: 71.6227 seconds, loss=0.1046


After 500,000 games, its already at ~66% winning.

In [47]:
wins, ties, losses = score(1000)
print("X won {:.2%} and lost {:.2%} of games".format(wins, losses))

X won 65.60% and lost 34.30% of games


# Colophon
This notebook was written by [Yoav Ram](http://python.yoavram.com).

The notebook was written using [Python](http://python.org/) 3.7.

This work is licensed under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) International License.

![Python logo](https://www.python.org/static/community_logos/python-logo.png)