![Py4Eng](img/logo.png)

# Reinforcement learning
## Yoav Ram

Reinforcement learning is a semi-supervied learning method in which we don't have target _per se_, but can still evaluate the value of model predictions, even if only after the fact.

In many exmaples of reinforcement learning the model is trained to play a game.
In this spirit, we'll train a model to play the simplest thinking game of all: tic-tac-toe, or X-O.

![X-O](https://duckduckgo.com/i/60ac44b2.png)

But there are many much more sophisticated examples and uses for RL, for example (click on the images to follow the links):

[![pong](http://karpathy.github.io/assets/rl/pong.gif)](http://karpathy.github.io/2016/05/31/rl/)

[![catch](https://edersantana.github.io/articles/keras_rl/catch.gif)](https://edersantana.github.io/articles/keras_rl/)

[![jumping](https://www.cs.ubc.ca/~van/papers/2016-TOG-deepRL/dog_teaser.png)](https://www.cs.ubc.ca/~van/papers/2016-TOG-deepRL/index.html)

[![AlphaGo](https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/FloorGoban.JPG/300px-FloorGoban.JPG)](https://deepmind.com/research/alphago/)


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import itertools
import time
import pickle
from collections import defaultdict
from IPython.display import HTML
import keras
print('Keras', keras.__version__)
from keras import backend as K
print('GPU:', K.tensorflow_backend._get_available_gpus())

Using TensorFlow backend.


Keras 2.2.4
GPU: []


# Basic game elements

We will model a board as an array of 18 elements. 
The first 9 are for player x and the last 9 are for player o.
Each 9 elements represent a 3x3 board for that player, with ones where the player put her mark (x or o) and zeros where she didn't.

The first function is for displaying a board.

In [2]:
int_to_str = {0: ' ', 1: 'x', 2: 'o'}
css = """
    width:2em;
    height:2em;
    border: 1px solid #000; 
    background: white;
    text-align: center;
    
"""
def display(board):
    board = (board[:9] + 2 * board[9:]).reshape((3, 3))
    table = "<table>"
    for i in range(3):
        table += "<tr>"
        for j in range(3):
            table += "<td style='{}'>{}</td>".format(css, int_to_str[board[i,j]])
        table += "</tr>"
    table += "</table>"
    return HTML(table.format(*board.ravel()))
    
board = np.zeros(18)
board[4] = 1
board[9 + 5] = 1
board[8] = 1
display(board)

0,1,2
,,
,x,o
,,x


The next function checks if a move is legal, i.e., is not trying to put a mark where an x or o mark is already there.

In [3]:
def is_legal_move(board, move):
    return 0 <= move < 9 and board[move] == 0 and board[9 + move] == 0

In [4]:
board = np.zeros(18)
board[0] = 1
print(is_legal_move(board, 0), is_legal_move(board, -1))
display(board)

False False


0,1,2
x,,
,,
,,


Now a function that draws a random legal move.

In [5]:
def random_move(board):
    board = board[:9] + board[9:]
    board = board.reshape((3, 3))
    empty = np.where(board.ravel() == 0)[0]
    return np.random.choice(empty, 1)[0]

In [6]:
board = np.zeros(18)
board[random_move(board)] = 1
board[9 + random_move(board)] = 1
board[random_move(board)] = 1
display(board)

0,1,2
,x,
,,
,o,x


The next functions check if the board is full or if a specific player has won.

In [7]:
def is_full(board):
    return (board[:9] + board[9:]).all()

def is_winner(board, player):
    board = board[9:] if player else board[:9]
    board = board.reshape((3, 3))
    if (board[0,:].all() or 
        board[1,:].all() or 
        board[2,:].all() or 
        board[:,0].all() or 
        board[:,1].all() or 
        board[:,2].all()):
        return True
    elif board[0,0] and board[1,1] and board[2,2]:
        return True
    elif board[0,2] and board[1,1] and board[2,0]:
        return True
    return False

In [8]:
board = np.zeros(18)
player = 1
while not is_winner(board, player) and not is_full(board):
    player = (player + 1) % 2
    move = random_move(board)
    board[9*player + move] = 1    
display(board)

0,1,2
o,x,x
o,o,x
,x,o


# Model

We use Keras to build a simple model with one dense layer and a softmax readout layer.

The model input is the current board as an array of 18 elements; the model output is a probability vector for the 9 possible moves (0-8).

In [9]:
model = keras.models.Sequential()
model.add(keras.layers.Dense(128, input_shape=(18,), activation='relu'))
model.add(keras.layers.Dense(9))
model.add(keras.layers.Activation('softmax'))

We use the mean squared error as a loss function and a simple stochastic gradient descent optimizer.

In [10]:
model.compile(keras.optimizers.SGD(lr=.2), "mse")

# Playing a game

The model will be player x, and player o will be implemented with naive random moves.

If the model wins, it gets a reward of 1. If it losses the reward is -1. Ties give a reward of 0.

Note that although the model starts, the naive player has an advantage because it never plays an illegal move, so at least it knows the rules.
The model, on the other hand, plays illegal moves and will be penalized for them: he losses (reward is -1) when it plays an illegal move.

During a game, sometimes (with probability $\epsilon$) we let the model choose a random move instead of the move it would choose using it's prediction function. This is done to add some noise and help get the model out of bad strategies. This strategy is called _exploration_.

During the game we save at each step the prediction and the move that the model chose, and in the last step we also save the reward. 
That game "memory" is the result of the `play_game` function.

In [11]:
def play_game(verbose=False, ϵ= 0.1):
    board = np.zeros(18, dtype=int)
    memory = defaultdict(dict)

    for t in range(5):
        memory[t]['board'] = board.copy()

        if is_full(board):
            if verbose: print('board full')
            memory[t]['reward'] = 0
            break
            
        # player x
        if np.random.rand() < ϵ:
            pred = np.ones(9) / 9
        else:
            pred = model.predict(board.reshape(1,-1)).ravel() # predict gives (1,9)
        move = np.random.multinomial(1, pred).argmax()
        memory[t]['pred'] = pred
        memory[t]['move'] = move

        if not is_legal_move(board, move):
            if verbose: print('illegal move by player x:', move)
            memory[t]['reward'] = -1
            break

        board[move] = 1
        
        if is_winner(board, 0): 
            if verbose: print('player x wins')
            memory[t]['reward'] = 1
            break
            
        # player o
        if is_full(board):
            if verbose: print('board full')
            memory[t]['reward'] = 0
            break
            
        move = random_move(board)
        board[9 + move] = 1
        
        if is_winner(board, 1): 
            if verbose: print('player o wins')
            memory[t]['reward'] = -1
            break
    return [memory[k] for k in sorted(memory.keys())]

# Training

Before training the model player x is pretty bad, getting a negative average score and winning only ~10% of games.

In [12]:
scores = np.array([play_game(ϵ=0)[-1]['reward'] for _ in range(1000)])
print('Average score {:.2f}'.format(scores.mean()))
print('x Won: {:.2%}'.format((scores==1).mean()))

Average score -0.85
x Won: 7.30%


Training on a single game is done by stacking the game boards (one for every x turn) as inputs ($X$).

The targets ($Y$) is an array of zeros with the reward at the index of the move the model chose.

So the model should try to increase the probability of choosing moves with reward of 1 (wins) and decrease the probability of choosing moves with reward -1.

In [13]:
def train_on_game(memory):
    X = [] # boards
    Y = [] # rewards
    reward = memory[-1]['reward']
    
    for turn in memory:
        board = turn['board']
        X.append(board)
        
        y = np.zeros(9)
        move = turn['move']
        y[move] = reward
        Y.append(y)

    X = np.array(X)
    Y = np.array(Y)
    return model.train_on_batch(X, Y)

Let's train the model.

In [14]:
num_of_games = 100000
tic = time.time()
for i in range(num_of_games):
    memory = play_game()
    loss = train_on_game(memory)
    if i % (num_of_games//10) == 0:
        toc = time.time()
        print("{}: {:.4f} seconds, loss={}".format(i, toc-tic, loss))
        tic = toc

0: 0.2767 seconds, loss=0.14926552772521973
10000: 17.0415 seconds, loss=0.14901472628116608
20000: 16.5169 seconds, loss=0.15238027274608612
30000: 16.6924 seconds, loss=0.15000814199447632
40000: 16.7493 seconds, loss=0.1500995010137558
50000: 16.8004 seconds, loss=0.14556358754634857
60000: 16.8227 seconds, loss=0.09441013634204865
70000: 16.7645 seconds, loss=0.09099067002534866
80000: 16.9373 seconds, loss=0.15189170837402344
90000: 16.9726 seconds, loss=0.09247449040412903


In [15]:
scores = np.array([play_game(ϵ=0)[-1]['reward'] for _ in range(1000)])
print('Average score {:.2f}'.format(scores.mean()))
print('x Won: {:.2%}'.format((scores==1).mean()))

Average score -0.35
x Won: 32.00%


After training on 100,000 games the model still has a negative score, but wins ~34% of games rather than 10%.

In [16]:
num_of_games = 400000
tic = time.time()
for i in range(num_of_games):
    memory = play_game()
    loss = train_on_game(memory)
    if i % (num_of_games//10) == 0:
        toc = time.time()
        print("{}: {:.4f} seconds, loss={}".format(i, toc-tic, loss))
        tic = toc

0: 0.0046 seconds, loss=0.15217572450637817
40000: 67.8320 seconds, loss=0.07331334054470062
80000: 68.0855 seconds, loss=0.1498451679944992
120000: 68.2618 seconds, loss=0.09146779030561447
160000: 67.2118 seconds, loss=0.06977465748786926
200000: 67.0176 seconds, loss=0.06808637827634811
240000: 68.8287 seconds, loss=0.06794822961091995
280000: 66.8273 seconds, loss=0.14753609895706177
320000: 66.7200 seconds, loss=0.1428232043981552
360000: 66.7117 seconds, loss=0.15468254685401917


After 500,000 games, its already at 61% winning and a positive average score.

In [17]:
scores = np.array([play_game(ϵ=0)[-1]['reward'] for _ in range(1000)])
print('Average score {:.2f}'.format(scores.mean()))
print('x Won: {:.2%}'.format((scores==1).mean()))

Average score 0.34
x Won: 66.70%


# Colophon
This notebook was written by [Yoav Ram](http://python.yoavram.com) and is part of the [_Deep Learning for Software Developers_](https://python.yoavram.com/Deep4Devs) course.

The notebook was written using [Python](http://python.org/) 3.6.3, [IPython](http://ipython.org/) 6.2.1, [Jupyter](http://jupyter.org) 5.1.0.

This work is licensed under a CC BY-NC-SA 4.0 International License.

![Python logo](https://www.python.org/static/community_logos/python-logo.png)