## Reinforcement Learning Chess
Reinforcement Learning Chess is a series of notebooks where I implement Reinforcement Learning algorithms to develop a chess AI. I start of with simpler versions (environments) that can be tackled with simple methods and gradually expand on those concepts untill I have a full-flegded chess AI.

[Notebook 1: Policy Iteration](https://www.kaggle.com/arjanso/reinforcement-learning-chess-1-policy-iteration)  
[Notebook 2: Model-free learning](https://www.kaggle.com/arjanso/reinforcement-learning-chess-2-model-free-methods)  
[Notebook 4: Policy Gradients](https://www.kaggle.com/arjanso/reinforcement-learning-chess-4-policy-gradients)

### Notebook III: Q-networks
In this notebook I implement an simplified version of chess named capture chess. In this environment the agent (playing white) is rewarded for capturing pieces (not for checkmate).  After running this notebook, you end up with an agent that can capture pieces against a random oponnent as demonstrated in the gif below. The main difference between this notebook and the previous one is that I use Q-networks as an alternative to Q-tables. Q-tables are nice and straightforward, but can only contain a limited amount of action values. Chess has state space complexity of 10<sup>47</sup>. Needless to say, this is too much information to put in a Q-table. This is where supervised learning comes in. A Q-network can represent a generalized mapping from state to action values.

![](https://images.chesscomfiles.com/uploads/game-gifs/90px/green/neo/0/cc/0/0/aXFZUWpyN1Brc1BPbHQwS211WEhudkh6cXohMGFPMExPUTJNUTY4MDY1OTI1NFpSND8yOT85M1Y5MTA3MUxLQ3RDUkpDSjcwTE0wN293V0d6Rzc2cHhWTXJ6NlhzQVg0dUM0WGNNWDU,.gif)



#### Import and Install

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import inspect

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
!pip install chess

In [None]:
!pip install --upgrade git+https://github.com/arjangroen/RLC.git  # RLC is the Reinforcement Learning package

In [None]:
!pip list | grep keras

In [None]:
from tensorflow.keras.optimizers import SGD

In [None]:
%%writefile /opt/conda/lib/python3.7/site-packages/RLC/capture_chess/agent.py

from keras.models import Model, clone_model
from keras.layers import Input, Conv2D, Dense, Reshape, Dot, Activation, Multiply
from tensorflow.keras.optimizers import SGD
import numpy as np
import keras.backend as K


def policy_gradient_loss(Returns):
    def modified_crossentropy(action, action_probs):
        cost = (K.categorical_crossentropy(action, action_probs, from_logits=False, axis=1) * Returns)
        return K.mean(cost)

    return modified_crossentropy


class Agent(object):

    def __init__(self, gamma=0.5, network='linear', lr=0.01, verbose=0):
        """
        Agent that plays the white pieces in capture chess
        Args:
            gamma: float
                Temporal discount factor
            network: str
                'linear' or 'conv'
            lr: float
                Learning rate, ideally around 0.1
        """
        self.gamma = gamma
        self.network = network
        self.lr = lr
        self.verbose = verbose
        self.init_network()
        self.weight_memory = []
        self.long_term_mean = []

    def init_network(self):
        """
        Initialize the network
        Returns:

        """
        if self.network == 'linear':
            self.init_linear_network()
        elif self.network == 'conv':
            self.init_conv_network()
        elif self.network == 'conv_pg':
            self.init_conv_pg()

    def fix_model(self):
        """
        The fixed model is the model used for bootstrapping
        Returns:
        """
        optimizer = SGD(lr=self.lr, momentum=0.0, decay=0.0, nesterov=False)
        self.fixed_model = clone_model(self.model)
        self.fixed_model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])
        self.fixed_model.set_weights(self.model.get_weights())

    def init_linear_network(self):
        """
        Initialize a linear neural network
        Returns:

        """
        optimizer = SGD(lr=self.lr, momentum=0.0, decay=0.0, nesterov=False)
        input_layer = Input(shape=(8, 8, 8), name='board_layer')
        reshape_input = Reshape((512,))(input_layer)
        output_layer = Dense(4096)(reshape_input)
        self.model = Model(inputs=[input_layer], outputs=[output_layer])
        self.model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])

    def init_conv_network(self):
        """
        Initialize a convolutional neural network
        Returns:

        """
        optimizer = SGD(lr=self.lr, momentum=0.0, decay=0.0, nesterov=False)
        input_layer = Input(shape=(8, 8, 8), name='board_layer')
        inter_layer_1 = Conv2D(1, (1, 1), data_format="channels_first")(input_layer)  # 1,8,8
        inter_layer_2 = Conv2D(1, (1, 1), data_format="channels_first")(input_layer)  # 1,8,8
        flat_1 = Reshape(target_shape=(1, 64))(inter_layer_1)
        flat_2 = Reshape(target_shape=(1, 64))(inter_layer_2)
        output_dot_layer = Dot(axes=1)([flat_1, flat_2])
        output_layer = Reshape(target_shape=(4096,))(output_dot_layer)
        self.model = Model(inputs=[input_layer], outputs=[output_layer])
        self.model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])

    def init_conv_pg(self):
        """
        Convnet net for policy gradients
        Returns:

        """
        optimizer = SGD(lr=self.lr, momentum=0.0, decay=0.0, nesterov=False)
        input_layer = Input(shape=(8, 8, 8), name='board_layer')
        R = Input(shape=(1,), name='Rewards')
        legal_moves = Input(shape=(4096,), name='legal_move_mask')
        inter_layer_1 = Conv2D(1, (1, 1), data_format="channels_first")(input_layer)  # 1,8,8
        inter_layer_2 = Conv2D(1, (1, 1), data_format="channels_first")(input_layer)  # 1,8,8
        flat_1 = Reshape(target_shape=(1, 64))(inter_layer_1)
        flat_2 = Reshape(target_shape=(1, 64))(inter_layer_2)
        output_dot_layer = Dot(axes=1)([flat_1, flat_2])
        output_layer = Reshape(target_shape=(4096,))(output_dot_layer)
        softmax_layer = Activation('softmax')(output_layer)
        legal_softmax_layer = Multiply()([legal_moves, softmax_layer])  # Select legal moves
        self.model = Model(inputs=[input_layer, R, legal_moves], outputs=[legal_softmax_layer])
        self.model.compile(optimizer=optimizer, loss=policy_gradient_loss(R))

    def network_update(self, minibatch):
        """
        Update the Q-network using samples from the minibatch
        Args:
            minibatch: list
                The minibatch contains the states, moves, rewards and new states.

        Returns:
            td_errors: np.array
                array of temporal difference errors

        """

        # Prepare separate lists
        states, moves, rewards, new_states = [], [], [], []
        td_errors = []
        episode_ends = []
        for sample in minibatch:
            states.append(sample[0])
            moves.append(sample[1])
            rewards.append(sample[2])
            new_states.append(sample[3])

            # Episode end detection
            if np.array_equal(sample[3], sample[3] * 0):
                episode_ends.append(0)
            else:
                episode_ends.append(1)

        # The Q target
        q_target = np.array(rewards) + np.array(episode_ends) * self.gamma * np.max(
            self.fixed_model.predict(np.stack(new_states, axis=0)), axis=1)

        # The Q value for the remaining actions
        q_state = self.model.predict(np.stack(states, axis=0))  # batch x 64 x 64

        # Combine the Q target with the other Q values.
        q_state = np.reshape(q_state, (len(minibatch), 64, 64))
        for idx, move in enumerate(moves):
            td_errors.append(q_state[idx, move[0], move[1]] - q_target[idx])
            q_state[idx, move[0], move[1]] = q_target[idx]
        q_state = np.reshape(q_state, (len(minibatch), 4096))

        # Perform a step of minibatch Gradient Descent.
        self.model.fit(x=np.stack(states, axis=0), y=q_state, epochs=1, verbose=0)

        return td_errors

    def get_action_values(self, state):
        """
        Get action values of a state
        Args:
            state: np.ndarray with shape (8,8,8)
                layer_board representation

        Returns:
            action values

        """
        return self.fixed_model.predict(state) + np.random.randn() * 1e-9

    def policy_gradient_update(self, states, actions, rewards, action_spaces, actor_critic=False):
        """
        Update parameters with Monte Carlo Policy Gradient algorithm
        Args:
            states: (list of tuples) state sequence in episode
            actions: action sequence in episode
            rewards: rewards sequence in episode

        Returns:

        """
        n_steps = len(states)
        Returns = []
        targets = np.zeros((n_steps, 64, 64))
        for t in range(n_steps):
            action = actions[t]
            targets[t, action[0], action[1]] = 1
            if actor_critic:
                R = rewards[t, action[0] * 64 + action[1]]
            else:
                R = np.sum([r * self.gamma ** i for i, r in enumerate(rewards[t:])])
            Returns.append(R)

        if not actor_critic:
            mean_return = np.mean(Returns)
            self.long_term_mean.append(mean_return)
            train_returns = np.stack(Returns, axis=0) - np.mean(self.long_term_mean)
        else:
            train_returns = np.stack(Returns, axis=0)
        # print(train_returns.shape)
        targets = targets.reshape((n_steps, 4096))
        self.weight_memory.append(self.model.get_weights())
        self.model.fit(x=[np.stack(states, axis=0),
                          train_returns,
                          np.concatenate(action_spaces, axis=0)
                          ],
                       y=[np.stack(targets, axis=0)],
                       verbose=self.verbose
                       )

In [None]:
import chess
from chess.pgn import Game
import RLC

from RLC.capture_chess.environment import Board
from RLC.capture_chess.learn import Q_learning
from RLC.capture_chess.agent import Agent

### The environment: Capture Chess
In this notebook we'll upgrade our environment to one that behaves more like real chess. It is mostly based on the Board object from python-chess.
Some modifications are made to make it easier for the algorithm to converge:
* There is a maximum of 25 moves, after that the environment resets
* Our Agent only plays white
* The Black player is part of the environment and returns random moves
* The reward structure is not based on winning/losing/drawing but on capturing black pieces:
    - pawn capture: +1
    - knight capture: +3
    - bishop capture: +3
    - rook capture: +5
    - queen capture: +9
* Our state is represent by an 8x8x8 array
    - Plane 0 represents pawns
    - Plane 1 represents rooks
    - Plane 2 represents knights
    - Plane 3 represents bishops
    - Plane 4 represents queens
    - Plane 5 represents kings
    - Plane 6 represents 1/fullmove number (needed for markov property)
    - Plane 7 represents can-claim-draw
* White pieces have the value 1, black pieces are minus 1
       


#### Board representation of python-chess:

In [None]:
board = Board()
board.board

#### Numerical representation of the pawns (layer 0)
Change the index of the first dimension to see the other pieces

In [None]:
board.layer_board[1,::-1,:].astype(int)

### The Agent
* The agent is no longer a single piece, it's a chess player
* Its action space consist of 64x64=4096 actions:
    * There are 8x8 = 64 piece from where a piece can be picked up
    * And another 64 pieces from where a piece can be dropped. 
* Of course, only certain actions are legal. Which actions are legal in a certain state is part of the environment (in RL, anything outside the control of the agent is considered part of the environment). We can use the python-chess package to select legal moves. (It seems that AlphaZero uses a similar approach https://ai.stackexchange.com/questions/7979/why-does-the-policy-network-in-alphazero-work)

#### Implementation

In [None]:
board = Board()
agent = Agent(network='conv',gamma=0.1,lr=0.07)
R = Q_learning(agent,board)
R.agent.fix_model()
R.agent.model.summary()

In [None]:
print(inspect.getsource(agent.network_update))

#### Q learning with a Q-network
**Theory**
- The Q-network is usually either a linear regression or a (deep) neural network. 
- The input of the network is the state (S) and the output is the predicted action value of each Action (in our case, 4096 values). 
- The idea is similar to learning with Q-tables. We update our Q value in the direction of the discounted reward + the max successor state action value
- I used prioritized experience replay to de-correlate the updates. If you want to now more about it, check the link in the references
> - I used fixed-Q targets to stabilize the learning process. 

#### Implementation
- I built two networks, A linear one and a convolutional one
- The linear model maps the state (8,8,8) to the actions (64,64), resulting in over 32k trainable weights! This is highly inefficient because there is no parameter sharing, but it will work.
- The convolutional model uses 2 1x1 convulutions and takes the outer product of the resulting arrays. This results in only 18 trainable weights! 
    - Advantage: More parameter sharing -> faster convergence
    - Disadvantage: Information gets lost -> lower performance
- For a real chess AI we need bigger neural networks. But now the neural network only has to learn to capture valuable pieces.

In [None]:
print(inspect.getsource(R.play_game))

#### Demo

In [None]:
pgn = R.learn(iters=750)

In [None]:
reward_smooth = pd.DataFrame(R.reward_trace)
reward_smooth.rolling(window=125,min_periods=0).mean().plot(figsize=(16,9),title='average performance over the last 125 steps')

The PGN file is exported to the output folder. You can analyse is by pasting it on the [chess.com analysis board](https://www.chess.com/analysis)

In [None]:
with open("final_game.pgn","w") as log:
    log.write(str(pgn))

## Learned action values analysis
So what has the network learned? The code below checks the action values of capturing every black piece for every white piece. 
- We expect that the action values for capturing black pieces is similar to the (Reinfeld) rewards we put in our environment. 
- Of course the action values also depend on the risk of re-capture by black and the opportunity for consecutive capture. 

In [None]:
board.reset()
bl = board.layer_board
bl[6,:,:] = 1/10  # Assume we are in move 10
av = R.agent.get_action_values(np.expand_dims(bl,axis=0))

av = av.reshape((64,64))

p = board.board.piece_at(20)#.symbol()


white_pieces = ['P','N','B','R','Q','K']
black_piece = ['_','p','n','b','r','q','k']

df = pd.DataFrame(np.zeros((6,7)))

df.index = white_pieces
df.columns = black_piece

for from_square in range(16):
    for to_square in range(30,64):
        from_piece = board.board.piece_at(from_square).symbol()
        to_piece = board.board.piece_at(to_square)
        if to_piece:
            to_piece = to_piece.symbol()
        else:
            to_piece = '_'
        df.loc[from_piece,to_piece] = av[from_square,to_square]
        
        

* ### Learned action values for capturing black (lower case) with white (upper case) pieces.
Underscore represents capturing an empty square

In [None]:
df[['_','p','n','b','r','q']]

## References
Reinforcement Learning: An Introduction  
> Richard S. Sutton and Andrew G. Barto  
> 1st Edition  
> MIT Press, march 1998  

RL Course by David Silver: Lecture playlist  
> https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ  

Experience Replay  
> https://datascience.stackexchange.com/questions/20535/what-is-experience-replay-and-what-are-its-benefits