An asynchronous implementation of the AlphaZero algorithm based on the AlphaZero paper.
AlphaZero is an algorithm that trains a reinforcement learning agent through self-play. The training examples are states of games, while the 'ground truth' labels are value of a state and policy (probability distribution of actions) of a state.
AlphaZero uses a modified version of the Monte Carlo Tree Search (MCTS) which uses the trained network to predict values of states rather than performing rollouts upon traversing to a leaf node.
Training was done with a multiprocessing, asynchronous approach demonstrated here.
The agent was trained for 1 week, and was able to defeat the one-step-look-ahead agent consistently very quickly (at around 3000 epochs).
I then tested the agent against myself. While it was difficult to beat, it is not unbeatable, and as Connect4 is a solved game, this agent should theoretically be able to converge to an optimal policy. I then increased the memory buffer size and started training on it again. Future updates will be reported.
The AlphaZero folder contains all of the backend code for this implementation.
The training configuration, ResNet built using tensorflow 2, memory object and game object can be found here.
MCTS related functions can be found here.
The Pit object for evaluating the agent against a one-step-look-ahead agent can be found here.