AlphaZero implementation for Ray/RLlib
This code implements a one-player AlphaZero agent. It includes the "ranked rewards" (R2) strategy which simulates the self-play in the two-player AlphaZero in forcing the agent to be better than its previous self. R2 is also very helpful to normalize dynamically the rewards.
The code is Pytorch based. It assumes that the environment is a gym environment, has a discrete action space and returns an observation as a dictionary with two keys:
obsthat contains an observation under either the form of a state vector or an image
action_maskthat contains a mask over the legal actions
It should also implement a
The model used in AlphaZero trainer should extend
ActorCriticModel and implement the method
Example on Cartpole
Note that both mean and max rewards are obtained with the MCTS in exploration mode: dirichlet noise is added to priors and actions are sampled from the tree policy vectors. We will add later the display of the MCTS in exploitation mode: no dirichlet noise and actions are chosen as tree policy vectors argmax.