Skip to content
/ muzero Public

A python implemenation of tabular MuZero for educational purposes

License

Notifications You must be signed in to change notification settings

wulfebw/muzero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MuZero

  • A tabular implementation of MuZero compatible with gym.
    • To elaborate on that: the current implementation uses the policy iteration update rule of MuZero (via MCTS), but instead of using function approximation (e.g., a neural network) for the learned model, a simple table is used.
  • I'm planning to add in the future a version with (i) a neural network model and (ii) distributed training.

Getting Started

  • To be added.

Results

  • The test environment is a maze, see maze.py for details.
  • Here's a graph showing average discounted return during training:
    • Averaged across 10 training runs
    • Standard deviation across runs shown in lighter color
    • The algorithm doesn't quite reach optimal performance. This might be because the exploration doesn't reach zero, because it learns a suboptimal value / policy for some of the initial states due to a lack of exploration, or because there's a bug

Average Discounted Return

  • The performance depends a fair amount on the number of Monte Carlo simulations run in choosing the action each timestep:

Monte Carlo Simulations

  • Here's a gif showing the state-value function of the learned model during training:
    • Dark red is the largest value and dark blue is the lowest value
    • The blue state in the upper right is a "pit" with -1 reward
    • The goal state is the cell one step up and left from the bottom right corner
    • Some state values near the pit and the exit don't converage because the agent doesn't visit them enough

State Value

Implementation

  • The implementation of the algorithm is based on implementations of other algorithms in rllab, which is a predecessor of baselines.
    • A simpler implementation for the tabular case is definitely possible.
      • The hope is that this approach will transfer to the function approximation case where distributed / parallel sampling is necessary.
  • Here's a description of the types of classes:
    • Trainer: Orchestrates the training loop (environment interaction and learning)
    • Agents: The actual algorithms (e.g., TabularMuZero)
    • Samplers: Objects that sample rollouts / trajectories from an environment when passed an agent
    • Datasets: Convert trajectories into a form useful for learning
    • Planners: Online planning algorithms used by the agents (e.g., mcts), or offline planning algorithms used for comparison (e.g., value iteration)

Additional Notes

About

A python implemenation of tabular MuZero for educational purposes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published