Minimal implemention of Tabular TD control using SARSA, Expected SARSA, and Q-Learning in a MAZE environment.
The TD control algorithms are based on Bellman equations. SARSA and Expected SARSA use a sample-based version of the Bellman equation and learn q_pi. Q-Learning uses the Bellman optimality equation and it learns q*. The three learning algorithms are shown below, by highlighting their targets.
When it comes to environment, the MAZE dimensions can be varied, as well as its cell content: standard 'Blocks' in black, 'Holes' in red with nominally high penalty, and 'Food' in green with nominally high reward. As an example, a 10x15 MAZE with B/H/F fractions of (0.3, 0.1, 0.1) is shown below. The cell placements can be randomly varied while keeping the same fractions. The example in the notebook focuses on a smaller maze to shorten the training.
The MAZE solver offers an opportunity to compare the three TD methods across a vast design space as a function of reward assignments, cell fractions, epsilon, gamma, and alpha. When the impact of 'Holes' and 'Food' is low, either due to their low fraction or low absolute reward, the three methods behave comparably. Otherwise, the training episodes get proportionally longer and the performance of the three methods tends to differ more.
To assist such investigations, the provided code offers several visualizations, e.g., average score vs. episodes:
Agent's trajectory for a specific episode before and after training:
Animation of Agent's moves for a specific episode before and after training (more GIF examples in 'images' folder):
References: