Material accompanying the lecture series NAIL133 – Agent-based Learning at Charles University, Prague (2022+) by Dr. Adam Streck.
The project contains two Unity scenes (located in Assets/Scenes/):
| Scene | Description |
|---|---|
| Bellman | Iterative Bellman value-propagation demo — computes and displays V(s) for every tile |
| QLearn | Q-Learning demo with ε-greedy exploration — trains an agent to reach the trophy |
| Key | Action |
|---|---|
| Space | Advance one step |
| A | Toggle automatic stepping (continuous play) |
Open either scene in the Unity Editor and enter Play mode. Press Space to step through the algorithm one iteration at a time, or press A to let it run automatically.
- What is a Policy?
- The Bellman Equation
- Solving the Bellman Equation Iteratively
- Action Quality (Q-Values)
- The Q-Table
- Exploration vs. Exploitation (ε-Greedy)
- The Unity Implementation
- The Broader RL Ecosystem
A policy π is a function that defines the behaviour of an agent. It maps states to a distribution over actions.
Given:
- A set of states
S - A set of actions
A
A deterministic policy is a direct mapping:
π: S → A
A stochastic policy provides the probability of taking each action in a given state:
π: S × A → [0, 1]
In the example grid environment used throughout this repository:
- Actions:
A = {Left, Right, Up, Down} - States:
S = [0…7] × [0…4](an 8 × 5 tile grid) - Tile types:
Grass — traversable tiles with reward 0
Water — sink tiles with reward −1(no escape)
Award — goal tile with reward +1
The agent starts on a grass tile and can move in four directions. The episode ends when the agent reaches water or the award tile.
The map encoded as a 2D array of rewards looks like this:
_map = {
{ -1, -1, -1, -1, -1, -1, -1, -1 }, // all water border
{ -1, 0, 0, 0, -1, 0, 1, -1 }, // 1 = Award (trophy)
{ -1, 0, 0, 0, -1, 0, 0, -1 },
{ -1, 0, 0, 0, 0, 0, 0, -1 },
{ -1, -1, -1, -1, -1, -1, -1, -1 }, // all water border
}
There is a narrow choke-point at column 4, forcing the agent to navigate around it to reach the trophy.
The optimal policy leads the agent from any grass tile to the trophy in the fewest possible steps.
The problem of finding an optimal policy can be solved iteratively using the Bellman Equation.
The Bellman Equation postulates that the long-term reward of an action equals the immediate reward for that action plus the expected reward from all future actions.
It can be computed iteratively for systems with:
- Discrete states
- Discrete state transitions via actions
Let:
s— current statea— action takens'— state reached by taking actionain statesγ— discounting factor (how much future rewards are worth today,0 < γ ≤ 1)R(s, a)— immediate reward for taking actionain states
The value V(s) of a state is:
V(s) = max_a [ R(s, a) + γ · V(s') ]
This is the necessary condition for finding the optimal policy in dynamic programming.
Computing the Bellman Equation is a dynamic programming problem. On each iteration n, we calculate the expected future reward reachable in n+1 steps for all tiles.
a ∈ {Left, Right, Up, Down}
s = (x, y)
R(s, a) = 1 if s is the Award tile
0 if s is a Grass tile
-1 if s is a Water tile
A universal sink state is added: entering water or the award tile transitions the agent to the sink (episode ends). Therefore, we only allow taking an action when on a grass tile.
private double GetNewValue(VTile tile)
{
return Agent.Actions
.Select(a => tileGrid.GetTargetTile(tile, a))
.Select(t => t.Reward + gamma * t.Value)
.Max();
}
private void CalculateValues()
{
for (var y = 0; y < TileGrid.BOARD_HEIGHT; y++)
{
for (var x = 0; x < TileGrid.BOARD_WIDTH; x++)
{
var tile = tileGrid.GetTileByCoords<VTile>(x, y);
if (tile.TileType == TileEnum.Grass)
{
tile.NextValue = GetNewValue(tile);
}
}
}
}On every step, the value V(s) of each tile is updated to the maximum over all actions of the immediate reward plus the discounted value of the resulting tile. The future reward propagates outward from the Award tile with a diminishing return controlled by γ = 0.9.
The solution converges after 10 iterations, at which point the optimal policy can be derived by selecting the action that leads to the tile with the highest value.
The Bellman Equation tells us the value of a state. In Q-Learning we focus on the quality of individual actions the agent can take. Each (state, action) pair is assigned a Q-value.
When an action is taken, its Q-value is updated using a temporal difference (TD):
Q(s, a) ← Q(s, a) + α · TD
Where the temporal difference is:
TD = R(s, a) + γ · max_a' Q(s', a') − Q(s, a)
| Symbol | Meaning |
|---|---|
α |
Learning rate — how quickly new information overrides old |
γ |
Discount factor — how much future rewards matter |
R(s, a) |
Immediate reward for taking action a in state s |
max_a' Q(s', a') |
Best Q-value available in the next state s' |
The TD term combines the immediate reward with the best possible future reward, making it a direct derivation of the Bellman Equation.
var s = _agent.State;
// Update Q-values for ALL actions from current state
foreach (var a in Agent.Actions)
{
var q = s.GetQValue(a);
var sPrime = tileGrid.GetTargetTile(s, a);
var r = sPrime.Reward;
var qMax = Agent.Actions.Select(sPrime.GetQValue).Max();
var td = r + gamma * qMax - q;
s.SetQValue(a, q + alpha * td);
}
// Now pick action based on updated Q-values and move
var chosen = PickAction(s);
_agent.State = tileGrid.GetTargetTile(s, chosen);On every step, the Q-values for all four actions from the current state are updated using the TD rule. Only after all updates are applied does the agent select and execute an action (via ε-greedy). This ensures the agent always acts on fully up-to-date Q-values.
The Q-table is a data structure that stores a Q-value for every (state, action) pair. It is initialised to 0 and updated incrementally as the agent explores.
| (X, Y) | Left | Right | Up | Down |
|---|---|---|---|---|
| (0,1) | 0.0 | 0.0 | 0.0 | 0.0 |
| (1,1) | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... |
- Rows = states
- Columns = actions
After enough experience, the table converges so that argmax_a Q(s, a) gives the optimal action in every state.
public class QTile : BaseTile
{
private readonly double[] _qValues = new double[4]; // one per action
public void SetQValue(ActionEnum action, double value)
{
var index = (int)action;
_qValues[index] = value;
qValueTexts[index].text = value.ToString("F3");
qValueTexts[index].color = TextColor(value); // green = positive, red = negative
}
public double GetQValue(ActionEnum action)
=> _qValues[(int)action];
}Each tile in the grid is a row of the Q-table. The four Q-values are displayed directly on the tile and colour-coded (green for positive, red for negative).
A key challenge in Q-Learning is the exploration–exploitation trade-off:
- Exploit — pick the action with the highest known Q-value (greedy).
- Explore — pick a random action to discover potentially better paths.
Given a random value r ∈ [0, 1] and parameter ε:
if r > ε → exploit: select a = argmax_a Q(s, a)
else → explore: select a at random
We typically want to explore more early on and exploit more later. This is achieved by decaying ε over time:
ε' = max(ε_min, ε − ε_decay)
| Parameter | Role |
|---|---|
ε_start |
Initial exploration rate (e.g., 1.0 — fully random) |
ε_min |
Minimum exploration rate (e.g., 0.05) |
ε_decay |
How much ε decreases per step (e.g., 0.001) |
After enough steps, the agent's policy converges to always selecting the maximum-quality action.
private ActionEnum PickAction(QTile state)
=> Random.Range(0f, 1f) > _epsilon
? Agent.Actions.Shuffle().OrderBy(state.GetQValue).Last() // exploit
: Agent.RndAction(); // exploreThe Shuffle() before OrderBy breaks ties randomly when multiple actions share the same Q-value.
Without ε-greedy (pure exploitation from the start), the agent quickly falls into suboptimal paths:
With ε-greedy exploration and decay, the Q-values converge to the optimal policy:
Epsilon is decayed after every step:
_epsilon = Mathf.Max(epsilonEnd, _epsilon - epsilonDecay);The simulation is built in Unity and consists of the following components:
| Class | Purpose |
|---|---|
TileEnum |
Enum for tile types: Water = -1, Grass = 0, Award = 1 |
ActionEnum |
Enum for actions: Left = 0, Right = 1, Up = 2, Down = 3 |
TilePos |
Lightweight record storing (X, Y) grid coordinates |
BaseTile |
Base MonoBehaviour for all tiles; holds Reward and CurrentPos |
VTile |
Extends BaseTile; displays a single state value V(s) (used by Bellman demo) |
QTile |
Extends BaseTile; stores and displays four Q-values (used by Q-Learning demo) |
TileGrid |
Manages the 8 × 5 grid; handles tile generation, coordinate lookup, and movement rules |
Agent |
MonoBehaviour representing the agent; holds the current QTile state and provides action helpers |
Bellman |
Runs the iterative Bellman value-propagation demo (scene Bellman) |
QLearn |
Runs the Q-Learning demo with ε-greedy exploration (scene QLearn) |
Extensions |
Utility class providing a Shuffle extension method for tie-breaking |
ScreenShot |
Debug utility — press S to save a timestamped screenshot |
// Bounded move on grass; otherwise stay
private TilePos GetTargetPos(TilePos source, ActionEnum action)
{
var tile = _tiles[source.Y, source.X];
if (tile.TileType == TileEnum.Grass)
{
return action switch
{
ActionEnum.Up when source.Y > 0 => source with { Y = source.Y - 1 },
ActionEnum.Down when source.Y < BOARD_HEIGHT - 1 => source with { Y = source.Y + 1 },
ActionEnum.Left when source.X > 0 => source with { X = source.X - 1 },
ActionEnum.Right when source.X < BOARD_WIDTH - 1 => source with { X = source.X + 1 },
_ => source
};
}
return source;
}Movement is only possible from grass tiles. Attempting to move into water or off-grid keeps the agent in place. When the agent reaches a non-grass tile (water or award), the episode ends and it is reset to the starting position (2, 2).
| Parameter | Default | Meaning |
|---|---|---|
alpha |
0.1 |
Learning rate |
epsilonStart |
1.0 |
Initial exploration rate |
epsilonEnd |
0.05 |
Minimum exploration rate |
epsilonDecay |
0.001 |
Epsilon decrease per step |
gamma |
0.9 |
Discount factor |
After ~100,000 steps with these defaults, the agent reliably navigates from start to the trophy.
Q-Learning is one algorithm within a larger family of Reinforcement Learning (RL) methods. Algorithms can be categorised along several axes:
| Dimension | Options |
|---|---|
| State space | Discrete (e.g., board games) · Continuous (e.g., FPS games) |
| Action space | Discrete (e.g., strategy games) · Continuous (e.g., driving) |
| Policy type | Off-policy (Q-Learning: a' is always maximised) · On-policy (SARSA: a' is selected by the agent's current policy) |
| Operator | Q-value · Advantage A(s, a) = Q(s, a) − V(s) |
Q-Learning is an off-policy, discrete-action method. Extending it to continuous state/action spaces leads to methods like Deep Q-Networks (DQN), which replace the Q-table with a neural network.
In the grid world example, the Q-table has |S| × |A| = 40 × 4 = 160 entries — perfectly manageable. But for a game like chess, the state space exceeds 10⁴⁴ positions, making an explicit table impossible to store or fill. A neural network compresses this vast mapping into a fixed set of learnable parameters (weights). Instead of storing one value per (s, a) pair, the network takes the state as input and outputs Q-values for all actions, generalising across similar states it has never seen before.
For a comprehensive list of RL algorithms, see the Reinforcement Learning Wikipedia page. Additional methods such as behavioural cloning are not listed there but are also used in practice. Real-world solutions typically use extended variants or combinations of the above.
| Concept | Key Formula |
|---|---|
| Bellman value update | V(s) = max_a [R(s,a) + γ·V(s')] |
| Q-value update | Q(s,a) ← Q(s,a) + α·(R + γ·max_a' Q(s',a') − Q(s,a)) |
| ε-greedy action | exploit if rand > ε, else explore |
| Epsilon decay | ε' = max(ε_min, ε − ε_decay) |
Q-Learning is a model-free, off-policy reinforcement learning algorithm that learns the optimal action-selection policy by iteratively updating a Q-table through direct interaction with the environment — no prior knowledge of the environment's dynamics is required.






