In [1]:
import numpy as np 

In this notebook, we simulate a simple reinforcement learning environment based on a traffic light scenario to demonstrate the Markov property and the basic idea of Q-learning.
The environment consists of two possible states — a red light and a green light — and two possible actions for the agent (the driver): stop or go. Each action produces a reward depending on the current state:

Going on a red light leads to a strong negative reward (a “penalty”),

Stopping at a green light gives a small negative reward (lost time),

Going on a green light gives a positive reward.

The goal of the agent is to learn, through repeated interaction with the environment, which action maximizes its total long-term reward.
The learning process uses the Q-learning algorithm, which updates a table of state–action values (Q-values) according to the rewards received and the expected future rewards.

Over time, the agent learns an optimal policy:

Stop when the light is red,

Go when the light is green.

This simple setup illustrates the key principle of Markov Decision Processes (MDPs):

The next state and reward depend only on the current state and action — not on the past history.

In [7]:
rewards = {0:{0:10,1:-5},1:{0:-5,1:10}}; 
alpha = 0.1; 
gamma = 0.9; 
eps = 0.2; 
states = [0,1]; 
actions = [0,1]; 

In [8]:
def step(state,action): 
    r = rewards[state][action]; 
    return r; 

In [11]:
Q = np.zeros((2,2)); 
state = np.random.choice(states)
for i in range(100000): 
    if np.random.uniform()<eps: 
        action = np.random.choice(actions); 
    else : action = np.argmax(Q[state]); 
    reward = step(state,action); 
    next_state = np.random.choice(states); 
    Q[state][action] = Q[state][action] + alpha * (reward + Q[next_state][action] * gamma-Q[state][action]); 
    state = next_state;

In [12]:
Q

array([[34.38190542, 17.45082546],
       [19.90838136, 34.78492062]])