# Simple Q-Learning Example with PyTorch
This notebook introduces the fundamental concepts of Q-learning, a foundational reinforcement learning algorithm. It demonstrates a minimal Q-learning agent using PyTorch in a simple gridworld environment. The example is suitable for beginners and includes detailed explanations.

## What is Q-Learning?
Q-learning is a value-based reinforcement learning algorithm. It learns the optimal action-value function (Q-function) for each state-action pair, allowing the agent to select actions that maximize expected cumulative reward. Unlike the bandit problem, Q-learning is designed for environments with states and transitions.

## Minimal Q-Learning Example: Gridworld
We use a simple 1D gridworld with 5 states. The agent starts at the leftmost state and aims to reach the rightmost state (goal). At each step, the agent can move left or right. The episode ends when the agent reaches the goal.

In [5]:
import torch
import random
import numpy as np

# Define the environment: 1D gridworld with 5 states (0 to 4)
class GridWorld:
    def __init__(self, n_states=5):
        self.n_states = n_states
        self.start_state = 0
        self.goal_state = n_states - 1
        self.reset()
    def reset(self):
        self.state = self.start_state
        return self.state
    def step(self, action):
        # action: 0=left, 1=right
        if action == 0:
            next_state = max(0, self.state - 1)
        else:
            next_state = min(self.n_states - 1, self.state + 1)
        reward = 1 if next_state == self.goal_state else 0
        done = next_state == self.goal_state
        self.state = next_state
        return next_state, reward, done

## Q-Learning Agent
The agent maintains a Q-table (state-action values) and updates it using the Q-learning update rule:

Q(s, a) ← Q(s, a) + α [r + γ * max Q(s', a') - Q(s, a)]

where:
- s: current state
- a: action taken
- r: reward received
- s': next state
- α: learning rate
- γ: discount factor

In [6]:
class QLearningAgent:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.q_table = torch.zeros(n_states, n_actions)
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions
    def select_action(self, state):
        # Epsilon-greedy policy
        if random.random() < self.epsilon:
            return random.randint(0, self.n_actions - 1)
        else:
            return torch.argmax(self.q_table[state]).item()
    def update(self, state, action, reward, next_state, done):
        # Q-learning update rule
        max_q_next = 0 if done else torch.max(self.q_table[next_state]).item()
        td_target = reward + self.gamma * max_q_next
        td_error = td_target - self.q_table[state, action]
        self.q_table[state, action] += self.alpha * td_error

## Training Loop (Forward Path)
The agent interacts with the environment for several episodes, updating its Q-table after each step.

In [7]:
env = GridWorld(n_states=5)
agent = QLearningAgent(n_states=5, n_actions=2, alpha=0.1, gamma=0.9, epsilon=0.2)
episodes = 200
steps_per_episode = []

for ep in range(episodes):
    state = env.reset()
    done = False
    step_count = 0
    while not done:
        action = agent.select_action(state)
        next_state, reward, done = env.step(action)
        agent.update(state, action, reward, next_state, done)
        state = next_state
        step_count += 1
    steps_per_episode.append(step_count)

print("Learned Q-table:\n", agent.q_table)
print("Average steps to goal:", np.mean(steps_per_episode[-20:]))

Learned Q-table:
 tensor([[0.5985, 0.7290],
        [0.6080, 0.8100],
        [0.6479, 0.9000],
        [0.6607, 1.0000],
        [0.0000, 0.0000]])
Average steps to goal: 4.35


## Summary
This notebook introduced the basics of Q-learning and demonstrated a simple Q-learning agent solving a gridworld problem. For more advanced RL, consider exploring deep Q-networks (DQN), policy gradients, or actor-critic methods.