# Module 4 — Reinforcement Learning

**Created:** 2025-12-04 14:06:54 UTC

## Overview
RL studies agents that learn by interacting with an environment via states, actions, and rewards.

## Learning objectives
- Grasp core RL concepts: agent, environment, policy, reward, episode.
- Beginner: tabular Q‑learning pseudocode.
- Intermediate: policy gradient idea and simple code sketch (using gym-like API).
- Advanced: deep RL, stability tricks, replay buffers, target networks.


## Beginner — Tabular Q‑learning (concept + pseudocode)

**Concept:** Maintain Q(s,a) table and update using the Bellman equation.

**Pseudo-code example:**


In [None]:
# Tabular Q-learning pseudocode (runnable if you have a simple env)
import numpy as np

n_states = 10
n_actions = 2
Q = np.zeros((n_states, n_actions))
alpha = 0.1
gamma = 0.99

# For each episode:
# s = env.reset()
# choose action a (epsilon-greedy)
# s2, r, done, _ = env.step(a)
# Q[s,a] = Q[s,a] + alpha*(r + gamma*np.max(Q[s2]) - Q[s,a])
# s = s2

print('Q-learning update formula provided as code comment')


## Intermediate — Policy gradients (concept)

**What to learn:** Instead of learning values, parameterize a policy and update parameters in direction of higher expected reward using gradients.

**Code sketch:** using a gym-like loop and a small neural network (PyTorch/TensorFlow) to collect episodes and compute policy gradient.


## Advanced — Modern deep RL practical notes

**Topics:** DQN, PPO, A3C, on-policy vs off-policy, sample efficiency, hyperparameters, reproducibility.

**Advanced tips:** Use stable-baselines3 or RL libraries for robust implementations; prefer PPO for reliable on-policy learning.
