# Reinforcement Learning â€” Overview

## Purpose
- Learn policies for sequential decision-making.
- Maximize long-term reward under uncertainty.
- Balance exploration and exploitation.

## Key questions this section answers
- How are states, actions, and rewards defined?
- What is the return and how is it optimized?
- When to use value-based vs policy-based methods?

## Topics
- Markov Decision Processes (MDPs)
- Bandits, exploration strategies
- Value functions and Q-learning
- Policy gradients and actor-critic methods
- Offline vs online RL, safety, and evaluation

## References
- Sutton & Barto, "Reinforcement Learning"; Gymnasium; stable-baselines3


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

rng = np.random.default_rng(0)
true_means = np.array([0.10, 0.20, 0.05, 0.30, 0.15])
num_arms = len(true_means)

epsilon = 0.1
n_steps = 500
q = np.zeros(num_arms)
counts = np.zeros(num_arms)
rewards = []

for _ in range(n_steps):
    if rng.random() < epsilon:
        action = rng.integers(num_arms)
    else:
        action = int(np.argmax(q))
    reward = rng.normal(true_means[action], 0.1)
    counts[action] += 1
    q[action] += (reward - q[action]) / counts[action]
    rewards.append(reward)

cum_reward = np.cumsum(rewards)
fig = px.line(
    x=np.arange(n_steps),
    y=cum_reward,
    title="Epsilon-greedy cumulative reward",
    labels={"x": "step", "y": "cumulative reward"},
)
fig.show()

fig = go.Figure()
labels = [f"arm {i}" for i in range(num_arms)]
fig.add_trace(go.Bar(x=labels, y=q, name="estimated"))
fig.add_trace(
    go.Scatter(
        x=labels,
        y=true_means,
        mode="markers",
        name="true mean",
        marker=dict(size=9, symbol="diamond"),
    )
)
fig.update_layout(title="Learned values vs true means", yaxis_title="reward")
fig

## Takeaway
Exploration is necessary early, but exploitation should dominate once the policy is confident.

