In [1]:
import numpy as np

# Optimal Policy and Optimal Value
## Optimal Policy Definition
Given environment, if exists policy $\pi_*$ s.t. every policy $\pi$ satisfies $\pi \preccurlyeq \pi_*$, then $\pi_*$ is an optimal policy. An optimal policy satisfies:
$$
\begin{align*}
v_{\pi_*}(s) &= \max_{\pi} v_\pi(s)\\
q_{\pi_*}(s,a) &= \max_{\pi} q_\pi(s,a)
\end{align*}
$$
### Optimal Values
Optimal State Value:
$$
v_*(s) = \sup_\pi v_\pi(s)
$$
Optimal Action Value:
$$
q_*(s,a) = \sup_\pi q_\pi(s,a)
$$
If the is an optimal $\pi$, values of optimal policy equal the optimal values.

Not all environments have optimal values. This is due to the nature of the maximum vs. the supremum.

## Existence and Uniqueness of Optimal Policy
Here we:
1. Examine an environment that has optimal values but not optimal policies.
2. Discuss conditions that ensure existence of optimal policy.
3. Show that there may be multiple different optimal policies.

### One-shot Environment
- $S = \{s\}$
- $A \subseteq \mathbb{R}$, $A$ is an open interval, such as $(0,1)$
- $R_0 = a$ where $a \in A$ is the action chosen.
- State value for any policy $\pi$ is $v_\pi(s) = \mathbb{E}_\pi[a] = \sum_{a \in A} a\cdot \pi(a\mid s)$.
- $q_\pi(s,a) = a$

Now we find optimal values for environment.
$$
v_\pi(s) = \sum_{a \in A}a\cdot \pi(a\mid s) \leq \sum_{a \in A}(\sup A)\pi(a\mid s) = \sup A
$$
Which means
$$
v_*(s) = \sup_\pi v_\pi(s) = \sup_\pi(\sup A) = \sup A
$$

On the other hand, for any action $a\in A$, we can construct a deterministic policy $\pi : s \mapsto a$. State value is $v_\pi(s) = a$. Consequently,
$$v_*(s) \geq v_\pi(s) = a$$
Since $a$ can be any action in $A$, optimal state value is $v_\pi(s) = a$.

Optimal action value is $q_*(s,a) = a$.

If we further define $A = (0,1)$, then optimal state value is $v_*(s) = \sup A = 1$. However, the state value of any policy $\pi$ satisfies:
$$v_\pi(s) = \int_0^1a \cdot \pi(a\mid s)\,\mathrm{d}a < \int_0^1\pi(a\mid s)\,\mathrm{d}a = 1 = v_*(s)$$
So the claim that $v_{\pi}(s) = v_*(s)$ is false. Therefore the environment has no optimal policy.

## When does an Optimal Policy Exist?
When one of the following conditions is met.
- $S$ is countable and $A(s)$ for all $s \in S$ is finite.
- $S$ is countable, and $A(s)$ for all $a \in A(s)$ is compact (i.e: $A(s)$ is closed and bounded).
- $S$ is Polish (e.g: $\mathbb{R}^n$, $[0,1]$ and $A(s)$ are all finite.
- $S$ is Polish, the $A(s)$ are all compact metric spaces, and $r(s,a)$ is bounded.

For simplicity, we usually incorrectly assume the existence of an optimal policy.

## $\epsilon$-Optimal Policies
For environments that have no optimal policies, we can at the very least design policies that are arbitrarily close to being "ideal".

Given $\epsilon > 0$, if $\pi_*$ satisfies
$$
\begin{align*}
v_*(s) - v_{\pi_*}(s) &< \epsilon\\
q_*(s,a) - q_{\pi_*}(s,a) &< \epsilon
\end{align*}
$$
Then $\pi_*$ is an $\epsilon$-optimal policy.

Sometimes, even when optimal values exist, there may not exist an $\epsilon$-optimal policy.