# MDP
* Stationary: $\mathbb{P}[(R_{t+1},S_{t+1})|(S_t,A_t)]$is independent of t
* State-Reward Transition Probability Function $\mathcal{P}_R:\mathcal{N} \times \mathcal{A} \times \mathcal{D} \times \mathcal{S}\rightarrow[0,1]$, conceptualize in python as $\mathcal{N}\rightarrow (\mathcal{A}\rightarrow (\mathcal{S} \times \mathcal{D}\rightarrow[0,1]))$
\begin{equation}
\mathcal{P}_R(s,a,r,s') = \mathbb{P}[(R_{t+1}=r,S_{t+1}=s')|(S_t=s,A_t=a)]
\end{equation}
```
StateReward = FiniteDistribution[Tuple[S, float]]
ActionMapping = Mapping[A, StateReward[S]]
StateActionMapping = Mapping[S, Optional[ActionMapping[A, S]]]
```
* State Transition Probability Function $\mathcal{N}\times\mathcal{A}\times S\rightarrow [0,1]$
\begin{equation}
\mathcal{P}(s,a,s') = \sum_{R\in \mathcal{D}} \mathcal{P}_R(s,a,r,s')
\end{equation}

* Reward Transition Function $\mathcal{R}_T:\mathcal{N}\times\mathcal{A}\times S\rightarrow\mathbb{R} $
\begin{equation}
\mathcal{R}_T(s,a,s') = \mathbb{E}[R_{t+1}|(S_{t+1}=s',S_t=s,A_t=a)] \\
= \sum_{R\in \mathcal{D}} \frac{\mathcal{P}_R(s,a,r,s')}{\mathcal{P}(s,a,s')} \cdot r = \sum_{R\in \mathcal{D}} \frac{\mathcal{P}_R(s,a,r,s')}{\sum_{R\in \mathcal{D}} \mathcal{P}_R(s,a,r,s')} \cdot r
\end{equation}

* Reward Function $\mathcal{R}:\mathcal{N}\times\mathcal{A}\rightarrow\mathbb{R} $
\begin{equation}
\mathcal{R}_T(s,a) = \mathbb{E}[R_{t+1}|(S_t=s,A_t=a)] \\
= \sum_{s'\in\mathcal{S}} \mathcal{P}(s,a,s')\cdot \mathcal{R}_T(s,a,s') =\sum_{s'\in\mathcal{S}}\sum_{r\in \mathbb{R}}\mathcal{P}_R(s,a,r,s') \cdot r
\end{equation}

# Policy
A policy is an Agent-controlled function $\pi:\mathcal{N}\times\mathcal{A}\rightarrow[0,1]$, probability of choosing an action given a state, probability distribution of actions
\begin{equation}
\pi (s,a) =\mathbb{P}[A_t=a|S_t=s] \text{ for all time steps $t=0,1,2,\dots$}
\end{equation}

## Deterministic policy
We definitely choose an action given a state, single action instead of distribution. $\pi_D:\mathcal{N}\rightarrow\mathcal{A}$
\begin{equation}
\pi(s,\pi_D(s))=1 \text{ and } \pi(s,a)=0 \text{ for all $a\in\mathcal{A}$ with a $\neq \pi_D(s)$ }
\end{equation}

## probability and rewards implied by policy
\begin{equation}
\mathcal{P}_R^\pi(s,r,s')=\sum_{a\in\mathcal{A}} \pi(s,a)\cdot\mathcal{P}_R(s,a,r,s')\\
\mathcal{P}^\pi(s,s')=\sum_{a\in\mathcal{A}} \pi(s,a)\cdot\mathcal{P}(s,a,s')\\
\mathcal{R}_T^\pi(s,s')=\sum_{a\in\mathcal{A}} \pi(s,a)\cdot\mathcal{R}_T(s,a,s')\\
\mathcal{R}^\pi(s)=\sum_{a\in\mathcal{A}} \pi(s,a)\cdot\mathcal{R}(s,a)
\end{equation}

# State-Value Function(implied by policy $\pi$
* State-Value Function(for policy $\pi$) $V^\pi:\mathcal{N}\rightarrow\mathbb{R}$ is defined as:
\begin{equation}
V^\pi(s)=\mathbb{E}_{\pi,\mathcal{P}_R}[G_t|S_t=s]\text{ for all $s\in\mathcal{N}$, for all $t=0,1,2,\dots$}
\end{equation}
* MRP Bellman equation
\begin{equation}
V^\pi(s)= \mathcal{R}^\pi(s)+\gamma\cdot\sum_{s'\in\mathcal{N}}\mathcal{P}^\pi(s,s')\cdot V^\pi(s')
\end{equation}
* MRP bellman equation in terms of MDP
\begin{equation}
V^\pi(s)= \sum_{a\in\mathcal{A}}\pi(s,a)\cdot\left(\mathcal{R}(s,a)+\gamma\cdot\sum_{s'\in\mathcal{N}}\mathcal{P}(s,a,s')\cdot V^\pi(s')\right)\text{ for $s\in\mathcal{N}$}
\end{equation}

# Action-Value Function (Q function) of an MDP for a fixed policy
* Action-Value Function(for policy $\pi$) $Q^\pi :\mathcal{N}\times\mathcal{A}\rightarrow\mathbb{R}$ defined as:
\begin{equation}
Q^\pi(s,a)=\mathbb{E}_{\pi,\mathcal{P}_R}[G_t|(S_t=s,A_t=a)] \text{ for all $s\in\mathcal{N}, a\in \mathcal{A}$}\\
V^\pi(s)=\sum_{a\in\mathcal{A}} \pi(s,a)\cdot Q^\pi(s,a) \text{ $s\in\mathcal{N}$}
\end{equation}
Combining the last few questions, we get different formulation of Q
\begin{equation}
Q^\pi(s,a)=\mathcal{R}(s,a)+\gamma\cdot\sum_{s'\in\mathcal{N}}\mathcal{P}(s,a,s')\cdot V^\pi(s')\text{ for all $s\in\mathcal{N}, a\in \mathcal{A}$}\\
Q^\pi(s,a)=\mathcal{R}(s,a)+\gamma\cdot\sum_{s'\in\mathcal{N}}\mathcal{P}(s,a,s')\sum_{a'\in\mathcal{A}} \text{ for all $s\in\mathcal{N}, a\in \mathcal{A}$}\\
\pi(s',a')\cdot Q^\pi(s',a')
\end{equation}

# Optimal functions
1. State Value function, v
2. action value function, q
* Optimal State value function $V^* :\mathcal{N} \rightarrow\mathbb{R}$ defined as:
\begin{equation}
V^*(s) = \max_{\pi\in\Pi} V^\pi(s)\text{ for all $s\in\mathcal{N}$}
\end{equation}
* Optimal Action-Value function $Q^*:\mathcal{N}\times\mathcal{A}\rightarrow\mathbb{R}$ defined as:
\begin{equation}
Q^*(s,a)=\max_{\pi\in\Pi} Q^\pi(s,a) \text{ for all $s\in\mathcal{N},a\in\mathcal{A}$}
\end{equation}

Note that 
\begin{equation}
V^*(s) = \max_{a\in\mathcal{A}} Q^*(s,a) \\
Q^*(s,a)=\mathcal{R}(s,a)+\gamma\cdot\sum_{s'\in\mathcal{N}}\mathcal{P}(s,a,s')\cdot V^*(s')
\end{equation}

# MDP State-Value Function Bellman Optimality Equation
\begin{equation}
V^*(s) = \max_{a\in\mathcal{A}}\{\mathcal{R}(s,a)+\gamma\cdot\sum_{s'\in\mathcal{N}}\mathcal{P}(s,a,s')\cdot V^*(s')\}
\end{equation}

# MDP Action-Value Function Bellman
\begin{equation}
Q^*(s,a)=\mathcal{R}(s,a)+\gamma\cdot\sum_{s'\in\mathcal{N}}\mathcal{P}(s,a,s')\cdot \max_{a\in\mathcal{A}} Q^*(s,a)
\end{equation}


## FiniteMarkovProcess
```
100 is a Terminal State
From State 0:
  To State 38 with Probability 0.167
  To State 2 with Probability 0.167
  To State 3 with Probability 0.167
  To State 14 with Probability 0.167
  To State 5 with Probability 0.167
  To State 6 with Probability 0.167
From State 2:
```

## FiniteMarkovRewardProcess
```
From State 0:
  To [State 38 and Reward 1.000] with Probability 0.167
  To [State 2 and Reward 1.000] with Probability 0.167
  To [State 3 and Reward 1.000] with Probability 0.167
  To [State 14 and Reward 1.000] with Probability 0.167
  To [State 5 and Reward 1.000] with Probability 0.167
  To [State 6 and Reward 1.000] with Probability 0.167
From State 2:
  To [State 3 and Reward 1.000] with Probability 0.167
  To [State 14 and Reward 1.000] with Probability 0.167
```

## FiniteMarkovDecisionProcess

```
From State 1:
  With Action A:
    To [State 0 and Reward -1.000] with Probability 0.333
    To [State 2 and Reward 0.000] with Probability 0.667
  With Action B:
    To [State 0 and Reward -1.000] with Probability 0.333
    To [State 2 and Reward 0.000] with Probability 0.333
    To [State 3 and Reward 1.000] with Probability 0.333
From State 2:
  With Action A:
    To [State 1 and Reward 0.000] with Probability 0.667
    To [State 3 and Reward 1.000] with Probability 0.333
  With Action B:
```