# Finite Markov Decision Process

## Agent-Environment interface

MDPs are meant to be a straightforward framing of the problem of learning from
interaction to achieve a goal.

1. Agent: the learner and decision maker.
2. Environment: The thing that agent interacts with comprising everything outside of agent.
3. Award: the special numerical values that the agent seeks to maximize over time through its choice of actions.

The agent selects actions, the environment responding to these actions and presenting new situations to the agent.
The environment also gives rise to rewards. In general, actions can be any decisions we want to learn how to make, and
states can be anything we can know that might be useful in making them.

<img src="agent-environment-interaction.png">

At each discrete time steps, t = 0, 1, 2, ..., the agent receives some representation of the environment's state $S_t \in S$
and on that basis selects an action, $A_t \in A$. One time step later, in part as a consequence of its action $A_t$, the agent receives
a numerical reward, $R_{t+1} \in R$ and receives a new state $S_{t+1}$

Thus, the MDP and agent
together thereby give rise to a sequence or trajectory that begins like this:

$S_0, A_0, S_{1}, R_{1}, A_{1}, ...$

In finite MDP, $S, A(s), R$ are all finite, thus, we can associate $S, A(s), R$ with probability distributions.py

$p(s^{\prime}, r| s, a) = p(S_{t} = s^{\prime}, R_t = r | S_{t-1} = s, A_{t-1} = a)$

The function p defines the dynamics of the MDP means that the probability of current state $s^{\prime}$ and reward r given previous state s and action a, so p has to satisfy:

$\sum_{s^{\prime} \in S} \sum_{r \in R} p(s^{\prime}, r| s, a) = 1$, for all $s \in S, a \in A(s)$. Since $\sum_{s^{\prime} \in S} \sum_{r \in R} p(s^{\prime}, r| s, a) = \sum_{s^{\prime} \in S} \sum_{r \in R} \frac{p(s^{\prime}, s, a, r)}{p(s, a)}
= \frac{\sum_{s^{\prime} \in S} \sum_{r \in R} p(s^{\prime}, s, a, r)}{p(s, a)} = \frac{p(s, a)}{p(s, a)} = 1$

At the same time, p satisfies the memory-less property (Markov property) which is assumed.

From p, we can compute anything else one might need to know about the environment:

1. probability of current state given previous state and action: $p(s^{\prime} | s, a) = p(S_t = s^{\prime} | S_{t-1} = s, A_{t-1} = a) = \sum_{r \in R} p(s^{\prime}, r| s, a)$
2. expected award for state action pairs: $r(a, s) = E(R_t | S_{t-1}=s, A_{t-1} = a) = \sum_{r \in R_t} r * p(R_t = r| S_{t-1}=s, A_{t-1} = a) = \sum_{r \in R_t} r * \sum_{s^{\prime} \in S}p(R_t = r, S_t = s^{\prime}| S_{t-1}=s, A_{t-1} = a) = \sum_{r \in R_t} r * \sum_{s^{\prime} \in S}p(r, s^{\prime}| s, a) $
3. expected award for state action next state triples: $r(a, s, s^{\prime}) = E(R_t = r | A_{t-1} = a, S_{t} = s^{\prime}, S_{t-1} = s) = \sum_{r \in R_t} r * p(R_t = r| S_{t-1}=s, A_{t-1} = a, S_{t} = s^{\prime}) = S_{t-1} = s) = \sum_{r \in R_t} r * \frac{p(R_t = r, S_{t-1}=s, A_{t-1} = a, S_{t} = s^{\prime})}{p(S_{t-1}=s, A_{t-1} = a, S_{t} = s^{\prime}))} =
\sum_{r \in R_t} r * \frac{p(R_t = r, S_{t} = s^{\prime} | S_{t-1}=s, A_{t-1} = a)}{p(S_{t} = s^{\prime} |S_{t-1}=s, A_{t-1} = a)} = \sum_{r \in R_t} r * \frac{p(r, s^{\prime} | s, a)}{p(s^{\prime} |s, a)}$

## Goals and Rewards

The primary goal of agent is to maximize the total amount of rewards it gets, we can formulate this informal idea into a reward hypothesis:
**That all of what we mean by goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a received
scalar signal (called reward).**

It is thus critical that the rewards we set up truly indicate what we want accomplished.
In particular, the reward signal is not the place to impart to the agent prior knowledge
about how to achieve what we want it to do.5 For example, a chess-playing agent should
be rewarded only for actually winning, not for achieving subgoals such as taking its
opponent’s pieces or aining control of the center of the board. If achieving these sorts
of subgoals were rewarded, then the agent might find a way to achieve them without
achieving the real goal. Better places for imparting this kind of prior knowledge are the initial policy or initial value function.