# Finite Markov Decision Process

## Agent-Environment interface

MDPs are meant to be a straightforward framing of the problem of learning from
interaction to achieve a goal.

1. Agent: the learner and decision maker.
2. Environment: The thing that agent interacts with comprising everything outside of agent.
3. Award: the special numerical values that the agent seeks to maximize over time through its choice of actions.

The agent selects actions, the environment responding to these actions and presenting new situations to the agent.
The environment also gives rise to rewards. In general, actions can be any decisions we want to learn how to make, and
states can be anything we can know that might be useful in making them.

<img src="agent-environment-interaction.png">

At each discrete time steps, t = 0, 1, 2, ..., the agent receives some representation of the environment's state $S_t \in S$
and on that basis selects an action, $A_t \in A$. One time step later, in part as a consequence of its action $A_t$, the agent receives
a numerical reward, $R_{t+1} \in R$ and receives a new state $S_{t+1}$

Thus, the MDP and agent
together thereby give rise to a sequence or trajectory that begins like this:

$S_0, A_0, S_{1}, R_{1}, A_{1}, ...$

In finite MDP, $S, A(s), R$ are all finite, thus, we can associate $S, A(s), R$ with probability distributions.py

$p(s^{\prime}, r| s, a) = p(S_{t} = s^{\prime}, R_t = r | S_{t-1} = s, A_{t-1} = a)$

The function p defines the dynamics of the MDP means that the probability of current state $s^{\prime}$ and reward r given previous state s and action a, so p has to satisfy:

$\sum_{s^{\prime} \in S} \sum_{r \in R} p(s^{\prime}, r| s, a) = 1$, for all $s \in S, a \in A(s)$. Since $\sum_{s^{\prime} \in S} \sum_{r \in R} p(s^{\prime}, r| s, a) = \sum_{s^{\prime} \in S} \sum_{r \in R} \frac{p(s^{\prime}, s, a, r)}{p(s, a)}
= \frac{\sum_{s^{\prime} \in S} \sum_{r \in R} p(s^{\prime}, s, a, r)}{p(s, a)} = \frac{p(s, a)}{p(s, a)} = 1$

At the same time, p satisfies the memory-less property (Markov property) which is assumed.

From p, we can compute anything else one might need to know about the environment:

1. probability of current state given previous state and action: $p(s^{\prime} | s, a) = p(S_t = s^{\prime} | S_{t-1} = s, A_{t-1} = a) = \sum_{r \in R} p(s^{\prime}, r| s, a)$
2. expected award for state action pairs: $r(a, s) = E(R_t | S_{t-1}=s, A_{t-1} = a) = \sum_{r \in R_t} r * p(R_t = r| S_{t-1}=s, A_{t-1} = a) = \sum_{r \in R_t} r * \sum_{s^{\prime} \in S}p(R_t = r, S_t = s^{\prime}| S_{t-1}=s, A_{t-1} = a) = \sum_{r \in R_t} r * \sum_{s^{\prime} \in S}p(r, s^{\prime}| s, a) $
3. expected award for state action next state triples: $r(a, s, s^{\prime}) = E(R_t = r | A_{t-1} = a, S_{t} = s^{\prime}, S_{t-1} = s) = \sum_{r \in R_t} r * p(R_t = r| S_{t-1}=s, A_{t-1} = a, S_{t} = s^{\prime}) = S_{t-1} = s) = \sum_{r \in R_t} r * \frac{p(R_t = r, S_{t-1}=s, A_{t-1} = a, S_{t} = s^{\prime})}{p(S_{t-1}=s, A_{t-1} = a, S_{t} = s^{\prime}))} =
\sum_{r \in R_t} r * \frac{p(R_t = r, S_{t} = s^{\prime} | S_{t-1}=s, A_{t-1} = a)}{p(S_{t} = s^{\prime} |S_{t-1}=s, A_{t-1} = a)} = \sum_{r \in R_t} r * \frac{p(r, s^{\prime} | s, a)}{p(s^{\prime} |s, a)}$

## Goals and Rewards

The primary goal of agent is to maximize the total amount of rewards it gets, we can formulate this informal idea into a reward hypothesis:
**That all of what we mean by goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a received
scalar signal (called reward).**

It is thus critical that the rewards we set up truly indicate what we want accomplished.
In particular, the reward signal is not the place to impart to the agent prior knowledge
about how to achieve what we want it to do.5 For example, a chess-playing agent should
be rewarded only for actually winning, not for achieving subgoals such as taking its
opponent’s pieces or aining control of the center of the board. If achieving these sorts
of subgoals were rewarded, then the agent might find a way to achieve them without
achieving the real goal. Better places for imparting this kind of prior knowledge are the initial policy or initial value function.

## Returns and Episodes

In general, we seek to maximize the expected return, where the return, denoted $G_t$, is defined as some specific function of the
reward sequence. In the simplest case, the return is the sum of the rewards:

$G_t = R_{t+1} + R_{t+2} + ... + R_{T}$

### Episodic Tasks
Where T is the final time step, it is a random variable, normally different for each subsequence. This approach makes sense when there is a final time step, that is, when the agent-environment interaction breaks naturally into
subsequences, these subsequences are called *episodes*. Each episode ends in a special state called terminal state, followed by a reset to a standard
starting state or to a sample from a standard distribution of starting states. Even if you think of episodes as
ending in different ways, such as winning and losing a game, the next episode begins
independently of how the previous one ended. Thus the episodes can all be considered to
end in the same terminal state, with different rewards for the different outcomes. Tasks with episodes of this kind is called
*episodic tasks*. In episodic tasks, we need to distinguish the set of all nonterminal states and terminal states $S^+$

Examples: plays of a game,
trips through a maze, or any sort of repeated interaction.

### Continuing Tasks
On the other hand, in many cases the agent–environment interaction does not break
naturally into identifiable episodes, but goes on continually without limit, we call these tasks *continuing tasks*.
The return formulation is problematic in this case becuase the final time step would be $T=\infty$ which could also result $G_t = \infty$.

Examples: the natural way to formulate an on-going process-control task, or an
application to a robot with a long life span.

#### Discounted rewards

According to continuing task approach,
the agent tries to select actions so that the sum of the discounted rewards it receives over
the future is maximized. In particular, it chooses $A_t$ to maximize the expected discounted return:

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} +... = \sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1}$

Where $\gamma \in [0, 1]$ is a weight parameter called discount rate.

The *discount rate* determines the present value of future rewards: a reward received k time steps in the future is worth only $\gamma^{k-1}$ times what it would be
worth if it were received immediately.
1. If $\gamma < 1$, $G_t$ has a finite value as long as the reward sequence $R_k$ is bounded.
2. If $\gamma = 0$, The agent is myopic in being only maximizing immediate rewards: its objective in this case is to learn how to choose $A_t$ so as to maximize only $R_{t+1}$.
3. As $\gamma \rightarrow 1$, the return objective takes future rewards into account more strongly, the agent becomes more farsighted.

Returns at successive time steps are related to each other in a way that is important for theory and algorithms of reinforcement learning:

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} +... = R_{t+1} + \gamma (R_{t+1+1} + \gamma R_{t+1+2} + ...) = R_{t+1} + \gamma G_{t+1}$
