# Finite Markov Decision Processes (MDPs)
`MDPs` are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through future rewards. In `MDPs` we *estimate* the value $q_{*}(s, a)$ of each action $a$ in each state $s$, or we *esitmate* value $v_{*}(s)$ of each state given the optimal action selections.

### The agent-environment interface
`MDPS` are meant to be a straightforward framing of the problem of learning from interaction to achieve the goal. **`Agent`** who is the learner and decision maker. The things it *interacts with* comprising everything outside the agent, is called **`Environment`**.\
This interaction continually, the agent selecting actions and the environment responding to these actions and presenting new situations(states) to the agent. The environment also given rise to rewards, that the agent seeks to cumulative over time through its choice of actions.
\
\
The *agent* and *environment* interact at each of a `sequence of discrete time steps,` $t = 0.1.2...N$ At each timestep $t$, *the agent* receives some representation of the environment's state $S_{t} \in S$, and on that basis selects an action, $A_{t} \in A(s)$. One time step later, in part as a consequence of its action $A{t}, *the agent* receives a numerical reward $R_{t+1} \in R$, and finds itself in a new state, $S_{t+1}$. 
\
The MDP and agent together thereby give rise to a sequence or `trajectory` that begins like this:
$$S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2},  S_{2}, A_{2}, R_{3}, ...$$ 
\
In *finite MDPs*, the sets of states, actions, and rewards ($S, A, and R$) all have a finite number of elements. In this case, the *random variables* $R_{t} and S_{t}$ have well defined discrete probability distributions dependent only on the preceding state and action. That is, for particular values of these *random variables*, $s' \in S and r \in R$, there is a probability of those values occuring at time $t$, given particular values of the preceding state and action:
$$p(s', r | s, a) \doteq Pr({S_{t} = s', R_{t} = r | S_{t-1} = s, A_{t-1} = a})$$
To simplified this formular, remember that $p$ specifies a probability distribution for each choice of given state $s$ and action $a$ that is, that:
$$ \sum \limits_{s' \in S} \sum\limits_{r \in R} p(s', r | s, a) \doteq 1, \; for \; all \; s \in S a \in A$$

### Goals and Rewards
In RL, the purpose or goal of the agent is formalized in terms of special signal, called the **`reward`**, passing from the environment to the agent. Informally, the agent's *goal* is `to maximize the total amount of reward` it receives. This means maximizing not *immediate reward*, but *cumulative reward* in the long run.
> That all of what we mean by foals and purposes can be well thought of as the maximization of the *expected value* of the cumulative sum of a received scalar signal (called reward). >

`The use of a reward signal` to formalize the idea of a goal is one of the most distinctive features of reinforcement learning.\
The reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want to do. Than main goal is to reach the end goal not subgoals. If achieving these of subgoals were rewarded, then the agentmight find a way to achieved them without achieving the end goal.

### Returns and Episodes
- In general, we seek to maximize the *expected return*, where the return, denoted $G_{t}$ is deined as some specific function of the reward sequence (the sum of the rewards):
$$ G_{t} = R_{1} + R_{2} + R_{3} + ... + R_{T}$$
- When the agent-environment interaction breaks naturally into subsequences, which we call **`episodes`**. Each episode ends in a special state called **`the terminal state`**, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. 

Thus the episodes can all be sonsidered to end in the same terminal state, with different rewards for tht different outcomes. Tasks with episodes of this kind are called **`episodic tasks`**. Differently if the agent-environmenrt interaction does not break naturally into identifiable episodes, but goes on continually without limit we called it **`continuing tasks`**. \
The return formulation os problematic for continuing tasks becauase the final time step would be $T = \infty$, and return, which is wahe we are trying to maximize, could itself easily be infinite. To solve this we introduce *discount factor*:
$$ G_{t} \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R{t+3} + ...  = \sum\limits_{k-0}^\infty \gamma^k R_{t+k+1}$$
where $\gamma$ is a parameter, 0 $\leq \gamma \leq$ 1, called **`discount factor`**. *This disocunt rate determines the present value of future rewardsL a reward received $k$ time steps in the future is worth only $\gamma^k-1$ times what it would be worth if it were reveived immediately.* As $\gamma$ approcaches 1, the return objective takes future rewards into account more strongly; the agent becomes more *farsighted*.\
\
Return at succesive time steps are related to each other in a way that is important for the teory and algorithms of reinforcement learning:
$$G_{t} \doteq R_{t+1} + \gamma G_{t+1}$$