# What is reinforcement learning? 

In very simple terms, an agent learns to pick actions in an environment to get more reward over time. The point is to make the agent learn the process of 

## Markov Decision Processes

Also known as MDPs, they provide a way to formalize sequential decision making. In an MDP, we have: 

1. Environment: A surrounding or scenario which defines our world.
2. Agent: A model that interacts with the environment it is placed in.
3. State: A configuration of the environment that the agent faces at a given time step.
4. Action: A step taken by the agent to bring about a change of state in its environment.
5. Reward: A score given to the agent as a result of that action.

The whole process repeats over and over, referred to as a trajectory, where the agent's aim is to  maximize cumulative reward i.e the reward it gets across time steps (not just at one time step). 

### Formal Problem Notation

In an MDP, we have set of states **S** , a set of actions **A**, and a set of rewards **R**. Each of them has a finite set of elements. 

1. At each time step *t*= **0,1,2,...**, agent is at a state $S_t \in S$ and takes an action $A_t \in A$. So, for each time step, we have a state-action pair $(S_t, A_t)$.
2. At $S_{t+1}$, we get numerical reward $R_{t+1} \in R$.
3. The reward can be expressed as a function of state and time.
   $$f(S_t,A_t)= R_{t+1}$$
4. Th trajectory is a sequence of $S, A, R$.
   $$ S_0, A_0, R_1, S_1, A_1, R_2, ...$$
5. Given **S** and **R** are finite, the random variables $S_t$ and $R_t$ have well defined probability distributions. The transition probability of going from action $a$ in state $s$ to state $s'$ with reward $r$ can be defined.
   $$ p(s',r|s,a)= Probability(S_t=s', R_t=r | S_{t-1}=s, A_{t-1}=a)$$

### Expected Return

In an MDP, the main driving force behind the agent is maximizing it's cumulative reward, which we also know as *Expected Return*. We can define it as 

$$G_t= R_{t+1)+ R_{t+2)+ R_{t+3)+ ...+ R_T$$, 

where *T* is the total number of time step and the goal of the agent is to maximize $G$.

However, there is an issue if we just add up rewards forever:

1. **Infinity problem**: If episodes don’t end , the sum may diverge (e.g., robot that keeps getting +1 reward each step = infinite return). 

2. **Unrealistic planning**: We care more about sooner rewards than way in the future. For example, a robot that delivers coffee in 10 steps is better than one that delivers in 1000 steps, even if both eventually succeed.