# Introduction to Markov Decision Processes

In [None]:
# Modules used in this Notebook
%matplotlib inline
from IPython.display import display, clear_output
import matplotlib.pyplot as plt
import numpy as np
import random
random.seed(6)
np.random.seed(5)

# user-defined functions
import utils

In k-Armed Bandit problem, the agent is presented with the same situation and each time the same action is always optimal. In many problems, different situations call for different responses. The actions we choose now may affect the amount of reward we can get into the future in the real world. The **Markov Decision Process** formalism captures such aspects of real-world problems. 

Imagine a rabbit is wandering around in a field looking for food and finds itself in a situation where there is a carrot to its right and broccoli on its left, a rabbit prefers carrots. So eating the carrot generates a reward of $+10$. 

<img src="images/reward_carrot.gif" width="40%" align="center"/>

Eating the broccoli on the other hand generates reward of only $+3$. 

<img src="images/reward_broccolli.gif" width="40%" align="center"/>

But what if later the rabbit finds itself in another situation, where there's broccoli on the right and carrot on the left. The rabbit would clearly prefer to go left instead of right. The k-Armed Bandit problem does not account for the fact that different situations call for different actions. It's also limited in another way. Let's say we do account for different actions in different situations. Here it looks like the rabbit would like to go right to get the carrot. However, going right will also impact the next situation the rabbit sees. Let's say just to the right of the carrot there is a tiger. If the rabbit moves right, it gets to eat the carrot. But afterwards it may not be fast enough to escape the tiger. 

<img src="images/reward_tiger.gif" width="60%" align="center"/>

If we account for the long-term impact of our actions, the rabbit should go left and settle for broccoli to give itself a better chance to escape. A bandit rabbit would only be concerned about immediate reward and so it would go for the carrot. But a better decision can be made by considering the long-term impact of our decisions. 

Now, let's look at how the situation changes as the rabbit takes actions. We will call these situations **states**. In each state the rabbits selects an action. For instance, the rabbit can choose to move right. Based on this action the world changes into a new state and produces a reward. In this case, the rabbit eats the carrot and receives a reward of $+10$ (left state). However, the rabbit is now next to the tiger. Let's say the rabbit chooses the left action. The world changes into a new state or the tiger eats the rabbit and the rabbit receives a reward of $-100$. From the original state the rabbit could alternatively choose to move left. Then the world transitions into a new state and the rabbit receives a reward of $+3$.

<img src="images/rabit_states.svg" width="60%" align="center"/>

The diagram now shows two potential sequences of states. The sequence that happens depends on the actions that the rabbit takes. We can formalize this interaction with the general framework. In this framework, the agent and environment interact at discrete time steps. At each time, the agent receives a state $S_t$ from the environment from a set of possible states ($\mathcal{S}$). Based on the state the agent selects an action $A_t$ from a set of possible actions. $\mathcal{A}(S_t)$ is the set of valid actions in state $S_t$. Moving right is an example of an action. One time step later based in part on the agent's action, the agent finds itself in a new state $S_{t+1}$. For example, the state where the rabbit is next to the tiger. The environment also provides a scalar reward $R_{t+1}$ drawn from a set of possible rewards ($\mathcal{R}$). In this case, the reward is $+10$ for eating the carrot. This diagram summarizes the agent environment interaction in the MDP framework. 

<img src="images/mdp_rabbit.gif" width="70%" align="center"/>

The agent environment interaction generates a trajectory of experience consisting of states, actions, and rewards. Actions influence immediate rewards as well as future states and through those, future rewards. We can represent the dynamics of this interaction as in bandits, where the outcomes are stochastic described as probabilities. When the agent takes an action in a state, there are many possible next states and rewards. The transition dynamics function $p$, formalizes this notion. 

$$
p(s', r|s, a)
$$

Given a state $S$ and action $a$, $p$ tells us the joint probability of next state $S'$ and reward are. Here, we assume that the set of states, actions, and rewards are finite. Since $p$ is a probability distribution, it must be non-negative and it's sum over all possible next states and rewards must equal one. 

$$
p : \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \ \ \ \ \rightarrow \ \ \ \ [0,1] \\
\sum\limits_{s' \in \mathcal{S}} \sum\limits_{r \in \mathcal{R}} p(s', r|s, a) = 1, \forall s \in \mathcal{S}, a \in \mathcal{A}(s)
$$

Note that future state and reward only depends on the current state and action. This is called the **Markov property**. It means that:

> The present state is sufficient and remembering earlier states would not improve predictions about the future. 

In summary, MDPs provide a general framework for sequential decision making and the dynamics of an MDP are defined by a probability distribution.

## Examples of MDPs

Consider recycling robot which collects empty soda cans in an office environment. It can detect soda cans, pick them up using his gripper, and dropped them off in a recycling bin. The robot runs in a rechargeable battery. Its objective is to collect as many cans as possible. Let's formulate this problem as an MDP, starting with the states, actions, and rewards. Let's assume that the sensors can only distinguish two charged levels, low and high ($\mathcal{S}$ = \{low, high\}). These charged levels represent the robot's state. In each state, the robot has three choices. It can **search** for cans for a fixed amount of time, it can remain stationary and **wait** for someone to bring in a can, or it can go to the charging station to **recharge** its battery ($\mathcal{A}$ = \{search, wait, recharge\}). We only allow recharging from the low state because recharging is pointless when the energy level is high. 

<img src="images/robot_collector.svg" width="50%" align="center"/>

Now, we draw the transition dynamics, where states are the open circles. 

> *Searching for cans when the energy level is high might reduce the energy level to low*

That is, the **search** action in the state high might not change the state (with probability $\alpha$), or the energy level might drop to low (with probability $1-\alpha$). In both cases, the robots **search** yields a reward of $r_{search}$. For instance, $r_{search}$ could be +10 indicating that the robot found 10 cans. The robot can also **wait**, so the state does not change.

> *Waiting for cans does not drain the battery*  

In both cases, the wait action yields a reward of $r_{wait}$. For example, $r_{wait}$ could be +1. 

> *Searching when the energy level is low might deplete the battery, and the robot must be rescued*

If the robot is **rescued** (with probability $1-\beta$) then its battery is restored. However, needing rescue yields a negative reward of $r_{rescued}$. For example, $r_{rescued}$ could be -20 because we were annoyed with the robot. Alternatively, the battery might not run out (with probability $\beta$) and the robot receives a reward of $r_{search}$. 

> *Recharge action restores the battery to the level high*

The action of recharging the battery receives a reward zero. This transition dynamics can be represented as the diagram below.

<img src="images/transition_diagram.svg" width="50%" align="center"/>

The MDP formalism can be used in many different applications, in many different ways. States can be low-level sensory readings, such as the pixel values of the video frame. They can also be high-level such as object descriptions or bounding boxes. Similarly, actions can be low-level, such as the wheel speed of a robot, or high-level, such as go to the charging station.

Considering this reinforcement learning problem, there are many ways we could formalize this task. For example, the states could be the readings of the joint angles and velocities of the robot. The actions could be the voltages applied to each motor. The reward could be plus 100 for successfully placing the can into the trash. We also could want the robot to use as little energy as possible. So, we would include a small negative reward corresponding to the energy used.

## The Goal of Reinforcement Learning

In reinforcement learning, the agent's objective is to maximize **future reward**. In bandits, we maximized the imediate reward, but this does not work in MDP. Although some actions yield large reward in the next time step, indicating that looks good in short-term, it might not be the best in the long-term. For example, consider a robot learning to walk, where the reward could be proportional to the forward motion. Lurching forward would clearly maximize immediate reward. However, this action cause the robot to fall over and prejudice the long-term reward. If the robot wants to maximize total forward motion, it should walk quickly but carefully. 

We can define what we mean by maximizing total future reward as the return of time step $t$ as the sum of rewards obtained after each time step $t$. The return ($G_t$) is a random variable since the dynamics of the MDP can be stochastic. 

$$
\text{return}\ \ \ \ G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \cdots
$$

In general, many different trajectories from the same state are possible. This is why we **maximize the expected return**. 

$$
\mathbb{E}[G_t] = \mathbb{E}[R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T]
$$

For this to be well-defined, the sum of rewards must be finite. Specifically, let say there is a final time step $T$ (in $R_T$) where the agent environment interaction ends. In the simplest case, the interaction naturally breaks into chunks called **episodes**. 

Each episode begins independently of how the previous one ended. At termination, the agent is reset to a start state. Every episode has a final state which we call the terminal state. We call these tasks **episodic tasks**. To understand episodic tasks better, let's look at the game of chess. A game of chess always ends with a *checkmate*, *draw*, or *resignation*. An episode when playing chess is a complete single game. Each game starts from the same start state with all the pieces reset. 

## Michael Littman: The Reward Hypothesis

The basic idea of the reward hypothesis is illustrated in this saying:

> *Give a man a fish and he'll eat for a day.*
>
> *Teach a man to fish and he'll eat for a lifetime.*
>
> *Give a man a taste for fish and he'll figure out how to fish even if the details change.* 

As this saying, there are three ways to think about creating intelligent behavior. The first, *give a man a fish* is good old-fashioned AI. If we want a machine to be smart, we program it with the behavior we want it to have. However, as new problems arise, the machine won't be able to adapt to new circumstances. It requires us to always be there providing new programs. 

The second, *teach a man to fish* is supervised learning. If we want a machine to be smart, we provide training examples, and the machine writes its own program to match those examples. It learns as long as we have a way to provide training examples. However, situations change, I mean most of us don't have the opportunity to eat fish or to fish for our food every day. 

The third, *give a man a taste for fish* is **reinforcement learning**. It's the idea that we don't have to specify the mechanism for achieving a goal. We can just encode the goal and the machine can design its own strategy for achieving it. Thus, you don't have to catch a salmon to eat a salmon, there's supermarkets, there's seafood chain restaurants. 

### Reward Hypothesis
 
A blog post by Muhammad Ashraf (and Dadid Silver in his intro to RL course) says:

> *All goals can be described by the maximization of expected cumulative rewards.*

Rich Sutton has a blog post called the reward hypothesis where he states

> *What we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).* 

This version emphasizes that it goes beyond goals and the fact that reward is a scalar. In his slide from a talk of the early 2000s, Michael Littman says

> *Intelligent behavior arises from the actions of an individual seeking to maximize its received reward signals in a complex and changing world.*

Overall, the spirit is very much the same in all versions, although Littman contrasts the simplicity of the idea of reward with the complexity of the real world. So if you buy into his hypothesis, it suggests that there's two main branches of research that need to be addressed. 

- To figure out what rewards agent should optimize
- How to design algorithms to maximize it

People have given a lot of attention to the first item, thus, he's going to focus on the second. **How can we define rewards?** Reinforcement learning agent on the stock market can probably just be given monetary rewards to optimize. Buying actions cost dollars, selling actions generate dollars, the trade-offs are easy. A reinforcement learning based solar panel, moving the motors to reposition itself cost energy, and the sun shining on the panel brings in also energy, so, there's a common currency. However, if we're designing a reinforcement learning agent to control a thermostat, what's the reward? Turning on the heat or air conditioning costs energy, but not turning on the heat or air conditioning causes discomfort in the occupants, so there's no common currency. 

We can express the idea of a goal using rewards. One way is to define a state where the goal is achieved as having $+1$ reward, and all others are $0$ reward, that's sometimes called the **goal-reward representation**. Another is to penalize the agent with a $-1$ each step in which the goal has not been achieved. Once the goal is achieved, there's no more cost, that's the **action-penalty representation**. 

$$
\text{Goal-Reward} = \left\{\begin{matrix}
+1 & \text{if achieves goal}\\ 
0 & \text{Otherwise}\ \ \ \ \ \ \ \
\end{matrix}\right. \ \ \ \ \ \ \ \ \text{Action-Penalty} = \left\{\begin{matrix}
0 & \text{if achieves goal}\\ 
-1 & \text{Otherwise}\ \ \ \ \ \ \ \
\end{matrix}\right. 
$$

In both cases, optimal behavior is achieved by reaching the goal so that's good. But they result in subtle differences in terms of what the agent should do along the way. The first doesn't really encourage the agent to get to the goal with any sense of urgency. And the second runs into serious problems if there's some small probability of getting stuck and never reaching the goal. Both schemes can lead to big problems for goals with really long horizons. 

Imagine we want to encourage an agent to win a Nobel Prize. We'd give the agent a reward for being honored in Sweden and 0 otherwise - that's really, really rough. Some intermediate rewards like +0.0001 for doing well on a science test, or +0.001 for getting tenure, could make a big difference for helping to point the agent in the right direction. So even if we accept the reward hypothesis, there's still work to do to define the right rewards. 

**Programming** is the most common way of defining rewards for a learning agent. A person sits down and does the work of translating the goals of behavior into reward values. That can be done once and for all by writing a program that takes in states and outputs rewards. Some recent research looks at special languages for specifying tasks like temporal logic. These languages might be useful as intermediate formats that are somewhat easy for people to write, but also somewhat easy for machines to interpret. 

Rewards can also be **delivered on the fly by a person**. Recent research focuses on how reinforcement learning algorithms need to change when the source of rewards is a person. People act differently than reward functions, they tend to change the reward they give in response to how the agent is learning, for example. Standard reinforcement learning algorithms don't respond well to this kind of non-stationary reward. 

We can also specify **rewards by example**, that can mean an agent learning to copy the rewards that a person gives, but a very interesting version of this approach is *inverse reinforcement learning*.  

> *In inverse reinforcement learning, a trainer demonstrates an example of the desired behavior, and the learner figures out what rewards the trainer must have been maximizing that makes this behavior optimal.* 

So whereas reinforcement learning goes from rewards to behavior, inverse reinforcement learning is going from behavior to rewards. Once identified, these rewards can be maximized in other settings, resulting in powerful generalization between environments. 

Rewards can also be derived indirectly through an optimization process, if there's some high-level behavior we can create a score for, an optimization approach can search for rewards that encourage that behavior. So returning to the Nobel Prize example from earlier, imagine creating multiple agents pursuing this goal instead of a single one. That would allow us to evaluate, not just the result of the behavior was the prize won, but the rewards being used as an incentive for this behavior. Arguably, this is how living agents get their reward functions, reinforcement learning agents survive if they have good reward functions and a good algorithm for maximizing them. Those agents past the reward functions along to their offspring. More generally, this is an example of **meta reinforcement learning**, learning at the evolutionary level that creates better ways of learning at the individual level. 

The reward hypothesis is very powerful and very useful for designing state-of-the-art agents, it's a great working hypothesis that has helped lead us to some excellent results. But I'd caution you not to take it too literally, we should be open to rejecting the hypothesis when it is outlived its usefulness. For one thing, they're examples of behavior that seemed to be doing something other than maximizing reward. For example, it's not immediately apparent how to capture **risk-averse behavior** in this framework. 

> *Risk-averse behavior involves choosing actions that might not be best on average but for example, minimize the chance of a worst case outcome.* 

On the other hand, if you can capture this kind of behavior by intervening on the reward stream to magnify negative outcomes, that will shift behavior in precisely the right way.

What about when the desired behavior isn't to do the best thing all the time but to do a bunch of things in some balance? Like imagine a pure reward maximizing music recommendation system, it should figure out your favorite song and then play it for you all the time, and that's not what we want. Although maybe there are ways to expand the state space so the reward for playing a song is scaled back if that song has been played recently. It's kind of like the idea that an animal gets a lot of value from drinking but only if it's thirsty. If it just had a drink the reward for drinking again, right away is low. So maybe rewards can handle these cases. 

Well, another observation that I think is worth considering is whether pursuing existing rewards is a good match for high-level human behavior. There are people who single-mindedly pursue their explicit goals, but it's not clear that we judge such people as being good to have around. As moral philosophers might point out, the goals we should be pursuing aren't immediately evident to us. As we age, we learn more about what it means to make good decisions, and generations of scholars have been working out what it means to be a good ethical person. Part of this is better understanding the impact of our actions on the environment, and the impacts on each other, and that's just reinforcement learning. But part of it is articulating a deeper sense of purpose.

## Continuing Tasks

In many problems in reinforcement learning, the agent environment interaction continues without end. Unlike episodic tasks that we can break up into episodes where the task must end in a terminal state and the next episode begins independently of how the last episode ended, continuing tasks cannot be broken up into independent episodes and the interaction goes on continually. 

For example, consider the smart thermostat in the image below, which regulates the temperature of a building. It can be formulated as a continuing task since the thermostat never stops interacting with the environment. The state could be the current temperature along with details of the situation like the time of day and the number of people in the building. There are just two actions, turn on the heater or turn it off. The reward to be $-1$ every time someone has to manually adjust the temperature and $0$ otherwise. To avoid negative reward, the thermostat would learn to anticipate the user's preferences. 

<img src="images/thermostat.svg" width="60%" align="center"/>

The return for continuing tasks cannot sum up all the future rewards as we did for episodic tasks, since we're summing over an infinite sequence. As this return might be finite, we could discount future rewards by a factor $\gamma$ called the **discount rate**, where $\gamma$ varies between zero and less than one ($0 \le \gamma < 1$). From the previous return formulation

$$
G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_{t+k}
$$

we can modify it to include the discounting factor.

$$
G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{k-1} R_{t+k}
$$

Using this discount factor, we can see that:

> The effect of discounting on the return implies that immediate rewards contribute more to the some. 

Thus, rewards far into the future contribute less because they are multiplied by $\gamma$ raised to successively larger powers of $k$. We can concisely write this sum as an expression that guarantees to be finite. 

$$
G_t \doteq \sum\limits_{k=0}^{\infty} \gamma^k R_{t+k+1}
$$

Assume $R_{max}$ is the maximum reward our agent can receive at any time step. We can now upper bound the return $G_t$ by replacing every reward with $R_{max}$. Since $R_{max}$ is just a constant we can pull it out of the summation. 

$$
G_t \doteq \sum\limits_{k=0}^{\infty} \gamma^k R_{t+k+1} \ \ \ \le \ \ \ \sum\limits_{k=0}^{\infty} \gamma^k R_{max} \ \ \ = \ \ \ R_{max} \sum\limits_{k=0}^{\infty} \gamma^k
$$

Note that the second factor is just a geometric series and the geometric series evaluates to 

$$
G_t = R_{max} \times \frac{1}{1 - \gamma}
$$

As $R_{max}$ times one divided by one minus Gamma is finite, it is an upper bound on $G_t$. So we know $G_t$ is finite. We can look at the effect of the discount factor on the behavior of the agent at the two extreme cases: when $\gamma=0$ and when $\gamma \approx 1$. When $\gamma=0$ the return is just the reward at the next time step as:

$$
G_t = R_{t+1} + 0 R_{t+2} + 0^2 R_{t+1} + \cdots + 0^{k-1} R_{t+k} \ \ \ = \ \ \ \color{blue}{R_{t+1}}
$$ 

Thus, the agent is shortsighted and only cares about immediate expected reward. On the other hand, when $\gamma \approx 1$, the immediate and future rewards are weighted nearly equally in the return. In this case, **the agent is more farsighted**. 

Finally, let's discuss a simple but important property of the return. It can be written recursively. Let's factor out Gamma starting from the second term in our sum. 

$$
G_t = R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \cdots)
$$

Amazingly, the sequence in parentheses is the return on the next time step. So we can just replace it with $G_{t+1}$

$$
G_t = R_{t+1} + \gamma G_{t+1}
$$

Now, we have a recursive equation with $G_t$ on the left and $G_{t+1}$ on the right. This simple equation demonstrates the recursively definition of the equation.

## Examples of Episodic and Continuing Tasks

**Episodic task**: Consider an agent learning to play the game in the image below. The player represented in blue gets points for collecting white treasure blocks. The game ends when the player touches a green enemy block. This game is naturally represented as an episodic MDP. The agent tries to get a high score, collecting as many points as possible before the game ends. 

<img src="images/game.gif" width="40%" align="center"/>

To model this game, the state is an array of pixel values corresponding to the current screen. There are four actions: up, down, left, and right. The agent gets a reward of `+1` whenever collects a treasure block. An episode ends when the agent touches one of the green enemies. Regardless of how the episode ends, the next episode we'll begin with the agent in the center of the screen with no enemies present.

<img src="images/game_rewards.svg" width="40%" align="center"/>

**Continuing task**: The agent is going to schedule jobs on a set of servers. Suppose we have three servers used by reinforcement researchers to run experiments. Researchers submit jobs with different priorities to a single queue. The state is a number of free servers, and the prior to the job at the top of the queue. The actions are to reject or accept the job at the top of the queue if a server is free. Accepting the job, runs it and yields a reward equal to the jobs priority. Rejecting a job yields a negative reward proportional to the priority, and sends the job to the back of the queue. The agent should be careful about scheduling low priority jobs since he could prevent high priority jobs from being scheduled later. The servers become available as they finish their jobs. The researchers continually add jobs to the queue, and the agent accepts or rejects them. Since this process never stops, it's well-described as a continuing task. 

<img src="images/continuous.svg" width="70%" align="center"/>

In [2]:
#Centralize images
from IPython.core.display import HTML
def css_styling():
   styles = open("../_styles/custom.css", "r").read() #or edit path to custom.css
   return HTML(styles)
css_styling()