# Lesson 08: Markov Decision Processes and Q Learning

## Part 01: Markov Decision Processes


### Summary

This notebook outlines the general concepts of Markov Decision Processes (MDPs) as discussed by Charles and Michael in the video lectures. Much of the content is scraped from the trnascripts of those videos, so you may hear their voices coming through here. 



#### Decision Making  Reinforcement Learning

Differences are between the three types of learning -  supervised, unsupervised and reinforcement. 
 
- ** Supervised learning ** takes the form of _function approximation_ where you're given a bunch of _x_, _y_ pairs (features and labels), and your goal is to find a function **f** that will map some new _x_ to a proper _y_. If the learner is a good one, then the predicted _y_ will be the same as or close to (in some sense) the true _y_. These _x_'s and _y_'s are often vectors.

- ** Unsupervised learning ** takes a bunch of _x_'s and your goal is to find some "function" **f** that gives you a compact description (this is the equivalent of _y_ now) of the set of _x_'s that you've seen. So we call this _clustering_, or _description_ as opposed to function approximation.

- ** Reinforcement learning (RL) **  takes a bunch of data pairs (_x_'s and _z_'s) and learns some function **f** that can generate the _y_'s. In this case, though, the _x_, _y_ and _z_ elements are very different from the the _x_ and _y_ in the other two types of learning. So the first step in RL is to understand what the _x_, _y_ and _z_'s are.

We are going to motivate the definition of these _x_, _y_ and _z_'s with a simple game. The game is adapated from the video lecture (which in turn adapted it from one of the classic texts on AI -- **Artificial Inelligence - A Modern Approach**, _Stuart Russell and Peter Norvig_). 

### The "World" of Charles and Michael - Quiz

In the video lectures, they describe a simple $ 3x4 $ grid-like world as the premise of a game. There are four kinds of cells in this world:
    1. cells that you can move to without anything spectacular happening
    2. cells that are "walled" and you cannot get to them
    3. cells that are spectacularly bad (you LOSE) and the game ends if you land in one of them
    4. cells that are pectacularly good (you WIN) and the game ends if you land in one of them. 
    
The player starts of in one one of the cells (type 1) selectedly randomly -- the _start_ state. The play proceeds by the player selecting one of a set of actions (_up_, _down_, _left_, or _right_) that  moves the player to a different cell. The object of the game is for the player to get to a cell of type 3 so the player can win the game. Also, a player can never all off the grid; if they attempt to do so, they will end up not moving.

Let's switch to using "you" instead of player (much easier for me to use!). In the grid displayed below we represent a $ 4x4 $ world. **X** represents "walled" cells; if you take an action attempting to move into a walled cell, you end up not moving. At every step, the game's controller responds to your actions deterministically, i.e., if you pick _L_ (to go _left_), the controller moves you to a cell to the immediate left of their current position provided that were an available cell (i.e., not a waled cell or a boundary). 


| A | B | C | D | 
 ----- | ------ | ------------ | --------- | ------------- | ----- 
a |   <code>       </code> |  **X**  | <code>       </code>   |  _WIN_  
b |    |  <code>       </code> |    |  _LOSE_ 
c |    | **X**  |  <code>       </code>  |   <code>       </code> 
d | _start_  |   |    |   
  
**Question 1.** In this world, given a _start_ position of d,A (lower left corner), what are the two optimal (shortest) paths to winning the game? How many moves did it take? What is the probability that the player wins the game?

**Answer:**

> _Answer_

In the world described above, I expect you were able to find the path without too much trouble. Think about how you solved this problem. What were the steps you took? What were the decisions you made along the way? Please discuss with your partner and write them down. It may seem a bit silly, but there is a small point to it. What algorithm might you use to solve this for a grid of any dimension where the world followed the same rules (you don't have to write the algorithm, just think of a way you might be able to do it)? 

** Reflection 1** How I solved the path problem

** Answer: **

 > _Answer_

In the next version of the game, we consider the case where the rules of movement have changed slightly. Now, the game's controller _doesn't_ always move you in the direction you wanted to move. Instead, 80% of the time, the controller moves you in the direction, but 20% of the time, it moves you in a direction perpendicular to the one you wanted to take. For example, let's say from the _start_ position (d,A), you wanted to go up. 80% of the time you would go up to (c,A), 10% of the time you would move to (d,B) and 10% of the time you would remain in (d, A) as the controller would try to move you to the right, but there is a boundary there so you end up not moving at all.

** Question 2.** Pick one of the paths from your answer to Q1. What is the probability that you will win the game if you made the exact same sequence of moves? Remember the minor cases.

**Answer:**

In [None]:
## Code cell -- you can write out all the steps and probabilities and use python as a simple calculator

Here's some food for thought. How easily could you calculate the probability of winning the game? Are you guaranteed to win it every time? What kind of algorithm might you use to find the optimal path to the WIN state? Discuss these with your partner. Write down some of your thoughts (I am not looking for the correct answer).

** Reflection ** 

 > 

#### Making decisions in a stochastic world 

The exercise in the videos (and above) illusrate how it is possible to come up with a static plan when the world is deterministic, but the problem becomes a bit more challenging when there is uncertainty (or stochasticity). We have two options:
 - plan out what we would do in a deterministic world and try to execute the steps of the plan, then every once in a while, see if we've drifted away from where we thought we should be, re-evaluate, draw up a new plan (still assuming a deterministic world)
 - come up with some way to incorporate all of these uncertainties and probabilities so that we never really have to rethink what to do in case something goes wrong and live with the consequences. Some fraction of the time we would end up not achieving our goal of reaching the WIN position, but that we would have the best chance of winning.
 
The second method is called single agent reinforcement learning, or a Markov Decision Process (MDP). 

Let's look back at our reflections on question 1. You may have come up with a different way, but generally, we look at the grid and say there are specific locations and I need to move from one location to the next. We are going to abstract these words and call the location (a grid cell) a **_state_** and the moves we made an **_action_**. These two terms are essential concepts in the world of RL or MDPs. 

Continuing on with adding more words to our thought processes, in solving the grid-world problem, we were in a state (location), took an action and ended up in another state. The grid was laid out for us so we could see the connections between the states and knew that we had to go _up_ from (d, A) to get to (c, A).  Well this map can also be abstracted and is another important concept -- the **_model_** or the **_transition model_**. It is a function of three variables, a state $ s $, an action $a$ and another state (where we end up) $s'$. In very broad terms, the colection of states define our world or universe, the actions are the things we can do to explore the world, and the transition model encapsulates the rules of the game.

**Question 3:** Using notation like (d,A) to represent a state, one of the four letters U, D, L, R to denote the action, write down all possible entries in the transition model for the deterministic world (from question 1) starting in states (b,C) and (c, A). Each state should have four entries. If the move isn't permitted, then the "final" state is the same as the initial state.

**Answer:**

 >  

Turns out that in the simplistic model, we left out explicitly mentioning a few key points. The first one in the _Markovian property_ which essentially states that its only the present state that matters. The history of how we got there doesn't matter. The second one is the non-stationarity property

The three abstract concepts are all we need to 'live' in the world, but if we want to have a purpose, we need to add the notion of a **reward**. A **reward** is simply a scalar value that you get for being in a state. It encompasses our domain knowledge; the reward you get from the state tells you the usefulness of entering into that state. Why is this important? Without a reward, there is no reason to be in any state vs. the next and also no real need to make decisions. 

These four things, by themselves, along with this Markov property and non-stationarity. , defines what's called the Markov 
 Decision Process. Or an MDP. Got it? 

## 10 - Markov Decision Processes - 4.srt

 Speaking of solutions, this is the last little bit of thing that you need to know. And that is. This defines a 
 problem. But, what we also want to have, whenever we have a problem. Is a solution. So, the solution to the Markov 
 Decision Process, is something called a policy. And, what a policy does. Is, it's a function, that takes in a state. And 
 returns an action, in other words, for any given state that you're in, it tells you the action that you should take. 
 
 Like as a hint? 
 No, it just tells you, this is the at. Well, I mean, I suppose you don't have to do 
 it, but the way we think about Markov Decision Processes, is that this is the action that will be taken. 
 
 I see, so it's more of an order. 
 Yes, it's a command. Okay. 
 So that's all a policy is. A policy is 
 solution to a Markov Decision Process. And there is a special policy, which I'm writing here as policy star, or 
 the optimal policy, and that is the policy that maximizes your long-term expected reward. So if all the policies you 
 could take, of all the decisions you might take, this is the policy that optimizes the amount of reward that 
 you're going to receive or expect to receive over your lifetime. 
 So, like, at the end? 
 Well, at yeah, at the end, or at 
 any given point in time, how much reward you're receiving. >From the Markov Decision Process point of view, there doesn't 
 have to be an end. Okay. Though in this example, you don't get anything, and then at the end, you get paid off. 
 
 Right, or unpaid off. 
 Right. 
 If you fall into the red square. So actually, your question points out something very important here. I 
 mentioned earlier when I talked about the three kinds of learning that there, supervised learning and reinforced learning were sort 
 of. Similar, except that instead of getting Ys and Xs we were given Ys and, Xs and Zs. And this is exactly what's happening here. Here what we would 
 like to have if we wanted to learn a policy is a bunch of sa pairs as training examples. Well here's the state and the action you should've took, taken, here's 
 another state and the action you should've taken, so on and so forth. And then we would learn a function, the policy, 
 that maps states to actions. But what we actually see in the reinforcement learning world, in the Markov Decision Process world, 
 is we see states, actions, and then the rewards that we received. And so in fact, this problem of seeing a sequence 
 of states, actions, and rewards. It's very different from the problem of being told. This is the correct action to take to maximize a function. 
 Or find a function that maps from state to action. Instead, we say well, if you're in this state, and you take this action, this is 
 the reward that you would see. And then from that, we need to find the optimal action. 
 So Pi star is being the F from that previous slide? 
 
 Right. 
 And R is being Z? 
 Yes. And y is being a. 
 And s is being x or x is being s 
 Got you. 
 Right. 
 
 So but, I'm, okay I'm a little confused about this notion of a policy. So we have the, the, the thing we tried to do to 
 get the goals was up, up, right, right, right. Yes. 
 I don't see how to capture that as a policy. 
 It's actually fairly straightforward. What a policy would say is: 
 What state are you in? Tell me what action you should take. So, the policy, basically is this: When you're in 
 the state, start, the start state, the action you should take is up. And it would have a mapping. For every state 
 that you might see, whether it's this state, this state, this state, this state, this state, this state, this state, or 
 even these two states, and it will tell you what action you should take. And that's what a policy is. A policy, 
 very simply, is nothing more than a function that tells you what action to take at every, in any state you happen to come across. 
 
 Okay, but the, but the. The question that you asked before was about up, up, right, right, right. 
 Mm hm. 
 
 And, it seems like, because of the stochastic transitions. You might not be in the same state. Like, you don't know what 
 state you're in, when you take those actions. 
 No, so, one of the things for what we're talking about 
 here, for the Markov Decision Process. Is, there're states, there're actions, there're rewards. You always know what state you're in, and you know what reward you 
receive. 
 So does that mean you can't do up, up right right right? 
 Well, the way it would work in a Markov Decision Process, 
 so what you're describing is is what's often called a plan. You know, it's, tell me what sequence of actions I should take from 
 here. What Markov Decision Process does and what a pr, a policy does is it doesn't tell you what sequence of actions to take fr om
a particular state. It tells you what action to take in a particular state. You will then end up in another state because of 
 the transition model, the transition function. And then when you're in that state you ask the policy what actions should I take now? Okay. 
 
 Right, so this is actually a key point. Although we talked about it in the language of planning, which is very common for the people who di, for example 
 take any ag course, the thing about this in terms of planning, what are the things that I can do to accomplish my 
 goals? The Markov Decision Process way of thinking about it, the reinforcement way of thinking about it, or the typical reinforcement lea rning
way of thinking about it, really doesn't talk about plans directly. But instead, talks about policies. Which from which you can infer a pl an,
but this has the advantage that it tells you what to do everywhere. And it's robust to the underlying stochastic of the word .
World. 
 So, is it clear that's all you need to be able to behave well. 
 Well, it's certainly the case, that if you have a pol icy
and that policy is optimal, it does tell you what to do, no matter what situation you're in. 
 'Kay. 
 And so, if you have that, then that's definitely 
 all you need to behave well. But I mean could it be that you wanted to do something like up, up, right, right, right which you cant write down as a po licy?
 
 And why cant you write that down as a policy? 
 Because the policies are only telling you what act ion
to do as a function of the state not sort of like how far along you are in the sequence. 
 Right unless, of course, you fold that into your  state
some how. But thats exactly right, the way to think about this is. The idea of coming up with a concrete plan of what to do for the next 20 time 
steps is different from the problem of whatever step I happen to be in, whatever state I happen to be in, what's the next best thing 
 I can do? And just always asking that question. 
 Hm. 
 If you always ask that question, that will induce a sequence, but that sequence 
 is actually dependent upon the set of states that you see. Whereas in the other case where we wrote down a particular policy, you'll 
 notice that was only dependent upon the state you started in and it had to ignore the states 
that you saw along the way. 
 And the only way to fix that would be to say, well, after I've taken an action, let 
me look at the state I'm in and see if I should do something different with it. But if you're going to do that, then why are you tr ying
to compute the complete set of states? Or I'm sorry, the complete set of actions that you might take. 
 Okay. 
 
 Okay, so there you go. Now, a lot of what we're going to be talking about next Michael, is, given that we have MDP, we have this Marko v
Decision Process defined like this. How do we go from this problem definition to finding a good poli cy,
and in particular, finding the optimal policy? That makes sense. 
 Good. And there you go.