<h1 style="text-align: center;">What is Reinforcement Learning?</h1>

<br>

__Reinforcement Learning__ is a subfield of machine learning which addresses the problem of automatic learning of optimal decisions over time. This is a general and common problem studied in many scientific and engineering fields. Also this approach incorporates the extra dimension (usually time, but not necessarily) into learning equations, which puts it much close to the human perception of artificial intelligence.

<br>

# 1. Preface

---

This book covers the following topics:

* __Chapter 1, What is Reinforcement Learning?__ <br>
Contains introduction to RL ideas and main formal models.<br><br>

* __Chapter 2, OpenAI Gym__<br>
Introduces the reader to the practical aspect of RL, using open-source library gym.<br><br>

* __Chapter 3, Deep Learning with PyTorch__<br>
Gives a quick overview of the PyTorch library.<br><br>

* __Chapter 4, The Cross-Entropy Method__<br>
 Introduces you to one of the simplest methods of RL to give you the feeling of RL methods and problems.<br><br>

* __Chapter 5, Tabular Learning and the Bellman Equation__<br>
 Gives an introduction to the Value-based family of RL methods.<br><br>

* __Chapter 6, Deep Q-Networks__<br>
 Describes DQN, the extension of basic Value-based methods, allowing to solve complicated environment.<br><br>

* __Chapter 7, DQN Extensions__<br>
 Gives a detailed overview of modern extension to the DQN method, to improve its stability and convergence in complex environments.<br><br>

* __Chapter 8, Stocks Trading Using RL__<br>
 Is the first practical project, applying the DQN method to stock trading.<br><br>

* __Chapter 9, Policy Gradients__<br>
An Alternative, introduces another family of RL methods, based on policy learning.<br><br>

* __Chapter 10, The Actor-Critic Method__<br>
 Describes one of the most widely used method in RL.<br><br>

* __Chapter 11, Asynchronous Advantage Actor-Critic__<br>
 Extends Actor-Critic with parallel environment communication, to improve stability and convergence.<br><br>

* __Chapter 12, Chatbots Training with RL__<br>
 Is the second project, showing how to apply RL methods to NLP problems.<br><br>

* __Chapter 13, Web Navigation__<br>
 Is another long project, applying RL to web page navigation, using MiniWoB set of tasks.<br><br>

* __Chapter 14, Continuous Action Space__<br>
 Describes the specifics of environments, using continuous action spaces and various methods.<br><br>

* __Chapter 15, Trust Regions–TRPO, PPO, and ACKTR__<br>
 Is yet another chapter about continuous action spaces describing "Trust region" set of methods.<br><br>

* __Chapter 16, Black-Box Optimization in RL__<br>
 Shows another set of methods that don't use gradients in explicit form.<br><br>

* __Chapter 17, Beyond Model-Free–Imagination__<br>
Introduces model-based approach to RL, using recent research results about imagination in RL.<br><br>

* __Chapter 18, AlphaGo Zero__<br>
 Describes the AlphaGo Zero method applied to game Connect Four.

<br>

### 1.1. Download the Codes

Go to the following link for downloading all the example codes:

www.github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On

<br>

# 2. Learning – Supervised, Unsupervised, and Reinforcement

---

### 2.1. Why Not ML

In our changing world, even problems that look like static input-output problems become dynamic in a larger perspective. 
For example, consider we're solving a supervised learning problem of pet image classification (with two target classes of dog and cat). We've gathered the training dataset and implemented the classifier using the deep learning toolkit, and after a while, the model
demonstrates excellent performance. After a vacation on some seaside resort, we discover that dog haircut fashions have changed, and a significant portion of your queries are now misclassified, so we need to update your training images and repeat the process again. This shows that the model is only relevant in that time.
The preceding example is intended to show that even simple Machine Learning (ML) problems have a hidden time dimension, which is frequently overlooked, but it might become an issue in a production system.

<br>

### 2.2. Supervised Learning

In supervised learning, when given a set of example pairs, we want to automatically build a function that maps some input into some output. Below you can find some examples of supervised learning:

* __Text classification:__ <br>
Is this email message spam or not?<br><br>

* __Image classification and object location:__<br>
Does this image contain a picture of a cat, dog, or something else?<br><br>

* __Regression problems:__<br>
Given the information from weather sensors, what will be the weather tomorrow?<br><br>

* __Sentiment analysis:__ <br>
What's the customer satisfaction level of this review?<br>

These questions can look different, but they share the same idea: we have many examples of the input and desired output, and we want to learn how to generate the output for some future, currently unseen inputs. The name, supervised comes from the fact that we learn from the known answers, which were obtained from some supervisor who has provided us with those labeled examples.

<br>

### 2.3. Unsupervised Learning

In unsupervised learning, we assume no supervision. In other word, there is no known labels assigned to our data. The main objective is to learn some hidden structure of the dataset at hand. Below you can find some examples of unsupervised learning:

* __Data clustering:__ <br>This happens when our algorithm tries to combine data items into a set of clusters, which can reveal relationships in data.<br><br>

* __Generative Adversarial Networks (GANs):__ <br> When we have two competing neural networks, the first of them is trying to generate fake data to fool the second network, while the other is trying to discriminate artificially generated data from data sampled from our dataset. Over time, both of them are becoming more and more skilful in their tasks by capturing subtle specific patterns of your dataset.

<br>

### 2.4. Reinforcement Learning

RL lays somewhere in between supervised learning and unsupervised learning. It uses many well-established methods of supervised learning such as deep neural networks for function approximation, stochastic gradient descent, and back-propagation, to learn data representation. On the other hand, it usually applies them in a different way.

For example, imagine a robot mouse (agent) in a maze (environment) which there are some foods (reward) and electricity (punishment) in the maze. The robot mouse can take actions to left, right, and moving forward. The robot mouse can observe the full state of the maze to make a decision about the actions it may take. The goal of agent is to find as much food as possible, while avoiding an electric shock. In this particular example, the mouse can suffer a bit of an electric shock to get to the place with plenty of food (because in that case, the result will be better for the mouse than just standing still and gaining nothing). We don't want to hard-code knowledge about the environment and the best actions to take in every specific situation into the robot (it will take too much effort and may become useless even with a slight maze change). What we want to do is to have some magic set of methods that will allow our robot to learn on its own how to avoid electricity and gather as much food as possible. RL is exactly this magic toolbox, which plays differently from supervised and unsupervised learning methods. It doesn't work with predefined labels as supervised learning does. Nobody labels all the images the robot sees as good or bad or gives it the best direction to turn in. However, we're not completely blind as in an unsupervised learning setup—we have a reward system. Rewards can be positive from gathering the food, negative from electric shocks, or neutral when nothing special happens. By observing such a reward and relating it to the actions we've taken, our agent learns how to perform an action better, gather more food, and get fewer electric shocks.

<img width="250px" src="assets/img1.png">

<br>

### 2.5. Challenges in RL

Below you can read some challenges that RL faces:

1. __Observation depends on agent's behavior:__
    * If the agent do inefficient things, then the observations will NOT tell about what have been done wrong and what should be done to improve the outcome. In other word, the agent will just get negative feedback all the time.<br>
    
    * On the other hand, If the agent is stubborn and keeps making mistakes, then the observations can make the false impression that there is no way to get a larger reward which can be totally wrong. In ML terms, it can be rephrased as having non-i.i.d data (abbreviation i.i.d stands for independent and identically distributed which is a requirement for most supervised learning methods).
<br>

2. __Exploration / Explotaition Dilemma__ <br>
We need to find a balance between exploration and exploitation, otherwise the reward will get decreased enormously. This exploration/exploitation dilemma is one of the open fundamental questions in RL.
<br>

3. __When reward is delayed from actions__ <br>
For example in chess, there can be a huge delay between receiving reward (at the end of game) and making action which is a big problem.

<br>

# 3. RL formalisms and relations

---

<br>

### 3.1. RL Entities

The following diagram shows 2 major RL entities (Agent and Environment) and their communication channels (Actions, Reward, and Observations).

<img width="400px" src="assets/img2.png">

<br>

### 3.2. Reward

In RL, reward is used with purpose of telling to our agent how well they have behaved. It has the following properties:

* It's a scaler value that we obtain periodically from the environment. It can be positive or negative, large or small.
* The frequency of receiving a reward is different based on the environment; it can be every second or once in a lifetime. However, we revive it at every fixed timestamp or every environment interaction.
* Reward is local. In other word, it reflects the success of the agent's recent activity, not all the successes achieved by the agent so far. 

<br>

An agent's goal is to achieve the largest accumulated reward over its sequence of actions. In fact, the term reinforcementcomes from the fact that a reward obtained by an agent should reinforce its behavior in a positive or negative way. 

<br>

Below you can find some examples that illustrates reward in different environments:
* __Financial trading:__ <br>An amount of profit is a reward for a trader buying and selling stocks.<br><br>

* __Chess:__ <br>Here, reward is obtained at the end of the game, as a win, lose, or draw. Of course, it's up to interpretation. For me, for example, having a draw in a match against a chess master would be a huge reward. In practice, we need to explicitly specify the exact reward value, but it could be a fairly complicated expression. For instance, in case of chess, the reward could be proportional to the opponent's strength.<br><br>

* __Dopamine system in a brain:__ <br>There is a part in the brain (limbic system) that produces dopamine every time it needs to send a positive signal to the rest of the brain. Higher concentrations of dopamine lead to a sense of pleasure, which reinforces activities considered by this system as good. Unfortunately, the limbic system is ancient in terms of things it considers good: food, reproduction, and dominance, but this is a totally different story.<br><br>

* __Computer games:__ <br>They usually give obvious feedback to the player, which is either the number of enemies killed or a score gathered. Note in this example that reward is already accumulated, so the RL reward for arcade games should be the derivative of the score, that is, +1 every time a new enemy is killed and 0 at all other time steps.<br><br>

* __Web navigation:__ <br>There is a set of problems with high practical value, which is to be able to automatically extract information present on the web. Search engines are trying to solve this task in general, but sometimes, to get to the data you're looking for you need to fill some forms or navigate through series of links, or complete captchas, which can be difficult for search engines to do. There is an RL-based approach to those tasks, in which the reward is the information or the outcome you need to get.<br><br>

* __Neural network architecture search:__ <br>RL has been successfully applied to the domain of NN architecture optimization, where the aim is to get the best performance metric on some dataset by tweaking the number of layers or their parameters, adding extra bypass connections, or making other changes to the neural network architecture. The reward in this case is the performance (accuracy or another measure showing how accurate the NN predictions are).<br><br>

* __Dog training:__ <br>If you have ever tried to train a dog, you know that you need to give it something tasty (but too not much) every time it does the thing you've asked. It's also common to punish your pet a bit (negative reward) when it doesn't follow your orders, although recent studies have shown this isn't as effective as positive rewards.<br><br>

* __School marks:__ <br>We all have experience here! School marks are a reward system to give pupils feedback about their studying. As you can see from the preceding examples, the notion of reward is a very general indication of the agent's performance, and it can be found or artificially injected into lots of practical problems around us.

<br>

### 3.3. Agent

An agent is somebody/something that interacts with the environment by executing certain actions, taking observations, and receiving rewards. In most practical RL scenarios, it's our piece of software that is supposed to solve some problem in a more-or-less efficient way. 

For our initial set of six examples, the agents will be one of these:

* __Financial trading:__ <br>A trading system or a trader making decisions about order execution <br><br>

* __Chess:__ <br>A player or a computer program <br><br>

* __Dopamine system:__ <br>The brain itself, according to sensory data, decides if it was a good experience or bad<br><br>

* __Computer games:__ <br>The player who plays the game or the computer program.<br><br>

* __Web navigation:__ <br>The software that tells the browser which links to click on, where to move the mouse, or which text to enter<br><br>

* __Neural network architecture search:__ <br>The software that controls the concrete architecture of the neural network being evaluated<br><br>

* __Dog training:__ <br>Your beloved pet<br><br>

* __School:__ <br>Student/pupil 

<br>

### 3.4. Environment

The environment is everything outside of an agent. In the most general sense, it's the rest of the universe, but this goes slightly overboard and exceeds the capacity of even tomorrow's computers, so we usually follow the general sense here.

The environment is external to an agent, and its communication with the environment is limited by:
* __Rewards__ (obtained from the environment)
* __Actions__ (executed by the agent and given to the environment)
* __Observations__ (some information besides the rewards that the agent receives from the environment). 

<br>

### 3.5. Action

Actions are things that an agent can do in the environment. Actions are moves allowed by the rules of play. They can be simple such as move pawn one space forward (in chess), or complicated such as fill the tax form in for tomorrow morning. 

In RL, we distinguish between two types of actions: 

* __Discrete:__ <br>Discrete actions form the finite set of mutually exclusive things an agent could do, such as move left or right.<br><br>

* __Continuous:__ <br>
Continuous actions have some value attached to the action, such as a car's action steer the wheel having an angle and direction of steering. Different angles could lead to a different scenario a second later, so just saying steer the wheel is definitely not enough.

<br>

### 3.6. Observations

There are 2 information channels of information for an agent:
* Reward
* Observations

Observations are pieces of information that the environment provides the agent with, which say what's going on around them. It may be relevant to the upcoming reward (such as seeing a bank notification saying, you have been paid) or not. 

Observations even can include reward information in some vague or obfuscated form (such as score numbers on a computer game's screen. Score numbers are just pixels, but potentially we can convert them into reward values; it's not a big deal with modern deep learning at hand). On the other hand, reward shouldn't be seen as a secondary or unimportant thing. The reward is the main force that drives the agent's learning process. If the reward is made wrong, noisy, or just slightly off-course of the primary objective, then there is a chance that training will go in a wrong way.

<br>

### 3.7. State vs. Observation

It's important to distinguish difference between an environment's state and observations: The state of an environment potentially includes every atom in the universe, which makes it impossible to measure everything about the environment. This is completely fine, RL was created to support such cases natively. 
Now let's support our intuition with a set of examples:

* __Financial trading:__<br>
    * Environment: The whole financial market and everything that influences it. This is a huge list of things such as the latest news, economic and political conditions, weather, food supplies, and Twitter trends. Even your decision to stay home today can potentially indirectly influence the world financial system.
    * Observations: This is limited to stock prices, news, and so on. We don't have access to most of the environment's state, which makes trading such a nontrivial thing.<br><br>

* __Chess:__ <br>
    * Environment: The board plus your opponent, which includes their chess skills, mood, brain state, chosen tactics, and so on.
    * Observation: This is what you see (your current chess position), but, I guess, at some levels of play mastery, the knowledge of psychology and ability to read an opponent's mood could increase your chances.<br><br>

* __Dopamine system:__ <br>
    * Environment: Your brain PLUS nervous system and organ's states PLUS the whole world you can perceive.
    * Observations: The inner brain state and signals coming from your senses.<br><br>

* __Computer game:__ <br>
    * Environment: Your computer's state, including all memory and disk data. For networked games, you need to include other computers PLUS all internet infrastructure between them and your machine.
    * Observations: Screen's pixels and sound, that's it. A screen's pixels is not a tiny amount of information, but the whole environment state is definitely larger.<br><br>

* __Web navigation:__ <br>
    * Environment: The internet, including all the network infrastructure between the computer our agent works and the web server, which is a really huge system that includes millions and millions of different components.
    * Observation: The web page that is loaded at the current navigation step.<br><br>

* __Neural network architecture search:__ <br>
    * Environment: This includes the NN toolkit that performs the particular neural network evaluation and the dataset that is used to obtain the performance metric. In comparison to the internet, this looks like a tiny toy environment.
    * Observations: This includes some information about the testing, such as loss convergence dynamics or other metrics obtained from the evaluation step. <br><br>

* __Dog training:__ <br>
    * Environment: Your dog (including its hardly observable inner reactions, mood, and life experiences) and everything around it, including other dogs and a cat hiding in a bush.
    * Observations: Signals from your senses and memory. <br><br>

* __School:__ <br>
    * Environment: The school itself, the education system of the country, society, and the cultural legacy.
    * Observations are the same as for the dog training: the student's senses and memory.

<br>

### 3.8. Domains in RL

There are many other areas that contribute and relate to RL. The most significant are shown below. It includes 6 large domains, heavily overlapping each other on the methods and specific topics related to decision making (shown inside the inner gray circle). In the intersection of all those related, but still different scientific areas, sits RL, which is so general and flexible that it can take the best from these varying domains:

* __Machine learning (ML):__ <br>
RL, being a subfield of ML, borrows lots of its machinery, tricks, and techniques from ML. Basically, the goal of RL is to learn how an agent should behave when it is given imperfect observational data.<br><br>

* __Engineering (especially optimal control):__ <br>
This helps in taking a sequence of optimal actions to get the best result.<br><br>

* __Neuroscience:__ <br>
We saw the dopamine system as our example, and it has been shown that the human brain acts closely to the RL model.<br><br>

* __Psychology:__ <br>
This studies behavior in various conditions, such as how people react and adapt, which is close to the RL topic. <br><br>

* __Economics:__ <br>
One of the important topics is how to maximize reward in terms of imperfect knowledge and the changing conditions of the real world.<br><br>

* __Mathematics:__ <br>
This works with idealized systems, and also devotes significant attention to finding and reaching the optimal conditions in the field of operations research.

<img width="400px" src="assets/img3.png">

<br>

# 4. Markov Decision Processes

---

<br>

### 4.1. Markov Decision Process

The description of Markov decision processes is built like a Russian matryoshka doll: we start from the simplest case of a __Markov Process (MP)__ (also known as a __Markov Chain__), then extend it with rewards, which will turn it into a __Markov reward processes__. Then we'll put this idea into one other extra envelope by adding actions, which will lead us to __Markov Decision Processes (MDPs)__.

<br>

### 4.2. Markov Process (or Markov Chain)

Imagine we have a system in front of us that we can only observe. What we observe is called states, and the system can switch between states according to some laws of dynamics. We cannot influence the system, however we can watch the states changing. All possible states for a system form a set called __state space__. In __Markov processes__, we require this set of states to be finite. Your observations form a sequence of states or a chain (that's it's also called __Markov Chains__). 
For example, looking at a simplest model of the weather in some city, we can observe the current day as sunny or rainy, which is our state space. A sequence of observations over time forms a chain of states, such as [sunny, sunny, rainy, sunny, ...], and it is called __history__.

<br>

### 4.3. Markov Property

To call a system Markov process (MP), it needs to fulfil the Markov property. It means that the future system dynamics from any state have to depend on this state only. The main point of the Markov property is to make every observable state self-contained to describe the future of the system. In other words, the __Markov property__ requires the states of the system to be distinguishable from each other and unique. In this case, only one state is required to model the future dynamics of the system, not the whole history or, say, the last N states.

In the case of our toy weather example, the Markov property limits our model to represent only the cases when a sunny day can be followed by a rainy one, with the same probability, regardless of the amount of sunny days we've seen in the past. It's not a very realistic model, as from common sense we know that the chance of rain tomorrow depends not only on the current condition, but on a large number of other factors, such as the season, our latitude, and the presence of mountains and sea nearby. It was recently proven that even solar activity has a major influence on weather. So, our example is really naïve, but it's important to understand the limitations and make conscious decisions about them.

if we want to make our model more complex, we can always do this by extending our state space, which will allow us to capture more dependencies in the model at the cost of a larger state space. For example, if you want to capture separately the probability of rainy days during summer and winter, then you can include the season in your state. In this case, your state space will be [sunny+summer, sunny+winter, rainy+summer, rainy+winter] and so on.

<br>

### 4.4. Transition Matrix

As the system model complies with the __Markov property__, we can capture __transition probabilities__ with a __transition matrix__, which is a square matrix of the size N×N (N is the number of states in our model). Every cell in a row i and a column j in the matrix contains the probability of the system to transition from the state i to state j.

For example, in the example of sunny/rainy, the transition is as follows:

<img width="400px" src="assets/img4.png">

In this case: 
* If it's a sunny day then with 80% chance the next day will be sunny, and 20% chance the next day will be rainy.
* If it's a rainy day then there is a 10% probability that the weather will become sunny and a 90% probability of the next day being rainy.

<br>

### 4.5. Markov Process (Formal Definition)

The formal definition of Markov process is as follows:
* A set of states (S) that a system can be in
* A transition matrix (T), with transition probabilities, which defines the system dynamics

<br>

The visual representation of Markov Process (MP) is a graph with:
*  Nodes corresponding to system states
* Edges labeled with probabilities representing a possible transition from state to state. If the probability of transition is 0, we don't draw an edge.

<br>

For the sunny/rainy weather model the graph is as shown here:

<img width="500px" src="assets/img5.png">

Now let's have another example which is more complicated. Here we consider a model for a office worker. The state space is as follows:
* __Home:__ He's not at the office
* __Computer:__ He's working on his computer at the office
* __Coffee:__ He's drinking coffee at the office
* __Chatting:__ He's discussing something with colleagues at the office

The transition graph is as follows:

<img width="400px" src="assets/img6.png">

The transition matrix for the preceding diagram is as follows:

<img width="700px" src="assets/img7.png">

The transition probabilities could be placed directly on the state transition graph, as shown here:

<img width="400px" src="assets/img8.png">

<br>

### 4.6. Episode

In practice, we rarely know the exact transition matrix. A much more real-world situation is when we have only observations of our systems' states, which are also called __episodes__:

* home → coffee → coffee → chat → chat → coffee → computer → computer → home
* computer → computer → chat → chat → coffee → computer → computer → computer
* home → home → coffee → chat → computer → coffee → coffee

It's not complicated to estimate the transition matrix by our observation; we just count all the transitions from every state and normalize them to a sum of 1. The more observation data we have, the closer our estimation will be to the true underlying model.

<br>

### 4.7. Stationary vs. Non-Stationary

Let's start by defining stationary and non-stationary:
* __Stationary__ means that the underlying transition distribution for any state does not change over time. It's worth mentioning that the Markov property implies stationarity. 
* __Non-stationarity__ means that there is some hidden factor that influences our system dynamics, and this factor is not included in observations. However, this contradicts the Markov property, which requires the underlying probability distribution to be the same for the same state regardless of the transition history. 

It's important to understand the difference between the actual transitions observed in an episode and the underlying distribution given in the transition matrix. Concrete episodes that we observe are randomly sampled from the distribution of the model, so they can differ from episode to episode. However, the probability of concrete transition to be sampled remains the same. If this is not the case, Markov chain formalism becomes non-applicable.

<br>

### 4.8. Reward in Markov Process

To introduce rewards, we need to extend our Markov process model. 

* First, we need to add value to our transition from state to state. We already have probability, but probability is being used to capture the dynamics of the system, so now we have an extra scalar number without an extra burden. Reward can be represented in various forms. The most general way is to have another square matrix similar to the transition matrix with rewards for transitioning from state i to state j residing in row i and column j. Rewards can be positive or negative, large or small. In some cases, this representation is redundant and can be simplified. For example, if the reward is given for reaching the state regardless of the previous state, we can keep only state → reward pairs, which is a more compact representation. However, this is applicable only if the reward value depends only on the target state, which is not always the case.

* Second thing we need to add to the model is discount factor γ (gamma), a single number from 0 to 1 (inclusive). The meaning will be explained later.

<br>

### 4.9. Return 

As you remember, we observe a chain of state transitions in a Markov process. This is still the case for a Markov reward process, but for every transition, we have our extra quantity — reward. So now, all our observations have a reward value attached to every transition of the system.

For every episode, we define return at the time t as this quantity:

<img width="350px" src="assets/img9.png">

This means that for every time point, we calculate return as a sum of subsequent rewards, but more distant rewards are multiplied by the discount factor raised to the power of the number of steps we are away from the starting point at time t. 

<br>

### 4.10. Discount Factor or Gamma

The discount factor stands for the foresightedness of an agent. 
* If gamma=1, then the return G<sub>t</sub> equals to a sum of all subsequent rewards.
* If gamma=0, then the return G<sub>t</sub> will be immediate reward without any subsequent state.
* The extreme values (0 or 1) are not useful, and usually gamma is set to something in between, such as 0.9 or 0.99. In this case, we will look into future rewards, but not too far.

Think about the  gamma parameter as a measure of how far into the future we look to estimate the future return: the closer to 1, the more steps ahead of us we take into account.

<br>

### 4.11. State Value

We can define value of state as follows. For every state s, the value V(s) is the average (or expected) return we get by following the Markov reward process.

<img width="200px" src="assets/img10.png">

<br>

### 4.12. Dilbert Reward Process (DRP)

Let's extend the __Dilbert Process__ with rewards and turn it into a __Dilbert Reward Process (DRP)__. Our reward values will be as follows:
* home → home: 1 (as it's good to be home)
* home → coffee: 1
* computer → computer: 5 (working hard is a good thing) 
* computer → chat: -3 (it's not good to be distracted)
* chat → computer: 2 
* computer → coffee: 1
* coffee → computer: 3
* coffee → coffee: 1
* coffee → chat: 2
* chat → coffee: 1
* chat → chat: -1 (long conversation becomes boring)

A diagram with rewards is shown here:

<img width="550px" src="assets/img11.png">

<br>

### 4.13. State Value Function (Gamme=0)

In this part we will calculate the state value when gamma = 0. 
As an example, let's fix our state to Chat. The subsequent transition depends on chance. According to the transition matrix for the Dilbert Process, the probabilities are as follows: 
* 50% probability that the next state is Chat 
* 20% chance that it will be Coffee 
* 30% of cases, we return to the Computer state. 

When gamma = 0, our return is equal only to a value of the next immediate state. So, if we want to calculate the value of the Chat state, then we need to sum all transition values, and multiply it by their probabilities:

* V(chat) = -1 * 0.5 + 2 * 0.3 + 1 * 0.2 = 0.3
*  V(coffee) = 2 * 0.7 + 1 * 0.1 + 3 * 0.2 = 2.1
* V(home) = 1 * 0.6 + 1 * 0.4 = 1.0
* V(computer) = 5 * 0.5 + (-3) * 0.1 + 1 * 0.2 + 2 * 0.2 = 2.8

So, Computer is the most valuable state to be in (if only caring about immediate reward). Not surprisingly, Computer → Computer is frequent and has a large reward, and the ratio of interruptions is not too high.

<br>

### 4.14. State Value Function (Gamme=1)

In this part we will calculate the state value when gamma = 1. 

In this case, the value is infinite for all states. Our diagram doesn't contain sink states (states without outgoing transitions), and when our discount equals 1, we care about a potentially infinite amount of transitions in the future. 

This infinite result shows us one of the reasons to introduce gamma into a Markov Reward Process, instead of just summing all future rewards. In most cases, the process can have an infinite (or large) amount of transitions. As it is not very practical to deal with infinite values, we would like to limit the horizon we calculate values for. Gamma with a value less than 1 provides such a limitation, and we'll discuss this later in chapters about the value iteration methods family. 

On the other hand, if we're dealing with finite-horizon environments (like TicTacToe game which is limited by 9 steps), then it will be fine to use gamma=1. As another example, there is an important class of environments with only one step called Multi-Armed Bandit MDP. This means that on every step you need to make a selection of one alternative action, which provides you with some reward and the episode ends.

<br>

### 4.15. Best Value for Gamma

Gamma is usually set to a value between 0 and 1. Commonly used values for gamma are 0.9 and 0.99. 

<br>

### 4.16. Add Action to Markov Decision Process

The Markov Reward Process (MRP) can be extended to include actions. The steps are as follows:
* Add a set of actions (A) which is finite. This is our agent's action space.
* Condition the transition matrix with action which means the matrix needs an extra action dimension. In this case, the transition matrix turns it into a cube. In the case of MPs and MRPs, the transition matrix has a square form, with source state in rows and target state in columns. So, every row i contained a list of probabilities to jump to every state:

<img width="400px" src="assets/img12.png">

* Now the agent can actively choose an action to take at every time. So for every state we have a matrix where the depth dimension contains actions that the agent can take, and the other dimension is that the target state system will jump to after this action is performed by the agent. The following diagram shows our new transition table that became a cube with: 
    * Source state as the height of dimension (indexed by i)
    *  Target state as width (indexed by j) 
    * The action that agent can choose as depth (indexed by k) of the transition table

So, in general, by choosing an action, the agent can affect the probabilities of target states, which is a useful ability.

<img width="400px" src="assets/img13.png">

As an example, to give you an idea of why we need so many complications, let's imagine a small robot which:
* Lives in a 3×3 grid
*  Execute actions of turning left, right, and going forward. 
* The state of the world is: 
    * Robot's position
    *  Robot's orientation (up, down, left, and right), which gives us 3×3×4=36 states (the robot can be at any location in any orientation). 
* Imagine that the robot has imperfect motors (like in real world) 
    * When it executes turn left or turn right, there is a 90% chance that the desired turn happens
    * With 10% probability, the wheel slips and the robot's position stays the same. 
    * The same happens with go forward which in 90% of cases it works, but for the rest (10%) the robot stays at the same position.

In the following illustration, a small part of a transition diagram is shown:
* It displays the possible transitions from the state (1, 1, up), when the robot is in the center of the grid and facing up. 
* If it tries to move forward, there is a 90% chance that it will end up in the state (0, 1, up)
* However there is a 10% probability that the wheels will slip and the target position will remain (1, 1, up).

To properly capture all these details about the environment and possible reactions on the agent's actions, the general MDP has a 3D transition matrix with dimensions (source state, action, and target state).

<img width="600px" src="assets/img14.png">

Finally, to turn our MRP into an MDP, we need to add actions to our reward matrix in the same way we did with the transition matrix: our reward matrix will depend not only on state but also on action. In other words, it means that the reward the agent obtains now depends not only on the state it ends up in but also on the action that leads to this state. It's similar as when putting effort into something, you're usually gaining skills and knowledge, even if the result of your efforts wasn't too successful. So, the reward could be better if you're doing something, rather than not doing something, even if the final result is the same.

<br>

### 4.17. Policy

Intuitively, policy is some set of rules that controls the agent's behavior. Even for fairly simple environments, we can have a variety of policies. 

For example, in the example with the robot in the grid world, the agent can have different policies, which will lead to different sets of visited states. Like:
* Blindly move forward regardless of anything
* Try to go around obstacles by checking whether that previous forward action failed
* Funnily spin around to entertain its creator
* Choose an action randomly modelling a drunk robot in the grid world scenario, and so on ...

The main objective of the RL agent is to gather as much return as possible (defined as __discounted cumulative reward__) . So, again,  different policies can give us different return, which makes it important to find a good policy. This is why the notion of policy is important, and it's the central thing we're looking for.

Formally, policy is defined as the probability distribution over actions for every possible state:

<code>π(a|s) = P[At = a|St = s]</code>

This is defined as probability, not as a concrete action, to introduce randomness into an agent's behavior. Deterministic policy is a special case of probabilistics with action having 1 as its probability. Another useful notion is that if our policy is fixed and not changing, then our MDP becomes an MRP, as we can reduce transition and reward matrices with a policy's probabilities and get rid of action dimensions.


___THE END___