# Fundamentals of Deep Learning
## 目录
- Chapter 9. Deep Reinforcement Learning
    - Deep Reinforcement Learning Masters Atari Games
    - What Is Reinforcement Learning?
    - Markov Decision Processes (MDP)
        - Policy
        - Future Return
        - Discounted Future Return
    - Policy Versus Value Learning
    - Q-Learning and Deep Q-Networks

## Deep Reinforcement Learning Masters Atari Games
This network, termed a **Deep Q-Network (DQN)** was the first large-scale successful application of reinforcement learning with deep neural networks. DQN was so remarkable because the same architecture, without any changes, was capable of learning 49 different Atari games, despite each game having different rules, goals, and gameplay structure.

Later in this chapter we will implement DQN, as it is described in the Nature paper “Human-level control through deep reinforcement learning.”

## What Is Reinforcement Learning?
This learning process involves an **actor**, an **environment**, and a **reward signal**. The actor chooses to take an action in the environment, for which the actor is rewarded accordingly. The way in which an actor chooses actions is called a **policy**. The actor wants to increase the reward it receives, and so must learn an optimal policy for interacting with the environment (Figure 9-2).

![9-2](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0902.png)

Figure 9-2. Reinforcement learning setup

![9-3](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0903.png)

Figure 9-3. A simple reinforcement learning agent balancing a pole. This image is from our OpenAI Gym Policy Gradient agent that we build in this chapter.

## Markov Decision Processes (MDP)
Our pole-balancing example has a few important elements, which we formalize as a **Markov Decision Process (MDP)**. These elements are:

- State
    - The cart has a range of possible places on the x-plane where it can be. Similarly, the pole has a range of possible angles.
- Action
    - The agent can take action by moving the cart either left or right.
- State Transition
    - When the agent acts, the environment changes—the cart moves and the pole changes angle and velocity.
- Reward
    - If an agent balances the pole well, it receives a positive reward. If the pole falls, the agent receives a negative reward.

An MDP is defined as the following:

- S, a finite set of possible states
- A, a finite set of actions
- P(r,s'|s,a), a state transition function
- R, reward function

MDPs offer a mathematical framework for modeling decision-making in a given environment (Figure 9-4).

![9-4](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0904.png)

Figure 9-4. An example of an MDP. Blue circles represent the states of the environment. Red diamonds represent actions that can be taken. The edges from diamonds to circles represent the transition from one state to the next. The numbers along these edges represent the probability of taking a certain action. The numbers at the end of the green arrows represent the reward given to the agent for making the given transition.

### Policy
MDP’s aim is to find an optimal policy for our agent. Policies are the way in which our agent acts based on its current state.

### Future Return
Future return is how we consider the rewards of the future. Choosing the best action requires consideration of not only the immediate effects of that action, but also the long-term consequences. Sometimes the best action actually has a negative immediate effect, but a better long-term result. 

### Discounted Future Return
To implement discounted future return, we scale the reward of a current state by the discount factor, , to the power of the current time step. In this way, we penalize agents that take many actions before receiving positive reward. Discounted rewards bias our agent to prefer receiving reward in immediate future, which is advantageous to learning a good policy. 

## Policy Versus Value Learning
In typical supervised learning, we can use stochastic gradient descent to update our parameters to minimize the loss computed from our network’s output and the true label.

In reinforcement learning, we don’t have a true label, only reward signals. However, we can still use SGD to optimize our weights using something called **policy gradients**.

With our loss function defined, we can apply SGD to minimize our loss and learn a good policy.

## Q-Learning and Deep Q-Networks
Q-learning is in the category of reinforcement learning called value-learning. Instead of directly learning a policy, we will be learning the value of states and actions. Q-learning involves learning a function, a **Q-function**, which represents the quality of a state, action pair. The Q-function, defined Q(s, a), is a function that calculates the maximum discounted future return when action a is performed in state s.

The **Q-value** represents our expected long-term rewards, given we are at a state, and take an action, and then take every subsequent action perfectly (to maximize expected future reward). 

A question you may be asking is, how can we know Q-values? It is difficult, even for humans, to know how good an action is, because you need to know how you are going to act in the future. Our expected future returns depend on what our long-term strategy is going to be. This seems to be a bit of a chicken-and-egg problem. In order to value a state, action pair you need to know all the perfect subsequent actions. And in order to know the best actions, you need to have accurate values for a state and action.

### The Bellman Equation
We solve this dilemma by defining our Q-values as a function of future Q-values. This relation is called the Bellman equation, and it states that the maximum future reward for taking action  is the current reward plus the next step’s max future reward from taking the next action a’:

$Q^{*}(s_t,a_t)=E[r_t+\gamma max_a Q^{*}(s_{t+1},a)]$

We can use the update rule, then, to propagate that Q-value to the previous time step:

$\hat{Q_j} \to \hat{Q_{j+1}} \to \hat{Q_{j+2}} \to \cdots \to \hat{Q}$

This updating of the  Q-value is known as **value iteration**.