<h1 style="text-align: center;">Deep Q-Networks</h1>

<br>

In this chapter, we'll talk about problems with the Value iteration method and introduce its variation, called __Q-learning__. In particular, we'll look at the application of Q-learning to so-called "grid world" environments, which is called __tabular Q-learning__, and then we'll discuss Q-learning in conjunction with neural networks. This combination has the name __DQN__. At the end of the chapter, we'll reimplement a DQN algorithm from the famous paper, *Playing Atari with Deep Reinforcement Learning by V. Mnih and others, published in 2013*, which started a new era in RL development.

<br>

<img width="700px" src="./assets/dqn.jpg">

<br>

# 01. Real-Life Value Iteration

---

Let's recall the value iteration method. In every step, loop over all states, and for every state, perform an update of its value with a Bellman approximation. The other variation of this method for Q-values (values for actions) is almost the same. However, in here we approximate and store values for every state and action.

Despite the impovement that the value iteration method has over Cross-Entropy method, it has its own limitations. For example:
1. __Count of states and ability to loop over them:__ <br> In Value iteration method we assume we know all states in advance and we can iterate over them and store the approximation of the value state. This is only possible with simple environments and not with the complex ones. <br><br>

2. __Limited to discrete action sapce only:__ <br> The value iteration approach limits us to discrete action spaces and we cannot use this approch for continuous control problems (where actions can represent continuous variables) such as the angle of a steering wheel, the force on an actuator, or the temperature of a heater.

<br>

# 02. Tabular Q-Learning

---

There is not need to iterate over every state in the state space. We have an environment that can be used as a source of real-life samples of states. If some state in the state space is not shown to us by the environment, why should we care about its value? We can use states obtained from the environment to update values of states, which can save us lots of work. This modification of the Value iteration method is known as __Q-learning__, and for cases with explicit state-to-value mappings, has the following steps:

1. Start with an empty table, mapping states to values of actions.
2. By interacting with the environment, obtain the tuple s, a, r, s′ (state, action, reward, and the new state). In this step, we need to decide which action to take with consideration of exploration vs. exploitation.
3. Update the Q(s, a) value using the Bellman approximation:
<img width="300px" src="assets/equ_1.png">
4. Repeat from step 2.

As in Value iteration, the end condition could be some threshold of the update or we can perform test episodes to estimate the expected reward from the policy. 

<br><center>***</center><br>

We update the Q-values or Q(s, a) using a __"blending"__ technique, which is the average between old and new values of Q using learning rate α (with a value from 0 to 1):

<img width="480px" src="assets/equ2.png">

This allows values of Q to converge smoothly, even if our environment is noisy.

<br><center>***</center><br>
So let's take a look at the final version of the algorithm:
1. Start with an empty table for Q(s, a).
2. Obtain (s, a, r, s′) from the environment.
3. Make a Bellman update:
<img width="480px" src="assets/equ3.png">
4. Check convergence conditions. If not met, repeat from step 2.

As mentioned earlier, this method is called __tabular Q-learning__, as we keep a table of states with their Q-values. Now let's try it on our FrozenLake environment.

In [1]:
# Import the libraries
import gym
import collections
from tensorboardX import SummaryWriter

In [2]:
# Hyperparameters
ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
ALPHA = 0.2         # Learning rate 
TEST_EPISODES = 20

In [3]:
# The agent class
class Agent:
    
    
    # The constructor class
    def __init__(self):
        
        # The environment
        self.env = gym.make(ENV_NAME)
        
        # The initial state
        self.state = self.env.reset()
        
        # Initialize an empty table for Q(s, a)
        self.values = collections.defaultdict(float)

        
    # Function for obtaining the next transition from the environment
    def sample_env(self):

        # Take a random action
        action = self.env.action_space.sample()

        # Update the old state with current state
        old_state = self.state

        # Take the given action and get the observation (next state), reward, done, info
        new_state, reward, is_done, _ = self.env.step(action)

        # Update the current state
        self.state = self.env.reset() if is_done else new_state

        # Return S, A, R, S'
        return (old_state, action, reward, new_state)

    
    # Function for finding the best action to take from given state
    def best_value_and_action(self, state):

        # Initialize the best action value and best action
        best_action_value, best_action = None, None

        # Iterate through action space
        for action in range(self.env.action_space.n):

            # Fetch the action value
            action_value = self.values[(state, action)]

            # If the best action value is none OR it's less than action value
            if (best_action_value is None) or (best_action_value < action_value):

                # Update the best action value
                best_action_value = action_value

                # Update the best action
                best_action = action

        return best_action_value, best_action

    
    # Function for updating the action value table
    def value_update(self, s, a, r, next_s):

        # Find the best action value to take
        best_v, _ = self.best_value_and_action(next_s)

        # Calculate the new action value using Bellman approximation 
        new_val = r + GAMMA * best_v

        # Fetch the old action value
        old_val = self.values[(s, a)]

        # Blend the old and new action value (using learning rate) and update the values table
        self.values[(s, a)] = old_val * (1 - ALPHA) + new_val * ALPHA

        
    # Function for playing one full eopisode
    def play_episode(self, env):

        # Initialize the total reward
        total_reward = 0.0

        # Get the initial state
        state = env.reset()

        # Infinite loop
        while True:

            # Get the best action
            _, action = self.best_value_and_action(state)

            # Take the given action and get the observation (next state), reward, done, info
            new_state, reward, is_done, _ = env.step(action)

            # Add the reward to total reward
            total_reward += reward

            # If terminal state
            if is_done:

                # Break the loop
                break

            # Update the current state to next state
            state = new_state

        return total_reward

In [4]:
# Execute the program
if __name__ == "__main__":
    
    # Create the environment
    test_env = gym.make(ENV_NAME)
    
    # Initialize the agent
    agent = Agent()
    
    # Summary writer for tensorboard
    writer = SummaryWriter(comment="-q-learning")

    # Initialize the iteration number
    iter_no = 0
    
    # Initialize the best reward
    best_reward = 0.0
    
    # Infinite loop
    while True:
        
        # Increment the iteration number
        iter_no += 1
        
        # Obtain the next transition from the environment
        s, a, r, next_s = agent.sample_env()
        
        # Update the action value table
        agent.value_update(s, a, r, next_s)

        # Initialize the reward
        reward = 0.0
        
        # Iterate through TEST_EPISODES numbers
        for _ in range(TEST_EPISODES):
            
            # Update the reward by playing one full episode
            reward += agent.play_episode(test_env)
            
        # Divide reward by TEST_EPISODES
        reward /= TEST_EPISODES
        
        # Add the reward value into tensorboard
        writer.add_scalar("reward", reward, iter_no)
        
        # If reward is higher than best reward
        if reward > best_reward:
            
            # Print best reward and reward
            print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
            
            # Update the best reward to reward
            best_reward = reward
            
        # If reward is higher than 0.8 then it is SOLVED 
        if reward > 0.80:
            
            # Print the iteration number which is solved
            print("Solved in %d iterations!" % iter_no)
            
            # break the loop
            break
            
    # Close the writer
    writer.close()

Best reward updated 0.000 -> 0.050
Best reward updated 0.050 -> 0.100
Best reward updated 0.100 -> 0.200
Best reward updated 0.200 -> 0.300
Best reward updated 0.300 -> 0.400
Best reward updated 0.400 -> 0.700
Best reward updated 0.700 -> 0.750
Best reward updated 0.750 -> 0.800
Best reward updated 0.800 -> 0.900
Solved in 7679 iterations!


You may have noticed that this version used more iterations to solve the problem compared to the value iteration method from the previous chapter. The reason for that is that we're no longer using the experience obtained during testing. (In Chapter05/02_frozenlake_q_iteration.py, periodical tests cause an update of Q-table statistics. Here we don't touch Q-values during the test, which cause more iterations before the environment gets solved.) Overall, the total amount of samples required from the environment is almost the same. The reward chart in TensorBoard also shows good training dynamics, which are very similar to the value iteration method.

<img width="350px" src="./assets/dynfrozen.png">

<br>

# 03. Deep Q-learning

---

The Q-Learning method solves the issue with iteration over the states. However, it might struggle if the count of the observable set of states is very large. For example, in Atari games, if we decide to use raw pixels as individual states then there's going to be too many states to track and approximate values for.

In some environments, the count of (different) observable states can be almost infinite. For example, in CartPole Environment, there is four floating point numbers represented as the state. The number of combinations of values is finite, but this number is extremely large. We can create some bins to discretize those values. For that we need to decide what ranges of parameters are important to distinguish as different states and what ranges could be clustered together. In the case of Atari games, we can treat two different images with a single pixel change as a single state. However, we need to distinguish some of the states.

The following image shows the Pong game. The objective of the game is to get the bounding ball past our opponent's paddle, while preventing it from getting past our paddle (our paddle is green and it's on the right). The situations below are just two from the 10<sup>70802</sup> possible situations, but we want our agent to act on them differently.

<img width="400px" src="./assets/pongenv.png">

As a solution to this problem, we can use a nonlinear representation that maps both state and action onto a value. Using a deep neural network is one of the most popular options, especially when dealing with observations represented as screen images. 

<br>

With this in mind, let's make modifications to the Q-learning algorithm (the following algorithm looks simple but, unfortunately, it won't work very well. In the following sections, we will discuss what can go wrong and show how to resolve it.):
1. Initialize Q(s, a) with some initial approximation
2. By interacting with the environment, obtain the tuple (s, a, r, s′)
3. Calculate loss. 
    - If episode has ended then:
$$ L = (Q_{s,a} − r)^2 $$
    <br>
    - Or otherwise:

<img width="300px" src="assets/equ4.png"> 

4. Update Q(s, a) using the __stochastic gradient descent (SGD)__ algorithm, by minimizing the loss with respect to the model parameters
5. Repeat from step 2 until converged


<br>

### 03.1. Interaction With the Environment

--- 

We need to interact with the environment to receive data for training. In simple environments (such as FrozenLake), we can act randomly. However, in complex environments (such as Pong), acting randomly will not work. Alternatively, we use our Q function approximation as a source of behavior (as in the value iteration method, when we remembered our experience during testing).

If our representation of Q is good, then the gained experience will be relevant for training. On the other hand, if our approximation is not perfect (usually at the beginning of training), then our agent might stuck with bad actions for some states without ever trying to behave differently. This is the problem know as __exploration vs. exploitation dilemma__. As you can see, exploration (random behavior) is better at the beginning of the training. As our training progresses, we want to do exploitation (fall back to our Q approximation to decide how to act).

__Epsilon-Greedy__ is method that performs a mix of exploration and exploitation. In this method, we are switching between random and Q policy using the probability hyperparameter ε. The usual practice is to start with `ε = 1.0` (100% random actions) and slowly decrease it to some small value such as `ε = 0.05` (5% random actions) or `ε = 0.02` (2% random actions). In this way, we explore more in the begining and exploit more in the end. This problem is one of the fundamental open questions in RL and an active area of research, which is not even close to being resolved completely.

<br>

### 03.2. SGD Optimization

---

In the Q-Learning procedure, we are trying to approximate a complex, nonlinear function Q(s, a) with a neural network. To do this, we calculate targets for this function using the Bellman equation and then pretend that we have a supervised learning problem at hand. 

One of the requirements for SGD optimization is that the training data is __independent and identically distributed__ (frequently abbreviated as __i.i.d__). In our case, data that we're going to use for the SGD update doesn't fulfill these criteria:

1. Our samples are not independent. Even if we accumulate a large batch of data samples, they all will be very close to each other, as they belong to the same episode.

2. Distribution of our training data won't be identical to samples provided by the optimal policy that we want to learn. Data that we have is a result of some other policy (our current policy, random, or both in the case of ε-greedy), but we don't want to learn how to play randomly: we want an optimal policy with the best reward.

To deal with this problem, we usually need to use a large buffer of our past experience and sample training data from it, instead of using our latest experience. This method is called __replay buffer__. The simplest implementation is a buffer of fixed size, with new data added to the end of the buffer so that it pushes the oldest experience out of it. Replay buffer allows us to train on more-or-less independent data, but data will still be fresh enough to train on samples generated by our recent policy.

<br>

### 03.3. Correlation Between Steps

---

Another practical issue with the default training procedure is also related to the lack of i.i.d in our data, but in a slightly different manner. The Bellman equation provides us with the value of Q(s, a) via Q(s′, a′) (which has the name of __bootstrapping__). However, both states s and s′ have only one step between them. This makes them very similar and it's really hard for neural networks to distinguish between them. When we perform an update of our network's parameters, to make Q(s, a) closer to the desired result, we indirectly can alter the value produced for Q(s′, a′) and other states nearby. This can make our training really unstable, like chasing our own tail: when we update Q for state s, then on subsequent states we discover that Q(s′, a′) becomes worse, but attempts to update it can spoil our Q(s, a) approximation, and so on.

To make training more stable, there is a trick, called __target network__, when we keep a copy of our network and use it for the Q(s′, a′) value in the Bellman equation. This network is synchronized with our main network only periodically, for example, once in N steps (where N is usually quite a large hyperparameter, such as 1k or 10k training iterations).


<br>

### 03.4. The Markov Property

---

Our RL methods use MDP formalism as their basis, which assumes that the environment obeys the Markov property: observation from the environment is all that we need to act optimally (in other words, our observations allow us
to distinguish states from one another). As we've seen on the preceding Pong's screenshot, one single image from the Atari game is not enough to capture all important information (using only one image we have no idea about the speed and direction of objects, like the ball and our opponent's paddle). This obviously violates the Markov property and moves our single-frame Pong environment into the area of __partially observable MDPs (POMDP)__. A POMDP is basically MDP without the Markov property and they are very important in practice. For example, for most card games where you don't see your opponents' cards, game observations are POMDPs, because current observation (your cards and cards on the table) could correspond to different cards in your opponents' hands.

We'll not discuss POMPDs in detail in this book, so, for now, we'll use a small technique to push our environment back into the MDP domain. The solution is maintaining several observations from the past and using them as a state. In the case of Atari games, we usually stack k subsequent frames together and use them as the observation at every state. This allows our agent to deduct the dynamics of the current state, for instance, to get the speed of the ball and its direction. The usual "classical" number of k for Atari is four OR `k=4`. Of course, it's just a hack, as there can be longer dependencies in the environment, but for most of the games it works well.

<br>

### 03.5. The Final Form of DQN Training

---

ε-greedy, replay buffer, and target network are the basis that allows DeepMind to successfully train a DQN on a set of 49 Atari games and demonstrate the efficiency of this approach applied to complicated environments.

The original paper (without target network) was published at the end of 2013 (___Playing Atari with Deep Reinforcement Learning, Mnih and others.___), and they used seven games for testing. Later, at the beginning of 2015, a revised version of the article, with 49 different games, was published in Nature (___Human-Level Control Through Deep Reinforcement Learning, Mnih and others.___)

<br>

The algorithm for DQN from the preceding papers has the following steps:
1. Initialize:
    - Parameters for <code>Q(s, a)</code> and <code>Q&#x0302;(s, a)</code> with random weights
    - `ε←1.0`
    - Empty replay buffer
    
    2. With probability `ε`, select a random action `a`, otherwise <code>a = argmax<sub>a</sub>Q<sub>s,a</sub></code>

3. Execute action a in an emulator and observe reward `r` and the next state `s′`

4. Store transition `(s, a, r, s′)` in the replay buffer

5. Sample a random minibatch of transitions from the replay buffer

6. For every transition in the buffer, calculate target `y=r` if the episode has ended at this step or otherwise:
<img width="250px" src="assets/equ5.png"> 

7. Calculate loss:
<img width="145px" src="assets/equ6.png"> 

8. Update `Q(s, a)` using the SGD algorithm by minimizing the loss in respect to model parameters

9. Every N steps copy weights from `Q` to <code>Q&#x0302;</code>

10. Repeat from step 2 until converged

<br>

Let's implement it now and try to beat some of the Atari games!

<br>

# 04. DQN on Pong

---

This example has been split into three modules due to its length, logical structure, and reusability. The modules are as follows:

- __Chapter06/lib/wrappers.py:__ These are Atari environment wrappers mostly taken from the OpenAI Baselines project
- __Chapter06/lib/dqn_model.py:__ This is the DQN neural net layer, with the same architecture as the DeepMind DQN from the Nature paper
- __Chapter06/02_dqn_pong.py:__ This is the main module with the training loop, loss function calculation, and experience replay buffer

<br>

### 04.1. Wrappers

---

Tackling Atari games with RL is quite demanding from a resource perspective. To make things faster, several transformations are applied to the Atari platform interaction, which are described in DeepMind's paper. Transformations are usually implemented as OpenAI Gym wrappers of various kinds. 

The full list is quite lengthy and there are several implementations of the same wrappers in various sources. My personal favorite is in the OpenAI repository called __baselines__, which is a set of RL methods and algorithms implemented in TensorFlow and applied to popular benchmarks, to establish the common ground for comparing methods. The repository is available from https://github.com/openai/baselines, and wrappers are available in this file: https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py.

The full list of Atari transformations used by RL researchers includes:

1.  __Convert each lives in the game into seperate episode:__ <br>In general, an episode contains all the steps from the beginning of the game until the "Game over" screen, which can last for thousands of game steps. Also, in arcade games, the player is given several lives. Our transformation splits a full episode into individual small episodes. This usually helps to speed up convergence since our episodes become shorter.


2. __Perform random (up to 30) no-op actions in the beginning:__ <br> This stabilizes the training. However, there is no proper explanation why it is the case.


3. __Make action decision every k steps (Usually `k=3` or `k=4`):__ <br> Since on intermediate frames, the chosen action is simply repeated. Then, this will speed up the training process.


4. __Take the maximum of every pixel in the last two frames and use it as an observation:__ <br> Some Atari games have a flickering effect, which is due to the platform's limitation (Atari has a limited amount of sprites that can be shown on a single frame). For a human eye, such quick changes are not visible, but they can confuse neural networks.


5. __Pressing FIRE in the beginning of the game:__ <br> Some games (including Pong and Breakout) require a user to press the FIRE button to start the game. In theory, it's possible for a neural network to learn to press FIRE itself, but it will require much more episodes to be played. So, we press FIRE in the wrapper.


6. __Scale each frame down from `210×160` (three color frames), into `84×84` (single-color):__ <br> Different approaches are possible. For example, the DeepMind paper describes this transformation as taking the Y-color channel from the YCbCr color space and then rescaling the full image to an 84 × 84 resolution. Some other researchers do grayscale transformation, cropping non-relevant parts of the image and then scaling down. In the Baselines repository (and in the following example code), the latter approach is used.


7. __Stack subsequent frames (usually four) together:__ <br> This gives the network the information about the dynamics of the game's objects.


8. __Clip rewards to -1, 0, 1:__ <br> Among different games, scores can vary wildly. This spread in reward values makes our loss have completely different scales between the games, which makes it harder to find common hyperparameters for a set of games. To fix this, reward just gets clipped.


9. __Convert observations from unsigned bytes to float32 values and rescale:__ <br> The screen obtained from the emulator is encoded as a tensor of bytes with values from 0 to 255, which is not the best representation for a neural network. So, we need to convert the image into floats and rescale the values to the range from 0.0 to 1.0.

<br>

In the Pong example, we don't need some of the above wrappers, such as converting lives into separate episodes and reward clipping, so those wrappers aren't included in the example code. However, you should be aware of them, just
in case you decide to experiment with other games. 

Sometimes, when the DQN is not converging, the problem is not in the code but in the wrongly wrapped environment. I've spend several days debugging convergence issues caused by missing the __FIRE__ button press at the beginning of a game!

Let's take a look at the implementation of individual wrappers from Chapter06/lib/wrappers.py:

In [5]:
# Import the libraries
import cv2
import gym
import gym.spaces
import numpy as np
import collections

In [6]:
# *For knowing purpose only*
env = gym.make("PongNoFrameskip-v4")
env.unwrapped.get_action_meanings()

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

In [7]:
class FireResetEnv(gym.Wrapper):
    """
    This wrapper presses the FIRE button in the beginning of environments since it's required to start the game. 
    In addition, this wrapper checks for several corner cases that are present in some games.
    """
    
    # Constructor function
    def __init__(self, env = None):
        
        # Call the parent's constructor to initialize themselves
        super(FireResetEnv, self).__init__(env)
        
        # Make sure the second item in the action space is 'FIRE'
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        
        # Make sure that the action space is greater or euqal to three
        assert len(env.unwrapped.get_action_meanings()) >= 3

        
    # Step function
    def step(self, action):
        
        # Take the action and get the observation, reward, is_done, info
        return self.env.step(action)

    
    # Reset function
    def reset(self):
        
        # Reset the environment and get the initial observation
        self.env.reset()
        
        # Take action '1' and get the observation, reward, is_done, info
        obs, _, done, _ = self.env.step(1)
        
        # If terminal state then reset the environment
        if done:
            self.env.reset()
        
        # Take action '2' and get the observation, reward, is_done, info
        obs, _, done, _ = self.env.step(2)
        
        # If terminal state then reset the environment
        if done:
            self.env.reset()
            
        return obs

In [8]:
class MaxAndSkipEnv(gym.Wrapper):
    """
    This wrapper combines the repetition of actions during K frames and pixels from two consecutive frames.
    Return only every `skip`-th frame
    """
    
    # Constructor function
    def __init__(self, env = None, skip = 4):
        
        # Call the parent's constructor to initialize themselves
        super(MaxAndSkipEnv, self).__init__(env)
        
        # Most recent raw observations (for max pooling across time steps)
        self._obs_buffer = collections.deque(maxlen = 2)
        
        # The skipping number (usually 3 or 4)
        self._skip = skip

        
    # Step function
    def step(self, action):
        
        # Initialize the total reward
        total_reward = 0.0
        
        # Initialize 'done' with None
        done = None
        
        # Iterate through number of 'skip's
        for _ in range(self._skip):
            
            # Take the action and get the observation, reward, is_done, info
            obs, reward, done, info = self.env.step(action)
            
            # Append the observation into _obs_buffer
            self._obs_buffer.append(obs)
            
            # Add reward into total reward
            total_reward += reward
            
            # If terminal state
            if done:
                
                # Break the loop
                break
                
        # Get the maximum of every pixel in the last two frames (since 'maxlen=2') and use it as an observation         
        max_frame = np.max(np.stack(self._obs_buffer), axis=0)
        
        return max_frame, total_reward, done, info

    
    # Reset function
    def reset(self):
        
        # Clear past frame buffer
        self._obs_buffer.clear()
        
        # Reset the environment and get the initial observation
        obs = self.env.reset()
        
        # Append the initial observation into _obs_buffer
        self._obs_buffer.append(obs)
        
        return obs

In [9]:
class ProcessFrame84(gym.ObservationWrapper):
    """
    This wrapper combines the repetition of actions during K frames and pixels from two consecutive frames.
    The goal of this wrapper is to convert input observations from the emulator, which normally has a resolution 
    of 210 × 160 pixels with RGB color channels, to a grayscale 84 × 84 image. It does this using a colorimetric 
    grayscale conversion (which is closer to human color perception than a simple averaging of color channels), 
    resizing the image and cropping the top and bottom parts of the result.
    """
    
    # Constructor function
    def __init__(self, env = None):
        
        # Call parent's constructor to initialize themselves
        super(ProcessFrame84, self).__init__(env)
        
        # Initialize the frame size
        self.observation_space = gym.spaces.Box(low=0, high=255, shape=(84, 84, 1), dtype=np.uint8)

    
    # Function for calling process method (for processing the observation)
    def observation(self, obs):
        return ProcessFrame84.process(obs)

    
    # @staticmethod is A static method does not receive an implicit first argument.
    # Process function
    @staticmethod
    def process(frame):
        
        # If size is 100'800 then convert it to 210x160x3 + Convert to float data tyep
        if frame.size == 210 * 160 * 3:
            img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
            
        # If size is 120'000 then convert to 250x160x3 + Convert to float data tyep
        elif frame.size == 250 * 160 * 3:
            img = np.reshape(frame, [250, 160, 3]).astype(np.float32)
            
        # If other size then throgh an error    
        else:
            assert False, "Unknown resolution."
            
        # Multiply each image color to some specific number
        img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
        
        # Resize image into 84x110
        resized_screen = cv2.resize(img, (84, 110), interpolation=cv2.INTER_AREA)
        
        # Crop the screen
        x_t = resized_screen[18:102, :]
        
        # Reshape the image into 84x84x1
        x_t = np.reshape(x_t, [84, 84, 1])
        
        # Convert to unit8 data type and return it 
        return x_t.astype(np.uint8)

In [10]:
class ImageToPyTorch(gym.ObservationWrapper):
    """
    This simple wrapper changes the shape of the observation from HWC to the CHW format required by PyTorch. The 
    input shape of the tensor has a color channel as the last dimension, but PyTorch's convolution layers assume 
    the color channel to be the first dimension.
    """
    
    # Constructor
    def __init__(self, env):
        
        # Call constructor's parent to initialize themselves
        super(ImageToPyTorch, self).__init__(env)
        
        # Get the old observation space
        old_shape = self.observation_space.shape
        
        # Initialize the observation space
        self.observation_space = gym.spaces.Box(low = 0.0, 
                                                high = 1.0,
                                                shape = (old_shape[-1], old_shape[0], old_shape[1]),
                                                dtype = np.float32)

    # Observation function
    def observation(self, observation):
        
        # Move axis of observation and return it
        return np.moveaxis(a = observation, source = 2, destination = 0)

In [11]:
class ScaledFloatFrame(gym.ObservationWrapper):
    """
    This wrapper converts observation data from bytes to floats and scales every pixel's value to the 
    range from 0.0 to 1.0
    """
    
    # Observation function
    def observation(self, obs):
        
        # Convert to floats + scale pixel values to [0, 1]
        return np.array(obs).astype(np.float32) / 255.0

In [12]:
class BufferWrapper(gym.ObservationWrapper):
    """
    This class creates a stack of subsequent frames along the first dimension and returns them as an observation. 
    The purpose is to give the network an idea about the dynamics of the objects, such as the speed and direction 
    of the ball in Pong or how enemies are moving. This is very important information, which it is not possible 
    to obtain from a single image.
    """
    
    # Constructor
    def __init__(self, env, n_steps, dtype = np.float32):
        
        # Call parent's constructor to initiallize themselves
        super(BufferWrapper, self).__init__(env)
        
        # Initialize the data type
        self.dtype = dtype
        
        # Initialize the old observation space
        old_space = env.observation_space
        
        # Initialize the observation space
        self.observation_space = gym.spaces.Box(low = old_space.low.repeat(n_steps, axis=0),
                                                high = old_space.high.repeat(n_steps, axis=0), 
                                                dtype = dtype)

    # Reset function
    def reset(self):
        
        # Initialize the buffer
        self.buffer = np.zeros_like(self.observation_space.low, dtype = self.dtype)
        
        # Put the initial observation into observation function and return it
        return self.observation(self.env.reset())

    
    # Observation function
    def observation(self, observation):
        
        #
        self.buffer[:-1] = self.buffer[1:]
        
        #
        self.buffer[-1] = observation
        
        return self.buffer

In [13]:
# Function for creating an environment by its name and applying all the required wrappers to it
def make_env(env_name):
    
    # Create an environment
    env = gym.make(env_name)
    
    # Combine the repetition of actions during K frames (k=4) and pixels from two consecutive frames
    env = MaxAndSkipEnv(env)
    
    # Press the FIRE button in the beginning of environments
    env = FireResetEnv(env)
    
    # Process the frames
    env = ProcessFrame84(env)
    
    # Change the shape of the observation from HWC to the CHW format
    env = ImageToPyTorch(env)
    
    # Create a stack of subsequent frames along the first dimension and return them as an observation
    env = BufferWrapper(env, 4)
    
    # Convert to floats + scale pixel values to [0, 1] and then return it
    return ScaledFloatFrame(env)

<br>

### 04.2. DQN Model

---

The model published in Nature has three convolution layers followed by two fully connected layers. All layers are separated by ReLU nonlinearities. The output of the model is Q-values for every action available in the environment, without nonlinearity applied (as Q-values can have any value). The approach to have all Q-values calculated with one pass through the network helps us to increase speed significantly in comparison to treating Q(s, a) literally and feeding observations and actions to the network to obtain the value of the action.

The code of the model is in Chapter06/lib/dqn_model.py:

In [14]:
# Import the libraries
import numpy as np
import torch
import torch.nn as nn

In [15]:
# Deep Q-Network Class
class DQN(nn.Module):
    
    # Constructor
    def __init__(self, input_shape, n_actions):
        
        # Call parent's constructor to initialize themselves
        super(DQN, self).__init__()

        # Network
        self.conv = nn.Sequential(nn.Conv2d(in_channels = input_shape[0], out_channels = 32, kernel_size = 8, stride = 4),
                                  nn.ReLU(),
                                  nn.Conv2d(in_channels = 32, out_channels = 64, kernel_size = 4, stride = 2),
                                  nn.ReLU(),
                                  nn.Conv2d(in_channels = 64, out_channels = 64, kernel_size = 3, stride = 1),
                                  nn.ReLU())

        # Get the size of output
        conv_out_size = self._get_conv_out(input_shape)
        
        # Fully connected layer
        self.fc = nn.Sequential(nn.Linear(in_features = conv_out_size, out_features = 512),
                                nn.ReLU(),
                                nn.Linear(in_features = 512, out_features = n_actions))

    
    # Function for getting number of parameters from given shape
    def _get_conv_out(self, shape):
        """
        This function accepts the input shape and applies the convolution layer to a fake tensor of such a shape. 
        The result of the function will be equal to the number of parameters returned by this application. For
        example, for 84 × 84 input, the output from the convolution layer will have 3136 values
        """
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    
    # Function for forward pass the network
    def forward(self, x):
        """
        This function accepts a 4D input tensor (batch size, color channel, third and fourth are image dimensions). 
        The application of transformations is done in two steps:
        """
        
        # Pass the input into conv layers
        conv_out = self.conv(x)
        
        # Flatten the the 3D tensor into into 2 dimensional array of (batch size, parameters)   # PyTorch doesn't have a 'flatter' layer
        conv_out = conv_out.view(x.size()[0], -1)
        
        # Pass the output into fcl to get the Q-Values and then return it
        return self.fc(conv_out)

<br>

### 04.3. Training

---

The third module contains the experience replay buffer, the agent, the loss function calculation, and the training loop itself. Before going into the code, something needs to be said about the training hyperparameters. DeepMind's Nature paper contained a table with all the details about hyperparameters used to train its model on all
49 Atari games used for evaluation. DeepMind kept all those parameters the same for all games (but trained individual models for every game), and it was the team's intention to show that the method is robust enough to solve lots of games with varying complexity, action space, reward structure, and other details using one single model architecture and hyperparameters. However, our goal here is much more modest: we want to solve just the Pong game.

Pong is quite simple and straightforward in comparison to other games in the Atari test set, so the hyperparameters in the paper are overkill for our task. For example, to get the best result on all 49 games, DeepMind used a million-observations replay buffer, which requires approximately 20 GB of RAM to keep and lots of samples from the environment to populate. The epsilon decay schedule that was used is also not the best for a single Pong game. In the training, DeepMind linearly decayed epsilon from 1.0 to 0.1 during the first million frames obtained from the environment. However, my own experiments have shown that for Pong, it's enough to decay epsilon over the first 100k frames and then keep it stable. The replay buffer can also be much smaller: 10k transitions will be enough. In the following example, I've used my parameters. These differ from the parameters in the paper but allow us to solve Pong about ten times faster. On a GeForce GTX 1080 Ti, the following version converges to a mean score of 19.5 in one to two hours, but with DeepMind's hyperparameters it will require at least a day.

This speed up, of course, is fine-tuning for one particular environment and can break convergence on other games. You're free to play with the options and other games from the Atari set.

In [16]:
# Import the libraries
import argparse
import time
import numpy as np
import collections
import torch
import torch.nn as nn
import torch.optim as optim
from tensorboardX import SummaryWriter
from dqn_lib import wrappers            # Local import from dgn_lib repository and from wrappers.py file
from dqn_lib import dqn_model           # Local import from dgn_lib repository and from dqn_model.py file

In [17]:
# Hyperparameters
DEFAULT_ENV_NAME = "PongNoFrameskip-v4"  # Environment name
DEVICE_TYPE = "cpu"                      # CPU or GPU
MEAN_REWARD_BOUND = 19.5                 # Reward boundary for the last 100 episodes to stop training

GAMMA = 0.99                 # Gamma value in Bellman approximation
BATCH_SIZE = 32              # Batch size sampled from the replay buffer
REPLAY_SIZE = 10000          # Maximum capacity of the buffer
REPLAY_START_SIZE = 10000    # Count of frames that we wait before starting training to populate the replay buffer 
LEARNING_RATE = 1e-4         # Learning rate used in the Adam optimizer
SYNC_TARGET_FRAMES = 1000    # Frequently we sync model weights from the training model to the target model, which is used to get the value of the next state in the Bellman approximation

EPSILON_START = 1.0                # Start epsilon
EPSILON_DECAY_LAST_FRAME = 10**5   # Total of 100'000 frames
EPSILON_FINAL = 0.02               # End epsilon 

In [18]:
# Experience replay buffer for keeping the transitions obtained from environment
Experience = collections.namedtuple('Experience', field_names = ['state', 'action', 'reward', 'done', 'new_state'])

In [19]:
# Experience replay buffer
class ExperienceBuffer:
    
    # Constructor
    def __init__(self, capacity):
        
        # Initialize the buffer to have 'capacity' entries only (limited amount of entries)
        self.buffer = collections.deque(maxlen = capacity)
       
    
    # Method for returning the length of buffer
    def __len__(self):
        return len(self.buffer)
    
    
    # Method for appending experience into buffer
    def append(self, experience):
        self.buffer.append(experience)
    
    
    # Method Sampling from the experience buffer
    def sample(self, batch_size):
        
        # Create a list of random indices
        indices = np.random.choice(len(self.buffer), batch_size, replace = False)
        
        # Get the indices from buffer and repack it
        states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
        
        # Get the output in the right format
        output = np.array(states), \
                 np.array(actions), \
                 np.array(rewards, dtype = np.float32), \
                 np.array(dones, dtype = np.uint8), \
                 np.array(next_states)
        
        return output

In [20]:
# The agent
class Agent:
    
    # Constructor
    def __init__(self, env, exp_buffer):    
        
        # Initialize the environment
        self.env = env
        
        # Initialize the experience replay
        self.exp_buffer = exp_buffer
        
        # Initialize the reset method
        self._reset()

        
    # Method for reseting
    def _reset(self):
        
        # Reset the environment and get the initial observation
        self.state = env.reset()
        
        # Initialize the total reward with 0
        self.total_reward = 0.0

        
    # Function for perfoming a step and store the result in buffer
    def play_step(self, net, epsilon=0.0, device="cpu"):
        
        # Initialize the end total reward with None
        done_reward = None
        
        # If a random probability is LESS than epsilon
        if np.random.random() < epsilon:
            
            # Take a random action in action space
            action = env.action_space.sample()
            
        # If a random probability is MORE than epsilon
        else:
            
            # Convert state into array
            state_a = np.array([self.state], copy = False)
            
            # Convert state array into tensor
            state_v = torch.tensor(state_a).to(device)
            
            # Pass forward the state into neural network
            q_vals_v = net(state_v)
            
            # Get the maximum value (which is the action tensor to take)
            _, act_v = torch.max(q_vals_v, dim = 1)
            
            # Get the action in the right data type
            action = int(act_v.item())
        
        # Take action and get the new observation, reward, is_done, info
        new_state, reward, is_done, _ = self.env.step(action)
        
        # Add reward into the total reward
        self.total_reward += reward
        
        # Add the (S, A, R, is_done, S') into the experience namedtuple
        exp = Experience(self.state, action, reward, is_done, new_state)
        
        # Append the experience namedtuple into experience buffer
        self.exp_buffer.append(exp)
        
        # Update the current state into new state
        self.state = new_state
        
        # If terminal state
        if is_done:
            
            # Assign the end total reward into done_reward
            done_reward = self.total_reward
            
            # Reset the environment + set the total reward into 0
            self._reset()
            
        return done_reward

In [21]:
# Function for calculate the lost
def calc_loss(batch, net, tgt_net, device = "cpu"):
    """
    Calculate the loss for the sampled batch. 
    Note: Revisit the loss expression that has been written before.
    
    PARAMETERS
    =========================
        - batch: The batch to calculate the loss
        - net: The network which is used to calculate gradients.
        - tgt_net: Target network which periodically synced with the trained one. is used to calculate values for 
                   the next states and this calculation shouldn't affect gradients.
        - device: CPU or GPU
    """
    
    # Repack the batch
    states, actions, rewards, dones, next_states = batch

    # Convert states into tensors
    states_v = torch.tensor(states).to(device)
    
    # Convert the next states into tensors
    next_states_v = torch.tensor(next_states).to(device)
    
    # Convert actions into tensors
    actions_v = torch.tensor(actions).to(device)
    
    # Convert rewards into tensors
    rewards_v = torch.tensor(rewards).to(device)
    
    # Convert dones into bytes
    done_mask = torch.ByteTensor(dones).to(device)

    """
    First, pass the observation to the first model. Then extract the Q-values for taken actions
    - Argument 1: Dimension index to perform gathering. Here, we use 1 since it's actions
    - Argument 2: A tensor of element indices to be chosen.

    Note 1: unsqueeze() and squeeze() are used to fulfill the requirements of the gather functions and to get rid of 
    extra dimensions that we created (index should have the same dimensions as the data we're processing). 

    Note 2: See below for checking out the exact illustration.
    """
    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    
    # Pass forward the next state in the target network + Calculate the maximum Q-value along action dimension 1
    # + Get the first value (since max() returns max and argmax)
    next_state_values = tgt_net(next_states_v).max(1)[0]
    
    # If transition in the batch is from the last step in the episode, then our value of the action doesn't have a 
    # discounted reward of the next state, as there is no next state to gather reward from
    next_state_values[done_mask] = 0.0
    
    """
    Detach values in order to prevent gradients from flowing into the neural network (for Q approximation calculation
    in next states).

    Without this our backpropagation of the loss will affect the predictions for the current state and the next state. 
    However, we don't want to touch predictions for the next state, as they're used in the Bellman equation to calculate 
    reference Q-values. 

    To block gradients from flowing into this branch of the graph, we're using the detach() method of the tensor, which 
    returns the tensor without connection to its calculation history. 
    """
    next_state_values = next_state_value.detach()

    # Calculate the Bellman approximation value
    expected_state_action_values = next_state_values * GAMMA + rewards_v

    # Calculate the mean squared error loss
    loss = nn.MSELoss()(state_action_values, expected_state_action_values)

    return loss

In the following image, you can see an illustration of what gather does on the example case, with a batch of six entries and four actions.

<img width="500px" src="./assets/fig3transform.png">

Keep in mind that the result of gather() applied to tensors is a differentiable operation, which will keep all gradients with respect to the final loss value.

In [None]:
# Execute the main program
if __name__ == "__main__":
    
    # Set the device type (CPU or GPU)
    device = torch.device(DEVICE_TYPE)
    
    # Create the environment
    env = wrappers.make_env(DEFAULT_ENV_NAME)
    
    # Initialize the network
    net = dqn_model.DQN(env.observation_space.shape, env.action_space.n).to(device)
    
    # Initialize the target network
    tgt_net = dqn_model.DQN(env.observation_space.shape, env.action_space.n).to(device)
    
    # Initialize the writer for TensorBoard
    writer = SummaryWriter(comment="-" + DEFAULT_ENV_NAME)

    # Initialize the experience replay buffer of the required size
    buffer = ExperienceBuffer(REPLAY_SIZE)
    
    # Initialize the agent and pass the env and buffer
    agent = Agent(env, buffer)
    
    # Set the epsilon
    epsilon = EPSILON_START

    # Initialize the Adam optimizer
    optimizer = optim.Adam(net.parameters(), lr = LEARNING_RATE)
    
    # Initialize the total reward
    total_rewards = []
    
    # Initialize the frame index counter
    frame_idx = 0
    
    # Initialize ts_frame to track our speed
    ts_frame = 0
    
    # Start the time - t(s) or time(second)
    ts = time.time()
    
    # Initialize the best mean reward so whenever the mean reward beats the record, we'll save the model
    best_mean_reward = None
    
    # Infinite loop
    while True:
        
        # Increment the frame index
        frame_idx += 1
        
        # Get the epsilon which it decreases linearly during the given number of frames (EPSILON_DECAY_LAST_FRAME=100k) 
        # and then will be kept on the same level of EPSILON_FINAL=0.02
        epsilon = max(EPSILON_FINAL, 
                      EPSILON_START - frame_idx / EPSILON_DECAY_LAST_FRAME)
        
        # Perfome one step and get the final reward
        reward = agent.play_step(net, epsilon, device = device)
        
        # If there is reward (since this function returns a non-None result if this is the final step in the episode)
        if reward is not None:
            
            # Append the reward into total_rewards
            total_rewards.append(reward)
            
            # Calculate the speed (as the count of frames processed per second)
            speed = (frame_idx - ts_frame) / (time.time() - ts)
            
            # Update the ts_frame
            ts_frame = frame_idx
            
            # Update the current time
            ts = time.time()
            
            # Get the mean of rewards for the last 100 episodes
            mean_reward = np.mean(total_rewards[-100:])
            
            # Report the progress
            print("%d: done %d games, mean reward %.3f, eps %.2f, speed %.2f f/s" % (frame_idx, 
                                                                                     len(total_rewards), 
                                                                                     mean_reward, 
                                                                                     epsilon, 
                                                                                     speed))
            
            # Add values to TensorBoard
            writer.add_scalar("epsilon", epsilon, frame_idx)
            writer.add_scalar("speed", speed, frame_idx)
            writer.add_scalar("reward_100", mean_reward, frame_idx)
            writer.add_scalar("reward", reward, frame_idx)
            
            # If the best mean reward is none OR it's less than mean_reward
            if (best_mean_reward is None) or (best_mean_reward < mean_reward):
                
                # Save the network
                torch.save(net.state_dict(), DEFAULT_ENV_NAME + "-best.dat")
                
                # If the best mean reward is non-None
                if best_mean_reward is not None:
                    
                    # Report the update for best mean reward 
                    print("Best mean reward updated %.3f -> %.3f, model saved" % (best_mean_reward, mean_reward))
                    
                # Update the best mean reward ot mean_reward
                best_mean_reward = mean_reward
                
            # If mean reward exceeds the boundary (For Pong, the boundary is 19.5, which means winning more than 19 games from 21 possible games.)
            if mean_reward > MEAN_REWARD_BOUND:
                
                # Print solved
                print("Solved in %d frames!" % frame_idx)
                
                # break the loop
                break
                

        # Check whether our buffer is large enough for training (in our case, it's 10k transitions)
        if len(buffer) < REPLAY_START_SIZE:
            
            # Go to the beginning of loop
            continue
            

        # Every SYNC_TARGET_FRAMES (which is 1k by default)
        if frame_idx % SYNC_TARGET_FRAMES == 0:
            
            # Sync parameters from  main network to the target net
            tgt_net.load_state_dict(net.state_dict())
            

        # Zero out the gradients of optimizer
        optimizer.zero_grad()
        
        # Sample data batches from the experience replay buffer
        batch = buffer.sample(BATCH_SIZE)
        
        # Calculate the loss
        loss_t = calc_loss(batch, net, tgt_net, device = device)
        
        # Do the backward propagation
        loss_t.backward()
        
        # Perform the optimization step to minimize the loss
        optimizer.step()
        
    # Close the writer
    writer.close()

<br>

### 04.4. Running and Performance

---

This example is demanding on resources. On Pong, it requires about 400k frames to reach a mean reward of 17 (which means winning more than 80% of games). A similar number of frames will be required to get from 17 to 19.5, as our learning progress saturates and it's hard for the model to improve the score. 

So, on average, a million frames are needed to train it fully. On the GTX 1080 Ti, I have a speed of about 150 frames per second, which is about two hours of training. On a CPU, the speed is much slower: about nine frames per second, which will take about a day and a half. Remember that this is for Pong, which is relatively easy to solve. Other games require hundreds of millions of frames and a 100 times larger experience replay buffer.

In the next chapter, we'll look at various approaches, found by researchers since 2015, which can help to increase both training speed and data efficiency. Nevertheless, for Atari you'll need resources and patience. The following image shows a TensorBoard screenshot with training dynamics:

<img width="800png" src="assets/characteristic_x.png">

<br>

In the beginning of the training:
                    
    1048: done 1 games, mean reward -19.000, eps 0.99, speed 83.45 f/s
    1894: done 2 games, mean reward -20.000, eps 0.98, speed 913.37 f/s
    2928: done 3 games, mean reward -20.000, eps 0.97, speed 932.16 f/s
    3810: done 4 games, mean reward -20.250, eps 0.96, speed 923.60 f/s
    4632: done 5 games, mean reward -20.400, eps 0.95, speed 921.52 f/s
    5454: done 6 games, mean reward -20.500, eps 0.95, speed 918.04 f/s
    6379: done 7 games, mean reward -20.429, eps 0.94, speed 906.64 f/s
    7409: done 8 games, mean reward -20.500, eps 0.93, speed 903.51 f/s
    8259: done 9 games, mean reward -20.556, eps 0.92, speed 905.94 f/s
    9395: done 10 games, mean reward -20.500, eps 0.91, speed 898.05 f/s
    10204: done 11 games, mean reward -20.545, eps 0.90, speed 374.76 f/s
    10995: done 12 games, mean reward -20.583, eps 0.89, speed 160.55 f/s
    11887: done 13 games, mean reward -20.538, eps 0.88, speed 160.44 f/s
    12949: done 14 games, mean reward -20.571, eps 0.87, speed 160.67 f/s


<br>

Hundreds of games later, our DQN should start to figure out how to win one or two games out of 21. The speed has decreased due to epsilon drop: we need to use our model not only for training but also for the environment step.

    101032: done 83 games, mean reward -19.506, eps 0.02, speed 143.06 f/s
    103349: done 84 games, mean reward -19.488, eps 0.02, speed 142.99 f/s
    106444: done 85 games, mean reward -19.424, eps 0.02, speed 143.15 f/s
    108359: done 86 games, mean reward -19.395, eps 0.02, speed 143.18 f/s
    110499: done 87 games, mean reward -19.379, eps 0.02, speed 143.01 f/s
    113011: done 88 games, mean reward -19.352, eps 0.02, speed 142.98 f/s
    115404: done 89 games, mean reward -19.326, eps 0.02, speed 143.07 f/s
    117821: done 90 games, mean reward -19.300, eps 0.02, speed 143.03 f/s
    121060: done 91 games, mean reward -19.220, eps 0.02, speed 143.10 f/s

<br>

Finally, after many more games, it can finally dominate and beat the (not very sophisticated) built-in Pong AI opponent:

    982059: done 520 games, mean reward 19.500, eps 0.02, speed 145.14 f/s
    984268: done 521 games, mean reward 19.420, eps 0.02, speed 145.39 f/s
    986078: done 522 games, mean reward 19.440, eps 0.02, speed 145.24 f/s
    987717: done 523 games, mean reward 19.460, eps 0.02, speed 145.06 f/s
    989356: done 524 games, mean reward 19.470, eps 0.02, speed 145.07 f/s
    991063: done 525 games, mean reward 19.510, eps 0.02, speed 145.31 f/s
    Best mean reward updated 19.500 -> 19.510, model saved
    Solved in 991063 frames!

<br>

### 04.5. Your Model in Action

---

Just to make your waiting a bit more fun, our code saves the best model's weights. In the Chapter06/03_dqn_play.py file, we have a program which can load this model file and play one episode, displaying the model's dynamics.

In [22]:
# Import the libraries
import gym
import time
import argparse
import numpy as np
import torch
from dqn_lib import wrappers
from dqn_lib import dqn_model
import collections

In [None]:
# Hyperparameters
DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
FPS = 25                # FPS (frame-per-second) parameter specifies the approximate speed of the shown frames.
record = "./record"     # Directory to store video recording
visualize = True        # Disable visualization of the game play
model = None            # Model file to load

In [None]:
# Execute the program
if __name__ == "__main__":
    
    # Create the environment
    env = wrappers.make_env(DEFAULT_ENV_NAME)
    
    # If record
    if record:
        
        # Record the agent playing
        env = gym.wrappers.Monitor(env, record)
        
    # Initialize the network
    net = dqn_model.DQN(env.observation_space.shape, env.action_space.n)
    
    # Load the weights
    net.load_state_dict(torch.load(model, map_location = lambda storage, loc: storage))

    # Reset the state and get the initial observation
    state = env.reset()
    
    # Initialize the total reward
    total_reward = 0.0
    
    # Initialize the counter
    c = collections.Counter()

    # Infinite loop
    while True:
        
        # Start the timer
        start_ts = time.time()
        
        # If visualize
        if visualize:
            
            # Render the environment
            env.render()
            
        # Convert state into a tensor
        state_v = torch.tensor(np.array([state], copy=False))
        
        # Get the Q-Values
        # Pass forward the state into the network + Get the data + Convert it into an array + Get the first dimension
        q_vals = net(state_v).data.numpy()[0]
        
        # Get the maximum value in Q-Values
        action = np.argmax(q_vals)
        
        # Increment the action count
        c[action] += 1
        
        # Take the given action and get the (state, reward, is_done, info)
        state, reward, done, _ = env.step(action)
        
        # Add reward into total_reward
        total_reward += reward
        
        # If terminal state
        if done:
            
            # Break the loop
            break
        
        # If visualize
        if visualize:
            
            # Calculate the delta
            delta = (1 / FPS) - (time.time() - start_ts)
            
            # If delta is positive
            if delta > 0:
                
                # Delay execution for delta seconds
                time.sleep(delta)
                
    # Print the total reward
    print("Total reward: %.2f" % total_reward)
    
    # Print the action counts
    print("Action counts:", c)
    
    # If record
    if record:
        
        # Override close in your subclass to perform any necessary cleanup
        env.env.close()

<br>

# 05. Summary

---

In this chapter, we introduced lots of new and complex material. We became familiar with the limitations of value iteration in complex environments with large observation spaces and discussed how to overcome them with Q-learning. We checked the Q-learning algorithm on the FrozenLake environment and discussed the approximation of Q-values with neural networks and the extra complications that arise from this approximation. We covered several tricks for DQNs to improve their training stability and convergence, such as experience replay buffer, target networks, and frame stacking. Finally, we combined those extensions in to one single implementation of DQN that solves the Pong environment from the Atari games suite.

___THE END___