<!--
authors: Matthew Wilson, Daniele Reda
created: 2020/01/14
last_updated: 2023/02/08
-->


## CPSC 533V: Assignment 2 - Tabular Q Learning and DQN (TBD)

---

#  Part 1 [54 pts] Tabular Q-Learning 

Tabular Q-learning is an RL algorithm for problems with discrete states and discrete actions. The algorithm is described in the class notes, which borrows the summary description from [Section 6.5](http://incompleteideas.net/book/RLbook2018.pdf#page=153) of Richard Sutton's RL book. In the tabular approach, the Q-value is represented as a lookup table. As discussed in class, Q-learning can further be extended to continuous states and discrete actions, leading to the [Atari DQN](https://arxiv.org/abs/1312.5602) / Deep Q-learning algorithm.  However, it is important and informative to first fully understand tabular Q-learning.

Informally, Q-learning works as follows: The goal is to learn the optimal Q-function: 
`Q(s,a)`, which is the *value* of being at state `s` and taking action `a`.  Q tells you how well you expect to do, on average, from here on out, given that you act optimally.  Once the Q function is learned, choosing an optimal action is as simple as looping over all possible actions and choosing the one with the highest Q (optimal action $a^* = \text{argmax}_a Q(s,a)$).  To learn Q, we initialize it arbitrarily and then iteratively refine it using the Bellman backup equation for Q functions, namely: 
$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \text{max}_a Q(s', a) - Q(s,a)]$.
Here, $r$ is the reward associated with with the transition from state s to s', and $\alpha$ is a learning rate.

In the first part of assignment you will implement tabular Q-learning and apply it to CartPole -- an environment with a **continuous** state space.  To apply the tabular method, you will need to discretize the CartPole state space by dividing the state-space into bins.


**Goals:**
- to become familiar with python/numpy, as well as using an OpenAI Gym environment
- to understand tabular Q-learning, by implementing tabular Q-Learning for 
  a discretized version of a continuous-state environment, and experimenting with the implementation
- (optional) to develop further intuition regarding possible variations of the algorithm

## Introduction
Deep reinforcement learning has generated impressive results for board games ([Go][go], [Chess/Shogi][chess]), video games ([Atari][atari], [DOTA2][dota], [StarCraft II][scii]), [and][baoding] [robotic][rubix] [control][anymal] ([of][cassie] [course][mimic] ;)).  RL is beginning to work for an increasing range of tasks and capabilities.  At the same time, there are many [gaping holes][irpan] and [difficulties][amid] in applying these methods. Understanding deep RL is important if you wish to have a good grasp of the modern landscape of control methods.

These next several assignments are designed to get you started with deep reinforcement learning, to give you a more close and personal understanding of the methods, and to provide you with a good starting point from which you can branch out into topics of interest. You will implement basic versions of some of the important fundamental algorithms in this space, including Q-learning and policy gradient/search methods.

We will only have time to cover a subset of methods and ideas in this space.
If you want to dig deeper, we suggest following the links given on the course webpage.  Additionally we draw special attention to the [Sutton book](http://incompleteideas.net/book/RLbook2018.pdf) for RL fundamentals and in depth coverage, and OpenAI's [Spinning Up resources](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html) for a concise intro to RL and deep RL concepts, as well as good comparisons and implementations of modern deep RL algorithms.


[atari]: https://arxiv.org/abs/1312.5602
[go]: https://deepmind.com/research/case-studies/alphago-the-story-so-far
[chess]:https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go 
[dota]: https://openai.com/blog/openai-five/
[scii]: https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning
[baoding]: https://bair.berkeley.edu/blog/2019/09/30/deep-dynamics/
[rubix]: https://openai.com/blog/solving-rubiks-cube/
[cassie]: https://www.cs.ubc.ca/~van/papers/2019-CORL-cassie/index.html
[mimic]: https://www.cs.ubc.ca/~van/papers/2018-TOG-deepMimic/index.html
[anymal]: https://arxiv.org/abs/1901.08652


[irpan]: https://www.alexirpan.com/2018/02/14/rl-hard.html
[amid]: http://amid.fish/reproducing-deep-rl



In [1]:
# # uncomment if necesary
!pip install numpy
# !pip install gym
# # OR:
!pip install gymnasium
import time
import itertools
import numpy as np
# import gym
import gymnasium as gym


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


---

## [12 pts] Explore the CartPole environment 

Your first task is to familiarize yourself with the OpenAI gym interface and the [CartPole environment](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/cartpole.py)
by writing a simple hand-coded policy to try to solve it.  
To begin understanding OpenAI Gym environments, [read this first](https://gymnasium.farama.org/api/env/).) 
The gym interface is very popular and you will see many algorithm implementations and 
custom environments that support it.  You may even want to use the API in your course projects, 
to define a custom environment for a task you want to solve.

Note that there were several breaking changes introduced in the past few years to the gym API. Some reference algorithm implementations online might be using the old version:
- `obs = env.reset()` ->  `obs, info = env.reset()`
- `obs, reward, done, info = env.step(action)` to `obs, reward, terminated, truncated, info = env.step(action)`
- `env.render()` no longer accepts the `render_mode` parameter (e.g. human mode where the environment is rendered in a pop-up window, or rgb_array which allows headless conversion to images or videos)


Below is some example code that runs a simple random policy.  You are to:
- **run the code to see what it does**
- **write code that chooses an action based on the observation**.  You will need to learn about the gym API and to read the CartPole documentation to figure out what the `action` and `obs` vectors mean for this environment. 
Your hand-coded policy can be arbitrary, and it should ideally do better than the random policy.  There is no single correct answer. The goal is to become familiar with `env`s.
- **write code to print out the total reward gained by your policy in a single episode run**
- **answer the short-response questions below** (see the TODOs for all of this)

In [2]:
env = gym.make('CartPole-v1', render_mode="rgb_array")  # you can also try LunarLander-v2, but make sure to change it back
print('observation space:', env.observation_space)
print('action space:', env.action_space)

# To find out what the observations mean, read the CartPole documentation.
# Uncomment the lines below, or visit the source file: 
# https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/cartpole.py

#cartpole = env.unwrapped
#cartpole?

observation space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
action space: Discrete(2)


In [3]:
# 1.1 [10pts]

# runs a single episode and render it.  try running this before editing anything
obs, info = env.reset()  # get first obs/state
total_reward = 0
while True:
    # TODO: replace this `action` with something that depends on `obs` 
    # action = env.action_space.sample()  # random action
    action = 0 if obs[2] < 0 else 1
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    env.render()
    time.sleep(0.1)  # so it doesn't render too quickly
    if terminated | truncated: break
env.close()

# TODO: print out your total sum of rewards here
print(f"Total sum of rewards: {total_reward}")

Total sum of rewards: 47.0


To answer the questions below, look at the full [source code here](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/cartpole.py) if you haven't already.

**1.2. [2pts] Briefly describe your policy.  What observation information does it use?  What score did you achieve (rough maximum and average)?  And how does it compare to the random policy?**

**Answer**: My policy uses pole angle: if the pole is tilting left, push cart to the left, and vice versa. The total rewards is approximately 31-42. This is slightly higher more stable than the random policy which has total rewards of approximately 9-46. 

---

##  [12 pts] Discretize the env

Next, we need to discretize CartPole's continuous state space to work for tabular Q-learning.  While this is in part  a contrived usage of tabular methods, given the existence of other approaches that are designed to cope with continuous state-spaces, it is also interesting to consider whether tabular methods can be adapted more directly via discretization of the state into bins. Furthermore, tabular methods are simple, interpretabile, and can be proved to converge, and thus they still remain relevant.

Your task is to discretize the state/observation space so that it is compatible with tabular Q-learning.  To do this:
- **implement `obs_normalizer` to pass its test**
- **implement `get_bins` to pass its test**
- **then answer question 2.3**

[map]: https://arxiv.org/abs/1504.04909
[qd]: https://quality-diversity.github.io/

In [4]:
env = gym.make('CartPole-v1')

In [5]:
env.observation_space.low

array([-4.8000002e+00, -3.4028235e+38, -4.1887903e-01, -3.4028235e+38],
      dtype=float32)

In [6]:
# 2.1 [5 pts for passing test_normed]
def obs_normalizer(obs):
    """Normalize the observations between 0 and 1
    
    If the observation has extremely large bounds, then clip to a reasonable range before normalizing; 
    (-2,2) should work.  (It is ok if the solution is specific to CartPole)
    
    Args:
        obs (np.ndarray): shape (4,) containing an observation from CartPole using the bound of the env
    Returns:
        normed (np.ndarray): shape (4,) where all elements are roughly uniformly mapped to the range [0, 1]
    
    """
    # HINT: check out env.observation_space.high, env.observation_space.low
    
    # TODO: implement this function
    high = env.observation_space.high
    low = env.observation_space.low

    # replace with reasonable numbers
    high[1] = 2
    high[3] = 2
    low[1] = -2
    low[3] = -2

    clipped_obs = np.clip(obs, low, high)
    normed = (clipped_obs - low) / (high - low)
    return normed
    # raise NotImplementedError('TODO')

In [7]:
### TEST 2.1
def test_normed():
    obs, info = env.reset()
    while True:
        obs, _, terminated, truncated, _ =  env.step(env.action_space.sample())
        normed = obs_normalizer(obs) 
        assert np.all(normed >= 0.0) and np.all(normed <= 1.0), '{} are outside of (0,1)'.format(normed)
        if terminated | truncated: break
    env.close()
    print('Passed!')
test_normed()

Passed!


In [8]:
# 2.2 [5 pts for passing test_binned]
def get_bins(normed, num_bins):
    """Map normalized observations (0,1) to bin index values (0,num_bins-1)
    
    Args:
        normed (np.ndarray): shape (4,) output from obs_normalizer
        num_bins (int): how many bins to use
    Returns:
        binned (np.ndarray of type np.int32): shape (4,) where all elements are values in range [0,num_bins-1]
    
    """
    scaled = normed * (num_bins-1)
    binned = np.floor(scaled).astype(np.int32)    

    return binned  
    # TODO: implement this function
    # raise NotImplementedError('TODO')

In [9]:
### TEST 2.2
obs, info = env.reset()

def test_binned(num_bins):
    normed = np.array([0.0, 0.2, 0.8, 1.0])
    binned = get_bins(normed, num_bins)
    assert np.all(binned >= 0) and np.all(binned < num_bins), '{} supposed to be between (0, {})'.format(binned, num_bins-1)
    assert binned.dtype == np.int32, "You should also make sure to cast your answer to int using arr.astype(np.int32)" 
    
test_binned(5)
test_binned(10)
test_binned(50)
print('Passed!')

Passed!


**2.3. [2 pts] If your state has 4 values and each is binned into N possible bins, how many bins are needed to represent all unique possible states?**

**Answer**: $N^4$ bins are needed. 

---

## [20 pts] Solve the env 

Using the pseudocode below and the functions you implemented above, implement tabular Q-learning and use it to solve CartPole.

We provide setup code to initialize the Q-table and give examples of interfacing with it. Write the inner and outer loops to train your algorithm.  These training loops will be similar to those deep RL approaches, so get used to writing them!

The algorithm (excerpted from Section 6.5 of [Sutton's book](http://incompleteideas.net/book/RLbook2018.pdf)) is given below:

![Sutton RL](https://i.imgur.com/mdcWVRL.png)

in summary:
- **implement Q-learning using this pseudocode and the helper code**
- **answer the questions below**
- **run the suggested experiments and otherwise experiment with whatever interests you**

In [10]:
# setup (see last few lines for how to use the Q-table)

# hyper parameters. feel free to change these as desired and experiment with different values
num_bins = 10
alpha = 0.1
gamma = 0.99
log_n = 1000
# epsilon greedy
eps = 0.05  #usage: action = optimal if np.random.rand() > eps else random

obs, info = env.reset()

# Q-table initialized to zeros.  first 4 dims are state, last dim is for action (0,1) for left,right.
Q = np.zeros([num_bins]*len(obs)+[env.action_space.n])

# helper function to convert observation into a binned state so we can index into our Q-table
obs2bin = lambda obs: tuple(get_bins(obs_normalizer(obs), num_bins=num_bins))

s = obs2bin(obs)

print('Shape of Q Table: ', Q.shape) # you can imagine why tabular learning does not scale very well
print('Original obs {} --> binned {}'.format(obs, s))
print('Value of Q Table at that obs/state value', Q[s])

Shape of Q Table:  (10, 10, 10, 10, 2)
Original obs [ 0.0222021  -0.03588101  0.00441333  0.04630226] --> binned (np.int32(4), np.int32(4), np.int32(4), np.int32(4))
Value of Q Table at that obs/state value [0. 0.]


In [11]:
# 3.1 [20 pts]

# TODO: implement Q learning, following the pseudo-code above. 
#     - you can follow it almost exactly, but translating things for the gym api and our code used above
#     - make sure to use e-greedy, where e = random about 0.05 percent of the time
#     - make sure to do the S <-- S' step because it can be easy to forget
#     - every log_n steps, you should render your environment and
#       print out the average total episode rewards of the past log_n runs to monitor how your agent trains
#      (your implementation should be able to break at least +150 average reward value, and you can use that 
#       as a breaking condition.  It make take several minutes to run depending on your computer.)

In [42]:
def Q_learning(env, Q, num_episode, log_n, num_bins, alpha, gamma, eps):
    obs, info = env.reset()
    obs2bin = lambda obs: tuple(get_bins(obs_normalizer(obs), num_bins=num_bins))
    s = obs2bin(obs)
    rewards_per_episode = []
    for episode in range(num_episode):
        obs, info = env.reset()
        total_reward = 0
        done = False

        s = obs2bin(obs) # discretize

        while done == False:
            # Epsilon-greedy
            if np.random.rand() > eps:
                a = np.argmax(Q[s])
            else:
                a = env.action_space.sample()

            # take action, observe R and S'
            obs_new, reward, terminated, truncated, info = env.step(a)
            done = terminated or truncated

            s_new = obs2bin(obs_new)

            # update q = current q + max future q
            max_future_q = np.max(Q[s_new])
            current_q = Q[s + (a,)]
            new_q = current_q + alpha * (reward + gamma * max_future_q - current_q)
            Q[s + (a,)] = new_q

            # update state and reward
            s = s_new
            total_reward += reward

        rewards_per_episode.append(total_reward)

        # Logging every log_n episodes
        if (episode+1) % log_n == 0:
            avg_reward_log = np.mean(rewards_per_episode[-log_n:])
            print(f"Episode {episode + 1}/{num_episode}, Average Reward: {avg_reward_log}")
            # Break if average reward >= 150
            if avg_reward_log >= 150:
                print(f"Breaking Criterion Reached: Average reward reached {avg_reward_log} after {episode + 1} episodes.")
                break


    avg_reward = np.mean(rewards_per_episode)
    print(f"Average Reward after {num_episode} episodes: {avg_reward}")

    env.close()

In [41]:
Q_learning(env, Q = np.zeros([num_bins]*len(obs)+[env.action_space.n]), num_episode=30000, log_n=1000, num_bins = 10, alpha=0.1, gamma=0.99, eps=0.05)

Episode 1000/30000, Average Reward: 12.123
Episode 2000/30000, Average Reward: 31.068
Episode 3000/30000, Average Reward: 59.122
Episode 4000/30000, Average Reward: 99.834
Episode 5000/30000, Average Reward: 136.916
Episode 6000/30000, Average Reward: 174.617
Breaking Criterion Reached: Average reward reached 174.617 after 6000 episodes.
Average Reward after 30000 episodes: 85.61333333333333


## [10 pts] Experiments

Given a working algorithm, you will run a few experiments.  Either make a copy of your code above to modify, or make the modifications in a way that they can be commented out or switched between (with boolean flag if statements).

**4.2. [5 pts] $\epsilon$-greedy.**  How sensitive are the results to the value of $\epsilon$?   First, write down your prediction of what would happen if $\epsilon$ is set to various values, including for example [0, 0.05, 0.25, 0.5].

**Answer**: The results should be sensitive to $\epsilon$. $\epsilon = 0$ should be exploiting and have a low total reward due to lack of exploration. $\epsilon = 0.05$ should have higher max reward than $\epsilon = 0$ and $\epsilon = 0.25$ should have higher rewards than $\epsilon = 0.05$. However, $\epsilon = 0.5$ may over explore and unable to exploit on a optimal q table, therefore, taking longer to have higher rewards than $\epsilon = 0.05$ and $\epsilon = 0.25$. 

Now run the experiment and observe the impact on the algorithm.  Report the results below.

In [13]:
# explore epsilons
Q_learning(env, Q = np.zeros([num_bins]*len(obs)+[env.action_space.n]), num_episode=30000, log_n=1000, num_bins = 10, alpha=0.1, gamma=0.99, eps=0)
Q_learning(env, Q = np.zeros([num_bins]*len(obs)+[env.action_space.n]), num_episode=30000, log_n=1000, num_bins = 10, alpha=0.1, gamma=0.99, eps=0.25)
Q_learning(env, Q = np.zeros([num_bins]*len(obs)+[env.action_space.n]), num_episode=30000, log_n=1000, num_bins = 10, alpha=0.1, gamma=0.99, eps=0.5)

Episode 1000/30000, Average Reward: 9.364
Episode 2000/30000, Average Reward: 9.353
Episode 3000/30000, Average Reward: 9.362
Episode 4000/30000, Average Reward: 9.354
Episode 5000/30000, Average Reward: 9.375
Episode 6000/30000, Average Reward: 9.352
Episode 7000/30000, Average Reward: 9.377
Episode 8000/30000, Average Reward: 9.386
Episode 9000/30000, Average Reward: 9.361
Episode 10000/30000, Average Reward: 9.357
Episode 11000/30000, Average Reward: 9.372
Episode 12000/30000, Average Reward: 9.346
Episode 13000/30000, Average Reward: 9.386
Episode 14000/30000, Average Reward: 9.376
Episode 15000/30000, Average Reward: 9.395
Episode 16000/30000, Average Reward: 9.333
Episode 17000/30000, Average Reward: 9.38
Episode 18000/30000, Average Reward: 9.377
Episode 19000/30000, Average Reward: 9.303
Episode 20000/30000, Average Reward: 9.394
Episode 21000/30000, Average Reward: 9.373
Episode 22000/30000, Average Reward: 9.339
Episode 23000/30000, Average Reward: 9.337
Episode 24000/30000, 

**Answer:**: Average rewrad after 10000 episodes is the highest with epsilon = 0.25 with all else being equal, matching the prediction above. epsilon = 0 fail to train and epsilon = 0.5 improves slowest due to lack of exploration. 

**4.3. [5 pts] Design your own experiment.** Design a modification that you think would either increase or reduce performance.  A simple example (which you can use) is initializing the Q-table differently, and thinking about how this might alter performance. Write down your idea, what you think might happen, and why.

**Answer**: I plan to initialize Q-table to a higher value than 0, which would improve exploration at the beginning since all other unexplored values have more optimal rewards. 

Run the experiment and report the results.

In [18]:
# explore Q table initialization
Q_learning(env, Q = np.ones([num_bins]*len(obs)+[env.action_space.n]), num_bins = 10, num_episode=30000, log_n=1000, alpha=0.1, gamma=0.99, eps=0.25)
Q_learning(env, Q = np.full([num_bins]*len(obs)+[env.action_space.n], 2), num_bins = 10, num_episode=30000, log_n=1000, alpha=0.1, gamma=0.99, eps=0.25)
Q_learning(env, Q = np.full([num_bins]*len(obs)+[env.action_space.n], 5), num_bins = 10, num_episode=30000, log_n=1000, alpha=0.1, gamma=0.99, eps=0.25)

Episode 1000/30000, Average Reward: 13.898
Episode 2000/30000, Average Reward: 55.903
Episode 3000/30000, Average Reward: 92.274
Episode 4000/30000, Average Reward: 100.075
Episode 5000/30000, Average Reward: 101.244
Episode 6000/30000, Average Reward: 109.923
Episode 7000/30000, Average Reward: 107.458
Episode 8000/30000, Average Reward: 106.927
Episode 9000/30000, Average Reward: 111.902
Episode 10000/30000, Average Reward: 110.294
Episode 11000/30000, Average Reward: 115.405
Episode 12000/30000, Average Reward: 128.007
Episode 13000/30000, Average Reward: 120.954
Episode 14000/30000, Average Reward: 136.476
Episode 15000/30000, Average Reward: 126.011
Episode 16000/30000, Average Reward: 125.913
Episode 17000/30000, Average Reward: 124.253
Episode 18000/30000, Average Reward: 123.835
Episode 19000/30000, Average Reward: 137.963
Episode 20000/30000, Average Reward: 141.518
Episode 21000/30000, Average Reward: 128.065
Episode 22000/30000, Average Reward: 121.603
Episode 23000/30000, A

**Answer**: Initiaing Q table to ones would have similar effect as setting Q table to zeros and learn slightly faster in some runs. However, any higher initial values (2 and 5) would potentially overestimate rewards and fail to learn.

---

## A. Extensions (fully optional, will not be graded, if you have time after Part 2)

- plots your learning curve, using e.g., matplotlib 
- visualize the Q-table to see which values are being updated and not
- design a better binning strategy that uses fewer bins for a better-performing policy
- extend this approach to work on different environments (e.g., LunarLander-v2)
- extend this approach to work on environments with continuous actions, by using a fixed set of discrete samples of the action space.  e.g., for Pendulum-v0
- implement a simple deep learning version of this.  we will see next part that DQN uses some tricks to make the neural network training more stable.  Experiment directly with simply replacing the Q-table with a Q-Network and train the Q-Network using gradient descent with `loss = (targets - Q(s,a))**2`, where `targets = stop_grad(R + gamma * maxa(Q(s,a))`).

# Part 2 [60 pts] Behavioral Cloning and Deep Q Learning

---
The second part of assignment will help you transition from tabular approaches to deep neural network approaches. You will implement the [Atari DQN / Deep Q-Learning](https://arxiv.org/abs/1312.5602) algorithm, which arguably kicked off the modern Deep Reinforcement Learning craze.

In this part we will use PyTorch as our deep learning framework.  To familiarize yourself with PyTorch, your first task is to use a behavior cloning (BC) approach to learn a policy.  Behavior cloning is a supervised learning method in which there exists a dataset of expert demonstrations (state-action pairs) and the goal is to learn a policy $\pi$ that mimics this expert.  At any given state, your policy should choose the same action the export would.

Since BC avoids the need to collect data from the policy you are trying to learn, it is relatively simple. 
This makes it a nice stepping stone for implementing DQN. Furthermore, BC is relevant to modern approaches---for example its use as an initialization for systems like [AlphaGo][go] and [AlphaStar][star], which then use RL to further adapte the BC result.  

<!--

I feel like this might be better suited to going lower in the document:

Unfortunately, in many tasks it is impossible to collect good expert demonstrations, making

it's not always possible to have good expert demonstrations for a task in an environemnt and this is where reinforcement learning comes handy. Through the reward signal retrieved by interacting with the environment, the agent learns by itself what is a good policy and can learn to outperform the experts.

-->

Goals:
- Famliarize yourself with PyTorch and its API including models, datasets, dataloaders
- Implement a supervised learning approach (behavioral cloning) to learn a policy.
- Implement the DQN objective and learn a policy through environment interaction.

[go]:  https://deepmind.com/research/case-studies/alphago-the-story-so-far
[star]: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii

## Submission information

- Complete by editing and executing the associated Python files.
- Copy and paste the code and the terminal output requested in the predefined cells on this Jupyter notebook.
- When done, upload the completed Jupyter notebook (ipynb file) on canvas.

## Preliminaries

### PyTorch

If you have never used PyTorch before, we recommend you follow this [60 Minutes Blitz][blitz] tutorial from the official website. It should give you enough context to be able to complete the assignment.


**If you have issues, post questions to Piazza**

### Installation

To install all required python packages:

```
python3 -m pip install -r requirements.txt
```

### Debugging


You can include:  `import ipdb; ipdb.set_trace()` in your code and it will drop you to that point in the code, where you can interact with variables and test out expressions.  We recommend this as an effective method to debug the algorithms.


[blitz]: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

## 1. [36 pts] Behavioral Cloning

Behavioral Cloning is a type of supervised learning in which you are given a dataset of expert demonstrations tuple $(s, a)$ and the goal is to learn a policy function $\hat a = \pi(s)$, such that $\hat a = a$.

The optimization objective is $\min_\theta D(\pi(s), a)$ where $\theta$ are the parameters the policy $\pi$, in our case the weights of a neural network, and where $D$ represents some difference between the actions.

---

Before starting, we suggest reading through the provided files.

For Behavioral Cloning, the important files to understand are: `model.py`, `dataset.py` and `bc.py`.

- The file `model.py` has the skeleton for the model (which you will have to complete in the following questions),

- The file `dataset.py` has the skeleton for the dataset the model is being trained with,

- and, `bc.py` will have all the structure for training the model with the dataset.


### [10 pts] 1.1 Dataset

We provide a pickle file with pre-collected expert demonstrations on CartPole from which to learn the policy $\pi$. The data has been collected from an expert policy on the environment, with the addition of a small amount of gaussian noise to the actions.

The pickle file contains a list of tuples of states and actions in `numpy` in the following way:

```
[(state s, action a), (state s, action a), (state s, action a), ...]
```

In the `dataset.py` file, we provide skeleton code for creating a custom dataset. The provided code shows how to load the file.

Your goal is to overwrite the `__getitem__` function in order to return a dictionary of tensors of the correct type.

Hint: Look in the `bc.py` file to understand how the dataset is used.

Answer the following questions:

**[6 pts]** Insert your code in the placeholder below.

In [57]:
import torch
import pickle
from dataset import Dataset

In [58]:
# PLACEHOLDER TO INSERT YOUR __getitem__ method here

# def __getitem__(self, index):
#     item = self.data[index]
#     # TODO YOUR CODE HERE
#     raise NotImplementedError()

### Answer ###
# def __getitem__(self, index):
#     item = self.data[index]
#     state_np, action_np = item
#     state = torch.tensor(state_np, dtype=torch.float32)
#     action = torch.tensor(action_np, dtype=torch.long)
#     return {'state': state, 'action': action} 

In [59]:
dataset = Dataset(data_path="CartPole-v1_dataset.pkl")
dataset_size = len(dataset)
print(f"Dataset size: {dataset_size}")

Dataset size: 99660


In [60]:
sample = dataset[0]
state = sample['state']
state_dim = state.shape[0]
print(f"The state has {state_dim} dimensiosn.")

The state has 4 dimensiosn.


In [61]:
all_states = torch.stack([dataset[i]['state'] for i in range(len(dataset))])
state_min = torch.min(all_states, dim=0)[0]
state_max = torch.max(all_states, dim=0)[0]
print(f"max = {state_max}")
print(f"min = {state_min}")

max = tensor([2.3995, 1.8470, 0.1464, 0.4714])
min = tensor([-0.7227, -0.4330, -0.0501, -0.3812])


In [62]:
all_actions = torch.tensor([dataset[i]['action'] for i in range(len(dataset))])
actions_unique = torch.unique(all_actions)
print(f"Action has dimension {all_actions.shape}.")
print(f"Unique actions in dataset: {actions_unique.tolist()}")

Action has dimension torch.Size([99660]).
Unique actions in dataset: [0, 1]


**[2 pt]** How big is the dataset provided?

**Answer:** dataset contains 99660 samples.

**[2 pts]** What is the dimensionality of $s$ and what range does each dimension of $s$ span?  I.e., how much of the state space does the expert data cover? What are the dimensionalities and ranges of the action $a$ in the dataset (how much of the action space does the expert data cover)?

**Answer:** $s$ has 4 dimensions, dim0 ranges from -0.7227 to 2.3995, dim1 ranges from -0.4330 to 1.8470, dim2 ranges from -0.0501 to 0.1464, and dim3 ranges from -0.3812 to 0.4714. $a$ is a scalar and is either 0 or 1. 


### [5 pts] 1.2 Environment

Recall the state and action space of CartPole, from the previous assignment.

Considering the full state and action spaces, do you think the provided expert dataset has good coverage?  Why or why not? How might this impact the performance of our cloned policy?

**Answer:** No. The provided range covers a narrow portion (around center) of the full range. In cases where the state is outside of this range, the cloned policy may not perform well because it lacks corresponding data. 

### [14 pts] 1.3 Model

The file `model.py` provides skeleton code for the model. Your goal is to create the architecture of the network by adding layers that map the input to output.

You will need to update the `__init__` method and the `forward` method.

The `select_action` method has already been written for you.  This should be used when running the policy in the environment, while the `forward` function should be used at training time.

- **[10 pts]** Insert your code in the placeholder below.

In [63]:
# PLACEHOLDER TO INSERT YOUR MyModel class here

# class MyModel(nn.Module):
#     def __init__(self, state_size, action_size):
#         super(MyModel, self).__init__()
#         # TODO YOUR CODE HERE FOR INITIALIZING THE MODEL

#     def forward(self, x):
#         # TODO YOUR CODE HERE FOR THE FORWARD PASS
#         raise NotImplementedError()

#     def select_action(self, state):
#         self.eval()
#         x = self.forward(state)
#         self.train()
#         return x.max(1)[1].view(1, 1).to(torch.long)


# ###Answer###
# class MyModel(nn.Module):
#     def __init__(self, state_size, action_size):
#         super().__init__()
#         self.fc1 = nn.Linear(state_size, 32) 
#         self.fc2 = nn.Linear(32, 32)
#         self.fc3 = nn.Linear(32, action_size)

#     def forward(self, x):
#         x = F.relu(self.fc1(x))
#         x = F.relu(self.fc2(x))
#         x = self.fc3(x)
#         return x

#     def select_action(self, state):
#         self.eval()
#         x = self.forward(state) 
#         self.train()
#         return x.max(1)[1].view(1, 1).to(torch.long)

Answer the following questions:

- **[2 pts]** What is the dimension and meaning of the input of the network

**Answer:** The input would have 4 dimensions, each corresponding to the state (cart position, cart velocity, pole angle, pole angular velocity). 

- **[2 pts]** Similarly, describe the output.

**Answer:** The output would have 2 dimensions, each corresponding to the probability of taking one action (0 for pushing to left, 1 for pushing to right).


### [7 pts] 1.4 Training

The file `bc.py` is the entry point for training your behavioral cloning model. The skeleton and the main components are already there.

The missing parts for you to do are:

- Initializing the model
- Choosing a loss function
- Choosing an optimizer
- Playing with hyperparameters to train your model.

- **[5 pts]** Insert your code in the placeholder below.

In [64]:
# PLACEHOLDER FOR YOUR CODE HER
# HOW DID YOU INITIALIZE YOUR MODEL, OPTIMIZER AND LOSS FUNCTIONS? PASTE HERE YOUR FINAL CODE
# NOTE: YOU CAN KEEP THE FOLLOWING LINES COMMENTED OUT, AS RUNNING THIS CELL WILL PROBABLY RESULT IN ERRORS

# model = None
# optimizer = None
# loss_function = None

###Answer###
# state_size = env.observation_space.shape[0]  
# action_size = env.action_space.n 
# model = MyModel(state_size, action_size).to(device)

# optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)  
# loss_function = nn.CrossEntropyLoss()  

You can run your code by doing:

```
python3 bc.py
```

**During all of this assignment, the code in `eval_policy.py` will be your best friend.** At any time, you can test your model by giving as argument the path to the model weights and the environment name using the following command:

```
python3 eval_policy.py --model-path /path/to/model/weights --env ENV_NAME
````

In [65]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

# [epoch    1/100] [iter       0] [loss 0.70570]
# [Test on environment] [epoch 2/100] [score 306.90]
# [Test on environment] [epoch 4/100] [score 248.80]
# [Test on environment] [epoch 6/100] [score 260.50]
# [epoch    7/100] [iter   10000] [loss 0.01669]
# [Test on environment] [epoch 8/100] [score 262.70]
# [Test on environment] [epoch 10/100] [score 248.90]
# [Test on environment] [epoch 12/100] [score 222.70]
# [epoch   13/100] [iter   20000] [loss 0.00198]
# [Test on environment] [epoch 14/100] [score 271.40]
# [Test on environment] [epoch 16/100] [score 240.80]
# [Test on environment] [epoch 18/100] [score 230.40]
# [epoch   20/100] [iter   30000] [loss 0.00440]
# [Test on environment] [epoch 20/100] [score 283.10]
# [Test on environment] [epoch 22/100] [score 249.60]
# [Test on environment] [epoch 24/100] [score 259.50]
# [epoch   26/100] [iter   40000] [loss 0.00377]
# [Test on environment] [epoch 26/100] [score 251.80]
# [Test on environment] [epoch 28/100] [score 235.70]
# [Test on environment] [epoch 30/100] [score 225.20]
# [Test on environment] [epoch 32/100] [score 287.80]
# [epoch   33/100] [iter   50000] [loss 0.01454]
# [Test on environment] [epoch 34/100] [score 257.30]
# [Test on environment] [epoch 36/100] [score 253.90]
# [Test on environment] [epoch 38/100] [score 242.10]
# [epoch   39/100] [iter   60000] [loss 0.00070]
# [Test on environment] [epoch 40/100] [score 234.90]
# [Test on environment] [epoch 42/100] [score 237.00]
# [Test on environment] [epoch 44/100] [score 253.80]
# [epoch   45/100] [iter   70000] [loss 0.00645]
# [Test on environment] [epoch 46/100] [score 290.80]
# [Test on environment] [epoch 48/100] [score 247.60]
# [Test on environment] [epoch 50/100] [score 257.00]
# [epoch   52/100] [iter   80000] [loss 0.00000]
# [Test on environment] [epoch 52/100] [score 262.00]
# [Test on environment] [epoch 54/100] [score 276.90]
# [Test on environment] [epoch 56/100] [score 250.90]
# [epoch   58/100] [iter   90000] [loss 0.00122]
# [Test on environment] [epoch 58/100] [score 243.00]
# [Test on environment] [epoch 60/100] [score 235.90]
# [Test on environment] [epoch 62/100] [score 266.10]
# [Test on environment] [epoch 64/100] [score 249.90]
# [epoch   65/100] [iter  100000] [loss 0.00000]
# [Test on environment] [epoch 66/100] [score 253.70]
# [Test on environment] [epoch 68/100] [score 268.20]
# [Test on environment] [epoch 70/100] [score 269.80]
# [epoch   71/100] [iter  110000] [loss 0.00079]
# [Test on environment] [epoch 72/100] [score 276.80]
# [Test on environment] [epoch 74/100] [score 232.60]
# [Test on environment] [epoch 76/100] [score 258.60]
# [epoch   78/100] [iter  120000] [loss 0.00002]
# [Test on environment] [epoch 78/100] [score 259.10]
# [Test on environment] [epoch 80/100] [score 243.60]
# [Test on environment] [epoch 82/100] [score 265.50]
# [epoch   84/100] [iter  130000] [loss 0.01828]
# [Test on environment] [epoch 84/100] [score 238.30]
# [Test on environment] [epoch 86/100] [score 246.30]
# [Test on environment] [epoch 88/100] [score 250.30]
# [epoch   90/100] [iter  140000] [loss 0.11484]
# [Test on environment] [epoch 90/100] [score 268.30]
# [Test on environment] [epoch 92/100] [score 248.90]
# [Test on environment] [epoch 94/100] [score 226.10]
# [Test on environment] [epoch 96/100] [score 274.00]
# [epoch   97/100] [iter  150000] [loss 0.00001]
# [Test on environment] [epoch 98/100] [score 279.90]
# [Test on environment] [epoch 100/100] [score 271.10]
# Saving model as behavioral_cloning_CartPole-v1.pt

**[2 pts]** Did you manage to learn a good policy? How consistent is the reward you are getting?

**Answer:** Yes the policy is good and fairly consistent, with reward around 250. 

## 2. [24 pts] Deep Q Learning

There are two main issues with the behavior cloning approach.

- First, we are not always lucky enough to have access to a dataset of expert demonstrations.
- Second, replicating an expert policy suffers from compounding error. The policy $\pi$ only sees these "perfect" examples and has no knowledge on how to recover from states not visited by the expert. For this reason, as soon as it is presented with a state that is off the expert trajectory, it will perform poorly and will continue to deviate from a good trajectory without the possibility of recovering from errors.

---
The second task consists in solving the environment from scratch, using RL, and most specifically the DQN algorithm, to learn a policy $\pi$.

For this task, familiarize yourself with the file `dqn.py`. We are going to re-use the file `model.py` for the model you created in the previous task.

Your task is very similar to the one in the previous assignment, to implement the Q-learning algorithm, but in this version, our Q-function is approximated with a neural network.

The algorithm (excerpted from [Atari DQN paper](https://arxiv.org/abs/1312.5602)) is given below:

![DQN algorithm](https://i.imgur.com/Mh4Uxta.png)

### 2.0 [2 pts] Think about your model...



In DQN, we are using the same model as in task 1 for behavioral cloning. In both tasks the model receives as input the state and in both tasks the model outputs something that has the same dimensionality as the number of actions. These two outputs, though, represent very different things. What is each one representing?

**Answer:** Behavioral cloning model output: the probability of taking either action; DQN model output: Q value (expected cumulative rewards)

### 2.1 [10 pts] Update your Q-function

Complete the `optimize_model` function. This function receives as input a `state`, an `action`, the `next_state`, the `reward` and `done` representing the tuple $(s_t, a_t, s_{t+1}, r_t, done_t)$. Your task is to update your Q-function as shown in the [Atari DQN paper](https://arxiv.org/abs/1312.5602) environment. For now don't be concerned with the experience replay buffer. We'll get to that later.

![Loss function](https://i.imgur.com/tpTsV8m.png)

Insert your code in the placeholder below.

In [66]:
## PLACEHOLDER TO INSERT YOUR optimize_model function here:

# def optimize_model(state, action, next_state, reward, done):
#     # TODO given a tuple (s_t, a_t, s_{t+1}, r_t, done_t) update your model weights

#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()

###Answer###
# def optimize_model(state, action, next_state, reward, done):
#     # tensor conversion
#     state_tensor = torch.tensor([state], dtype=torch.float32).to(device)
#     action_tensor = torch.tensor([[action]], dtype=torch.long).to(device)
#     reward_tensor = torch.tensor([reward], dtype=torch.float32).to(device)
#     next_state_tensor = torch.tensor([next_state], dtype=torch.float32).to(device)
#     done_tensor = torch.tensor([done], dtype=torch.float32).to(device)

#     # get q
#     q_values = model(state_tensor)
#     state_action_value = q_values.gather(1, action_tensor) #current q

#     with torch.no_grad(): #target q
#         next_q_values = target(next_state_tensor)
#         next_state_value = next_q_values.max(1)[0]
#         next_state_value = next_state_value * (1 - done_tensor)
#         expected_state_action_value = reward_tensor + GAMMA * next_state_value

#     loss = F.mse_loss(state_action_value, expected_state_action_value)

#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()

### 2.2 [5 pts] $\epsilon$-greedy strategy

You will need a strategy to explore your environment. The standard strategy is to use $\epsilon$-greedy. Implement it in the `choose_action` function template.

Insert your code in the placeholder below.

In [67]:
## PLACEHOLDER TO INSERT YOUR choose_action function here:

# def choose_action(state, test_mode=False):
#     # TODO implement an epsilon-greedy strategy
#     raise NotImplementedError()

###Answer###
# def choose_action(state, test_mode=False):
#     epsilon = EPS_EXPLORATION if not test_mode else 0
#     state_tensor = torch.tensor([state], dtype=torch.float32).to(device)
#     if random.random() < epsilon:
#         action = torch.tensor([[random.randrange(n_actions)]], dtype=torch.long).to(device)
#     else:
#         with torch.no_grad():
#             q_values = model(state_tensor)
#             action = torch.argmax(q_values, dim=1).unsqueeze(0)
#     return action

### 2.3 [2 pts] Train your model

Try to train a model in this way.

You can run your code by doing:

```
python3 dqn.py
```

How many episodes does it take to learn (ie. reach a good reward)?

**Answer:** It took ~1650 episodes to reach a good reward.

In [68]:
# ## PASTE YOUR TERMINAL OUTPUT HERE
# # NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10
# [Episode   10/4000] [Steps   10] [reward 11.0]
# [Episode   20/4000] [Steps   12] [reward 13.0]
# ----------
# saving model.
# [TEST Episode 25] [Average Reward 9.5]
# ----------
# [Episode   30/4000] [Steps   10] [reward 11.0]
# [Episode   40/4000] [Steps    8] [reward 9.0]
# [Episode   50/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 50] [Average Reward 9.4]
# ----------
# [Episode   60/4000] [Steps   11] [reward 12.0]
# [Episode   70/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 75] [Average Reward 9.5]
# ----------
# [Episode   80/4000] [Steps    9] [reward 10.0]
# [Episode   90/4000] [Steps    9] [reward 10.0]
# [Episode  100/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 100] [Average Reward 9.4]
# ----------
# [Episode  110/4000] [Steps    9] [reward 10.0]
# [Episode  120/4000] [Steps   11] [reward 12.0]
# ----------
# saving model.
# [TEST Episode 125] [Average Reward 9.9]
# ----------
# [Episode  130/4000] [Steps   10] [reward 11.0]
# [Episode  140/4000] [Steps   11] [reward 12.0]
# [Episode  150/4000] [Steps    9] [reward 10.0]
# ----------
# saving model.
# [TEST Episode 150] [Average Reward 10.9]
# ----------
# [Episode  160/4000] [Steps    8] [reward 9.0]
# [Episode  170/4000] [Steps   13] [reward 14.0]
# ----------
# saving model.
# [TEST Episode 175] [Average Reward 15.0]
# ----------
# [Episode  180/4000] [Steps   32] [reward 33.0]
# [Episode  190/4000] [Steps   21] [reward 22.0]
# [Episode  200/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 200] [Average Reward 11.1]
# ----------
# [Episode  210/4000] [Steps   10] [reward 11.0]
# [Episode  220/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 225] [Average Reward 14.7]
# ----------
# [Episode  230/4000] [Steps    7] [reward 8.0]
# [Episode  240/4000] [Steps    9] [reward 10.0]
# [Episode  250/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 250] [Average Reward 9.9]
# ----------
# [Episode  260/4000] [Steps    9] [reward 10.0]
# [Episode  270/4000] [Steps    7] [reward 8.0]
# ----------
# [TEST Episode 275] [Average Reward 10.1]
# ----------
# [Episode  280/4000] [Steps   10] [reward 11.0]
# [Episode  290/4000] [Steps    9] [reward 10.0]
# [Episode  300/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 300] [Average Reward 10.2]
# ----------
# [Episode  310/4000] [Steps   10] [reward 11.0]
# [Episode  320/4000] [Steps   14] [reward 15.0]
# ----------
# saving model.
# [TEST Episode 325] [Average Reward 15.3]
# ----------
# [Episode  330/4000] [Steps   42] [reward 43.0]
# [Episode  340/4000] [Steps   27] [reward 28.0]
# [Episode  350/4000] [Steps   11] [reward 12.0]
# ----------
# saving model.
# [TEST Episode 350] [Average Reward 33.2]
# ----------
# [Episode  360/4000] [Steps   13] [reward 14.0]
# [Episode  370/4000] [Steps   40] [reward 41.0]
# ----------
# [TEST Episode 375] [Average Reward 15.2]
# ----------
# [Episode  380/4000] [Steps   11] [reward 12.0]
# [Episode  390/4000] [Steps   15] [reward 16.0]
# [Episode  400/4000] [Steps   18] [reward 19.0]
# ----------
# [TEST Episode 400] [Average Reward 16.5]
# ----------
# [Episode  410/4000] [Steps   16] [reward 17.0]
# [Episode  420/4000] [Steps   17] [reward 18.0]
# ----------
# [TEST Episode 425] [Average Reward 27.8]
# ----------
# [Episode  430/4000] [Steps   13] [reward 14.0]
# [Episode  440/4000] [Steps   20] [reward 21.0]
# [Episode  450/4000] [Steps   80] [reward 81.0]
# ----------
# [TEST Episode 450] [Average Reward 18.4]
# ----------
# [Episode  460/4000] [Steps   18] [reward 19.0]
# [Episode  470/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 475] [Average Reward 25.4]
# ----------
# [Episode  480/4000] [Steps   17] [reward 18.0]
# [Episode  490/4000] [Steps   33] [reward 34.0]
# [Episode  500/4000] [Steps   19] [reward 20.0]
# ----------
# [TEST Episode 500] [Average Reward 14.6]
# ----------
# [Episode  510/4000] [Steps    7] [reward 8.0]
# [Episode  520/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 525] [Average Reward 9.5]
# ----------
# [Episode  530/4000] [Steps   11] [reward 12.0]
# [Episode  540/4000] [Steps   11] [reward 12.0]
# [Episode  550/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 550] [Average Reward 10.2]
# ----------
# [Episode  560/4000] [Steps   12] [reward 13.0]
# [Episode  570/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 575] [Average Reward 9.3]
# ----------
# [Episode  580/4000] [Steps   13] [reward 14.0]
# [Episode  590/4000] [Steps    8] [reward 9.0]
# [Episode  600/4000] [Steps   34] [reward 35.0]
# ----------
# [TEST Episode 600] [Average Reward 28.1]
# ----------
# [Episode  610/4000] [Steps    8] [reward 9.0]
# [Episode  620/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 625] [Average Reward 31.8]
# ----------
# [Episode  630/4000] [Steps   21] [reward 22.0]
# [Episode  640/4000] [Steps   31] [reward 32.0]
# [Episode  650/4000] [Steps   24] [reward 25.0]
# ----------
# [TEST Episode 650] [Average Reward 17.2]
# ----------
# [Episode  660/4000] [Steps    7] [reward 8.0]
# [Episode  670/4000] [Steps   17] [reward 18.0]
# ----------
# [TEST Episode 675] [Average Reward 9.2]
# ----------
# [Episode  680/4000] [Steps    8] [reward 9.0]
# [Episode  690/4000] [Steps   10] [reward 11.0]
# [Episode  700/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 700] [Average Reward 9.4]
# ----------
# [Episode  710/4000] [Steps    9] [reward 10.0]
# [Episode  720/4000] [Steps   16] [reward 17.0]
# ----------
# [TEST Episode 725] [Average Reward 16.5]
# ----------
# [Episode  730/4000] [Steps   10] [reward 11.0]
# [Episode  740/4000] [Steps    9] [reward 10.0]
# [Episode  750/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 750] [Average Reward 15.0]
# ----------
# [Episode  760/4000] [Steps   10] [reward 11.0]
# [Episode  770/4000] [Steps   16] [reward 17.0]
# ----------
# [TEST Episode 775] [Average Reward 9.3]
# ----------
# [Episode  780/4000] [Steps    9] [reward 10.0]
# [Episode  790/4000] [Steps    9] [reward 10.0]
# [Episode  800/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 800] [Average Reward 8.8]
# ----------
# [Episode  810/4000] [Steps   20] [reward 21.0]
# [Episode  820/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 825] [Average Reward 25.1]
# ----------
# [Episode  830/4000] [Steps   29] [reward 30.0]
# [Episode  840/4000] [Steps   11] [reward 12.0]
# [Episode  850/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 850] [Average Reward 10.0]
# ----------
# [Episode  860/4000] [Steps   18] [reward 19.0]
# [Episode  870/4000] [Steps   20] [reward 21.0]
# ----------
# [TEST Episode 875] [Average Reward 14.4]
# ----------
# [Episode  880/4000] [Steps   83] [reward 84.0]
# [Episode  890/4000] [Steps    8] [reward 9.0]
# [Episode  900/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 900] [Average Reward 9.9]
# ----------
# [Episode  910/4000] [Steps   11] [reward 12.0]
# [Episode  920/4000] [Steps   85] [reward 86.0]
# ----------
# [TEST Episode 925] [Average Reward 15.2]
# ----------
# [Episode  930/4000] [Steps   18] [reward 19.0]
# [Episode  940/4000] [Steps   10] [reward 11.0]
# [Episode  950/4000] [Steps   62] [reward 63.0]
# ----------
# [TEST Episode 950] [Average Reward 19.7]
# ----------
# [Episode  960/4000] [Steps   14] [reward 15.0]
# [Episode  970/4000] [Steps   79] [reward 80.0]
# ----------
# [TEST Episode 975] [Average Reward 14.4]
# ----------
# [Episode  980/4000] [Steps   15] [reward 16.0]
# [Episode  990/4000] [Steps   10] [reward 11.0]
# [Episode 1000/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 1000] [Average Reward 9.3]
# ----------
# [Episode 1010/4000] [Steps   16] [reward 17.0]
# [Episode 1020/4000] [Steps   18] [reward 19.0]
# ----------
# saving model.
# [TEST Episode 1025] [Average Reward 41.6]
# ----------
# [Episode 1030/4000] [Steps  111] [reward 112.0]
# [Episode 1040/4000] [Steps   16] [reward 17.0]
# [Episode 1050/4000] [Steps   29] [reward 30.0]
# ----------
# [TEST Episode 1050] [Average Reward 20.9]
# ----------
# [Episode 1060/4000] [Steps    8] [reward 9.0]
# [Episode 1070/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 1075] [Average Reward 9.5]
# ----------
# [Episode 1080/4000] [Steps    9] [reward 10.0]
# [Episode 1090/4000] [Steps   15] [reward 16.0]
# [Episode 1100/4000] [Steps   60] [reward 61.0]
# ----------
# [TEST Episode 1100] [Average Reward 17.2]
# ----------
# [Episode 1110/4000] [Steps   18] [reward 19.0]
# [Episode 1120/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 1125] [Average Reward 17.9]
# ----------
# [Episode 1130/4000] [Steps   15] [reward 16.0]
# [Episode 1140/4000] [Steps   18] [reward 19.0]
# [Episode 1150/4000] [Steps   21] [reward 22.0]
# ----------
# [TEST Episode 1150] [Average Reward 15.7]
# ----------
# [Episode 1160/4000] [Steps   36] [reward 37.0]
# [Episode 1170/4000] [Steps  499] [reward 500.0]
# ----------
# saving model.
# [TEST Episode 1175] [Average Reward 75.9]
# ----------
# [Episode 1180/4000] [Steps   12] [reward 13.0]
# [Episode 1190/4000] [Steps    8] [reward 9.0]
# [Episode 1200/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 1200] [Average Reward 9.7]
# ----------
# [Episode 1210/4000] [Steps   14] [reward 15.0]
# [Episode 1220/4000] [Steps   14] [reward 15.0]
# ----------
# [TEST Episode 1225] [Average Reward 19.0]
# ----------
# [Episode 1230/4000] [Steps   21] [reward 22.0]
# [Episode 1240/4000] [Steps   88] [reward 89.0]
# [Episode 1250/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 1250] [Average Reward 15.4]
# ----------
# [Episode 1260/4000] [Steps   10] [reward 11.0]
# [Episode 1270/4000] [Steps   16] [reward 17.0]
# ----------
# [TEST Episode 1275] [Average Reward 14.1]
# ----------
# [Episode 1280/4000] [Steps   12] [reward 13.0]
# [Episode 1290/4000] [Steps   15] [reward 16.0]
# [Episode 1300/4000] [Steps   18] [reward 19.0]
# ----------
# [TEST Episode 1300] [Average Reward 14.8]
# ----------
# [Episode 1310/4000] [Steps   15] [reward 16.0]
# [Episode 1320/4000] [Steps   22] [reward 23.0]
# ----------
# [TEST Episode 1325] [Average Reward 16.2]
# ----------
# [Episode 1330/4000] [Steps   95] [reward 96.0]
# [Episode 1340/4000] [Steps   12] [reward 13.0]
# [Episode 1350/4000] [Steps  190] [reward 191.0]
# ----------
# [TEST Episode 1350] [Average Reward 15.7]
# ----------
# [Episode 1360/4000] [Steps   21] [reward 22.0]
# [Episode 1370/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 1375] [Average Reward 9.9]
# ----------
# [Episode 1380/4000] [Steps   17] [reward 18.0]
# [Episode 1390/4000] [Steps   65] [reward 66.0]
# [Episode 1400/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 1400] [Average Reward 15.3]
# ----------
# [Episode 1410/4000] [Steps   15] [reward 16.0]
# [Episode 1420/4000] [Steps   15] [reward 16.0]
# ----------
# [TEST Episode 1425] [Average Reward 15.4]
# ----------
# [Episode 1430/4000] [Steps  160] [reward 161.0]
# [Episode 1440/4000] [Steps   12] [reward 13.0]
# [Episode 1450/4000] [Steps   18] [reward 19.0]
# ----------
# saving model.
# [TEST Episode 1450] [Average Reward 76.9]
# ----------
# [Episode 1460/4000] [Steps   40] [reward 41.0]
# [Episode 1470/4000] [Steps   60] [reward 61.0]
# ----------
# [TEST Episode 1475] [Average Reward 70.7]
# ----------
# [Episode 1480/4000] [Steps    9] [reward 10.0]
# [Episode 1490/4000] [Steps   65] [reward 66.0]
# [Episode 1500/4000] [Steps  111] [reward 112.0]
# ----------
# [TEST Episode 1500] [Average Reward 17.7]
# ----------
# [Episode 1510/4000] [Steps   79] [reward 80.0]
# [Episode 1520/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 1525] [Average Reward 51.9]
# ----------
# [Episode 1530/4000] [Steps   14] [reward 15.0]
# [Episode 1540/4000] [Steps    8] [reward 9.0]
# [Episode 1550/4000] [Steps   15] [reward 16.0]
# ----------
# [TEST Episode 1550] [Average Reward 28.1]
# ----------
# [Episode 1560/4000] [Steps   11] [reward 12.0]
# [Episode 1570/4000] [Steps   13] [reward 14.0]
# ----------
# [TEST Episode 1575] [Average Reward 15.7]
# ----------
# [Episode 1580/4000] [Steps   13] [reward 14.0]
# [Episode 1590/4000] [Steps   63] [reward 64.0]
# [Episode 1600/4000] [Steps   16] [reward 17.0]
# ----------
# [TEST Episode 1600] [Average Reward 66.4]
# ----------
# [Episode 1610/4000] [Steps   44] [reward 45.0]
# [Episode 1620/4000] [Steps   11] [reward 12.0]
# ----------
# saving model.
# [TEST Episode 1625] [Average Reward 94.7]
# ----------
# [Episode 1630/4000] [Steps  108] [reward 109.0]
# [Episode 1640/4000] [Steps   15] [reward 16.0]
# [Episode 1650/4000] [Steps   11] [reward 12.0]
# ----------
# saving model.
# [TEST Episode 1650] [Average Reward 151.4]
# ----------
# [Episode 1660/4000] [Steps   15] [reward 16.0]
# [Episode 1670/4000] [Steps   25] [reward 26.0]
# ----------
# [TEST Episode 1675] [Average Reward 127.8]
# ----------
# [Episode 1680/4000] [Steps  278] [reward 279.0]
# [Episode 1690/4000] [Steps  167] [reward 168.0]
# [Episode 1700/4000] [Steps   86] [reward 87.0]
# ----------
# [TEST Episode 1700] [Average Reward 43.8]
# ----------
# [Episode 1710/4000] [Steps    9] [reward 10.0]
# [Episode 1720/4000] [Steps   25] [reward 26.0]
# ----------
# [TEST Episode 1725] [Average Reward 14.5]
# ----------
# [Episode 1730/4000] [Steps   14] [reward 15.0]
# [Episode 1740/4000] [Steps   15] [reward 16.0]
# [Episode 1750/4000] [Steps  151] [reward 152.0]
# ----------
# [TEST Episode 1750] [Average Reward 33.8]
# ----------
# [Episode 1760/4000] [Steps   59] [reward 60.0]
# [Episode 1770/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 1775] [Average Reward 99.4]
# ----------
# [Episode 1780/4000] [Steps  321] [reward 322.0]
# [Episode 1790/4000] [Steps   39] [reward 40.0]
# [Episode 1800/4000] [Steps  147] [reward 148.0]
# ----------
# [TEST Episode 1800] [Average Reward 120.1]
# ----------
# [Episode 1810/4000] [Steps   95] [reward 96.0]
# [Episode 1820/4000] [Steps  123] [reward 124.0]
# ----------
# [TEST Episode 1825] [Average Reward 83.4]
# ----------
# [Episode 1830/4000] [Steps  108] [reward 109.0]
# [Episode 1840/4000] [Steps    7] [reward 8.0]
# [Episode 1850/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 1850] [Average Reward 11.7]
# ----------
# [Episode 1860/4000] [Steps  128] [reward 129.0]
# [Episode 1870/4000] [Steps  323] [reward 324.0]
# ----------
# [TEST Episode 1875] [Average Reward 11.5]
# ----------
# [Episode 1880/4000] [Steps   16] [reward 17.0]
# [Episode 1890/4000] [Steps  106] [reward 107.0]
# [Episode 1900/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 1900] [Average Reward 9.4]
# ----------
# [Episode 1910/4000] [Steps   42] [reward 43.0]
# [Episode 1920/4000] [Steps   31] [reward 32.0]
# ----------
# [TEST Episode 1925] [Average Reward 101.2]
# ----------
# [Episode 1930/4000] [Steps    9] [reward 10.0]
# [Episode 1940/4000] [Steps  218] [reward 219.0]
# [Episode 1950/4000] [Steps  131] [reward 132.0]
# ----------
# [TEST Episode 1950] [Average Reward 91.4]
# ----------
# [Episode 1960/4000] [Steps   13] [reward 14.0]
# [Episode 1970/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 1975] [Average Reward 78.5]
# ----------
# [Episode 1980/4000] [Steps  127] [reward 128.0]
# [Episode 1990/4000] [Steps   61] [reward 62.0]
# [Episode 2000/4000] [Steps   33] [reward 34.0]
# ----------
# saving model.
# [TEST Episode 2000] [Average Reward 448.8]
# ----------
# [Episode 2010/4000] [Steps   52] [reward 53.0]
# [Episode 2020/4000] [Steps  168] [reward 169.0]
# ----------
# [TEST Episode 2025] [Average Reward 130.8]
# ----------
# [Episode 2030/4000] [Steps  140] [reward 141.0]
# [Episode 2040/4000] [Steps   36] [reward 37.0]
# [Episode 2050/4000] [Steps  220] [reward 221.0]
# ----------
# [TEST Episode 2050] [Average Reward 135.2]
# ----------
# [Episode 2060/4000] [Steps  191] [reward 192.0]
# [Episode 2070/4000] [Steps  241] [reward 242.0]
# ----------
# [TEST Episode 2075] [Average Reward 239.2]
# ----------
# [Episode 2080/4000] [Steps  373] [reward 374.0]
# [Episode 2090/4000] [Steps   13] [reward 14.0]
# [Episode 2100/4000] [Steps  139] [reward 140.0]
# ----------
# [TEST Episode 2100] [Average Reward 147.0]
# ----------
# [Episode 2110/4000] [Steps  499] [reward 500.0]
# [Episode 2120/4000] [Steps   42] [reward 43.0]
# ----------
# [TEST Episode 2125] [Average Reward 147.1]
# ----------
# [Episode 2130/4000] [Steps  184] [reward 185.0]
# [Episode 2140/4000] [Steps  170] [reward 171.0]
# [Episode 2150/4000] [Steps  106] [reward 107.0]
# ----------
# [TEST Episode 2150] [Average Reward 226.2]
# ----------
# [Episode 2160/4000] [Steps  133] [reward 134.0]
# [Episode 2170/4000] [Steps  144] [reward 145.0]
# ----------
# [TEST Episode 2175] [Average Reward 9.0]
# ----------
# [Episode 2180/4000] [Steps    9] [reward 10.0]
# [Episode 2190/4000] [Steps    8] [reward 9.0]
# [Episode 2200/4000] [Steps   15] [reward 16.0]
# ----------
# [TEST Episode 2200] [Average Reward 205.6]
# ----------
# [Episode 2210/4000] [Steps   18] [reward 19.0]
# [Episode 2220/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 2225] [Average Reward 141.1]
# ----------
# [Episode 2230/4000] [Steps  208] [reward 209.0]
# [Episode 2240/4000] [Steps  194] [reward 195.0]
# [Episode 2250/4000] [Steps  251] [reward 252.0]
# ----------
# [TEST Episode 2250] [Average Reward 147.1]
# ----------
# [Episode 2260/4000] [Steps  152] [reward 153.0]
# [Episode 2270/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 2275] [Average Reward 118.5]
# ----------
# [Episode 2280/4000] [Steps  134] [reward 135.0]
# [Episode 2290/4000] [Steps  140] [reward 141.0]
# [Episode 2300/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 2300] [Average Reward 183.5]
# ----------
# [Episode 2310/4000] [Steps  190] [reward 191.0]
# [Episode 2320/4000] [Steps   56] [reward 57.0]
# ----------
# [TEST Episode 2325] [Average Reward 130.3]
# ----------
# [Episode 2330/4000] [Steps   41] [reward 42.0]
# [Episode 2340/4000] [Steps   74] [reward 75.0]
# [Episode 2350/4000] [Steps  162] [reward 163.0]
# ----------
# [TEST Episode 2350] [Average Reward 177.2]
# ----------
# [Episode 2360/4000] [Steps   32] [reward 33.0]
# [Episode 2370/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 2375] [Average Reward 200.0]
# ----------
# [Episode 2380/4000] [Steps   12] [reward 13.0]
# [Episode 2390/4000] [Steps   16] [reward 17.0]
# [Episode 2400/4000] [Steps   85] [reward 86.0]
# ----------
# [TEST Episode 2400] [Average Reward 86.3]
# ----------
# [Episode 2410/4000] [Steps   97] [reward 98.0]
# [Episode 2420/4000] [Steps   87] [reward 88.0]
# ----------
# [TEST Episode 2425] [Average Reward 270.1]
# ----------
# [Episode 2430/4000] [Steps   54] [reward 55.0]
# [Episode 2440/4000] [Steps  171] [reward 172.0]
# [Episode 2450/4000] [Steps  179] [reward 180.0]
# ----------
# [TEST Episode 2450] [Average Reward 138.6]
# ----------
# [Episode 2460/4000] [Steps  143] [reward 144.0]
# [Episode 2470/4000] [Steps   34] [reward 35.0]
# ----------
# [TEST Episode 2475] [Average Reward 272.9]
# ----------
# [Episode 2480/4000] [Steps   69] [reward 70.0]
# [Episode 2490/4000] [Steps  446] [reward 447.0]
# [Episode 2500/4000] [Steps   54] [reward 55.0]
# ----------
# [TEST Episode 2500] [Average Reward 229.6]
# ----------
# [Episode 2510/4000] [Steps  118] [reward 119.0]
# [Episode 2520/4000] [Steps   92] [reward 93.0]
# ----------
# [TEST Episode 2525] [Average Reward 232.4]
# ----------
# [Episode 2530/4000] [Steps   13] [reward 14.0]
# [Episode 2540/4000] [Steps    9] [reward 10.0]
# [Episode 2550/4000] [Steps  141] [reward 142.0]
# ----------
# [TEST Episode 2550] [Average Reward 224.0]
# ----------
# [Episode 2560/4000] [Steps   52] [reward 53.0]
# [Episode 2570/4000] [Steps  118] [reward 119.0]
# ----------
# [TEST Episode 2575] [Average Reward 112.3]
# ----------
# [Episode 2580/4000] [Steps  155] [reward 156.0]
# [Episode 2590/4000] [Steps    9] [reward 10.0]
# [Episode 2600/4000] [Steps  238] [reward 239.0]
# ----------
# [TEST Episode 2600] [Average Reward 241.0]
# ----------
# [Episode 2610/4000] [Steps  223] [reward 224.0]
# [Episode 2620/4000] [Steps  297] [reward 298.0]
# ----------
# saving model.
# [TEST Episode 2625] [Average Reward 500.0]
# ----------
# [Episode 2630/4000] [Steps  499] [reward 500.0]
# [Episode 2640/4000] [Steps  129] [reward 130.0]
# [Episode 2650/4000] [Steps   91] [reward 92.0]
# ----------
# [TEST Episode 2650] [Average Reward 99.1]
# ----------
# [Episode 2660/4000] [Steps  176] [reward 177.0]
# [Episode 2670/4000] [Steps   14] [reward 15.0]
# ----------
# [TEST Episode 2675] [Average Reward 380.2]
# ----------
# [Episode 2680/4000] [Steps  377] [reward 378.0]
# [Episode 2690/4000] [Steps  366] [reward 367.0]
# [Episode 2700/4000] [Steps  293] [reward 294.0]
# ----------
# [TEST Episode 2700] [Average Reward 213.1]
# ----------
# [Episode 2710/4000] [Steps  385] [reward 386.0]
# [Episode 2720/4000] [Steps  242] [reward 243.0]
# ----------
# [TEST Episode 2725] [Average Reward 205.9]
# ----------
# [Episode 2730/4000] [Steps  205] [reward 206.0]
# [Episode 2740/4000] [Steps    9] [reward 10.0]
# [Episode 2750/4000] [Steps  202] [reward 203.0]
# ----------
# [TEST Episode 2750] [Average Reward 496.1]
# ----------
# [Episode 2760/4000] [Steps  192] [reward 193.0]
# [Episode 2770/4000] [Steps  187] [reward 188.0]
# ----------
# [TEST Episode 2775] [Average Reward 500.0]
# ----------
# [Episode 2780/4000] [Steps   69] [reward 70.0]
# [Episode 2790/4000] [Steps   10] [reward 11.0]
# [Episode 2800/4000] [Steps  202] [reward 203.0]
# ----------
# [TEST Episode 2800] [Average Reward 264.8]
# ----------
# [Episode 2810/4000] [Steps  499] [reward 500.0]
# [Episode 2820/4000] [Steps  182] [reward 183.0]
# ----------
# [TEST Episode 2825] [Average Reward 139.9]
# ----------
# [Episode 2830/4000] [Steps  109] [reward 110.0]
# [Episode 2840/4000] [Steps   50] [reward 51.0]
# [Episode 2850/4000] [Steps   33] [reward 34.0]
# ----------
# [TEST Episode 2850] [Average Reward 164.2]
# ----------
# [Episode 2860/4000] [Steps   24] [reward 25.0]
# [Episode 2870/4000] [Steps  350] [reward 351.0]
# ----------
# [TEST Episode 2875] [Average Reward 144.1]
# ----------
# [Episode 2880/4000] [Steps   85] [reward 86.0]
# [Episode 2890/4000] [Steps  129] [reward 130.0]
# [Episode 2900/4000] [Steps   61] [reward 62.0]
# ----------
# [TEST Episode 2900] [Average Reward 106.4]
# ----------
# [Episode 2910/4000] [Steps  128] [reward 129.0]
# [Episode 2920/4000] [Steps  397] [reward 398.0]
# ----------
# [TEST Episode 2925] [Average Reward 111.0]
# ----------
# [Episode 2930/4000] [Steps   14] [reward 15.0]
# [Episode 2940/4000] [Steps  116] [reward 117.0]
# [Episode 2950/4000] [Steps   72] [reward 73.0]
# ----------
# [TEST Episode 2950] [Average Reward 132.8]
# ----------
# [Episode 2960/4000] [Steps  106] [reward 107.0]
# [Episode 2970/4000] [Steps  132] [reward 133.0]
# ----------
# [TEST Episode 2975] [Average Reward 107.8]
# ----------
# [Episode 2980/4000] [Steps   91] [reward 92.0]
# [Episode 2990/4000] [Steps  126] [reward 127.0]
# [Episode 3000/4000] [Steps  402] [reward 403.0]
# ----------
# [TEST Episode 3000] [Average Reward 101.2]
# ----------
# [Episode 3010/4000] [Steps  106] [reward 107.0]
# [Episode 3020/4000] [Steps   16] [reward 17.0]
# ----------
# [TEST Episode 3025] [Average Reward 79.1]
# ----------
# [Episode 3030/4000] [Steps   96] [reward 97.0]
# [Episode 3040/4000] [Steps    9] [reward 10.0]
# [Episode 3050/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 3050] [Average Reward 11.4]
# ----------
# [Episode 3060/4000] [Steps  112] [reward 113.0]
# [Episode 3070/4000] [Steps   73] [reward 74.0]
# ----------
# [TEST Episode 3075] [Average Reward 114.7]
# ----------
# [Episode 3080/4000] [Steps  166] [reward 167.0]
# [Episode 3090/4000] [Steps    9] [reward 10.0]
# [Episode 3100/4000] [Steps   31] [reward 32.0]
# ----------
# [TEST Episode 3100] [Average Reward 75.8]
# ----------
# [Episode 3110/4000] [Steps  121] [reward 122.0]
# [Episode 3120/4000] [Steps  139] [reward 140.0]
# ----------
# [TEST Episode 3125] [Average Reward 101.7]
# ----------
# [Episode 3130/4000] [Steps  239] [reward 240.0]
# [Episode 3140/4000] [Steps  179] [reward 180.0]
# [Episode 3150/4000] [Steps   34] [reward 35.0]
# ----------
# [TEST Episode 3150] [Average Reward 74.3]
# ----------
# [Episode 3160/4000] [Steps   66] [reward 67.0]
# [Episode 3170/4000] [Steps   53] [reward 54.0]
# ----------
# [TEST Episode 3175] [Average Reward 71.9]
# ----------
# [Episode 3180/4000] [Steps   89] [reward 90.0]
# [Episode 3190/4000] [Steps   48] [reward 49.0]
# [Episode 3200/4000] [Steps   53] [reward 54.0]
# ----------
# [TEST Episode 3200] [Average Reward 71.0]
# ----------
# [Episode 3210/4000] [Steps    9] [reward 10.0]
# [Episode 3220/4000] [Steps   99] [reward 100.0]
# ----------
# [TEST Episode 3225] [Average Reward 84.6]
# ----------
# [Episode 3230/4000] [Steps  114] [reward 115.0]
# [Episode 3240/4000] [Steps  226] [reward 227.0]
# [Episode 3250/4000] [Steps   58] [reward 59.0]
# ----------
# [TEST Episode 3250] [Average Reward 66.9]
# ----------
# [Episode 3260/4000] [Steps   45] [reward 46.0]
# [Episode 3270/4000] [Steps   80] [reward 81.0]
# ----------
# [TEST Episode 3275] [Average Reward 363.4]
# ----------
# [Episode 3280/4000] [Steps   90] [reward 91.0]
# [Episode 3290/4000] [Steps   90] [reward 91.0]
# [Episode 3300/4000] [Steps   82] [reward 83.0]
# ----------
# [TEST Episode 3300] [Average Reward 78.8]
# ----------
# [Episode 3310/4000] [Steps   72] [reward 73.0]
# [Episode 3320/4000] [Steps   15] [reward 16.0]
# ----------
# [TEST Episode 3325] [Average Reward 81.9]
# ----------
# [Episode 3330/4000] [Steps  109] [reward 110.0]
# [Episode 3340/4000] [Steps   55] [reward 56.0]
# [Episode 3350/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 3350] [Average Reward 500.0]
# ----------
# [Episode 3360/4000] [Steps  119] [reward 120.0]
# [Episode 3370/4000] [Steps   93] [reward 94.0]
# ----------
# [TEST Episode 3375] [Average Reward 61.1]
# ----------
# [Episode 3380/4000] [Steps   78] [reward 79.0]
# [Episode 3390/4000] [Steps   83] [reward 84.0]
# [Episode 3400/4000] [Steps   25] [reward 26.0]
# ----------
# [TEST Episode 3400] [Average Reward 156.8]
# ----------
# [Episode 3410/4000] [Steps   39] [reward 40.0]
# [Episode 3420/4000] [Steps   14] [reward 15.0]
# ----------
# [TEST Episode 3425] [Average Reward 85.9]
# ----------
# [Episode 3430/4000] [Steps   31] [reward 32.0]
# [Episode 3440/4000] [Steps   41] [reward 42.0]
# [Episode 3450/4000] [Steps   69] [reward 70.0]
# ----------
# [TEST Episode 3450] [Average Reward 192.1]
# ----------
# [Episode 3460/4000] [Steps   26] [reward 27.0]
# [Episode 3470/4000] [Steps   43] [reward 44.0]
# ----------
# [TEST Episode 3475] [Average Reward 86.8]
# ----------
# [Episode 3480/4000] [Steps   13] [reward 14.0]
# [Episode 3490/4000] [Steps   40] [reward 41.0]
# [Episode 3500/4000] [Steps   53] [reward 54.0]
# ----------
# [TEST Episode 3500] [Average Reward 498.9]
# ----------
# [Episode 3510/4000] [Steps    9] [reward 10.0]
# [Episode 3520/4000] [Steps   62] [reward 63.0]
# ----------
# [TEST Episode 3525] [Average Reward 113.0]
# ----------
# [Episode 3530/4000] [Steps   27] [reward 28.0]
# [Episode 3540/4000] [Steps   67] [reward 68.0]
# [Episode 3550/4000] [Steps  131] [reward 132.0]
# ----------
# [TEST Episode 3550] [Average Reward 52.0]
# ----------
# [Episode 3560/4000] [Steps   17] [reward 18.0]
# [Episode 3570/4000] [Steps   25] [reward 26.0]
# ----------
# [TEST Episode 3575] [Average Reward 362.9]
# ----------
# [Episode 3580/4000] [Steps   24] [reward 25.0]
# [Episode 3590/4000] [Steps   32] [reward 33.0]
# [Episode 3600/4000] [Steps   53] [reward 54.0]
# ----------
# [TEST Episode 3600] [Average Reward 76.8]
# ----------
# [Episode 3610/4000] [Steps   15] [reward 16.0]
# [Episode 3620/4000] [Steps   23] [reward 24.0]
# ----------
# [TEST Episode 3625] [Average Reward 46.4]
# ----------
# [Episode 3630/4000] [Steps   53] [reward 54.0]
# [Episode 3640/4000] [Steps   57] [reward 58.0]
# [Episode 3650/4000] [Steps   68] [reward 69.0]
# ----------
# [TEST Episode 3650] [Average Reward 69.3]
# ----------
# [Episode 3660/4000] [Steps   64] [reward 65.0]
# [Episode 3670/4000] [Steps   92] [reward 93.0]
# ----------
# [TEST Episode 3675] [Average Reward 91.1]
# ----------
# [Episode 3680/4000] [Steps   29] [reward 30.0]
# [Episode 3690/4000] [Steps   78] [reward 79.0]
# [Episode 3700/4000] [Steps  100] [reward 101.0]
# ----------
# [TEST Episode 3700] [Average Reward 108.8]
# ----------
# [Episode 3710/4000] [Steps   34] [reward 35.0]
# [Episode 3720/4000] [Steps  136] [reward 137.0]
# ----------
# [TEST Episode 3725] [Average Reward 43.8]
# ----------
# [Episode 3730/4000] [Steps   60] [reward 61.0]
# [Episode 3740/4000] [Steps   27] [reward 28.0]
# [Episode 3750/4000] [Steps   60] [reward 61.0]
# ----------
# [TEST Episode 3750] [Average Reward 80.3]
# ----------
# [Episode 3760/4000] [Steps   58] [reward 59.0]
# [Episode 3770/4000] [Steps  127] [reward 128.0]
# ----------
# [TEST Episode 3775] [Average Reward 134.3]
# ----------
# [Episode 3780/4000] [Steps   19] [reward 20.0]
# [Episode 3790/4000] [Steps   62] [reward 63.0]
# [Episode 3800/4000] [Steps  279] [reward 280.0]
# ----------
# [TEST Episode 3800] [Average Reward 140.1]
# ----------
# [Episode 3810/4000] [Steps   66] [reward 67.0]
# [Episode 3820/4000] [Steps   71] [reward 72.0]
# ----------
# [TEST Episode 3825] [Average Reward 86.4]
# ----------
# [Episode 3830/4000] [Steps   61] [reward 62.0]
# [Episode 3840/4000] [Steps  229] [reward 230.0]
# [Episode 3850/4000] [Steps   78] [reward 79.0]
# ----------
# [TEST Episode 3850] [Average Reward 268.0]
# ----------
# [Episode 3860/4000] [Steps   30] [reward 31.0]
# [Episode 3870/4000] [Steps   76] [reward 77.0]
# ----------
# [TEST Episode 3875] [Average Reward 151.6]
# ----------
# [Episode 3880/4000] [Steps   28] [reward 29.0]
# [Episode 3890/4000] [Steps  229] [reward 230.0]
# [Episode 3900/4000] [Steps   27] [reward 28.0]
# ----------
# [TEST Episode 3900] [Average Reward 106.0]
# ----------
# [Episode 3910/4000] [Steps   56] [reward 57.0]
# [Episode 3920/4000] [Steps   77] [reward 78.0]
# ----------
# [TEST Episode 3925] [Average Reward 192.3]
# ----------
# [Episode 3930/4000] [Steps   29] [reward 30.0]
# [Episode 3940/4000] [Steps  112] [reward 113.0]
# [Episode 3950/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 3950] [Average Reward 88.4]
# ----------
# [Episode 3960/4000] [Steps   40] [reward 41.0]
# [Episode 3970/4000] [Steps   46] [reward 47.0]
# ----------
# [TEST Episode 3975] [Average Reward 153.7]
# ----------
# [Episode 3980/4000] [Steps   64] [reward 65.0]
# [Episode 3990/4000] [Steps  117] [reward 118.0]
# [Episode 4000/4000] [Steps  101] [reward 102.0]
# ----------
# [TEST Episode 4000] [Average Reward 111.7]
# ----------

### 2.4 [5 pts] Add the Experience Replay Buffer

If you read the DQN paper (and as you can see from the algorithm picture above), the authors make use of an experience replay buffer to learn faster. We provide an implementation in the file `replay_buffer.py`. Update the `train_reinforcement_learning` code to push a tuple to the replay buffer and to sample a batch for the `optimize_model` function.

In [69]:
# ## PASTE YOUR TERMINAL OUTPUT HERE
# # NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

# [Episode   10/4000] [Steps   10] [reward 11.0]
# [Episode   20/4000] [Steps   15] [reward 16.0]
# ----------
# saving model.
# [TEST Episode 25] [Average Reward 8.9]
# ----------
# [Episode   30/4000] [Steps   12] [reward 13.0]
# [Episode   40/4000] [Steps   11] [reward 12.0]
# [Episode   50/4000] [Steps    9] [reward 10.0]
# ----------
# saving model.
# [TEST Episode 50] [Average Reward 9.6]
# ----------
# [Episode   60/4000] [Steps    9] [reward 10.0]
# [Episode   70/4000] [Steps   10] [reward 11.0]
# ----------
# saving model.
# [TEST Episode 75] [Average Reward 9.8]
# ----------
# [Episode   80/4000] [Steps   10] [reward 11.0]
# [Episode   90/4000] [Steps   10] [reward 11.0]
# [Episode  100/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 100] [Average Reward 9.6]
# ----------
# [Episode  110/4000] [Steps    8] [reward 9.0]
# [Episode  120/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 125] [Average Reward 9.1]
# ----------
# [Episode  130/4000] [Steps   11] [reward 12.0]
# [Episode  140/4000] [Steps    8] [reward 9.0]
# [Episode  150/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 150] [Average Reward 9.6]
# ----------
# [Episode  160/4000] [Steps   10] [reward 11.0]
# [Episode  170/4000] [Steps    9] [reward 10.0]
# ----------
# saving model.
# [TEST Episode 175] [Average Reward 9.9]
# ----------
# [Episode  180/4000] [Steps   11] [reward 12.0]
# [Episode  190/4000] [Steps   13] [reward 14.0]
# [Episode  200/4000] [Steps   13] [reward 14.0]
# ----------
# saving model.
# [TEST Episode 200] [Average Reward 55.6]
# ----------
# [Episode  210/4000] [Steps   10] [reward 11.0]
# [Episode  220/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 225] [Average Reward 10.3]
# ----------
# [Episode  230/4000] [Steps    8] [reward 9.0]
# [Episode  240/4000] [Steps   11] [reward 12.0]
# [Episode  250/4000] [Steps   41] [reward 42.0]
# ----------
# [TEST Episode 250] [Average Reward 21.4]
# ----------
# [Episode  260/4000] [Steps   19] [reward 20.0]
# [Episode  270/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 275] [Average Reward 38.2]
# ----------
# [Episode  280/4000] [Steps   23] [reward 24.0]
# [Episode  290/4000] [Steps   14] [reward 15.0]
# [Episode  300/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 300] [Average Reward 10.9]
# ----------
# [Episode  310/4000] [Steps   12] [reward 13.0]
# [Episode  320/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 325] [Average Reward 9.8]
# ----------
# [Episode  330/4000] [Steps   11] [reward 12.0]
# [Episode  340/4000] [Steps   22] [reward 23.0]
# [Episode  350/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 350] [Average Reward 12.1]
# ----------
# [Episode  360/4000] [Steps   14] [reward 15.0]
# [Episode  370/4000] [Steps   26] [reward 27.0]
# ----------
# [TEST Episode 375] [Average Reward 12.6]
# ----------
# [Episode  380/4000] [Steps    9] [reward 10.0]
# [Episode  390/4000] [Steps   91] [reward 92.0]
# [Episode  400/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 400] [Average Reward 11.6]
# ----------
# [Episode  410/4000] [Steps   10] [reward 11.0]
# [Episode  420/4000] [Steps   15] [reward 16.0]
# ----------
# [TEST Episode 425] [Average Reward 14.0]
# ----------
# [Episode  430/4000] [Steps   76] [reward 77.0]
# [Episode  440/4000] [Steps   13] [reward 14.0]
# [Episode  450/4000] [Steps   15] [reward 16.0]
# ----------
# [TEST Episode 450] [Average Reward 14.2]
# ----------
# [Episode  460/4000] [Steps   10] [reward 11.0]
# [Episode  470/4000] [Steps   13] [reward 14.0]
# ----------
# [TEST Episode 475] [Average Reward 45.6]
# ----------
# [Episode  480/4000] [Steps   22] [reward 23.0]
# [Episode  490/4000] [Steps   20] [reward 21.0]
# [Episode  500/4000] [Steps  119] [reward 120.0]
# ----------
# [TEST Episode 500] [Average Reward 13.6]
# ----------
# [Episode  510/4000] [Steps   69] [reward 70.0]
# [Episode  520/4000] [Steps   23] [reward 24.0]
# ----------
# [TEST Episode 525] [Average Reward 24.8]
# ----------
# [Episode  530/4000] [Steps   18] [reward 19.0]
# [Episode  540/4000] [Steps  321] [reward 322.0]
# [Episode  550/4000] [Steps   34] [reward 35.0]
# ----------
# [TEST Episode 550] [Average Reward 10.9]
# ----------
# [Episode  560/4000] [Steps    9] [reward 10.0]
# [Episode  570/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 575] [Average Reward 9.4]
# ----------
# [Episode  580/4000] [Steps  180] [reward 181.0]
# [Episode  590/4000] [Steps    8] [reward 9.0]
# [Episode  600/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 600] [Average Reward 12.4]
# ----------
# [Episode  610/4000] [Steps  321] [reward 322.0]
# [Episode  620/4000] [Steps  147] [reward 148.0]
# ----------
# saving model.
# [TEST Episode 625] [Average Reward 95.6]
# ----------
# [Episode  630/4000] [Steps   13] [reward 14.0]
# [Episode  640/4000] [Steps  252] [reward 253.0]
# [Episode  650/4000] [Steps   23] [reward 24.0]
# ----------
# [TEST Episode 650] [Average Reward 26.6]
# ----------
# [Episode  660/4000] [Steps   17] [reward 18.0]
# [Episode  670/4000] [Steps   15] [reward 16.0]
# ----------
# [TEST Episode 675] [Average Reward 20.9]
# ----------
# [Episode  680/4000] [Steps   64] [reward 65.0]
# [Episode  690/4000] [Steps   40] [reward 41.0]
# [Episode  700/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 700] [Average Reward 59.4]
# ----------
# [Episode  710/4000] [Steps  117] [reward 118.0]
# [Episode  720/4000] [Steps  221] [reward 222.0]
# ----------
# saving model.
# [TEST Episode 725] [Average Reward 295.2]
# ----------
# [Episode  730/4000] [Steps  499] [reward 500.0]
# [Episode  740/4000] [Steps   12] [reward 13.0]
# [Episode  750/4000] [Steps  139] [reward 140.0]
# ----------
# [TEST Episode 750] [Average Reward 288.5]
# ----------
# [Episode  760/4000] [Steps   22] [reward 23.0]
# [Episode  770/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 775] [Average Reward 100.2]
# ----------
# [Episode  780/4000] [Steps   47] [reward 48.0]
# [Episode  790/4000] [Steps   62] [reward 63.0]
# [Episode  800/4000] [Steps   13] [reward 14.0]
# ----------
# [TEST Episode 800] [Average Reward 29.6]
# ----------
# [Episode  810/4000] [Steps  164] [reward 165.0]
# [Episode  820/4000] [Steps  134] [reward 135.0]
# ----------
# [TEST Episode 825] [Average Reward 283.5]
# ----------
# [Episode  830/4000] [Steps  228] [reward 229.0]
# [Episode  840/4000] [Steps  153] [reward 154.0]
# [Episode  850/4000] [Steps  127] [reward 128.0]
# ----------
# [TEST Episode 850] [Average Reward 144.9]
# ----------
# [Episode  860/4000] [Steps   95] [reward 96.0]
# [Episode  870/4000] [Steps  110] [reward 111.0]
# ----------
# [TEST Episode 875] [Average Reward 169.3]
# ----------
# [Episode  880/4000] [Steps   15] [reward 16.0]
# [Episode  890/4000] [Steps  204] [reward 205.0]
# [Episode  900/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 900] [Average Reward 120.7]
# ----------
# [Episode  910/4000] [Steps   61] [reward 62.0]
# [Episode  920/4000] [Steps  152] [reward 153.0]
# ----------
# saving model.
# [TEST Episode 925] [Average Reward 452.2]
# ----------
# [Episode  930/4000] [Steps   10] [reward 11.0]
# [Episode  940/4000] [Steps  347] [reward 348.0]
# [Episode  950/4000] [Steps  499] [reward 500.0]
# ----------
# [TEST Episode 950] [Average Reward 228.4]
# ----------
# [Episode  960/4000] [Steps  499] [reward 500.0]
# [Episode  970/4000] [Steps  335] [reward 336.0]
# ----------
# [TEST Episode 975] [Average Reward 352.2]
# ----------
# [Episode  980/4000] [Steps  220] [reward 221.0]
# [Episode  990/4000] [Steps  499] [reward 500.0]
# [Episode 1000/4000] [Steps  499] [reward 500.0]
# ----------
# saving model.
# [TEST Episode 1000] [Average Reward 500.0]
# ----------
# [Episode 1010/4000] [Steps  124] [reward 125.0]
# [Episode 1020/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 1025] [Average Reward 37.1]
# ----------
# [Episode 1030/4000] [Steps   94] [reward 95.0]
# [Episode 1040/4000] [Steps    8] [reward 9.0]
# [Episode 1050/4000] [Steps  157] [reward 158.0]
# ----------
# [TEST Episode 1050] [Average Reward 412.7]
# ----------
# [Episode 1060/4000] [Steps   14] [reward 15.0]
# [Episode 1070/4000] [Steps   82] [reward 83.0]
# ----------
# [TEST Episode 1075] [Average Reward 131.0]
# ----------
# [Episode 1080/4000] [Steps   90] [reward 91.0]
# [Episode 1090/4000] [Steps  126] [reward 127.0]
# [Episode 1100/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 1100] [Average Reward 9.6]
# ----------
# [Episode 1110/4000] [Steps   18] [reward 19.0]
# [Episode 1120/4000] [Steps  176] [reward 177.0]
# ----------
# [TEST Episode 1125] [Average Reward 315.3]
# ----------
# [Episode 1130/4000] [Steps  131] [reward 132.0]
# [Episode 1140/4000] [Steps  128] [reward 129.0]
# [Episode 1150/4000] [Steps  214] [reward 215.0]
# ----------
# [TEST Episode 1150] [Average Reward 349.6]
# ----------
# [Episode 1160/4000] [Steps   11] [reward 12.0]
# [Episode 1170/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 1175] [Average Reward 316.4]
# ----------
# [Episode 1180/4000] [Steps  252] [reward 253.0]
# [Episode 1190/4000] [Steps  270] [reward 271.0]
# [Episode 1200/4000] [Steps  135] [reward 136.0]
# ----------
# [TEST Episode 1200] [Average Reward 174.3]
# ----------
# [Episode 1210/4000] [Steps  185] [reward 186.0]
# [Episode 1220/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 1225] [Average Reward 110.6]
# ----------
# [Episode 1230/4000] [Steps  213] [reward 214.0]
# [Episode 1240/4000] [Steps  240] [reward 241.0]
# [Episode 1250/4000] [Steps    7] [reward 8.0]
# ----------
# [TEST Episode 1250] [Average Reward 10.7]
# ----------
# [Episode 1260/4000] [Steps   11] [reward 12.0]
# [Episode 1270/4000] [Steps   61] [reward 62.0]
# ----------
# [TEST Episode 1275] [Average Reward 11.5]
# ----------
# [Episode 1280/4000] [Steps   14] [reward 15.0]
# [Episode 1290/4000] [Steps  127] [reward 128.0]
# [Episode 1300/4000] [Steps  389] [reward 390.0]
# ----------
# [TEST Episode 1300] [Average Reward 71.3]
# ----------
# [Episode 1310/4000] [Steps  175] [reward 176.0]
# [Episode 1320/4000] [Steps  499] [reward 500.0]
# ----------
# [TEST Episode 1325] [Average Reward 200.5]
# ----------
# [Episode 1330/4000] [Steps  188] [reward 189.0]
# [Episode 1340/4000] [Steps  209] [reward 210.0]
# [Episode 1350/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 1350] [Average Reward 15.2]
# ----------
# [Episode 1360/4000] [Steps  293] [reward 294.0]
# [Episode 1370/4000] [Steps   18] [reward 19.0]
# ----------
# [TEST Episode 1375] [Average Reward 293.9]
# ----------
# [Episode 1380/4000] [Steps  186] [reward 187.0]
# [Episode 1390/4000] [Steps   42] [reward 43.0]
# [Episode 1400/4000] [Steps   71] [reward 72.0]
# ----------
# [TEST Episode 1400] [Average Reward 61.6]
# ----------
# [Episode 1410/4000] [Steps  179] [reward 180.0]
# [Episode 1420/4000] [Steps  142] [reward 143.0]
# ----------
# [TEST Episode 1425] [Average Reward 191.0]
# ----------
# [Episode 1430/4000] [Steps  159] [reward 160.0]
# [Episode 1440/4000] [Steps    9] [reward 10.0]
# [Episode 1450/4000] [Steps  373] [reward 374.0]
# ----------
# [TEST Episode 1450] [Average Reward 197.7]
# ----------
# [Episode 1460/4000] [Steps   12] [reward 13.0]
# [Episode 1470/4000] [Steps  219] [reward 220.0]
# ----------
# [TEST Episode 1475] [Average Reward 190.3]
# ----------
# [Episode 1480/4000] [Steps  173] [reward 174.0]
# [Episode 1490/4000] [Steps  302] [reward 303.0]
# [Episode 1500/4000] [Steps  168] [reward 169.0]
# ----------
# [TEST Episode 1500] [Average Reward 334.1]
# ----------
# [Episode 1510/4000] [Steps   11] [reward 12.0]
# [Episode 1520/4000] [Steps   24] [reward 25.0]
# ----------
# [TEST Episode 1525] [Average Reward 23.3]
# ----------
# [Episode 1530/4000] [Steps  139] [reward 140.0]
# [Episode 1540/4000] [Steps   19] [reward 20.0]
# [Episode 1550/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 1550] [Average Reward 13.1]
# ----------
# [Episode 1560/4000] [Steps   11] [reward 12.0]
# [Episode 1570/4000] [Steps  121] [reward 122.0]
# ----------
# [TEST Episode 1575] [Average Reward 172.2]
# ----------
# [Episode 1580/4000] [Steps   19] [reward 20.0]
# [Episode 1590/4000] [Steps   17] [reward 18.0]
# [Episode 1600/4000] [Steps   87] [reward 88.0]
# ----------
# [TEST Episode 1600] [Average Reward 209.3]
# ----------
# [Episode 1610/4000] [Steps   28] [reward 29.0]
# [Episode 1620/4000] [Steps  127] [reward 128.0]
# ----------
# [TEST Episode 1625] [Average Reward 459.5]
# ----------
# [Episode 1630/4000] [Steps  270] [reward 271.0]
# [Episode 1640/4000] [Steps    8] [reward 9.0]
# [Episode 1650/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 1650] [Average Reward 344.9]
# ----------
# [Episode 1660/4000] [Steps   94] [reward 95.0]
# [Episode 1670/4000] [Steps  203] [reward 204.0]
# ----------
# [TEST Episode 1675] [Average Reward 112.1]
# ----------
# [Episode 1680/4000] [Steps  129] [reward 130.0]
# [Episode 1690/4000] [Steps    8] [reward 9.0]
# [Episode 1700/4000] [Steps  211] [reward 212.0]
# ----------
# [TEST Episode 1700] [Average Reward 210.4]
# ----------
# [Episode 1710/4000] [Steps  116] [reward 117.0]
# [Episode 1720/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 1725] [Average Reward 220.0]
# ----------
# [Episode 1730/4000] [Steps  193] [reward 194.0]
# [Episode 1740/4000] [Steps    8] [reward 9.0]
# [Episode 1750/4000] [Steps   10] [reward 11.0]
# ----------
# [TEST Episode 1750] [Average Reward 15.1]
# ----------
# [Episode 1760/4000] [Steps  166] [reward 167.0]
# [Episode 1770/4000] [Steps   55] [reward 56.0]
# ----------
# [TEST Episode 1775] [Average Reward 69.3]
# ----------
# [Episode 1780/4000] [Steps   14] [reward 15.0]
# [Episode 1790/4000] [Steps  184] [reward 185.0]
# [Episode 1800/4000] [Steps  200] [reward 201.0]
# ----------
# [TEST Episode 1800] [Average Reward 104.2]
# ----------
# [Episode 1810/4000] [Steps   44] [reward 45.0]
# [Episode 1820/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 1825] [Average Reward 256.7]
# ----------
# [Episode 1830/4000] [Steps  115] [reward 116.0]
# [Episode 1840/4000] [Steps   96] [reward 97.0]
# [Episode 1850/4000] [Steps   60] [reward 61.0]
# ----------
# [TEST Episode 1850] [Average Reward 176.4]
# ----------
# [Episode 1860/4000] [Steps  217] [reward 218.0]
# [Episode 1870/4000] [Steps  128] [reward 129.0]
# ----------
# [TEST Episode 1875] [Average Reward 9.3]
# ----------
# [Episode 1880/4000] [Steps   20] [reward 21.0]
# [Episode 1890/4000] [Steps   62] [reward 63.0]
# [Episode 1900/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 1900] [Average Reward 10.3]
# ----------
# [Episode 1910/4000] [Steps  193] [reward 194.0]
# [Episode 1920/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 1925] [Average Reward 112.2]
# ----------
# [Episode 1930/4000] [Steps  119] [reward 120.0]
# [Episode 1940/4000] [Steps   71] [reward 72.0]
# [Episode 1950/4000] [Steps  204] [reward 205.0]
# ----------
# [TEST Episode 1950] [Average Reward 215.0]
# ----------
# [Episode 1960/4000] [Steps   13] [reward 14.0]
# [Episode 1970/4000] [Steps  184] [reward 185.0]
# ----------
# [TEST Episode 1975] [Average Reward 235.1]
# ----------
# [Episode 1980/4000] [Steps  231] [reward 232.0]
# [Episode 1990/4000] [Steps   10] [reward 11.0]
# [Episode 2000/4000] [Steps    7] [reward 8.0]
# ----------
# [TEST Episode 2000] [Average Reward 258.7]
# ----------
# [Episode 2010/4000] [Steps   11] [reward 12.0]
# [Episode 2020/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 2025] [Average Reward 500.0]
# ----------
# [Episode 2030/4000] [Steps   41] [reward 42.0]
# [Episode 2040/4000] [Steps    8] [reward 9.0]
# [Episode 2050/4000] [Steps   42] [reward 43.0]
# ----------
# [TEST Episode 2050] [Average Reward 383.3]
# ----------
# [Episode 2060/4000] [Steps   29] [reward 30.0]
# [Episode 2070/4000] [Steps  117] [reward 118.0]
# ----------
# [TEST Episode 2075] [Average Reward 17.9]
# ----------
# [Episode 2080/4000] [Steps   88] [reward 89.0]
# [Episode 2090/4000] [Steps  178] [reward 179.0]
# [Episode 2100/4000] [Steps   13] [reward 14.0]
# ----------
# [TEST Episode 2100] [Average Reward 68.0]
# ----------
# [Episode 2110/4000] [Steps    7] [reward 8.0]
# [Episode 2120/4000] [Steps  195] [reward 196.0]
# ----------
# [TEST Episode 2125] [Average Reward 117.7]
# ----------
# [Episode 2130/4000] [Steps  124] [reward 125.0]
# [Episode 2140/4000] [Steps  245] [reward 246.0]
# [Episode 2150/4000] [Steps  499] [reward 500.0]
# ----------
# [TEST Episode 2150] [Average Reward 485.8]
# ----------
# [Episode 2160/4000] [Steps   11] [reward 12.0]
# [Episode 2170/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 2175] [Average Reward 166.7]
# ----------
# [Episode 2180/4000] [Steps   87] [reward 88.0]
# [Episode 2190/4000] [Steps   10] [reward 11.0]
# [Episode 2200/4000] [Steps  140] [reward 141.0]
# ----------
# [TEST Episode 2200] [Average Reward 245.4]
# ----------
# [Episode 2210/4000] [Steps   91] [reward 92.0]
# [Episode 2220/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 2225] [Average Reward 85.1]
# ----------
# [Episode 2230/4000] [Steps  196] [reward 197.0]
# [Episode 2240/4000] [Steps   11] [reward 12.0]
# [Episode 2250/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 2250] [Average Reward 9.2]
# ----------
# [Episode 2260/4000] [Steps   69] [reward 70.0]
# [Episode 2270/4000] [Steps   13] [reward 14.0]
# ----------
# [TEST Episode 2275] [Average Reward 62.1]
# ----------
# [Episode 2280/4000] [Steps   52] [reward 53.0]
# [Episode 2290/4000] [Steps   60] [reward 61.0]
# [Episode 2300/4000] [Steps   26] [reward 27.0]
# ----------
# [TEST Episode 2300] [Average Reward 53.8]
# ----------
# [Episode 2310/4000] [Steps   33] [reward 34.0]
# [Episode 2320/4000] [Steps  211] [reward 212.0]
# ----------
# [TEST Episode 2325] [Average Reward 179.9]
# ----------
# [Episode 2330/4000] [Steps   50] [reward 51.0]
# [Episode 2340/4000] [Steps  191] [reward 192.0]
# [Episode 2350/4000] [Steps   44] [reward 45.0]
# ----------
# [TEST Episode 2350] [Average Reward 127.3]
# ----------
# [Episode 2360/4000] [Steps    8] [reward 9.0]
# [Episode 2370/4000] [Steps  221] [reward 222.0]
# ----------
# [TEST Episode 2375] [Average Reward 91.0]
# ----------
# [Episode 2380/4000] [Steps  187] [reward 188.0]
# [Episode 2390/4000] [Steps  221] [reward 222.0]
# [Episode 2400/4000] [Steps  261] [reward 262.0]
# ----------
# [TEST Episode 2400] [Average Reward 55.9]
# ----------
# [Episode 2410/4000] [Steps  499] [reward 500.0]
# [Episode 2420/4000] [Steps   13] [reward 14.0]
# ----------
# [TEST Episode 2425] [Average Reward 11.0]
# ----------
# [Episode 2430/4000] [Steps   36] [reward 37.0]
# [Episode 2440/4000] [Steps   25] [reward 26.0]
# [Episode 2450/4000] [Steps  107] [reward 108.0]
# ----------
# [TEST Episode 2450] [Average Reward 500.0]
# ----------
# [Episode 2460/4000] [Steps  142] [reward 143.0]
# [Episode 2470/4000] [Steps  499] [reward 500.0]
# ----------
# [TEST Episode 2475] [Average Reward 500.0]
# ----------
# [Episode 2480/4000] [Steps   72] [reward 73.0]
# [Episode 2490/4000] [Steps  499] [reward 500.0]
# [Episode 2500/4000] [Steps  383] [reward 384.0]
# ----------
# [TEST Episode 2500] [Average Reward 319.4]
# ----------
# [Episode 2510/4000] [Steps  126] [reward 127.0]
# [Episode 2520/4000] [Steps  411] [reward 412.0]
# ----------
# [TEST Episode 2525] [Average Reward 85.8]
# ----------
# [Episode 2530/4000] [Steps   12] [reward 13.0]
# [Episode 2540/4000] [Steps    9] [reward 10.0]
# [Episode 2550/4000] [Steps   25] [reward 26.0]
# ----------
# [TEST Episode 2550] [Average Reward 314.8]
# ----------
# [Episode 2560/4000] [Steps  147] [reward 148.0]
# [Episode 2570/4000] [Steps  166] [reward 167.0]
# ----------
# [TEST Episode 2575] [Average Reward 76.4]
# ----------
# [Episode 2580/4000] [Steps  111] [reward 112.0]
# [Episode 2590/4000] [Steps   31] [reward 32.0]
# [Episode 2600/4000] [Steps  103] [reward 104.0]
# ----------
# [TEST Episode 2600] [Average Reward 96.8]
# ----------
# [Episode 2610/4000] [Steps  499] [reward 500.0]
# [Episode 2620/4000] [Steps  213] [reward 214.0]
# ----------
# [TEST Episode 2625] [Average Reward 87.1]
# ----------
# [Episode 2630/4000] [Steps  472] [reward 473.0]
# [Episode 2640/4000] [Steps  499] [reward 500.0]
# [Episode 2650/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 2650] [Average Reward 10.5]
# ----------
# [Episode 2660/4000] [Steps   10] [reward 11.0]
# [Episode 2670/4000] [Steps  150] [reward 151.0]
# ----------
# [TEST Episode 2675] [Average Reward 94.9]
# ----------
# [Episode 2680/4000] [Steps  171] [reward 172.0]
# [Episode 2690/4000] [Steps    8] [reward 9.0]
# [Episode 2700/4000] [Steps   91] [reward 92.0]
# ----------
# [TEST Episode 2700] [Average Reward 77.7]
# ----------
# [Episode 2710/4000] [Steps   19] [reward 20.0]
# [Episode 2720/4000] [Steps   88] [reward 89.0]
# ----------
# [TEST Episode 2725] [Average Reward 500.0]
# ----------
# [Episode 2730/4000] [Steps   29] [reward 30.0]
# [Episode 2740/4000] [Steps   83] [reward 84.0]
# [Episode 2750/4000] [Steps    8] [reward 9.0]
# ----------
# [TEST Episode 2750] [Average Reward 10.0]
# ----------
# [Episode 2760/4000] [Steps   70] [reward 71.0]
# [Episode 2770/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 2775] [Average Reward 37.4]
# ----------
# [Episode 2780/4000] [Steps   49] [reward 50.0]
# [Episode 2790/4000] [Steps   76] [reward 77.0]
# [Episode 2800/4000] [Steps  147] [reward 148.0]
# ----------
# [TEST Episode 2800] [Average Reward 94.7]
# ----------
# [Episode 2810/4000] [Steps   46] [reward 47.0]
# [Episode 2820/4000] [Steps   59] [reward 60.0]
# ----------
# [TEST Episode 2825] [Average Reward 121.7]
# ----------
# [Episode 2830/4000] [Steps  130] [reward 131.0]
# [Episode 2840/4000] [Steps  140] [reward 141.0]
# [Episode 2850/4000] [Steps   19] [reward 20.0]
# ----------
# [TEST Episode 2850] [Average Reward 213.2]
# ----------
# [Episode 2860/4000] [Steps  256] [reward 257.0]
# [Episode 2870/4000] [Steps  358] [reward 359.0]
# ----------
# [TEST Episode 2875] [Average Reward 251.2]
# ----------
# [Episode 2880/4000] [Steps  154] [reward 155.0]
# [Episode 2890/4000] [Steps    9] [reward 10.0]
# [Episode 2900/4000] [Steps   35] [reward 36.0]
# ----------
# [TEST Episode 2900] [Average Reward 94.7]
# ----------
# [Episode 2910/4000] [Steps  368] [reward 369.0]
# [Episode 2920/4000] [Steps  166] [reward 167.0]
# ----------
# [TEST Episode 2925] [Average Reward 286.8]
# ----------
# [Episode 2930/4000] [Steps   16] [reward 17.0]
# [Episode 2940/4000] [Steps  130] [reward 131.0]
# [Episode 2950/4000] [Steps  102] [reward 103.0]
# ----------
# [TEST Episode 2950] [Average Reward 95.1]
# ----------
# [Episode 2960/4000] [Steps   11] [reward 12.0]
# [Episode 2970/4000] [Steps  136] [reward 137.0]
# ----------
# [TEST Episode 2975] [Average Reward 111.7]
# ----------
# [Episode 2980/4000] [Steps  105] [reward 106.0]
# [Episode 2990/4000] [Steps   13] [reward 14.0]
# [Episode 3000/4000] [Steps  128] [reward 129.0]
# ----------
# [TEST Episode 3000] [Average Reward 135.3]
# ----------
# [Episode 3010/4000] [Steps  112] [reward 113.0]
# [Episode 3020/4000] [Steps   33] [reward 34.0]
# ----------
# [TEST Episode 3025] [Average Reward 162.2]
# ----------
# [Episode 3030/4000] [Steps  499] [reward 500.0]
# [Episode 3040/4000] [Steps  175] [reward 176.0]
# [Episode 3050/4000] [Steps  374] [reward 375.0]
# ----------
# [TEST Episode 3050] [Average Reward 177.3]
# ----------
# [Episode 3060/4000] [Steps  229] [reward 230.0]
# [Episode 3070/4000] [Steps  150] [reward 151.0]
# ----------
# [TEST Episode 3075] [Average Reward 9.7]
# ----------
# [Episode 3080/4000] [Steps   16] [reward 17.0]
# [Episode 3090/4000] [Steps   34] [reward 35.0]
# [Episode 3100/4000] [Steps   49] [reward 50.0]
# ----------
# [TEST Episode 3100] [Average Reward 74.0]
# ----------
# [Episode 3110/4000] [Steps  225] [reward 226.0]
# [Episode 3120/4000] [Steps  115] [reward 116.0]
# ----------
# [TEST Episode 3125] [Average Reward 60.3]
# ----------
# [Episode 3130/4000] [Steps   57] [reward 58.0]
# [Episode 3140/4000] [Steps   99] [reward 100.0]
# [Episode 3150/4000] [Steps  101] [reward 102.0]
# ----------
# [TEST Episode 3150] [Average Reward 130.3]
# ----------
# [Episode 3160/4000] [Steps   12] [reward 13.0]
# [Episode 3170/4000] [Steps  173] [reward 174.0]
# ----------
# [TEST Episode 3175] [Average Reward 97.3]
# ----------
# [Episode 3180/4000] [Steps   81] [reward 82.0]
# [Episode 3190/4000] [Steps  249] [reward 250.0]
# [Episode 3200/4000] [Steps   92] [reward 93.0]
# ----------
# [TEST Episode 3200] [Average Reward 180.0]
# ----------
# [Episode 3210/4000] [Steps   81] [reward 82.0]
# [Episode 3220/4000] [Steps   40] [reward 41.0]
# ----------
# [TEST Episode 3225] [Average Reward 122.9]
# ----------
# [Episode 3230/4000] [Steps  127] [reward 128.0]
# [Episode 3240/4000] [Steps  127] [reward 128.0]
# [Episode 3250/4000] [Steps   52] [reward 53.0]
# ----------
# [TEST Episode 3250] [Average Reward 141.7]
# ----------
# [Episode 3260/4000] [Steps  251] [reward 252.0]
# [Episode 3270/4000] [Steps  182] [reward 183.0]
# ----------
# [TEST Episode 3275] [Average Reward 159.5]
# ----------
# [Episode 3280/4000] [Steps  171] [reward 172.0]
# [Episode 3290/4000] [Steps  478] [reward 479.0]
# [Episode 3300/4000] [Steps  499] [reward 500.0]
# ----------
# [TEST Episode 3300] [Average Reward 500.0]
# ----------
# [Episode 3310/4000] [Steps  179] [reward 180.0]
# [Episode 3320/4000] [Steps   11] [reward 12.0]
# ----------
# [TEST Episode 3325] [Average Reward 207.2]
# ----------
# [Episode 3330/4000] [Steps  499] [reward 500.0]
# [Episode 3340/4000] [Steps   92] [reward 93.0]
# [Episode 3350/4000] [Steps   38] [reward 39.0]
# ----------
# [TEST Episode 3350] [Average Reward 500.0]
# ----------
# [Episode 3360/4000] [Steps   41] [reward 42.0]
# [Episode 3370/4000] [Steps  218] [reward 219.0]
# ----------
# [TEST Episode 3375] [Average Reward 160.9]
# ----------
# [Episode 3380/4000] [Steps  202] [reward 203.0]
# [Episode 3390/4000] [Steps  499] [reward 500.0]
# [Episode 3400/4000] [Steps  197] [reward 198.0]
# ----------
# [TEST Episode 3400] [Average Reward 98.8]
# ----------
# [Episode 3410/4000] [Steps   55] [reward 56.0]
# [Episode 3420/4000] [Steps  499] [reward 500.0]
# ----------
# [TEST Episode 3425] [Average Reward 124.9]
# ----------
# [Episode 3430/4000] [Steps  117] [reward 118.0]
# [Episode 3440/4000] [Steps   10] [reward 11.0]
# [Episode 3450/4000] [Steps   91] [reward 92.0]
# ----------
# [TEST Episode 3450] [Average Reward 115.5]
# ----------
# [Episode 3460/4000] [Steps   79] [reward 80.0]
# [Episode 3470/4000] [Steps   28] [reward 29.0]
# ----------
# [TEST Episode 3475] [Average Reward 119.9]
# ----------
# [Episode 3480/4000] [Steps  129] [reward 130.0]
# [Episode 3490/4000] [Steps  100] [reward 101.0]
# [Episode 3500/4000] [Steps  139] [reward 140.0]
# ----------
# [TEST Episode 3500] [Average Reward 115.8]
# ----------
# [Episode 3510/4000] [Steps   76] [reward 77.0]
# [Episode 3520/4000] [Steps    9] [reward 10.0]
# ----------
# [TEST Episode 3525] [Average Reward 152.6]
# ----------
# [Episode 3530/4000] [Steps  226] [reward 227.0]
# [Episode 3540/4000] [Steps   32] [reward 33.0]
# [Episode 3550/4000] [Steps   87] [reward 88.0]
# ----------
# [TEST Episode 3550] [Average Reward 75.5]
# ----------
# [Episode 3560/4000] [Steps   17] [reward 18.0]
# [Episode 3570/4000] [Steps  156] [reward 157.0]
# ----------
# [TEST Episode 3575] [Average Reward 377.8]
# ----------
# [Episode 3580/4000] [Steps  241] [reward 242.0]
# [Episode 3590/4000] [Steps  142] [reward 143.0]
# [Episode 3600/4000] [Steps   45] [reward 46.0]
# ----------
# [TEST Episode 3600] [Average Reward 163.3]
# ----------
# [Episode 3610/4000] [Steps  140] [reward 141.0]
# [Episode 3620/4000] [Steps  102] [reward 103.0]
# ----------
# [TEST Episode 3625] [Average Reward 112.0]
# ----------
# [Episode 3630/4000] [Steps   86] [reward 87.0]
# [Episode 3640/4000] [Steps   82] [reward 83.0]
# [Episode 3650/4000] [Steps   12] [reward 13.0]
# ----------
# [TEST Episode 3650] [Average Reward 89.6]
# ----------
# [Episode 3660/4000] [Steps   89] [reward 90.0]
# [Episode 3670/4000] [Steps   20] [reward 21.0]
# ----------
# [TEST Episode 3675] [Average Reward 500.0]
# ----------
# [Episode 3680/4000] [Steps   10] [reward 11.0]
# [Episode 3690/4000] [Steps  118] [reward 119.0]
# [Episode 3700/4000] [Steps   96] [reward 97.0]
# ----------
# [TEST Episode 3700] [Average Reward 104.3]
# ----------
# [Episode 3710/4000] [Steps  128] [reward 129.0]
# [Episode 3720/4000] [Steps  108] [reward 109.0]
# ----------
# [TEST Episode 3725] [Average Reward 125.4]
# ----------
# [Episode 3730/4000] [Steps  116] [reward 117.0]
# [Episode 3740/4000] [Steps   10] [reward 11.0]
# [Episode 3750/4000] [Steps   99] [reward 100.0]
# ----------
# [TEST Episode 3750] [Average Reward 92.6]
# ----------
# [Episode 3760/4000] [Steps   74] [reward 75.0]
# [Episode 3770/4000] [Steps  120] [reward 121.0]
# ----------
# [TEST Episode 3775] [Average Reward 18.2]
# ----------
# [Episode 3780/4000] [Steps  499] [reward 500.0]
# [Episode 3790/4000] [Steps   86] [reward 87.0]
# [Episode 3800/4000] [Steps   16] [reward 17.0]
# ----------
# [TEST Episode 3800] [Average Reward 51.4]
# ----------
# [Episode 3810/4000] [Steps   87] [reward 88.0]
# [Episode 3820/4000] [Steps   62] [reward 63.0]
# ----------
# [TEST Episode 3825] [Average Reward 63.4]
# ----------
# [Episode 3830/4000] [Steps   12] [reward 13.0]
# [Episode 3840/4000] [Steps  126] [reward 127.0]
# [Episode 3850/4000] [Steps   42] [reward 43.0]
# ----------
# [TEST Episode 3850] [Average Reward 55.5]
# ----------
# [Episode 3860/4000] [Steps  154] [reward 155.0]
# [Episode 3870/4000] [Steps   63] [reward 64.0]
# ----------
# [TEST Episode 3875] [Average Reward 128.0]
# ----------
# [Episode 3880/4000] [Steps  114] [reward 115.0]
# [Episode 3890/4000] [Steps   90] [reward 91.0]
# [Episode 3900/4000] [Steps  163] [reward 164.0]
# ----------
# [TEST Episode 3900] [Average Reward 142.5]
# ----------
# [Episode 3910/4000] [Steps   18] [reward 19.0]
# [Episode 3920/4000] [Steps  111] [reward 112.0]
# ----------
# [TEST Episode 3925] [Average Reward 10.9]
# ----------
# [Episode 3930/4000] [Steps    7] [reward 8.0]
# [Episode 3940/4000] [Steps   91] [reward 92.0]
# [Episode 3950/4000] [Steps   68] [reward 69.0]
# ----------
# [TEST Episode 3950] [Average Reward 81.2]
# ----------
# [Episode 3960/4000] [Steps    7] [reward 8.0]
# [Episode 3970/4000] [Steps   77] [reward 78.0]
# ----------
# [TEST Episode 3975] [Average Reward 127.1]
# ----------
# [Episode 3980/4000] [Steps  248] [reward 249.0]
# [Episode 3990/4000] [Steps   14] [reward 15.0]
# [Episode 4000/4000] [Steps   61] [reward 62.0]
# ----------
# [TEST Episode 4000] [Average Reward 41.9]
# ----------

How does the replay buffer improve performance?

**Answer:** The replay buffer improved the learning efficiency. It only took ~725 episodes to reach a good reward.

## 3. Extra (fully optional)

Ideas to experiment with:

- Is $\epsilon$-greedy strategy the best strategy available? Experiment with other strategies.
- Make use of the model you have trained in the behavioral cloning part and fine-tune it with RL. How does that affect performance?
- You are perhaps bored with `CartPole-v1` by now. Another environment we suggest trying is `LunarLander-v2`. It will be harder to learn but with experimentation, you will find the correct optimizations for success. Piazza is also your friend :)
- What about learning from images? This requires more work because you have to extract the image from the environment. How much more challenging might you expect the learning to be in this case?
- An improvement over DQN is DoubleDQN. Experiment with this to see how much of an impact it makes.



In [70]:
# YOU CAN USE THIS CODEBLOCK AND ADD ANY BLOCK BELOW AS YOU NEED
# TO SHOW US THE IDEAS AND EXTRA EXPERIMENTS YOU RUN.
# HAVE FUN!