# DQN with `dm_control`

Here are some experiments using deep Q-Learning to solve simple continuous control tasks. I implemented [the original DQN](https://www.nature.com/articles/nature14236) and used it to solve several tasks in the [DeepMind Control Suite](https://arxiv.org/abs/1801.00690). The key innovations from DQN are:

1. Maintain a replay buffer of experiences from which minibatches are randomly drawn during training. This decreases correlations in the training data, thereby reducing variance in the updates.
2. Keep an additional Q network for calculating targets that is an 'outdated' version of the main Q network. Every `q_update_interval` updates the weights are copied from the main to the target q network. Updates are more stable because the target network is updated less frequently.

Q-Learning takes the max across actions, which is not ideal for continuous action spaces. In this implementation the action space is discretized, such that each action dimension can take a value in `linspace(action_min, actions_max, action_grid)`, where `action_grid=2` for this demo. The full action space is the cartesian product of the vectors for each dimension.

The demo is organized as follows:
1. **setup**
2. **solving tasks:** cartpole (balance+swingup), ball in cup, pendulum
3. **double DQN:** I implement [Double Q-Learning](https://arxiv.org/abs/1509.06461) and test whether it increases the accuracy of action-value estimates.
4. **encouraging exploration:** To increase exploration I used a simple trick to encourage optimism in the face of uncertainty. Namely, I pretrained the network to output optimistic action-values across the state-space, which encourages exploration in the early phases of learning.

# setup


 The heart of the algorithm can be found in:
- [`train_utils.train`](https://github.com/richard-warren/rl_sandbox/blob/e56c44d74ddd47cbd6c2dc37753ba95896f9b81d/dm_control_tests/train_utils.py#L87), which trains an [`Agent`](https://github.com/richard-warren/rl_sandbox/blob/e56c44d74ddd47cbd6c2dc37753ba95896f9b81d/dm_control_tests/agents.py#L18) given an agent and a `dm_control` environment. 
- [`Agent.update`](https://github.com/richard-warren/rl_sandbox/blob/e56c44d74ddd47cbd6c2dc37753ba95896f9b81d/dm_control_tests/agents.py#L67), which selects minibatches and performs network updates.

To increase training speed I found it helpful to:
- *Train on the CPU rather than GPU.* The Q network is very small. My CPU was faster than the GPU unless batch sizes were really large.
- *Perform forward passes on Numpy*. Network forward passes ended up being much faster with numpy than Tensorflow (again, unless batch sizes were really large).
- *Train multiple agents in parallel*. Training results could be somewhat idiosyncratic even with the same hyperparameters, so I train 12 agents in parallel to make sure the results are robust.

Below are some utility functions for plotting performance and showing rollouts for trained agents.


In [None]:
%load_ext autoreload
%autoreload 2

from dm_control_tests import train_utils, plot_utils
from dm_control_tests.agents import Agent
import matplotlib.pyplot as plt
from dm_control import suite
from tqdm.auto import tqdm
import tensorflow as tf
import seaborn as sns
import numpy as np
import pickle
import os


tf.get_logger().setLevel('ERROR')
train_utils.disable_gpu()

# plot performance over training, averaged across agents
def plot_performance(x, data):
    ax = plt.axes(xlabel='episode', ylabel='return',
                  xlim=(x[0],x[-1]), ylim=(0,1000))
    data = np.array(data)
    if data.ndim==2:
        mean = data.mean(0)
        std = data.std(0)
        ax.plot(x, data.T, color=(0,0,0), alpha=.15)
        ax.plot(x, mean)
        ax.fill_between(x, mean+std, mean-std, alpha=.15)
    else:
        ax.plot(x, data)

# show a rollout for the agent with the best performance at the end of training
def show_best_agent_rollout(agents_dir, framerate=30, epsilon=.05):
    with open(os.path.join(agents_dir, 'training_data'), 'rb') as file:
        training_data = pickle.load(file)
    best_agent = np.argmax(training_data['avg_returns'][:,-1])
    agent, metadata = train_utils.load_agent(
        os.path.join(agents_dir, 'agent{:03d}'.format(best_agent)))
    env = suite.load(*metadata['domain_and_task'])
    return plot_utils.show_rollout_jupyter(agent, env, epsilon=epsilon, framerate=framerate)
