# dm_control experiments

we are going to try out DQN on dm_control. not ideal, but can discretize action space and restrict to tasks with action dimensionality < 2.
- vanilla dqn
- double dqn to reduce q value bias
- 'optimisitc' dqn to encourage exploration

# dqn

dqn uses replay buffer to reduce correlations and target network to stabalize updates

In [82]:
#@title setup

%load_ext autoreload
%autoreload 2

import numpy as np
from dm_control import suite
from tqdm import tqdm
import Agents, utils
import tensorflow as tf
import matplotlib.pyplot as plt
import random

# disable GPUs for tensorflow (CPU is faster for small networks/batches on my machine)
tf.config.set_visible_devices([], 'GPU')
visible_devices = tf.config.get_visible_devices()
for device in visible_devices:
    assert device.device_type != 'GPU'

# plot mean with error bars
def plot_perforamce(ax, x, data):
    mean = np.array(data).T.mean(0)
    std = np.array(data).T.std(0)
    ax.plot(x, mean)
    ax.fill_between(x, mean+std, mean-std, alpha=.15)

# reset random seeds
def rand_seed_reset(env, i):
    random.seed(i)
    np.random.seed(i)
    tf.random.set_seed(i)
    env.task.random.seed(i)


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


details:
- actions are discretized like this...
- q network an mlp
- trained for `episodes`, adding batch to buffer of size `buffer_length` at each step. 
time resolution is high, therefore actions repeated 4 times (cite) and q is updated every `steps_per_update` steps
- `q_target` only updated every 100 times....
- epsilon linearly annealed from x to y over z episodes
- initialize buffer

In [114]:
# @title hyperparameters 

# agent
agent_args = dict(
    action_grid = 2,            # number of discrete actions per action dimension
    units_per_layer = (12,24),  # hidden units per layer
    buffer_length = 50000,
    q_update_interval = 100,    # q updates per q_target update
    learning_rate = .001,       # learning rate (adam optimizer)       
)

# training
train_args = dict(
    episodes = 200,
    batch_size = 64,
    action_repeats = 4,         # repeat each action this number of times during training
    steps_per_update = 8,       # environment steps before updating q
    gamma = .99,
    epsilon_start = 1,
    epsilon_final = .1,
    epsilon_final_episode = (episodes*.5),   # episode at which epsilon_final is reached
)

## pendulum

- state space only r2 and action space r1 - try this out to make sure everything is in order
- explain evaluation metrics... evaluated every n eps, avg of m eps with z epsilon


In [None]:
# make environment and agent
env = suite.load('cartpole', 'swingup')
rand_seed_reset(env, 0)
agent = Agents.Agent(env.observation_spec(), env.action_spec(), **agent_args)
utils.initialize_buffer(agent, env)

# train
episode_num, returns = utils.train(agent, env, **train_args,
                                   eval_interval=10, verbose=True)

# plot performance over training
ax = plt.axes(xlabel='episode', ylabel='return', ylim=(0,1000))
plot_perforamce(ax, episode_num, returns)


initializing replay buffer...


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))


training agent...


HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))

iteration 10, avg return 140.79, epsilon 0.92, returns: [148, 133, 157, 141, 139, 143, 137, 139, 134, 132]


In [None]:
utils.show_rollout_jupyter(agent, env, epsilon=.05)