## Advantage actor-critic in AgentNet (5 pts)

Once we're done with REINFORCE, it's time to proceed with something more sophisticated.
The next one in line is advantage actor-critic, in which agent learns both policy and value function, using the latter to speed up learning.


In [1]:
%env THEANO_FLAGS='floatX=float32'
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1
        
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


env: THEANO_FLAGS='floatX=float32'




In [2]:
import gym

env = gym.make("MountainCar-v0")
obs = env.reset()
state_size = len(obs)
n_actions = env.action_space.n
print(obs)

[2017-03-15 03:02:34,564] Making new env: MountainCar-v0


[-0.40807951  0.        ]


# Basic agent setup
Here we define a simple agent that maps game images into Qvalues using shallow neural network.


In [3]:
import lasagne
from lasagne.layers import InputLayer,DenseLayer,NonlinearityLayer,batch_norm,dropout
#image observation at current tick goes here, shape = (sample_i,x,y,color)
observation_layer = InputLayer((None,state_size))

nn = DenseLayer(observation_layer,256)#<your architecture>
nn = DenseLayer(nn,256)#<your architecture>

Couldn't import dot_parser, loading of dot files will not be possible.


In [4]:
#a layer that predicts Qvalues

policy_layer = DenseLayer(nn,n_actions,nonlinearity=lasagne.nonlinearities.softmax)#<estimate probabilities of actions given prev layer. Mind the nonlinearity!>


V_layer = DenseLayer(nn,1,nonlinearity=None)#<estimate state values (1 unit layer). Mind nonlinearity too.>

In [5]:
#To pick actions, we use an epsilon-greedy resolver (epsilon is a property)
from agentnet.resolver import ProbabilisticResolver
action_layer = ProbabilisticResolver(policy_layer,
                                     name="e-greedy action picker",
                                     assume_normalized=True)

##### Finally, agent
We declare that this network is and MDP agent with such and such inputs, states and outputs

In [6]:
from agentnet.agent import Agent
#all together
agent = Agent(observation_layers=observation_layer,
              policy_estimators=(policy_layer,V_layer),
              action_layers=action_layer)


In [7]:
#Since it's a single lasagne network, one can get it's weights, output, etc
weights = lasagne.layers.get_all_params((action_layer,V_layer),trainable=True)
weights

[W, b, W, b, W, b, W, b]

# Create and manage a pool of atari sessions to play with

* To make training more stable, we shall have an entire batch of game sessions each happening independent of others
* Why several parallel agents help training: http://arxiv.org/pdf/1602.01783v1.pdf
* Alternative approach: store more sessions: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

In [8]:
from agentnet.experiments.openai_gym.pool import EnvPool

#create a small pool with 10 parallel agents
pool = EnvPool(agent,"MountainCar-v0", n_games=10,max_size=1000) 

#we assume that pool size 1000 is small enough to learn "almost on policy" :)

[2017-03-15 03:02:35,582] Making new env: MountainCar-v0
[2017-03-15 03:02:35,585] Making new env: MountainCar-v0
[2017-03-15 03:02:35,587] Making new env: MountainCar-v0
[2017-03-15 03:02:35,589] Making new env: MountainCar-v0
[2017-03-15 03:02:35,591] Making new env: MountainCar-v0
[2017-03-15 03:02:35,593] Making new env: MountainCar-v0
[2017-03-15 03:02:35,595] Making new env: MountainCar-v0
[2017-03-15 03:02:35,598] Making new env: MountainCar-v0
[2017-03-15 03:02:35,600] Making new env: MountainCar-v0
[2017-03-15 03:02:35,602] Making new env: MountainCar-v0


In [9]:
%%time
#interact for 7 ticks
_,action_log,reward_log,_,_,_  = pool.interact(7)


print(action_log[:3])
print(reward_log[:3])

[[2 0 1 2 1 0 0]
 [0 2 0 2 0 2 1]
 [0 1 2 1 0 1 0]]
[[-1. -1. -1. -1. -1. -1.  0.]
 [-1. -1. -1. -1. -1. -1.  0.]
 [-1. -1. -1. -1. -1. -1.  0.]]
CPU times: user 7.16 ms, sys: 0 ns, total: 7.16 ms
Wall time: 6.97 ms


In [10]:
SEQ_LENGTH = 10
#load first sessions (this function calls interact and remembers sessions)
pool.update(SEQ_LENGTH)

# Actor-critic loss

Here we define obective function for actor-critic (one-step) RL.

* We regularize policy with expected inverse action probabilities (discouraging very small probas) to make objective numerically stable


In [11]:
#get agent's Qvalues obtained via experience replay
replay = pool.experience_replay

_,_,_,_,(policy_seq,V_seq) = agent.get_sessions(
    replay,
    session_length=SEQ_LENGTH,
    experience_replay=True,
)



In [12]:
from agentnet.learning import a2c                                                   


elwise_mse_loss = a2c.get_elementwise_objective(policy_seq,
                                                V_seq[:,:,0],
                                                replay.actions[0],
                                                replay.rewards,
                                                replay.is_alive,
                                                gamma_or_gammas=0.99,
                                                n_steps=1)

#compute mean over "alive" fragments
loss = elwise_mse_loss.sum() / replay.is_alive.sum()

In [13]:
from theano import tensor as T
reg_entropy = (1./(policy_seq)).sum(-1).mean()#<regularize agent with 1/pi. Higher entropy = smaller loss. >

loss += 0.01*reg_entropy

In [14]:
# Compute weight updates
updates = lasagne.updates.rmsprop(loss,weights,learning_rate=0.001)

In [15]:
import theano
train_step = theano.function([],loss,updates=updates)

# Demo run

In [16]:
#for MountainCar-v0 evaluation session is cropped to 200 ticks
untrained_reward = pool.evaluate(save_path="./records",record_video=True)

#video is in the ./records folder

[2017-03-15 03:02:44,437] Making new env: MountainCar-v0
[2017-03-15 03:02:44,441] Clearing 3 monitor files from previous run (because force=True was provided)
[2017-03-15 03:02:44,443] Starting new video recorder writing to /root/anet/Practical_RL/week6/records/openaigym.video.0.3558.video000000.mp4
[2017-03-15 03:03:11,264] Tried to pass invalid video frame, marking as broken: Your frame has shape (1, 1, 3), but the VideoRecorder is configured for shape (400, 600, 3).
[2017-03-15 03:03:11,362] Cleaning up paths for broken video recorder: path=/root/anet/Practical_RL/week6/records/openaigym.video.0.3558.video000000.mp4 metadata_path=/root/anet/Practical_RL/week6/records/openaigym.video.0.3558.video000000.meta.json
[2017-03-15 03:03:11,384] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')


Episode finished after 200 timesteps with reward=-200.0


# Training loop


In [None]:
#starting epoch
epoch_counter = 1

#full game rewards
rewards = {}

In [None]:
from tqdm import tqdm
#the loop may take eons to finish.
#consider interrupting early.
loss = 0
for i in tqdm(range(10000)):    
    
    #train
    pool.update(SEQ_LENGTH,append=True)
    
    loss = loss*0.99 + train_step()*0.01
        
    
    
    if epoch_counter%100==0:
        #average reward per game tick in current experience replay pool
        pool_mean_reward = np.average(pool.experience_replay.rewards.get_value()[:,:-1],
                                      weights=1+pool.experience_replay.is_alive.get_value()[:,:-1])
        print("iter=%i\treward/step=%.5f\tloss ma=%.5f"%(epoch_counter,
                                                        pool_mean_reward,
                                                        loss))
        

    ##record current learning progress and show learning curves
    if epoch_counter%500 ==0:
        n_games = 10
        rewards[epoch_counter] = pool.evaluate( record_video=False,n_games=n_games,
                                               verbose=False)
        print("Current score(mean over %i) = %.3f"%(n_games,np.mean(rewards[epoch_counter])))
    
    
    epoch_counter  +=1

    
# Time to drink some coffee!

  1%|          | 101/10000 [00:06<18:53,  8.73it/s]

iter=100	reward/step=-1.00000	loss ma=-0.12804


  2%|▏         | 201/10000 [00:19<19:20,  8.45it/s]

iter=200	reward/step=-1.00000	loss ma=0.50704


  3%|▎         | 301/10000 [00:31<19:14,  8.40it/s]

iter=300	reward/step=-0.99989	loss ma=0.92701


  4%|▍         | 401/10000 [00:44<19:55,  8.03it/s]

iter=400	reward/step=-0.99989	loss ma=1.17698


  5%|▍         | 499/10000 [00:56<19:06,  8.29it/s][2017-03-15 03:04:08,617] Making new env: MountainCar-v0
[2017-03-15 03:04:08,622] Clearing 3 monitor files from previous run (because force=True was provided)


iter=500	reward/step=-1.00000	loss ma=0.66480


[2017-03-15 03:04:09,115] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
  5%|▌         | 501/10000 [00:57<41:01,  3.86it/s]

Current score(mean over 10) = -200.000


  6%|▌         | 601/10000 [01:10<20:56,  7.48it/s]

iter=600	reward/step=-1.00000	loss ma=0.31583


  7%|▋         | 701/10000 [01:23<20:24,  7.59it/s]

iter=700	reward/step=-1.00000	loss ma=0.17722


  8%|▊         | 801/10000 [01:36<20:48,  7.37it/s]

iter=800	reward/step=-1.00000	loss ma=0.12628


  9%|▉         | 901/10000 [01:48<19:57,  7.60it/s]

iter=900	reward/step=-1.00000	loss ma=0.10716


 10%|▉         | 999/10000 [02:00<21:48,  6.88it/s][2017-03-15 03:05:12,619] Making new env: MountainCar-v0
[2017-03-15 03:05:12,624] Clearing 2 monitor files from previous run (because force=True was provided)


iter=1000	reward/step=-1.00000	loss ma=0.09998


[2017-03-15 03:05:13,124] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 10%|█         | 1002/10000 [02:01<30:21,  4.94it/s]

Current score(mean over 10) = -200.000


 11%|█         | 1101/10000 [02:14<17:30,  8.47it/s]

iter=1100	reward/step=-1.00000	loss ma=0.09730


 12%|█▏        | 1201/10000 [02:26<18:16,  8.02it/s]

iter=1200	reward/step=-1.00000	loss ma=0.09622


 13%|█▎        | 1301/10000 [02:39<18:06,  8.01it/s]

iter=1300	reward/step=-0.99994	loss ma=0.18702


 14%|█▍        | 1401/10000 [02:51<17:13,  8.32it/s]

iter=1400	reward/step=-1.00000	loss ma=0.28617


 15%|█▍        | 1499/10000 [03:03<17:11,  8.24it/s][2017-03-15 03:06:15,442] Making new env: MountainCar-v0
[2017-03-15 03:06:15,448] Clearing 2 monitor files from previous run (because force=True was provided)


iter=1500	reward/step=-0.99994	loss ma=0.22260


[2017-03-15 03:06:15,972] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 15%|█▌        | 1501/10000 [03:04<35:01,  4.04it/s]

Current score(mean over 10) = -200.000


 16%|█▌        | 1601/10000 [03:16<16:18,  8.58it/s]

iter=1600	reward/step=-0.99983	loss ma=0.38983


 17%|█▋        | 1701/10000 [03:28<16:40,  8.29it/s]

iter=1700	reward/step=-0.99994	loss ma=0.32245


 18%|█▊        | 1801/10000 [03:41<16:37,  8.22it/s]

iter=1800	reward/step=-0.99989	loss ma=0.32253


 19%|█▉        | 1902/10000 [03:53<16:19,  8.27it/s]

iter=1900	reward/step=-0.99978	loss ma=0.35401


 20%|█▉        | 1999/10000 [04:06<15:58,  8.35it/s][2017-03-15 03:07:17,970] Making new env: MountainCar-v0
[2017-03-15 03:07:17,975] Clearing 2 monitor files from previous run (because force=True was provided)


iter=2000	reward/step=-0.99956	loss ma=0.43938


[2017-03-15 03:07:18,458] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 20%|██        | 2001/10000 [04:07<32:16,  4.13it/s]

Current score(mean over 10) = -200.000


 21%|██        | 2101/10000 [04:19<16:07,  8.16it/s]

iter=2100	reward/step=-0.99911	loss ma=1.23256


 22%|██▏       | 2201/10000 [04:32<16:38,  7.81it/s]

iter=2200	reward/step=-0.99878	loss ma=1.24214


 23%|██▎       | 2301/10000 [04:45<15:50,  8.10it/s]

iter=2300	reward/step=-0.99805	loss ma=1.15351


 24%|██▍       | 2401/10000 [04:57<14:40,  8.63it/s]

iter=2400	reward/step=-0.99749	loss ma=1.05558


 25%|██▍       | 2499/10000 [05:10<15:24,  8.12it/s][2017-03-15 03:08:21,756] Making new env: MountainCar-v0
[2017-03-15 03:08:21,760] Clearing 2 monitor files from previous run (because force=True was provided)


iter=2500	reward/step=-0.99738	loss ma=0.89084


[2017-03-15 03:08:22,280] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 25%|██▌       | 2501/10000 [05:10<30:13,  4.13it/s]

Current score(mean over 10) = -180.900


 26%|██▌       | 2601/10000 [05:23<15:54,  7.75it/s]

iter=2600	reward/step=-0.99727	loss ma=0.85026


 27%|██▋       | 2701/10000 [05:35<14:59,  8.11it/s]

iter=2700	reward/step=-0.99727	loss ma=0.83069


 28%|██▊       | 2801/10000 [05:48<16:18,  7.36it/s]

iter=2800	reward/step=-0.99733	loss ma=0.78665


 29%|██▉       | 2901/10000 [06:00<16:39,  7.10it/s]

iter=2900	reward/step=-0.99727	loss ma=0.77308


 30%|██▉       | 2999/10000 [06:13<14:52,  7.85it/s][2017-03-15 03:09:25,329] Making new env: MountainCar-v0
[2017-03-15 03:09:25,333] Clearing 2 monitor files from previous run (because force=True was provided)


iter=3000	reward/step=-0.99705	loss ma=0.80363


[2017-03-15 03:09:25,802] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 30%|███       | 3001/10000 [06:14<27:44,  4.21it/s]

Current score(mean over 10) = -178.500


 31%|███       | 3101/10000 [06:27<14:15,  8.06it/s]

iter=3100	reward/step=-0.99710	loss ma=0.76121


 32%|███▏      | 3201/10000 [06:39<15:28,  7.32it/s]

iter=3200	reward/step=-0.99727	loss ma=0.75326


 33%|███▎      | 3301/10000 [06:52<13:42,  8.14it/s]

iter=3300	reward/step=-0.99727	loss ma=0.74640


 34%|███▍      | 3401/10000 [07:05<13:22,  8.23it/s]

iter=3400	reward/step=-0.99705	loss ma=0.73848


 35%|███▍      | 3499/10000 [07:17<12:55,  8.38it/s][2017-03-15 03:10:29,243] Making new env: MountainCar-v0
[2017-03-15 03:10:29,247] Clearing 2 monitor files from previous run (because force=True was provided)


iter=3500	reward/step=-0.99716	loss ma=0.69179


[2017-03-15 03:10:29,637] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 35%|███▌      | 3501/10000 [07:18<21:49,  4.96it/s]

Current score(mean over 10) = -161.500


 36%|███▌      | 3601/10000 [07:31<14:17,  7.47it/s]

iter=3600	reward/step=-0.99710	loss ma=0.70496


 37%|███▋      | 3701/10000 [07:43<13:02,  8.05it/s]

iter=3700	reward/step=-0.99710	loss ma=0.62804


 38%|███▊      | 3801/10000 [07:55<12:53,  8.02it/s]

iter=3800	reward/step=-0.99716	loss ma=0.65986


 39%|███▉      | 3901/10000 [08:08<12:39,  8.03it/s]

iter=3900	reward/step=-0.99710	loss ma=0.60895


 40%|███▉      | 3999/10000 [08:21<12:14,  8.17it/s][2017-03-15 03:11:32,947] Making new env: MountainCar-v0
[2017-03-15 03:11:32,951] Clearing 2 monitor files from previous run (because force=True was provided)


iter=4000	reward/step=-0.99716	loss ma=0.58934


[2017-03-15 03:11:33,356] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 40%|████      | 4002/10000 [08:22<18:18,  5.46it/s]

Current score(mean over 10) = -157.500


 41%|████      | 4101/10000 [08:34<12:30,  7.86it/s]

iter=4100	reward/step=-0.99694	loss ma=0.59245


 42%|████▏     | 4201/10000 [08:46<13:02,  7.41it/s]

iter=4200	reward/step=-0.99716	loss ma=0.57948


 43%|████▎     | 4301/10000 [08:59<12:21,  7.69it/s]

iter=4300	reward/step=-0.99699	loss ma=0.58098


 44%|████▍     | 4402/10000 [09:12<11:01,  8.46it/s]

iter=4400	reward/step=-0.99716	loss ma=0.54602


 45%|████▍     | 4499/10000 [09:24<10:43,  8.55it/s][2017-03-15 03:12:35,964] Making new env: MountainCar-v0
[2017-03-15 03:12:35,967] Clearing 2 monitor files from previous run (because force=True was provided)


iter=4500	reward/step=-0.99733	loss ma=0.55025


[2017-03-15 03:12:36,376] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 45%|████▌     | 4501/10000 [09:24<18:38,  4.91it/s]

Current score(mean over 10) = -165.700


 46%|████▌     | 4601/10000 [09:37<12:21,  7.28it/s]

iter=4600	reward/step=-0.99710	loss ma=0.54896


 47%|████▋     | 4701/10000 [09:49<10:29,  8.42it/s]

iter=4700	reward/step=-0.99705	loss ma=0.55388


 48%|████▊     | 4801/10000 [10:01<10:36,  8.17it/s]

iter=4800	reward/step=-0.99716	loss ma=0.51218


 49%|████▉     | 4901/10000 [10:14<11:03,  7.69it/s]

iter=4900	reward/step=-0.99716	loss ma=0.49164


 50%|████▉     | 4999/10000 [10:26<10:26,  7.98it/s][2017-03-15 03:13:38,324] Making new env: MountainCar-v0
[2017-03-15 03:13:38,329] Clearing 2 monitor files from previous run (because force=True was provided)


iter=5000	reward/step=-0.99721	loss ma=0.48065


[2017-03-15 03:13:38,777] Finished writing results. You can upload them to the scoreboard via gym.upload('/root/anet/Practical_RL/week6/records')
 50%|█████     | 5002/10000 [10:27<15:57,  5.22it/s]

Current score(mean over 10) = -166.700


 51%|█████     | 5101/10000 [10:39<10:09,  8.04it/s]

iter=5100	reward/step=-0.99705	loss ma=0.52175


 52%|█████▏    | 5201/10000 [10:52<09:28,  8.45it/s]

iter=5200	reward/step=-0.99716	loss ma=0.49839


 53%|█████▎    | 5301/10000 [11:05<09:11,  8.52it/s]

iter=5300	reward/step=-0.99721	loss ma=0.45238


 54%|█████▍    | 5401/10000 [11:17<09:01,  8.50it/s]

iter=5400	reward/step=-0.99716	loss ma=0.45623


 54%|█████▍    | 5415/10000 [11:19<08:55,  8.57it/s]

In [None]:
iters,session_rewards=zip(*sorted(rewards.items(),key=lambda (k,v):k))

In [None]:
plt.plot(iters,map(np.mean,session_rewards))

### Visualizing the $V(s)$ and  $\pi(a|s)$

Since the observation space is just 2-dimensional, we can plot it on a 2d scatter-plot to gain insight of what agent learned.

In [None]:
_,_,_,_,(pool_policy,pool_V) = agent.get_sessions(
    pool.experience_replay,
    session_length=SEQ_LENGTH,
    experience_replay=True,)

plt.scatter(
    *pool.experience_replay.observations[0].get_value().reshape([-1,2]).T,
    c = pool_V.ravel().eval(),
    alpha = 0.1)
plt.title("predicted state values")
plt.xlabel("position")
plt.ylabel("speed")

In [None]:
obs_x,obs_y = pool.experience_replay.observations[0].get_value().reshape([-1,2]).T
optimal_actid = pool_policy.argmax(-1).ravel().eval()
action_names=["left","stop","right"]
for i in range(3):
    sel = optimal_actid==i
    plt.scatter(obs_x[sel],obs_y[sel],
                c=['red','blue','green'][i],
                alpha = 0.1,label=action_names[i])
    
plt.title("most likely action id")
plt.xlabel("position")
plt.ylabel("speed")
plt.legend(loc='best')

### Variations in the algorithm (2 pts)

Try different `n_steps` param to see if it improves learning performance.

Your objective is to compare learning curves for 1, 3, 10 and 25-step updates (or any grid you think is appropriate).

For 25-step updates, please also increase SEQ_LENGTH to 25.

_(Bonus)_ Also evaluate how performance changes with different entropy regularizer coefficient.

In [None]:
#<a lot of your code here>

### Bonus section (5+ pts)

Beat the [`LunarLanderContinuous-v2`](https://gym.openai.com/envs/LunarLanderContinuous-v2) with continuous version of advantage actor-critic.

You will require a multidimensional gaussian (or similar) policy from your agent.

You can implement that by feeding a2c.get_elementwise_objective probabilities of agent's chosen actions (it will be 2-dimensional) instead of all actions.

Contact us if you have any questions.