## Advantage actor-critic in AgentNet (5 pts)

Once we're done with REINFORCE, it's time to proceed with something more sophisticated.
The next one in line is advantage actor-critic, in which agent learns both policy and value function, using the latter to speed up learning.

We'll start as usual by running LunarLander env. If you didn't manage to get it installed by now, ~~you deserve whatever happens to you~~ you can try `MountainCar-v0` or `Acrobot-v0`.

In [1]:
%env THEANO_FLAGS='floatX=float32'
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1
        
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


env: THEANO_FLAGS='floatX=float32'


In [2]:
import gym

env = gym.make("LunarLander-v2")
obs = env.reset()
state_size = len(obs)
n_actions = env.action_space.n
print(obs)

Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 103
[2017-03-15 01:19:05,109] Making new env: LunarLander-v2


[ 0.00328159  0.94092541  0.33237436  0.02080304 -0.00379576 -0.07528779
  0.          0.        ]


# Basic agent setup
Here we define a simple agent that maps game images into Qvalues using shallow neural network.


In [3]:
import lasagne
from lasagne.layers import InputLayer,DenseLayer,NonlinearityLayer,batch_norm,dropout
#image observation at current tick goes here, shape = (sample_i,x,y,color)
observation_layer = InputLayer((None,state_size))

nn = DenseLayer(observation_layer,100)#<your architecture>
nn = DenseLayer(nn,200)#<your architecture>

In [4]:
#a layer that predicts Qvalues

policy_layer = DenseLayer(nn,n_actions,nonlinearity=lasagne.nonlinearities.softmax)#<estimate probabilities of actions given prev layer. Mind the nonlinearity!>


V_layer = DenseLayer(nn,1,nonlinearity=None)#<estimate state values (1 unit layer). Mind nonlinearity too.>

In [5]:
#To pick actions, we use an epsilon-greedy resolver (epsilon is a property)
from agentnet.resolver import ProbabilisticResolver
action_layer = ProbabilisticResolver(policy_layer,
                                     name="e-greedy action picker",
                                     assume_normalized=True)

##### Finally, agent
We declare that this network is and MDP agent with such and such inputs, states and outputs

In [6]:
from agentnet.agent import Agent
#all together
agent = Agent(observation_layers=observation_layer,
              policy_estimators=(policy_layer,V_layer),
              action_layers=action_layer)


In [7]:
#Since it's a single lasagne network, one can get it's weights, output, etc
weights = lasagne.layers.get_all_params((action_layer,V_layer),trainable=True)
weights

[W, b, W, b, W, b, W, b]

# Create and manage a pool of atari sessions to play with

* To make training more stable, we shall have an entire batch of game sessions each happening independent of others
* Why several parallel agents help training: http://arxiv.org/pdf/1602.01783v1.pdf
* Alternative approach: store more sessions: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

In [8]:
from agentnet.experiments.openai_gym.pool import EnvPool

#create a small pool with 10 parallel agents
pool = EnvPool(agent,"LunarLander-v2", n_games=10,max_size=1000) 

#we assume that pool size 1000 is small enough to learn "almost on policy" :)

Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 103
[2017-03-15 01:19:06,697] Making new env: LunarLander-v2
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 103
[2017-03-15 01:19:06,709] Making new env: LunarLander-v2
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 103
[2017-03-15 01:19:06,720] Making new env: LunarLander-v2
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    strea

In [9]:
%%time
#interact for 7 ticks
_,action_log,reward_log,_,_,_  = pool.interact(7)


print(action_log[:3])
print(reward_log[:3])

[[3 0 1 3 2 0 0]
 [0 2 0 3 1 3 3]
 [0 1 3 1 0 1 1]]
[[-0.02555107  0.73051758  1.67219392 -0.05185902 -1.55174084  0.68232006
   0.        ]
 [ 0.89301952 -5.05226646  0.95141698  1.94756751  0.51880214  1.94206123
   0.        ]
 [-0.54613835  0.41520672 -1.60931248  0.50348755 -0.65353357  0.35359821
   0.        ]]
CPU times: user 24 ms, sys: 32 ms, total: 56 ms
Wall time: 56.3 ms


In [10]:
SEQ_LENGTH = 10
#load first sessions (this function calls interact and remembers sessions)
pool.update(SEQ_LENGTH)

# Actor-critic loss

Here we define obective function for actor-critic (one-step) RL.

* We regularize policy with expected inverse action probabilities (discouraging very small probas) to make objective numerically stable


In [11]:
#get agent's Qvalues obtained via experience replay
replay = pool.experience_replay

_,_,_,_,(policy_seq,V_seq) = agent.get_sessions(
    replay,
    session_length=SEQ_LENGTH,
    experience_replay=True,
)



In [12]:
from agentnet.learning import a2c                                                   


elwise_mse_loss = a2c.get_elementwise_objective(policy_seq,
                                                V_seq[:,:,0],
                                                replay.actions[0],
                                                replay.rewards*0.01,
                                                replay.is_alive,
                                                gamma_or_gammas=0.99,
                                                n_steps=1)

#compute mean over "alive" fragments
loss = elwise_mse_loss.sum() / replay.is_alive.sum()

In [13]:
from theano import tensor as T
reg_entropy = (policy_seq*T.log(policy_seq)).sum(-1).mean()#<regularize agent with __negative__ entropy. Higher entropy = smaller loss. >

loss += reg_entropy

In [14]:
# Compute weight updates
updates = lasagne.updates.adam(loss,weights,learning_rate=1e-4)

In [15]:
import theano
train_step = theano.function([],loss,updates=updates)

Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file blas.py, line 433
[2017-03-15 01:19:24,257] We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library.


# Demo run

In [16]:
#for MountainCar-v0 evaluation session is cropped to 200 ticks
untrained_reward = pool.evaluate(save_path="./records",record_video=True)

#video is in the ./records folder

Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 103
[2017-03-15 01:19:27,106] Making new env: LunarLander-v2
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 81
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file monitor_manager.py, line 31
[2017-03-15 01:19:27,120] Clearing 4 monitor files from previous run (because force=True was provided)
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs

Episode finished after 94 timesteps with reward=-250.488916479


Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file monitor_manager.py, line 221
[2017-03-15 01:19:32,627] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/jheuristic/Downloads/Practical_RL/week6/records')


# Training loop


In [None]:
#starting epoch
epoch_counter = 1

#full game rewards
rewards = {}

In [None]:
from tqdm import tqdm
#the loop may take eons to finish.
#consider interrupting early.
loss = 0
for i in tqdm(range(10000)):    
    
    #train
    pool.update(SEQ_LENGTH,append=True,)
    
    loss = loss*0.99 + train_step()*0.01
        
    
    
    if epoch_counter%100==0:
        #average reward per game tick in current experience replay pool
        pool_mean_reward = np.average(pool.experience_replay.rewards.get_value()[:,:-1],
                                      weights=1+pool.experience_replay.is_alive.get_value()[:,:-1])
        print("iter=%i\treward/step=%.5f\loss ma=%.5f"%(epoch_counter,
                                                        pool_mean_reward,
                                                        loss))
        

    ##record current learning progress and show learning curves
    if epoch_counter%500 ==0:
        n_games = 10
        rewards[epoch_counter] = pool.evaluate( record_video=False,n_games=n_games,
                                               verbose=False)
        print("Current score(mean over %i) = %.3f"%(n_games,np.mean(rewards[epoch_counter])))
    
    
    epoch_counter  +=1

    
# Time to drink some coffee!

  1%|          | 100/10000 [00:36<1:23:52,  1.97it/s]

iter=100	reward/step=-2.22627	pool_size=1000	loss ma=-0.88251


  2%|▏         | 200/10000 [01:24<1:06:05,  2.47it/s]

iter=200	reward/step=-2.35537	pool_size=1000	loss ma=-1.19674


  3%|▎         | 300/10000 [02:09<1:17:07,  2.10it/s]

iter=300	reward/step=-2.43342	pool_size=1000	loss ma=-1.31218


  4%|▍         | 400/10000 [03:03<1:51:59,  1.43it/s]

iter=400	reward/step=-2.56407	pool_size=1000	loss ma=-1.35477


  5%|▍         | 499/10000 [03:53<1:16:19,  2.07it/s]Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 103
[2017-03-15 01:23:26,824] Making new env: LunarLander-v2
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 81
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file monitor_manager.py, line 31
[2017-03-15 01:23:26,862] Clearing 4 monitor files from previous run (because force=True was provided)


iter=500	reward/step=-2.32853	pool_size=1000	loss ma=-1.36896


Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file monitor_manager.py, line 221
[2017-03-15 01:23:27,893] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/jheuristic/Downloads/Practical_RL/week6/records')
  5%|▌         | 500/10000 [03:55<2:05:57,  1.26it/s]

Current score(mean over 10) = -173.312


  6%|▌         | 600/10000 [04:41<1:29:45,  1.75it/s]

iter=600	reward/step=-2.11058	pool_size=1000	loss ma=-1.37464


  7%|▋         | 700/10000 [05:40<1:57:35,  1.32it/s]

iter=700	reward/step=-2.36339	pool_size=1000	loss ma=-1.37662


  8%|▊         | 800/10000 [06:41<1:22:14,  1.86it/s]

iter=800	reward/step=-1.94985	pool_size=1000	loss ma=-1.37838


  9%|▉         | 900/10000 [07:34<1:30:23,  1.68it/s]

iter=900	reward/step=-1.96288	pool_size=1000	loss ma=-1.38105


 10%|▉         | 999/10000 [08:52<1:51:46,  1.34it/s]Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 103
[2017-03-15 01:28:25,739] Making new env: LunarLander-v2
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 81
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file monitor_manager.py, line 31
[2017-03-15 01:28:25,771] Clearing 2 monitor files from previous run (because force=True was provided)


iter=1000	reward/step=-2.13598	pool_size=1000	loss ma=-1.38227


Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file monitor_manager.py, line 221
[2017-03-15 01:28:26,859] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/jheuristic/Downloads/Practical_RL/week6/records')
 10%|█         | 1000/10000 [08:54<2:42:23,  1.08s/it]

Current score(mean over 10) = -219.689


 11%|█         | 1100/10000 [10:22<2:16:51,  1.08it/s]

iter=1100	reward/step=-1.96493	pool_size=1000	loss ma=-1.38154


 12%|█▏        | 1200/10000 [12:04<2:14:45,  1.09it/s]

iter=1200	reward/step=-1.83011	pool_size=1000	loss ma=-1.38223


 13%|█▎        | 1300/10000 [14:00<2:54:59,  1.21s/it]

iter=1300	reward/step=-1.61950	pool_size=1000	loss ma=-1.38228


 14%|█▍        | 1400/10000 [16:15<4:16:25,  1.79s/it]

iter=1400	reward/step=-1.65189	pool_size=1000	loss ma=-1.38259


 15%|█▍        | 1499/10000 [18:45<3:28:51,  1.47s/it]Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 103
[2017-03-15 01:38:19,500] Making new env: LunarLander-v2
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file registration.py, line 81
Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file monitor_manager.py, line 31
[2017-03-15 01:38:19,511] Clearing 2 monitor files from previous run (because force=True was provided)


iter=1500	reward/step=-1.83839	pool_size=1000	loss ma=-1.38299


Traceback (most recent call last):
  File "/home/jheuristic/anaconda2/lib/python2.7/logging/__init__.py", line 882, in emit
    stream.write(fs % msg)
IOError: [Errno 5] Input/output error
Logged from file monitor_manager.py, line 221
[2017-03-15 01:38:20,461] Finished writing results. You can upload them to the scoreboard via gym.upload('/home/jheuristic/Downloads/Practical_RL/week6/records')
 15%|█▌        | 1500/10000 [18:47<4:07:45,  1.75s/it]

Current score(mean over 10) = -228.532


 16%|█▌        | 1600/10000 [21:42<3:53:34,  1.67s/it]

iter=1600	reward/step=-1.67259	pool_size=1000	loss ma=-1.38294


 17%|█▋        | 1700/10000 [26:03<5:09:27,  2.24s/it]

iter=1700	reward/step=-1.76617	pool_size=1000	loss ma=-1.38275


 18%|█▊        | 1800/10000 [29:11<3:43:59,  1.64s/it]

iter=1800	reward/step=-1.82253	pool_size=1000	loss ma=-1.38253


 19%|█▉        | 1892/10000 [31:45<3:53:20,  1.73s/it]

In [None]:
iters,session_rewards=zip(*sorted(rewards.items(),key=lambda (k,v):k))

In [None]:
plt.plot(iters,map(np.mean,session_rewards))

### Variations in the algorithm (2 pts)

Try different `n_steps` param to see if it improves learning performance.

Your objective is to compare learning curves for 1, 3, 10 and 25-step updates (or any grid you think is appropriate).

For 25-step updates, please also increase SEQ_LENGTH to 25.

_(Bonus)_ Also evaluate how performance changes with different entropy regularizer coefficient.

In [None]:
#<a lot of your code here>

### Bonus section (5+ pts)

Beat the [`LunarLanderContinuous-v2`](https://gym.openai.com/envs/LunarLanderContinuous-v2) with continuous version of advantage actor-critic.

You will require a multidimensional gaussian (or similar) policy from your agent.

You can implement that by feeding a2c.get_elementwise_objective probabilities of agent's chosen actions (it will be 2-dimensional) instead of all actions.

Contact us if you have any questions.