# Whatami

I am a simple experiment on using actor-critic agent setup for MountainCar problem.
Being policy-based method, actor-critic has much better convergence properties that q-learning from the other notebook.

## About OpenAI Gym

* Its a recently published platform that basicly allows you to train agents in a wide variety of environments with near-identical interface.
* This is twice as awesome since now we don't need to write a new wrapper for every game
* Go check it out!
  * Blog post - https://openai.com/blog/openai-gym-beta/
  * Github - https://github.com/openai/gym


## New to Lasagne and AgentNet?
* We only require surface level knowledge of theano and lasagne, so you can just learn them as you go.
* Alternatively, you can find Lasagne tutorials here:
 * Official mnist example: http://lasagne.readthedocs.io/en/latest/user/tutorial.html
 * From scratch: https://github.com/ddtm/dl-course/tree/master/Seminar4
 * From theano: https://github.com/craffel/Lasagne-tutorial/blob/master/examples/tutorial.ipynb
* This is pretty much the basic tutorial for AgentNet, so it's okay not to know it.


In [1]:
%load_ext autoreload
%autoreload 2

# Experiment setup
* Here we basically just load the game and check that it works

In [2]:
from __future__ import print_function 
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
%env THEANO_FLAGS="floatX=float32"

env: THEANO_FLAGS="floatX=float32"


In [3]:
#global params.
GAME = "MountainCar-v0"

#number of parallel agents and batch sequence length (frames)
N_AGENTS = 1
SEQ_LENGTH = 10

In [4]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import gym
env = gym.make(GAME)
obs = env.step(0)[0]
action_names = np.array(["left",'stop',"right"]) #i guess so... i may be wrong
state_size = len(obs)
print(obs)

[2017-01-06 17:30:37,680] Making new env: MountainCar-v0


[-0.57241447 -0.00063994]


# Basic agent setup
Here we define a simple agent that maps game images into Qvalues using shallow neural network.


In [5]:
import lasagne
from lasagne.layers import InputLayer,DenseLayer,NonlinearityLayer,batch_norm,dropout
#image observation at current tick goes here, shape = (sample_i,x,y,color)
observation_layer = InputLayer((None,state_size))

dense0 = DenseLayer(observation_layer,100,name='dense1')
dense1 = DenseLayer(dense0,256,name='dense2')


In [6]:
#a layer that predicts Qvalues

policy_layer = DenseLayer(dense1,
                   num_units = env.action_space.n,
                   nonlinearity=lasagne.nonlinearities.softmax,
                   name="q-evaluator layer")


V_layer = DenseLayer(dense1, 1, nonlinearity=None,name="state values")

In [7]:


import theano
epsilon = theano.shared(np.float32(0),allow_downcast=True)
policy_smooth_layer = NonlinearityLayer(policy_layer,
                                        lambda p: (1.-epsilon)*p + epsilon/env.action_space.n)

#To pick actions, we use an epsilon-greedy resolver (epsilon is a property)
from agentnet.resolver import ProbabilisticResolver
action_layer = ProbabilisticResolver(policy_smooth_layer,
                                     name="e-greedy action picker",
                                     assume_normalized=True)



##### Finally, agent
We declare that this network is and MDP agent with such and such inputs, states and outputs

In [8]:
from agentnet.agent import Agent
#all together
agent = Agent(observation_layers=observation_layer,
              policy_estimators=(policy_layer,V_layer),
              action_layers=action_layer)


In [9]:
#Since it's a single lasagne network, one can get it's weights, output, etc
weights = lasagne.layers.get_all_params((action_layer,V_layer),trainable=True)
weights

[dense1.W,
 dense1.b,
 dense2.W,
 dense2.b,
 q-evaluator layer.W,
 q-evaluator layer.b,
 state values.W,
 state values.b]

# Create and manage a pool of atari sessions to play with

* To make training more stable, we shall have an entire batch of game sessions each happening independent of others
* Why several parallel agents help training: http://arxiv.org/pdf/1602.01783v1.pdf
* Alternative approach: store more sessions: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

In [10]:
from agentnet.experiments.openai_gym.pool import EnvPool

pool = EnvPool(agent,GAME, N_AGENTS,max_size=10000)


[2017-01-06 17:30:49,646] Making new env: MountainCar-v0


In [11]:
%%time
#interact for 7 ticks
_,action_log,reward_log,_,_,_  = pool.interact(7)


print(action_names[action_log])
print(reward_log)

[['right' 'left' 'left' 'right' 'left' 'left' 'left']]
[[-1. -1. -1. -1. -1. -1.  0.]]
CPU times: user 6.69 ms, sys: 5.29 ms, total: 12 ms
Wall time: 8.59 ms


In [12]:
#load first sessions (this function calls interact and remembers sessions)
pool.update(SEQ_LENGTH)

# a2c loss

Here we define obective function for actor-critic (one-step) RL.

* We regularize policy with expected inverse action probabilities (discouraging very small probas) to make objective numerically stable


In [13]:
#get agent's Qvalues obtained via experience replay
replay = pool.experience_replay.sample_session_batch(100,replace=True)

_,_,_,_,(policy_seq,V_seq) = agent.get_sessions(
    replay,
    session_length=SEQ_LENGTH,
    experience_replay=True,
)



In [14]:
#get reference Qvalues according to Qlearning algorithm
from agentnet.learning import a2c_n_step

#crop rewards to [-1,+1] to avoid explosion.
#import theano.tensor as T
#rewards = T.maximum(-1,T.minimum(rewards,1))

#loss for Qlearning = (Q(s,a) - (r+gamma*Q(s',a_max)))^2

elwise_mse_loss = a2c_n_step.get_elementwise_objective(policy_seq,V_seq[:,:,0],
                                                       replay.actions[0],
                                                       replay.rewards,
                                                       replay.is_alive,
                                                       gamma_or_gammas=0.99,
                                                       n_steps=1)

#compute mean over "alive" fragments
loss = elwise_mse_loss.sum() / replay.is_alive.sum()

In [15]:
from theano import tensor as T
reg_entropy = T.mean((1./policy_seq))
loss += 0.01*reg_entropy

In [16]:
# Compute weight updates
updates = lasagne.updates.rmsprop(loss,weights,learning_rate=0.001)

In [17]:
#compile train function
import theano
train_step = theano.function([],loss,updates=updates)


# Demo run

In [18]:
#for MountainCar-v0 evaluation session is cropped to 200 ticks
untrained_reward = pool.evaluate(save_path="./records",record_video=False)

[2017-01-06 17:31:46,480] Making new env: MountainCar-v0
[2017-01-06 17:31:46,503] Attempted to wrap env <MountainCarEnv instance> after .configure() was called. All wrappers must be applied before calling .configure()
[2017-01-06 17:31:46,506] Clearing 8 monitor files from previous run (because force=True was provided)
[2017-01-06 17:31:46,608] Finished writing results. You can upload them to the scoreboard via gym.upload('/Users/alexajax/Downloads/goto/records')


Episode finished after 200 timesteps with reward=-200.0


In [19]:
from IPython.display import HTML

#video_path="./records/openaigym.video.0.7346.video000000.mp4"

#HTML("""
#<video width="640" height="480" controls>
#  <source src="{}" type="video/mp4">
#</video>
#""".format(video_path))


# Training loop

In [20]:
pool.envs[0].reset()

array([-0.42284873,  0.        ])

In [21]:
#starting epoch
epoch_counter = 1

#full game rewards
rewards = {epoch_counter:untrained_reward}

In [None]:
#pre-fill pool
from tqdm import tqdm
for i in tqdm(range(1000)):
    pool.update(SEQ_LENGTH,append=True,)
r

100%|██████████| 1000/1000 [00:06<00:00, 151.44it/s]


In [None]:

#the loop may take eons to finish.
#consider interrupting early.
loss = 0
for i in tqdm(range(10000)):    
    
    
    #train
    for i in range(10):
        pool.update(SEQ_LENGTH,append=True,)
    for i in range(10):
        loss = loss*0.99 + train_step()*0.01
        
    
    

    ##record current learning progress and show learning curves
    if epoch_counter%100 ==0:
        n_games = 10
        epsilon.set_value(0)
        rewards[epoch_counter] = pool.evaluate( record_video=True,n_games=n_games,verbose=True)
        print("Current score(mean over %i) = %.3f"%(n_games,np.mean(rewards[epoch_counter])))
        epsilon.set_value(0.05)
    
    
    epoch_counter  +=1

    
# Time to drink some coffee!

  1%|          | 99/10000 [00:41<1:20:53,  2.04it/s][2017-01-06 17:32:35,322] Making new env: MountainCar-v0
[2017-01-06 17:32:35,347] Attempted to wrap env <MountainCarEnv instance> after .configure() was called. All wrappers must be applied before calling .configure()
[2017-01-06 17:32:35,350] Clearing 8 monitor files from previous run (because force=True was provided)
[2017-01-06 17:32:35,355] Starting new video recorder writing to /Users/alexajax/Downloads/goto/records/openaigym.video.1.4458.video000000.mp4
[2017-01-06 17:32:44,146] Starting new video recorder writing to /Users/alexajax/Downloads/goto/records/openaigym.video.1.4458.video000001.mp4


Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0


[2017-01-06 17:32:51,106] Starting new video recorder writing to /Users/alexajax/Downloads/goto/records/openaigym.video.1.4458.video000008.mp4


Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0


[2017-01-06 17:32:58,606] Finished writing results. You can upload them to the scoreboard via gym.upload('/Users/alexajax/Downloads/goto/records')
  1%|          | 100/10000 [01:04<20:36:27,  7.49s/it]

Episode finished after 200 timesteps with reward=-200.0
Current score(mean over 10) = -200.000


  2%|▏         | 199/10000 [01:51<1:16:42,  2.13it/s][2017-01-06 17:33:45,971] Making new env: MountainCar-v0
[2017-01-06 17:33:45,996] Attempted to wrap env <MountainCarEnv instance> after .configure() was called. All wrappers must be applied before calling .configure()
[2017-01-06 17:33:45,999] Clearing 13 monitor files from previous run (because force=True was provided)
[2017-01-06 17:33:46,005] Starting new video recorder writing to /Users/alexajax/Downloads/goto/records/openaigym.video.2.4458.video000000.mp4
[2017-01-06 17:33:53,730] Starting new video recorder writing to /Users/alexajax/Downloads/goto/records/openaigym.video.2.4458.video000001.mp4


Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0


[2017-01-06 17:34:01,980] Starting new video recorder writing to /Users/alexajax/Downloads/goto/records/openaigym.video.2.4458.video000008.mp4


Episode finished after 200 timesteps with reward=-200.0
Episode finished after 200 timesteps with reward=-200.0


[2017-01-06 17:34:09,487] Finished writing results. You can upload them to the scoreboard via gym.upload('/Users/alexajax/Downloads/goto/records')
  2%|▏         | 200/10000 [02:15<20:28:37,  7.52s/it]

Episode finished after 200 timesteps with reward=-200.0
Current score(mean over 10) = -200.000


  2%|▏         | 210/10000 [02:20<1:51:23,  1.46it/s]

In [None]:
iters,session_rewards=zip(*sorted(rewards.items(),key=lambda (k,v):k))

In [None]:
plt.plot(iters,map(np.mean,session_rewards))

In [None]:

_,_,_,_,(pool_policy,pool_V) = agent.get_sessions(
    pool.experience_replay,
    session_length=SEQ_LENGTH,
    experience_replay=True,)

plt.scatter(
    *pool.experience_replay.observations[0].get_value().reshape([-1,2]).T,
    c = pool_V.ravel().eval(),
    alpha = 0.1)
plt.title("predicted state values")
plt.xlabel("position")
plt.ylabel("speed")

In [None]:
obs_x,obs_y = pool.experience_replay.observations[0].get_value().reshape([-1,2]).T
optimal_actid = pool_policy.argmax(-1).ravel().eval()

for i in range(3):
    sel = optimal_actid==i
    plt.scatter(obs_x[sel],obs_y[sel],
                c=['red','blue','green'][i],
                alpha = 0.1,label=action_names[i])
    
plt.title("most likely action id")
plt.xlabel("position")
plt.ylabel("speed")
plt.legend(loc='best')