# Approximate q-learning

This notebook will teach you to solve reinforcement learning with crossentropy method.

In [None]:
#XVFB will be launched if you run on a server
import os
if os.environ.get("DISPLAY") is str and len(os.environ.get("DISPLAY"))!=0:
    !bash ../xvfb start
    %env DISPLAY=:1

In [None]:
import gym
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
env = gym.make("CartPole-v0")
env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape

plt.imshow(env.render("rgb_array"))

# Approximate (deep) Q-learning: building the network

In this section we will build and train naive Q-learning with theano/lasagne

First step is initializing input variables

In [None]:
import theano
import theano.tensor as T

#create input variables. We'll support multiple states at once


current_states = T.matrix("states[batch,units]")
actions = T.ivector("action_ids[batch]")
rewards = T.vector("rewards[batch]")
next_states = T.matrix("next states[batch,units]")
is_end = T.ivector("vector[batch] where 1 means that session just ended")

In [None]:
import lasagne
from lasagne.layers import *

l_states = InputLayer((None,)+state_dim)


<Your architecture. Please start with a single-layer network>




l_qvalues = DenseLayer(<previous_layer>,num_units=n_actions,nonlinearity=None)

Code below is responsible for actual q-value predictions

In [None]:
predicted_qvalues = get_output(l_qvalues,{l_states:current_states})
predicted_qvalues_for_actions = predicted_qvalues[T.arange(actions.shape[0]),actions]

In [None]:
get_qvalues = <compile a function that takes current_states and returns predicted_qvalues>

Code below contains utilites for learning 

In [None]:
#predict q-values for next states
predicted_next_qvalues = get_output(l_qvalues,{l_states:<input with next q-values>})

gamma = 0.99
target_qvalues_for_actions = <target Q-values using rewards and predicted_next_qvalues>

#zero-out q-values at the end
target_qvalues_for_actions = (1-is_end)*target_qvalues_for_actions
target_qvalues_for_actions = theano.gradient.disconnected_grad(target_qvalues_for_actions)

In [None]:
loss = <find a distance between target_qvalues_for_actions and predicted_qvalues_for_actions>

all_weights = get_all_params(l_qvalues,trainable=True)
updates = <your favorite optimizer>

In [None]:
train_step = theano.function([current_states,actions,rewards,next_states,is_end],updates=updates)

In [None]:
epsilon = 0.25

def generate_session(t_max=1000):
    """play env with q-learning agent"""
    total_reward = 0
    
    s = env.reset()
    
    for t in range(t_max):
        
        #get action q-values from the network
        q_values = get_qvalues([s])[0] 
        
            
        a = <sample action with epsilon-greedy strategy>
        
        new_s,r,done,info = env.step(a)
        
        #train agent one step. Note that we use one-element arrays instead of scalars 
        #because that's what function accepts.
        train_step([s],[a],[r],[new_s],[done])
        
        total_reward+=r
        
        s = new_s
        if done: break
            
    return total_reward
        

In [None]:
for i in range(100):
    
    rewards = [generate_session() for _ in range(100)] #generate new sessions
    
    epsilon*=0.95
    
    print ("mean reward:%.3f\tepsilon:%.5f"%(np.mean(rewards),epsilon))

    if np.mean(rewards) > 250:
        print ("You Win!")
        break
        
    assert epsilon!=0, "Please explore environment"

### Video

In [None]:
epsilon=0 #Don't forget to reset epsilon back to 0.99

In [None]:
#record sessions
import gym.wrappers
env = gym.wrappers.Monitor(env,directory="videos",force=True)
sessions = [generate_session() for _ in range(100)]
env.close()
#unwrap 
env = env.env.env
#upload to gym
#gym.upload("./videos/",api_key="<your_api_key>") #you'll need me later

#Warning! If you keep seeing error that reads something like"DoubleWrapError",
#run env=gym.make("CartPole-v0");env.reset();

In [None]:
#show video
from IPython.display import HTML
import os

video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./videos/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./videos/"+video_names[-1])) #this may or may not be _last_ video. Try other indices

### Homework

Two paths lie ahead of you, and which one to take is a rightfull choice of yours.

* [recommended] Go deeper. Return to seminar1 and 
* [alternative] Pick ```<your favourite env>``` and solve it, using NN.
 * LunarLander, MountainCar or Breakout (from week1 bonus)
 * LunarLander should get at least +100
 * MountainCar should get at least -200
 * Breakout should be better than random 
   * +5 points if it gets average score of >= +10
   * +5 more if it gets average score of >= +20
   * more if more points
   
* Bonus - try approximate expected-value SARSA and other algorithms and compare it with q-learning (+2 per algorithm)