# Reinforcement learning (RL)
Goal of the lab:
- What is RL
    - Motivation
    - distinction between RL and supervised learning
    - distinction between RL and online learning
- Several important examples
    - Multi-armed bandit problem
    - a control game
- An example using neural net to play CartPole-v0 

References: 

[Reinforcement learning tutorial Part 2](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724)

[Reinforcement learning tutorial Part 0](https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0)

[Theory about Q-learning](http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/)

[CS294 course from Berkeley 2017](http://rll.berkeley.edu/deeprlcourse/) and all the course videos [here](https://www.youtube.com/playlist?list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3)

###  Movitation
- Agent/environment model
- key: **sequential** decision making, such as Go.

Compared to supervised learning:
- Usually there is a clear goal what you want to achieve. (Which model is better?)
- supervised learning need labels.
- no labels in reinforcement learning.

Compared to unsupervised learning:
- both have ambiguous objective functions.
- After figuring out an objective, unsupervised learning is usually easier to optimize. (clustering)
- RL is still hard because of the possible search space.

Compared to online learning:
- RL is online in nature, but online algorithm can appear elsewhere in supervised/unsupervised learning.

Frozen lake problem 

![frozen lake](https://cdn-images-1.medium.com/max/960/1*MCjDzR-wfMMkS0rPqXSmKw.png)

** available to play at open AI gym (python package)**

Terms:

- policy
- state
- observation
- reward
- Markov decision 

Two ways to learning:
- directly learn the policy (policy gradient)
- learn the Q-function iteratively.

$$ Q'(s, a) = r + \gamma \max_{a'} Q(s, a')$$

### Example: cartpole-v0

[details at OpenAI webiste](https://gym.openai.com/envs/CartPole-v0)

![demo](https://cdn-images-1.medium.com/max/960/1*G_whtIrY9fGlw3It6HFfhA.gif)

In [75]:
import gym
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.contrib.slim as slim
from sklearn.ensemble import RandomForestClassifier as rf
%matplotlib inline

In [12]:
env = gym.make('CartPole-v0')
env.reset()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


array([ 0.02925718, -0.00325514,  0.01952045,  0.00902916])

In [13]:
gamma = .99
def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r, dtype=float)
    running_add = 0
    for t in reversed(range(0, len(r))):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r
discount_rewards(np.array([0, 0, 1]))

array([0.9801, 0.99  , 1.    ])

In [14]:
class agent():
    def __init__(self, lr, s_size,a_size,h_size):
        #These lines established the feed-forward part of the network. The agent takes a state and produces an action.
        self.state_in= tf.placeholder(shape=[None,s_size],dtype=tf.float32)
        hidden = slim.fully_connected(self.state_in,h_size,biases_initializer=None,activation_fn=tf.nn.relu)
        self.output = slim.fully_connected(hidden,a_size,activation_fn=tf.nn.softmax,biases_initializer=None)
        self.chosen_action = tf.argmax(self.output,1)

        #The next six lines establish the training proceedure. We feed the reward and chosen action into the network
        #to compute the loss, and use it to update the network.
        self.reward_holder = tf.placeholder(shape=[None],dtype=tf.float32)
        self.action_holder = tf.placeholder(shape=[None],dtype=tf.int32)
        
        self.indexes = tf.range(0, tf.shape(self.output)[0]) * tf.shape(self.output)[1] + self.action_holder
        self.responsible_outputs = tf.gather(tf.reshape(self.output, [-1]), self.indexes)

        self.loss = -tf.reduce_mean(tf.log(self.responsible_outputs)*self.reward_holder)
        
        tvars = tf.trainable_variables()
        self.gradient_holders = []
        for idx,var in enumerate(tvars):
            placeholder = tf.placeholder(tf.float32,name=str(idx)+'_holder')
            self.gradient_holders.append(placeholder)
        
        self.gradients = tf.gradients(self.loss,tvars)
        
        optimizer = tf.train.AdamOptimizer(learning_rate=lr)
        self.update_batch = optimizer.apply_gradients(zip(self.gradient_holders,tvars))

In [64]:
tf.reset_default_graph() #Clear the Tensorflow graph.

myAgent = agent(lr=1e-2,s_size=4,a_size=2,h_size=8) #Load the agent.

total_episodes = 2000 #Set total number of episodes to train agent on.
max_ep = 999
update_frequency = 5

init = tf.global_variables_initializer()

# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0
    total_reward = []
    total_lenght = []
    agreed = []
    gradBuffer = sess.run(tf.trainable_variables())
    #for ix,grad in enumerate(gradBuffer):
    #    gradBuffer[ix] = grad * 0
        
    while i < total_episodes:
        s = env.reset()
        running_reward = 0
        ep_history = []
        for j in range(max_ep):
            #Probabilistically pick an action given our network outputs.
            a_dist = sess.run(myAgent.output,feed_dict={myAgent.state_in:[s]})
            #a = np.random.choice(a_dist[0],p=a_dist[0])
            #a = np.argmax(a_dist == a)
            a = np.random.choice(np.arange(len(a_dist[0])), p = a_dist[0])

            s1,r,d,_ = env.step(a) #Get our reward for taking an action given a bandit.
            ep_history.append([s,a,r,s1])
            s = s1
            running_reward += r
            if d == True:
                #Update the network.
                ep_history = np.array(ep_history)
                ep_history[:,2] = discount_rewards(ep_history[:,2])
                feed_dict={myAgent.reward_holder:ep_history[:,2],
                        myAgent.action_holder:ep_history[:,1],myAgent.state_in:np.vstack(ep_history[:,0])}
                grads = sess.run(myAgent.gradients, feed_dict=feed_dict)
                for idx,grad in enumerate(grads):
                    gradBuffer[idx] += grad

                if i % update_frequency == 0 and i != 0:
                    feed_dict= dictionary = dict(zip(myAgent.gradient_holders, gradBuffer))
                    _ = sess.run(myAgent.update_batch, feed_dict=feed_dict)
                    for ix,grad in enumerate(gradBuffer):
                        gradBuffer[ix] = grad * 0
                
                total_reward.append(running_reward)
                total_lenght.append(j)
                out = sess.run(myAgent.output, feed_dict={myAgent.state_in:np.array([list(i) for i in ep_history[:,0]])})
                pred_actions = [np.argmax(tt) for tt in out]
                agreed.append(np.mean(pred_actions == ep_history[:,1]))
                break

        
            #Update our running tally of scores.
        if i % 100 == 0:
            print(np.mean(total_reward[-100:]))
            print('agreed with the past: %f'%(np.mean(agreed[-100:])))
        i += 1

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


26.0
agreed with the past: 0.615385
25.22
agreed with the past: 0.504390
31.73
agreed with the past: 0.519228
41.86
agreed with the past: 0.587411
54.05
agreed with the past: 0.629945
59.39
agreed with the past: 0.643152


KeyboardInterrupt: 