# TensorFlow Assignment: Reinforcement Learning (RL)

**[Duke Community Standard](http://integrity.duke.edu/standard.html): By typing your name below, you are certifying that you have adhered to the Duke Community Standard in completing this assignment.**

Name: Rachel Kositsky

### Questions:

- Why don't we multiply by alpha when updating the action Q value? 
- Answer: because we're using the continuous neural network, and it updates via backpropagation

### Short answer

1\. One of the fundamental challenges of reinforcement learning is balancing *exploration* versus *exploitation*. What do these two terms mean, and why do they present a challenge?

The network has the task of maximizing its reward on a task. It does this by taking the best action at any given state according to its policy. However, this way it can become fixed in a certain policy which may not be the optimal policy. _Exploration_ is the idea of taking potentially non-optimal actions in order to improve the policy. _Exploitation_ is the idea of using the learned policy to maximize the reward. 

No exploration means that the learned policy may converge on a non-optimal solution. No exploitation means that the rewards will be non-optimal because the rewards per state will never have been chosen by the policy. It is a challenge to balance these two in a way that allows learning and accurate reward propagation.

2\. Another fundamental reinforcement learning challenge is what is known as the *credit assignment problem*, especially when rewards are sparse. 
What do we mean by the phrase, and why does this make learning especially difficult?
How does this interact with reward function design, where we have to be careful that our reward captures the true objective?

The _credit assigning problem_ is that rewards for achieving the overall goal may only occur after many steps, decreasing the reward and making it difficult to update beneficial steps. This often occurs when the "win" outcome is only one state in a continuous or large discrete state space. However, reward functions still have to be designed to capture the true objective and not intermediate goals, as the model may learn to achieve the intermediate goals as the expense of the true objective.

### Deep SARSA Cart Pole

[SARSA (state-action-reward-state-action)](https://en.wikipedia.org/wiki/State–action–reward–state–action) is another Q value algorithm that resembles Q-learning quite closely:

Q-learning update rule:
\begin{equation}
Q_\pi (s_t, a_t) \leftarrow (1 - \alpha) \cdot Q_\pi(s_t, a_t) + \alpha \cdot \big(r_t + \gamma \max_a Q_\pi(s_{t+1}, a)\big)
\end{equation}

SARSA update rule:
\begin{equation}
Q_\pi (s_t, a_t) \leftarrow (1 - \alpha) \cdot Q_\pi(s_t, a_t) + \alpha \cdot \big(r_t + \gamma Q_\pi(s_{t+1}, a_{t+1})\big)
\end{equation}

Unlike Q-learning, which is considered an *off-policy* network, SARSA is an *on-policy* algorithm. 
When Q-learning calculates the estimated future reward, it must "guess" the future, starting with the next action the agent will take. In Q-learning, we assume the agent will take the best possible action: $\max_a Q_\pi(s_{t+1}, a)$. SARSA, on the other hand, uses the action that was actually taken next in the episode we are learning from: $Q_\pi(s_{t+1}, a_{t+1})$. In other words, SARSA learns from the next action he actually took (on policy), as opposed to what the max possible Q value for the next state was (off policy).

Build an RL agent that uses SARSA to solve the Cart Pole problem. 

*Hint: You can and should reuse the Q-Learning agent we went over earlier. In fact, if you know what you're doing, it's possible to finish this assignment in about 30 seconds.*

In [1]:
### YOUR CODE HERE ###
# Based on: https://gym.openai.com/evaluations/eval_EIcM1ZBnQW2LBaFN6FY65g/

import random
import gym
import math
import numpy as np
import tensorflow as tf
from collections import deque

class DQNCartPoleSolver():
    def __init__(self, n_episodes=1000, n_win_ticks=195, max_env_steps=None, gamma=1.0, epsilon=1.0, epsilon_min=0.01, epsilon_log_decay=0.995, alpha=0.01, alpha_decay=0.01, batch_size=64, monitor=False, quiet=False):
        self.memory = deque(maxlen=100000)
        self.env = gym.make('CartPole-v0')
        if monitor: self.env = gym.wrappers.Monitor(self.env, '../data/cartpole-1', force=True)
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_log_decay
        self.alpha = alpha
        self.alpha_decay = alpha_decay
        self.n_episodes = n_episodes
        self.n_win_ticks = n_win_ticks
        self.batch_size = batch_size
        self.quiet = quiet
        if max_env_steps is not None: self.env._max_episode_steps = max_env_steps

        # Init model
        self.state_ = tf.placeholder(tf.float32, shape=[None, 4])
        h = tf.layers.dense(self.state_, units=24, activation=tf.nn.tanh)
        h = tf.layers.dense(h, units=48, activation=tf.nn.tanh)
        self.Q = tf.layers.dense(h, units=2)
        
        self.Q_ = tf.placeholder(tf.float32, shape=[None, 2])
        loss = tf.losses.mean_squared_error(self.Q_, self.Q)
        self.global_step = tf.Variable(0, name='global_step', trainable=False)
        lr = tf.train.exponential_decay(0.01, self.global_step, 0.995, 1)
        self.train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=self.global_step)
        
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    def remember(self, state, action, reward, next_state, next_action, done):
        self.memory.append((state, action, reward, next_state, next_action, done))

    def choose_action(self, state, epsilon):
        if (np.random.random() <= epsilon):
            action = self.env.action_space.sample()
        else:
            action = np.argmax(self.sess.run(self.Q, feed_dict={self.state_: state}))
        return action
        
    def get_epsilon(self, t):
        return max(self.epsilon_min, min(self.epsilon, 1.0 - math.log10((t + 1) * self.epsilon_decay)))

    def preprocess_state(self, state):
        return np.reshape(state, [1, 4])

    def replay(self, batch_size):
        # TODO: change the update here. gets called after each episode ends.
        x_batch, y_batch = [], []
        minibatch = random.sample(
            self.memory, min(len(self.memory), batch_size))
        
        # can you see it in minibatch?
        # y_target: initialized as .Q
        # y_target: is the avg reward?
        for state, action, reward, next_state, next_action, done in minibatch:
            y_target = self.sess.run(self.Q, feed_dict={self.state_: state})
            
            # Change the update for the action to SARSA update
            y_target[0][action] = reward if done else reward + self.gamma * self.sess.run(self.Q, feed_dict={self.state_: next_state})[0][next_action]
            x_batch.append(state[0])
            y_batch.append(y_target[0]) 
        self.sess.run(self.train_step, feed_dict={self.state_: np.array(x_batch), self.Q_: np.array(y_batch)})

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def run(self):
        scores = deque(maxlen=100)

        for e in range(self.n_episodes):
            state = self.preprocess_state(self.env.reset())
            done = False
            i = 0
            next_action = self.choose_action(state, self.get_epsilon(e))
            
            while not done:
                if e % 100 == 0 and not self.quiet:
                    self.env.render()
                action = next_action
                #action = self.choose_action(state, self.get_epsilon(e))
                next_state, reward, done, _ = self.env.step(action)
                next_state = self.preprocess_state(next_state)
                next_action = self.choose_action(state, self.get_epsilon(e))
                self.remember(state, action, reward, next_state, next_action, done)
                state = next_state
                i += 1

            scores.append(i)
            mean_score = np.mean(scores)
            if mean_score >= self.n_win_ticks and e >= 100:
                if not self.quiet: print('Ran {} episodes. Solved after {} trials ✔'.format(e, e - 100))
                return e - 100
            if e % 100 == 0 and not self.quiet:
                print('[Episode {}] - Mean survival time over last 100 episodes was {} ticks.'.format(e, mean_score))

            self.replay(self.batch_size)
        
        if not self.quiet: print('Did not solve after {} episodes 😞'.format(e))
        return e

if __name__ == '__main__':
    agent = DQNCartPoleSolver()
    agent.run()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[Episode 0] - Mean survival time over last 100 episodes was 13.0 ticks.
[Episode 100] - Mean survival time over last 100 episodes was 27.3 ticks.
[Episode 200] - Mean survival time over last 100 episodes was 58.06 ticks.
[Episode 300] - Mean survival time over last 100 episodes was 178.13 ticks.
[Episode 400] - Mean survival time over last 100 episodes was 97.01 ticks.
[Episode 500] - Mean survival time over last 100 episodes was 140.76 ticks.
[Episode 600] - Mean survival time over last 100 episodes was 169.74 ticks.
[Episode 700] - Mean survival time over last 100 episodes was 183.58 ticks.
[Episode 800] - Mean survival time over last 100 episodes was 164.75 ticks.
[Episode 900] - Mean survival time over last 100 episodes was 158.28 ticks.
Did not solve after 999 episodes 😞


In [2]:
agent.env.close()

Note: you should be able to find that SARSA works much better for the demo we went over during lecture.
This is not necessarily a general result.
Q-learning and SARSA tend to do better on different kinds of problems.