<a href="https://colab.research.google.com/github/varun-bhaseen/Reinforcement-Learning/blob/master/RL_Midterm_Stu_Id_014538212_Bhaseen_Varun.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Midterm Exam

## Problem 1 (70 pts):
1. Use the Q-learning example as the base code and compare the SARSA and Q-learning approach
based on the same setting for the learning rate (alpha) and the discount factor (gamma).
2. Observe the performance of the learning algorithms by changing the learning rate by 4 values
and gamma by 2.
3. Tabulate/plot the performance vs. the changes in these two hyperparameters and elaborate on
the results.
4. Submit your code and the report on Canvas by the due date.

## Problem 2 (30):
1. How does Q-learning different from the state value-based learning?
2. Is it possible for the agent to rely on the state value-based learning approach to achieve its goal?
3. What kind of information does the agent need in order to use each approach (Q-learning and
State value-based)?
4. Is there a systematic way to determine which action value-based learning method (Q-learning
and SARSA) is a better choice and can achieve better results? Explain.

In [None]:
!pip install tensorboardX



In [None]:
!rm -r ./runs/

In [None]:
from tensorboardX import SummaryWriter
writer = SummaryWriter()

%load_ext tensorboard

In [None]:
import gym
import numpy as np

MAX_NUM_EPISODES = 25000
STEPS_PER_EPISODE = 200 #  This is specific to MountainCar. May change with env
EPSILON_MIN = 0.005
max_num_steps = MAX_NUM_EPISODES * STEPS_PER_EPISODE
EPSILON_DECAY = 500 * EPSILON_MIN / max_num_steps
ALPHA = 0.05  # Learning rate
GAMMA = 0.98  # Discount factor
NUM_DISCRETE_BINS = 30  # Number of bins to Discretize each observation dim

class Q_Learner(object):
    def __init__(self, env):
        self.obs_shape = env.observation_space.shape
        self.obs_high = env.observation_space.high
        self.obs_low = env.observation_space.low
        self.obs_bins = NUM_DISCRETE_BINS  # Number of bins to Discretize each observation dim
        self.bin_width = (self.obs_high - self.obs_low) / self.obs_bins
        self.action_shape = env.action_space.n
        # Create a multi-dimensional array (aka. Table) to represent the
        # Q-values
        self.Q = np.zeros((self.obs_bins + 1, self.obs_bins + 1,
                           self.action_shape))  # (51 x 51 x 3)
        self.alpha = ALPHA  # Learning rate
        self.gamma = GAMMA  # Discount factor
        self.epsilon = 1.0

    def discretize(self, obs):
        return tuple(((obs - self.obs_low) / self.bin_width).astype(int))

    def get_action(self, obs):
        discretized_obs = self.discretize(obs)
        # Epsilon-Greedy action selection
        if self.epsilon > EPSILON_MIN:
            self.epsilon -= EPSILON_DECAY
        if np.random.random() > self.epsilon:
            return np.argmax(self.Q[discretized_obs])
        else:  # Choose a random action
            return np.random.choice([a for a in range(self.action_shape)])

    def learn(self, obs, action, reward, next_obs):
        discretized_obs = self.discretize(obs)
        discretized_next_obs = self.discretize(next_obs)
        td_target = reward + self.gamma * np.max(self.Q[discretized_next_obs])
        td_error = td_target - self.Q[discretized_obs][action]
        self.Q[discretized_obs][action] += self.alpha * td_error

def q_train(agent, env):
    best_reward = -float('inf')
    with SummaryWriter() as writer:

      for episode in range(MAX_NUM_EPISODES):
          done = False
          obs = env.reset()
          total_reward = 0.0
          while not done:
              action = agent.get_action(obs)
              next_obs, reward, done, info = env.step(action)
              agent.learn(obs, action, reward, next_obs)
              obs = next_obs
              total_reward += reward
          if total_reward > best_reward:
              best_reward = total_reward
          print("Q-Learn Episode#:{} reward:{} best_reward:{} eps:{}".format(episode,
                                      total_reward, best_reward, agent.epsilon))
          # writer.add_scalar('Q-Learner', episode, total_reward)
          
          # writer.add_hparams({'learning_rate': agent.alpha, 'discount_factor': agent.gamma,
          #                 'epsilon': agent.epsilon}, {'reward': total_reward, 'best_reward': best_reward})

          stats = {'learning_rate': agent.alpha, 'discount_factor': agent.gamma,
                  'epsilon': agent.epsilon, 'reward': reward, 
                  'best_reward': best_reward, 'total_reward': total_reward}
          
          writer.add_scalars('Q_LearnAgent', stats, episode)

      # Return the trained policy
      return np.argmax(agent.Q, axis=2)

def q_test(agent, env, policy):
    done = False
    obs = env.reset()
    total_reward = 0.0
    while not done:
        action = policy[agent.discretize(obs)]
        next_obs, reward, done, info = env.step(action)
        obs = next_obs
        total_reward += reward
    return total_reward

# if __name__ == "__main__":
#     env = gym.make('MountainCar-v0')
#     # env = gym.wrappers.Monitor(env, "recording", force=True)
    
#     agent = Q_Learner(env)
#     learned_policy = train(agent, env)
#     # Use the Gym Monitor wrapper to evalaute the agent and record video
#     # gym_monitor_path = "./gym_monitor_output"
#     # env = gym.wrappers.Monitor(env, gym_monitor_path, force=True)
#     writer.flush()
#     for _ in range(1000):
#         test(agent, env, learned_policy)
#     env.close()

In [None]:
# !pip install gym pyvirtualdisplay
# !apt-get install -y xvfb python-opengl ffmpeg


In [None]:
import gym
import numpy as np
# from pyvirtualdisplay import Display
import math
# import glob
# import io
# import base64
# from IPython.display import HTML
from gym import logger as gymlogger
# import IPython.display as ipythondisplay

# gymlogger.set_level(40) #error only

# display = Display(visible=0, size=(1400, 900))
# display.start()

MAX_NUM_EPISODES = 25000
STEPS_PER_EPISODE = 200 #  This is specific to MountainCar. May change with env
EPSILON_MIN = 0.005
max_num_steps = MAX_NUM_EPISODES * STEPS_PER_EPISODE
EPSILON_DECAY = 500 * EPSILON_MIN / max_num_steps

ALPHA = 0.05  # Learning rate

GAMMA = 0.99  # Discount factor
NUM_DISCRETE_BINS = 30  # Number of bins to Discretize each observation dim

class SARSA_Learner(object):
    def __init__(self, env):
        self.obs_shape = env.observation_space.shape
        self.obs_high = env.observation_space.high
        self.obs_low = env.observation_space.low
        
        self.obs_bins = NUM_DISCRETE_BINS  # Number of bins to Discretize each observation dim
        self.bin_width = (self.obs_high - self.obs_low) / self.obs_bins
        self.action_shape = env.action_space.n #n_action = action_shape
        # Create a multi-dimensional array (aka. Table) to represent the
        # Q-values
        self.Q = np.zeros((self.obs_bins + 1, self.obs_bins + 1,
                           self.action_shape))  # (51 x 51 x 3)
        
        
        self.env_den = self.bin_width
        self.pos_den = self.env_den[0]    
        self.vel_den = self.env_den[1]
        self.pos_high = self.obs_high[0]    
        self.pos_low = self.obs_low[0]    
        self.vel_high = self.obs_high[1]    
        self.vel_low = self.obs_low[1]
        
        
        self.gamma = GAMMA  # Discount factor
        self.epsilon = 1.0

        self.alpha = ALPHA
        
#         self.alpha = max(EPSILON_MIN,self.epsilon*(self.gamma**(episode//100)))  # Learning rate

    def discretize(self, obs):
        
        
        self.pos_scaled = int((obs[0] - self.pos_low)/self.pos_den)  #converts to an integer value    
        self.vel_scaled = int((obs[1] - self.vel_low)/self.vel_den)  #converts to an integer value
        
        return self.pos_scaled, self.vel_scaled

    def get_action(self, obs):
        
        pos, vel = self.discretize(obs)
        
        # Epsilon-Greedy action selection
        
           #action for the initial state using epsilon greedy   
        if self.epsilon > EPSILON_MIN:
            self.epsilon -= EPSILON_DECAY

        if np.random.uniform(low=0,high=1) > self.epsilon:        
            a = np.random.choice(env.action_space.n)
            return a
        else:        
#             pos,vel = discretize(obs)        
            a = np.argmax(self.Q[pos][vel])
            return a

        
#         if self.epsilon > EPSILON_MIN:
#             self.epsilon -= EPSILON_DECAY
        
#         if np.random.random() > self.epsilon:
#             return np.argmax(self.Q[discretized_obs])
        
#         else:  # Choose a random action
#             return np.random.choice([a for a in range(self.action_shape)])

    def learn(self, obs, action, reward, next_obs):
        
        pos, vel = self.discretize(obs)
        
        pos_, vel_ = self.discretize(next_obs)
        
        td_target = reward + self.gamma * np.max(self.Q[pos_][vel_][action])
        
        """
        Formula picked from below source
        https://learning.oreilly.com/library/view/reinforcement-learning-with/9781788835725/ffd21cf7-d907-45e6-a897-8762c9a20f2d.xhtml
        """
        
        self.Q[pos][vel][action] = (1 - self.alpha)*self.Q[pos][vel][action] + self.alpha * td_target
        
#         td_error = td_target - self.Q[pos][vel][action]
        
#         self.Q[pos][vel][action] = (1 - self.alpha)
        
#         self.Q[pos][vel][action] += self.alpha * td_error

def sarsa_train(agent, env):
    best_reward = -float('inf')
    with SummaryWriter() as writer:

      for episode in range(MAX_NUM_EPISODES):
          done = False
          obs = env.reset()
          total_reward = 0.0
  #         alpha = max(EPSILON_MIN,self.epsilon*(gamma**(episode//100)))
          while not done:
              action = agent.get_action(obs)
              next_obs, reward, done, info = env.step(action)
              agent.learn(obs, action, reward, next_obs)
              obs = next_obs
              total_reward += reward
          if total_reward > best_reward:
              best_reward = total_reward
          print("SARSA Episode#:{} reward:{} best_reward:{} eps:{}".format(episode,
                                      total_reward, best_reward, agent.epsilon))
          # writer.add_scalar('SARSA', episode, total_reward)
          
          # writer.add_hparams({'learning_rate': agent.alpha, 'discount_factor': agent.gamma,
          #                 'epsilon': agent.epsilon}, {'reward': total_reward, 'best_reward': best_reward})          
        
          stats = {'learning_rate': agent.alpha, 'discount_factor': agent.gamma,
                  'epsilon': agent.epsilon, 'reward': reward, 
                  'best_reward': best_reward, 'total_reward': total_reward}
          
          writer.add_scalars('SARSA_LearnAgent', stats, episode)
        
        # show_video()
    # Return the trained policy
    return np.argmax(agent.Q, axis=2)

def sarsa_test(agent, env, policy):
    done = False
    obs = env.reset()
    total_reward = 0.0
    while not done:
        action = policy[agent.discretize(obs)]
        next_obs, reward, done, info = env.step(action)
        obs = next_obs
        total_reward += reward
    return total_reward

# def show_video():
#   mp4list = glob.glob('video/*.mp4')
#   if len(mp4list) > 0:
#     mp4 = mp4list[0]
#     video = io.open(mp4, 'r+b').read()
#     encoded = base64.b64encode(video)
#     ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
#                 loop controls style="height: 400px;">
#                 <source src="data:video/mp4;base64,{0}" type="video/mp4" />
#              </video>'''.format(encoded.decode('ascii'))))
#   else: 
#     print("Could not find video")

# def wrap_env(env):
#   env = gym.wrappers.Monitor(env, './video', force=True)
#   return env

# if __name__ == "__main__":
#     # env = wrap_env(gym.make('MountainCar-v0')) #wrapping the env to render as a video
#     env = gym.make('MountainCar-v0')
#     # env = gym.wrappers.Monitor(env, "recording", force=True)
    
#     sarsa_agent = SARSA_Learner(env)
#     sarsa_learned_policy = train(sarsa_agent, env)
#     # Use the Gym Monitor wrapper to evalaute the agent and record video
#     # gym_monitor_path = "./gym_monitor_output"
#     # env = gym.wrappers.Monitor(env, gym_monitor_path, force=True)
#     # show_video()
#     # writer.flush()
    
#     q_agent = Q_Learner(env)
#     q_learned_policy = train(q_agent, env)
#     # Use the Gym Monitor wrapper to evalaute the agent and record video
#     # gym_monitor_path = "./gym_monitor_output"
#     # env = gym.wrappers.Monitor(env, gym_monitor_path, force=True)
#     writer.flush()
    
#     for _ in range(1000):
#         test_1 = test(q_agent, env, q_learned_policy)
#         test_2 = test(sarsa_agent, env, sarsa_learned_policy)
#         # show_video()
#     env.close()

In [None]:
if __name__ == "__main__":
    
    env = gym.make('MountainCar-v0')
    
    sarsa_agent = SARSA_Learner(env)  
    q_agent = Q_Learner(env)
    
    sarsa_learned_policy = sarsa_train(sarsa_agent, env)
    q_learned_policy = q_train(q_agent, env)
    
    writer.flush()
    
    for _ in range(1000):
        test_1 = q_test(q_agent, env, q_learned_policy)
        test_2 = sarsa_test(sarsa_agent, env, sarsa_learned_policy)
        # show_video()
    env.close()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
SARSA Episode#:20001 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20002 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20003 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20004 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20005 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20006 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20007 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20008 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20009 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20010 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20011 reward:-200.0 best_reward:-155.0 eps:0.004999500037173707
SARSA Episode#:20012 reward:-200.0 best_reward:-155.0 eps:0.004999

## Conclusion: 

1.	Use the Q-learning example as the base code and compare the SARSA and Q-learning approach based on the same setting for the learning rate (alpha) and the discount factor (gamma).

*   Q-Learner is coverging faster as compared to SARSA on same learning rate and Discount factor. The Reason for that is because Q-Learner takes a greedy approach so it takes in larger values to converge as fast as possible for maximising awards


2.	Observe the performance of the learning algorithms by changing the learning rate by 4 values and gamma by 2.

*   The Discount Factor Gamma is decayed dynamically over a range of values after each 185 steps and as observed the Q-Learner is performing better even as gamma is decaying whereas SARSA is not coverging sooner. The Learning rate (alpha) is also changed dynamically over time after every 100000 episodes. SARSA works poorly as it takes smaller steps over smaller learning rate as opposed to Q-Learning which is following a curve for convergence

3.  Tabulate/plot the performance vs. the changes in these two hyperparameters and elaborate on the results.


*   The Learning rate is dynamically changed over time and we can see the performance changing drastically for SARSA. This is due to the fact that SARSA takes a more conservative approach and as learning rate reduces the ability for SARSA to converge also reduces significantly

In [None]:
# %tensorboard dev upload --logdir runs

In [None]:
# !tensorboard --logdir runs/

In [None]:
!tensorboard dev upload --logdir runs 

2020-10-15 00:49:38.249302: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

***** TensorBoard Uploader *****

This will upload your TensorBoard logs to https://tensorboard.dev/ from
the following directory:

runs

This TensorBoard will be visible to everyone. Do not upload sensitive
data.

Your use of this service is subject to Google's Terms of Service
<https://policies.google.com/terms> and Privacy Policy
<https://policies.google.com/privacy>, and TensorBoard.dev's Terms of Service
<https://tensorboard.dev/policy/terms/>.

This notice will not be shown again while you are logged into the uploader.
To log out, run `tensorboard dev auth revoke`.

Continue? (yes/NO) yes

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=373649185512-8v619h5kft38l4456nm2dj4ubeqsrvh6.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scop

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# !rm -r ./runs/

In [None]:
!tensorboard dev auth revoke

2020-10-15 01:11:12.776175: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Logged out of uploader.


In [None]:
# !cp -r runs/ drive/"My Drive"