# ASSIGNMENT 3
# Submission Deadline: 20/03/2024 at 10 AM
# Submission Link: https://forms.gle/b8s6xYHUYqTJtSNeA 

# Table of Contents

1. [Provide Information](#Provide-Information)
2. [Instructions](#Instructions)
3. [Environment](#Environment)
4. [Hyperparameters](#Hyperparameters)
5. [Helper Functions](#helper)
6. [Deep Value Based RL Agents](#deep-value-based)
7. [Deep Policy Based RL Agents](#deep-policy-based)
8. [Experiments to Run](#experiments)

# Provide Information
<a id="Provide-Information"></a>

Name: **Kislay Aditya Oj**

Roll No.: **210524**

IITK EMail: **kislay21@iitk.ac.in**

# Instructions
<a id="Instructions"></a>


**Read all the instructions below carefully before you start working on the assignment.**
- The purpose of this course is that you learn RL and the best way to do that is by implementation and experimentation.
- The assignment requires your to implement some algorithms and you are required report your findings after experimenting with those algorithms.
- **You are required to submit ZIP file containing a Jupyter notebook (.ipynb), and an image folder. The notebook would include the code, graphs/plots of the experiments you run and your findings/observations. Image folder is the folder having plots, images, etc.**
- In case you use any maths in your explanations, render it using latex in the Jupyter notebook.
- You are expected to implement algorithms on your own and not copy it from other sources/class mates. Of course, you can refer to lecture slides.
- If you use any reference or material (including code), please cite the source, else it will be considered plagiarism. But referring to other sources that directly solve the problems given in the assignment is not allowed. There is a limit to which you can refer to outside material.
- This is an individual assignment.
- In case your solution is found to have an overlap with solution by someone else (including external sources), all the parties involved will get zero in this and all future assignments plus further more penalties in the overall grade. We will check not just for lexical but also semantic overlap. Same applies for the code as well. Even an iota of cheating would NOT be tolerated. If you cheat one line or cheat one page the penalty would be same.
- Be a smart agent, think long term, if you cheat we will discover it somehow, the price you would be paying is not worth it.
- In case you are struggling with the assignment, seek help from TAs. Cheating is not an option! I respect honesty and would be lenient if you are not able to solve some questions due to difficulty in understanding. Remember we are there to help you out, seek help if something is difficult to understand.
- The deadline for the submission is given above. Submit at least 30 minutes before the deadline, lot can happen at the last moment, your internet can fail, there can be a power failure, you can be abducted by aliens, etc.
- You have to submit your assignment via the Google Form (link above)
- The form would close after the deadline and we will not accept any solution. No reason what-so-ever would be accepted for not being able to submit before the deadline.
- Since the assignment involves experimentation, reporting your results and observations, there is a lot of scope for creativity and innovation and presenting new perspectives. Such efforts would be highly appreciated and accordingly well rewarded. Be an exploratory agent!
- Your code should be very well documented, there are marks for that.
- In your plots, have a clear legend and clear lines, etc. Of course you would generating the plots in your code but you must also put these plots in your notebook. Generate high resolution pdf/svg version of the plots so that it doesn't pixilate on zooming.
- For all experiments, report about the seed used in the code documentation, write about the seed used.
- In your notebook write about all things that are not obvious from the code e.g., if you have made any assumptions, references/sources, running time, etc.
-  **DO NOT Forget to write name, roll no and email details above**
- **In addition to checking your code, we will be conducting one-on-one viva for the evaluation. So please make sure that you do not cheat!**
- **Use of LLMs based tools or AI-based code tools is strictly prohibited! Use of ChatGPT, VS Code, Gemini, CO-Pilot, etc. is not allowed. NOTE VS code is also not allowed. Even in Colab disable the AI assistant. If you use it, we will know it very easily. Use of any of the tools would be counted as cheating and would be given a ZERO, with no questions asked.**
- For each of the sub-part in the question create a new cell below the question and put your answer in there. This includes the plots as well

# OpenAI Gym Environments
<a id="Environment"></a>

In [None]:
# all imports go in here
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time 
import copy 
import random
import itertools 

In this assignment we will be exploring Deep RL algorithms and for this we will be using environmentd provided by OpenAI Gym. In particualr we will be exploring "CartPole-v0" and "MountainCar-v0" environments (https://gymnasium.farama.org/environments/classic_control/ ). The code to instantiate the environments are given in the cells below. Run these cells and play with the environments to learn more details about the environments. 

In [None]:
# Create CartPole environment
#https://gymnasium.farama.org/environments/classic_control/cart_pole/

env = gym.make('CartPole-v1')
#env.seed(34)
s = env.reset()
print("Observation Space = ")
print(env.observation_space)
print("Action Space = ")
print(env.action_space)
done = False
for episode in range(20):
    print("In episode {}".format(episode))
    for i in range(100):
        env.render()
        print(s)
        a = env.action_space.sample()
        s, r, terminated, truncated , info = env.step(a)
        if terminated or truncated:
            print("Finished after {} timestep".format(i+1))
env.close()

In [None]:
# Create MountainCar environment: 
# https://gymnasium.farama.org/environments/classic_control/mountain_car/

env = gym.make('MountainCar-v0')
#env.seed(45)
s = env.reset()
print("Observation Space = ")
print(env.observation_space)
print("Action Space = ")
print(env.action_space)
done = False
for episode in range(20):
    print("In episode {}".format(episode))
    for i in range(100):
        env.render()
        print(s)
        a = env.action_space.sample()
        s, r, terminated, truncated , info = env.step(a)
        if terminated or truncated :
            print("Finished after {} timestep".format(i+1))
env.close()

# Hyperparameters
<a id="Hyperparameters"></a>

All your hyperparameters should be stated here. We will change their value here and your code should work  accordingly. 

In [None]:
# mention the values of all the hyperparameters (you can add more hyper-paramters as well) to be used in 
# the entire notebook,put the values that gave the best performance and were finally used for the agent

gamma = 0.99
epsilon = 0.5 #epsilon greedy strategy
temp = 0.4#softmax strategy 
delta = 1#huber loss
tau = 0.1#D3QN
alpha = 0.1#D3QN-PER
beta = 0.9#D3QN-PER
beta_rate = 0.99992#D3QN-PER
MAX_TRAIN_EPISODES = 1000
MAX_EVAL_EPISODES = 1

# Helper Functions
<a id="helper"></a>

Write all the helper functions that will be used for value-based and policy based algorithms below. In case you want to add more helper functions, please feel free to add.

In [None]:
def selectGreedyAction(net, state):
    #this function gets q-values via the network and selects greedy action from q-values and returns it
    #Your code goes in here
    
    q_values = net(state)
    greedyaction = torch.argmax(q_values, dim=1)[0]

    return greedyaction.detach().item()
    
    

In [None]:
def selectEpsilonGreedyAction(net,  state, epsilon = 0.9 ):
    #this function gets q-values via the network and selects an action from q-values using epsilon greedy strategy
    #and returns it
    #note this function can be used for decaying epsilon greedy strategy, 
    #you would need to create a wrapper function that will handle decaying epsilon
    #you can create this wrapper in this helper function section
    #for the agents you would be implementing it would be nice to play with decaying parameter to get optimal 
    #results
    
    #Your code goes in here
    random_value = random.random()
    action = 0
    if random_value <= epsilon:
        action = env.action_space.sample()
    else :
        action = selectGreedyAction(net, state)
        
    return action

In [None]:
def selectSoftMaxAction(net, state, temp):
    #this function gets q-values via the network and selects an action from q-values using softmax strategy
    #and returns it
    #note this function can be used for decaying temperature softmax strategy, 
    #you would need to create a wrapper function that will handle decaying temperature
    #you can create this wrapper in this helper function section
    #for the agents you would be implementing it would be nice to play with decaying parameter to get optimal 
    #results
    
    #Your code goes in here
    q_values = net(state)
    
    softmax_probs = F.softmax(q_values / temp, dim=1)
    
    soft_action = torch.multinomial(softmax_probs, 1).item()
    
    return softAction

In [None]:
class Network(nn.Module):
    def __init__(self, inDim, outDim, hDim=[32, 32], activation=nn.ReLU()):
        super(Network, self).__init__()
        layers = []
        layers.append(nn.Linear(inDim, hDim[0]))
        layers.append(activation)
        
        for i in range(len(hDim) - 1):
            layers.append(nn.Linear(hDim[i], hDim[i + 1]))
            layers.append(activation)
            
        layers.append(nn.Linear(hDim[-1], outDim))
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)
    

In [None]:
class DuelNetwork(nn.Module):
    def __init__(self, inDim, outDim, hDim=[32, 32], activation=nn.ReLU()):
        super(DuelNetwork, self).__init__()

        self.fc1 = nn.Linear(inDim, hDim[0])
        self.act = activation
        self.fc_value = nn.Linear(hDim[0], hDim[1])
        self.fc_adv = nn.Linear(hDim[0], hDim[1])

        self.value = nn.Linear(hDim[1], 1)
        self.adv = nn.Linear(hDim[1], outDim)

    def forward(self, state):
        y = self.act(self.fc1(state))
        value = self.act(self.fc_value(y))
        adv = self.act(self.fc_adv(y))

        value = self.value(value)
        adv = self.adv(adv)

        advAverage = torch.mean(adv, dim=1, keepdim=True)
        Q = value + adv - advAverage

        return Q

In [None]:
#Value Network
def createValueNetwork(inDim, outDim, hDim = [32,32], activation = nn.ReLU()):
    #this creates a Feed Forward Neural Network class and instantiates it and returns the class
    #the class should be derived from torch nn.Module and it should have init and forward method at the very least
    #the forward function should return q-value for each possible action
    
    #Your code goes in here
    return Network(inDim, outDim, hDim, activation)

In [None]:
#Dueling Network
def createDuelingNetwork(inDim, outDim, hDim = [32,32], activation = nn.ReLU()):
    #this creates a Feed Forward Neural Network class and instantiates it and returns the class
    #the class should be derived from torch nn.Module and it should have init and forward method at the very least
    #the forward function should return q-value which is derived 
    #internally from action-advantage function and v-function, 
    #Note we center the advantage values, basically we subtract the mean from each state-action value
    
    #Your code goes in here
    
    return DuelNetwork(inDim, outDim, hDim, activation)

In [None]:
#Policy Network
def createPolicyNetwork(inDim, outDim, hDim = [32,32], activation = nn.ReLU()):
    #this creates a Feed Forward Neural Network class and instantiates it and returns the class
    #the class should be derived from torch nn.Module and it should have init and forward method at the very least
    #the forward function should return action logit vector 
    
    #Your code goes in here
    return Network(inDim, outDim, hDim, activation)

In [None]:
def plotQuantity(quantityListDict, totalEpisodeCount, descriptionList,window_size=10):
    # This function takes in the quantityListDict and plots quantity vs episodes.
    # quantityListListDict = {envInstanceCount: quantityList}
    # quantityList is list of the quantity per episode,
    # for example, it could be mean reward per episode, train time per episode, etc.
    #
    # NOTE: len(quantityList) == totalEpisodeCount
    #
    # Since we run multiple instances of the environment, there will be variance across environments
    # so in the plot, you will plot per episode maximum, minimum, and average value across all env instances
    # Basically, you need to envelop (e.g., via color) the quantity between max and min with mean value in between
    #
    # use the descriptionList parameter to put legends, title, etc.
    # For each of the plot, create the legend on the left/right side so that it doesn't overlay on the
    # plot lines/envelop.
    #
    # this is a generic function and can be used to plot any of the quantity of interest
    # In particular we will be using this function to plot:
    #        mean train rewards vs episodes
    #        mean evaluation rewards vs episodes
    #        total steps vs episode
    #        train time vs episode
    #        wall clock time vs episode
    #
    # this function doesn't return anything
    
    
    mean_values = np.zeros(totalEpisodeCount)
    max_values = np.full(totalEpisodeCount, -float('inf'))
    min_values = np.full(totalEpisodeCount, float('inf'))
    count = 0
    
    
    for quantityList in quantityListDict.values():
        quantityList = np.array(quantityList, dtype=float)
        
        if len(quantityList) > totalEpisodeCount:
            quantityList = quantityList[:totalEpisodeCount]
            
        quantityList = np.pad(quantityList, (0, totalEpisodeCount - len(quantityList)), mode='constant', constant_values=np.nan)
        
        
        mean_values += quantityList
        max_values = np.maximum(max_values, quantityList)
        min_values = np.minimum(min_values, quantityList)
        
        count += 1
    
    episode_numbers = np.arange(1, totalEpisodeCount + 1)
    mean_values = mean_values / count
    
    # Applying simple moving average 
    kernel = np.ones(window_size) / window_size
    mean_values_smooth = np.convolve(mean_values, kernel, mode='valid')
    
    # Plot 
    plt.fill_between(episode_numbers[window_size-1:], min_values[window_size-1:], max_values[window_size-1:], alpha=0.3)
    plt.plot(episode_numbers[window_size-1:], mean_values_smooth, label=descriptionList[2])
    
    plt.xlabel('Episodes')
    plt.ylabel(descriptionList[0])
    plt.title(descriptionList[1])
    plt.legend(loc='upper left', bbox_to_anchor=(1, 1))

    

In [None]:
def huberLoss(error, delta):
    #this function calculates the huber loss for the error using the delta parameter
    
    #Your code goes in here
    abs_error = torch.abs(error)
    quadratic = torch.minimum(abs_error, delta)
    linear = abs_error - quadratic
    hLoss = 0.5 * quadratic**2 + delta * linear
    return hLoss.mean()

In [None]:
#in case you want to create any other helper function, the code goes in here


In [None]:
#in case you want to create any other helper function, the code goes in here

# Deep Value Based RL agents.
<a id="deep-value-based"></a>

### The purpose of this part is to learn about different Deep Value Based RL agents.

In this part of the assignment you will be implementing Deep RL algorithms we learnt in Lectures. Namely, we will be implementing NFQ, DQN, Double DQN (DDQN), Duelling Double DQN (D3QN), and Duelling Double DQN with Prioritized Experience Replay (D3QN-PER). For all the algorithms below, this time we will not be specifying the hyper-parameters, please play with the hyper-params to come up with the best values. This way you will learn to tune the model. Some of the values were specified in the lecture, that would be a good starting point. Your aim is to develop the best NFQ/DQN/DDQN/D3QN/D3QN-PER agent for each of the setting.  

For those of you who follow TEDEd, here is an interesting video by TED on DQN and Atari Games: https://www.youtube.com/watch?v=PP8Zc778B8s 

Also since these environments are available in Gymanasium, there are public leaderboards (https://github.com/openai/gym/wiki/Leaderboard) for each of these environments. Compare where does your agent stand on these leaderboard for each of these environments, try to tune your agents so that it is on the top of the leaderboard. In fact, if your agent performs well on these environments, you can alse make your entry on the leaderboard.  

## <font color='green'> Do not change any Class/Methods definition. We have split the class methods across cells for code readibility purposes. This requires to inherit the same class, please do not change it. </font>

## ReplayBuffer 

In next few cells, you will implement replaybuffer class. 

This class creates a buffer for storing and retrieving experiences. This is a generic class and can be used
for different agents like NFQ, DQN, DDQN, PER_DDQN, etc. 
Following are the methods for this class which are implemented in subsequent cells

```
class ReplayBuffer():
    def __init__(self, bufferSize, **kwargs)
    def store(self, experience)
    def update(self, indices, priorities) 
    def collectExperiences(env, state, explorationStrategy, net = None)
    def sample(self, batchSize, **kwargs)
    def splitExperiences(self, experiences)
    def length(self)
```   

In [None]:
class ReplayBuffer():
    def __init__(self, bufferSize, bufferType = 'DQN', **kwargs):
        # this function creates the relevant data-structures, and intializes all relevant variables
        # it can take variable number of parameters like alpha, beta, beta_rate (required for PER)
        # here the bufferType variable can be used to maintain one class for all types of agents
        # using the bufferType parameter in the methods below, you can implement all possible functionalities 
        # that could be used for different types of agents
        
        # permissible values for bufferType = NFQ, DQN, DDQN, D3QN and PER-D3QN
        
        #Your code goes in here
        self.bufferSize = bufferSize
        self.bufferType = bufferType
        self.buffer     = deque(maxlen = bufferSize)
        
        self.is_priority = False
        if bufferType == 'PER-D3QN':
            self.is_priority = True
            self.priority_buffer = deque(maxlen = bufferSize)
            self.priority_alpha = kwargs['priority_alpha']
            self.priority_beta  = kwargs['priority_beta']
            self.priority_beta_rate = kwargs['priority_beta_rate']
        

In [None]:
class ReplayBuffer(ReplayBuffer):
    def store(self, experience):
        #stores the experiences, based on parameters in init it can assign priorities, etc.  
        #
        #this function does not return anything
        #
        #Your code goes in here
        
        # for normal cases 
        
        #print("I am here .......... in self store fun")
        self.buffer.append(experience)
        
        # for PER-D3QN
        if self.is_priority:
            if self.priority_buffer:
                max_priority = max(self.priority_buffer)
            else:
                max_priority = 1.0  # Set max_priority to a default value if priority_buffer is empty
            self.priority_buffer.append(max_priority)

In [None]:
class ReplayBuffer(ReplayBuffer):
    def update(self, indices, priorities):
        # This is mainly used for PER-DDQN
        # Otherwise just have a pass in this method
        #
        # This function does not return anything
        #
        # Your code goes in here
        if self.is_priority:
            priority_epsilon = 0.001
            mean_priority = torch.mean(torch.abs(priorities)).item()
            for index in indices:
                self.priority_buffer[index] = mean_priority + priority_epsilon



In [None]:
class ReplayBuffer(ReplayBuffer):
    def collectExperiences(self,env, state, explorationStrategy, countExperiences, net = None):
        #this method allows the agent to interact with the environment starting from a state and it collects
        #experiences during the interaction, it uses network to get the value function and uses exploration 
        #strategy to select action. It collects countExperiences and in case the environment terminates  
        #before that it returns the function calling this method needs to handle early termination accordingly.
        #
        #this function does not return anything
        #
        #Your code goes in here
        
        #print("I am here .......... in collect exp fun")
        
        experience_buffer = []
        reward_count = 0
        steps = 0
        flag = False 
        state , info = env.reset()
        
        while not flag :
            
           
            action = explorationStrategy(net, torch.tensor([state], dtype=torch.float32))
            new_state , reward , terminated , truncated , info = env.step(action)
            
            if countExperiences == -1 or steps <= countExperiences :
                experience_buffer.append([state , action , reward , terminated , truncated , new_state])
                
            reward_count += reward 
            
            if terminated or truncated :
                state , info = env.reset()
                break
            
            state = new_state
            
            steps+=1
            
            if countExperiences != -1 and len(experience_buffer) >= countExperiences:
                state , info = env.reset()
                flag = True
            
        self.episode_reward = reward_count 
        self.episode_total_steps = steps 
        
        # for PER-D3QN 
        if self.is_priority == True :
            # multiply beta by beta_annealing_rate typically 0.99992
            self.priority_beta =  self.priority_beta * self.priority_beta_rate
            
        if countExperiences != -1 and len(experience_buffer) < countExperiences : 
            return
        for exp in experience_buffer :
            self.store(exp)

In [None]:
class ReplayBuffer(ReplayBuffer):
    def sample(self, batchSize, **kwargs):
        # this method returns batchSize number of experiences
        # based on extra arguments, it could do sampling or it could return the latest batchSize experiences or
        # via some other strategy
        #
        # in the case of Prioritized Experience Replay (PER) the sampling needs to take into account the priorities
        #
        # this function returns experiences samples
        #
        #Your code goes in here
        
        
        # types of sampling -> latest , random and prioritized 
        
        if self.is_priority == True :
            
            # prioritized sampling
            
            temp = (np.array(self.priority_buffer) ** self.priority_alpha)
            probability = temp / np.sum(temp)
            probability_index = random.choices(range(len(self.buffer)), k=batchSize, weights=probability)
        
            exp_List = [self.buffer[i] for i in probability_index]

            weights = (len(self.buffer)*probability[probability_index])
            weights = weights ** (-1*self.priority_beta)
            weights = weights/np.max(weights)

            exp_List = [*zip(exp_List, weights , probability_index)]
            return exp_List
        
        else :
            
            # latest sampling
            
            if 'sampling_type' in kwargs and kwargs['sampling_type']=='latest':
                exp_List = self.buffer[-batchSize:]
                return exp_List
            
            # random sampling ( by default )
            
            else :
                exp_List = random.sample(self.buffer, batchSize )
                return exp_List
        
        
        
        return

In [None]:
class ReplayBuffer(ReplayBuffer):
    def splitExperiences(self, experiences):
        #it takes in experiences and gives the following:
        #states, actions, rewards, nextStates, dones
        #
        #Your code goes in here
        
        if self.is_priority == True :
            experiences , weights , index = [*zip(*experiences)]
        
        
        curr_states  = np.asarray([e[0] for e in experiences])
        actions      = np.asarray([e[1] for e in experiences])
        rewards      = np.asarray([e[2] for e in experiences])
        terminated_s = np.asarray([e[3] for e in experiences])
        truncated_s  = np.asarray([e[4] for e in experiences])
        new_states   = np.asarray([e[5] for e in experiences])
        
        curr_states_t   = torch.as_tensor(curr_states, dtype=torch.float32)
        actions_t       = torch.as_tensor(actions,dtype=torch.int64).unsqueeze(-1)
        rewards_t       = torch.as_tensor(rewards,dtype=torch.float32).unsqueeze(-1)
        terminated_s_t  = torch.as_tensor(terminated_s,dtype=torch.float32).unsqueeze(-1)
        truncated_s_t   = torch.as_tensor(truncated_s,dtype=torch.float32).unsqueeze(-1)
        new_states_t    = torch.as_tensor(new_states,dtype=torch.float32)
        
        if self.is_priority==True:
            weights = torch.tensor([weights])
            return curr_states_t, actions_t, rewards_t, new_states_t, terminated_s_t, truncated_s_t,weights, index

        
        return curr_states_t, actions_t, rewards_t, new_states_t, terminated_s_t, truncated_s_t

In [None]:
class ReplayBuffer(ReplayBuffer):
    def length(self):
        #tells the number of experiences stored in the internal buffer
        #
        #Your code goes in here
        return len(self.buffer)

## Neural Fitted Q (NFQ)

Implement the Neural Fitted Q algorithm. We have studied about NFQ algorithm in the Lecture. Use the function definitions (given below).

This class implements the NFQ Agent, you are required to implement the various methods of this class
as outlined below. Note this class is generic and should work with any permissible Gym environment. 
Also please feel free to play with different exploration strategies with decaying paramters (epsilon/temperature)

```
class NFQ():
    def __init__(env, seed, gamma, epochs,
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn)
    def initBookKeeping(self)
    def performBookKeeping(self, train = True)
    def runNFQ(self)
    def trainAgent(self)
    def trainNetwork(self, experiences, epochs)
    def evaluateAgent(self)
```

In [None]:
class NFQ():
    def __init__(self, env, seed, gamma,  
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn):
        #this NFQ method 
        # 1. creates and initializes (with seed) the environment, train/eval episodes, gamma, etc. 
        # 2. creates and intializes all the variables required for bookkeeping values via the initBookKeepingmethod
        # 3. creates Q-network using the createValueNetwork above
        # 4. creates and initializes (with network params) the optimizer function
        # 5. sets the explorationStartegy variables/functions for train and evaluation
        # 6. sets the batchSize for the number of experiences 
        # 7. Creates the replayBuffer
    
        #Your code goes in here
        # 1 , 6
        self.gamma = gamma 
        self.buffer_size = bufferSize
        self.batch_size = batchSize 
        self.max_train_ep = MAX_TRAIN_EPISODES
        self.max_test_ep = MAX_EVAL_EPISODES
        
        
        # 2
        self.train_reward = 0
        self.test_reward = 0
        self.total_steps = 0 
        self.cpu_time = 0
        self.wall_time = 0
        self.episode = 0
        self.initBookKeeping()
        
        
        # 5
        self.exploration_train = explorationStrategyTrainFn
        self.exploration_eval  = explorationStrategyEvalFn
        
        self.state = env.reset(seed = seed)
        self.seed = seed 
        self.env = env 
        
        # 3
        inDim = int(np.prod(env.observation_space.shape))
        outDim = env.action_space.n
        self.policy_network = createPolicyNetwork(inDim, outDim, hDim = [512,128], activation = nn.ReLU())
        
        # 4
        self.optimizer = optimizerFn(self.policy_network.parameters(), lr=optimizerLR)
        
        # 7
        self.replay_buffer = ReplayBuffer(self.buffer_size, bufferType = 'NFQ')
    

In [None]:
class NFQ(NFQ):
    def initBookKeeping(self):
        #this method creates and intializes all the variables required for book-keeping values and it is called
        #init method

        #Your code goes in here
        self.train_reward_array = []
        self.test_reward_array = []
        self.total_steps_array = []
        self.total_cpu_array = []
        self.total_wall_array = []
        return
        

In [None]:
class NFQ(NFQ):
    def performBookKeeping(self, train = True):
        #this method updates relevant variables for the bookKeeping, this can be called 
        #multiple times during training
        #if you want you can print information using this, so it may help to monitor progress and also help to debug
        
    #Your code goes in here
        if train == False :
            self.test_reward_array.append(self.test_reward)
        else :
            self.train_reward_array.append(self.train_reward)
            self.total_steps_array.append(self.total_steps)
            self.total_cpu_array.append(self.cpu_time)
            self.total_wall_array.append(self.wall_time)
        
        return

In [None]:
class NFQ(NFQ):
    def runNFQ(self):
        #this is the main method, it trains the agent, performs bookkeeping while training and finally evaluates
        #the agent and returns the following quantities:
        #1. episode wise mean train rewards
        #2. epsiode wise mean eval rewards 
        #2. episode wise trainTime (in seconds): time elapsed during training since the start of the first episode 
        #3. episode wise wallClockTime (in seconds): actual time elapsed since the start of training, 
        #                               note this will include time for BookKeeping and evaluation 
        # Note both trainTime and wallClockTime get accumulated as episodes proceed. 
        
        
        #Your code goes in here
        
        
        self.trainAgent()
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array, self.test_reward_array
    

In [None]:
class NFQ(NFQ):
    def trainAgent(self):
        #this method collects experiences and trains the NFQ agent and does BookKeeping while training. 
        #this calls the trainNetwork() method internally, it also evaluates the agent per episode
        #it trains the agent for MAX_TRAIN_EPISODES
        
        #Your code goes in here
        
        cpu_start = time.time()
        wall_start = time.perf_counter()
        
        for episode in range(self.max_train_ep):
            
            state , info = self.env.reset(seed = self.seed)
            
            self.replay_buffer.collectExperiences(self.env, state, self.exploration_train, countExperiences = -1, net = self.policy_network)
            
            if self.replay_buffer.length() < self.batch_size :
                continue 
                
            experience_buffer = self.replay_buffer.sample(self.batch_size)
            self.trainNetwork(experience_buffer , 5 )
            
            self.train_reward = self.replay_buffer.episode_reward 
            self.total_steps += self.replay_buffer.episode_total_steps
            self.cpu_time = time.time()-cpu_start
            self.wall_time = time.perf_counter() - wall_start
            
            self.episode = episode
            
            self.performBookKeeping(train=True)
            self.evaluateAgent()
            self.performBookKeeping(train=False)
            
        
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array
        

In [None]:
class NFQ(NFQ):
    def trainNetwork(self, experiences, epochs):
        # this method trains the value network epoch number of times and is called by the trainAgent function
        # it essentially uses the experiences to calculate target, using the targets it calculates the error, which
        # is further used for calulating the loss. It then uses the optimizer over the loss 
        # to update the params of the network by backpropagating through the network
        # this function does not return anything
        # you can try out other loss functions other than MSE like Huber loss, MAE, etc. 
        
        #Your code goes in here
        curr_states, actions, rewards, new_states, terminated_s, truncated_s = self.replay_buffer.splitExperiences(experiences)
        
        for _ in range(epochs):
            target_q_values      = self.policy_network(new_states).detach()
            max_target_q_values  = target_q_values.max(dim=1, keepdim=True)[0]
            
            q_values  = self.policy_network(curr_states)
            temp = torch.logical_or(terminated_s,truncated_s)
            targets = rewards + gamma * (torch.logical_not(temp)) * max_target_q_values
            action_q_values = torch.gather(input=q_values , dim=1 , index=actions)
            
            error = F.smooth_l1_loss(action_q_values , targets)  # Huber loss
            
            # Gradient Descent 
            self.optimizer.zero_grad()
            error.backward()
            self.optimizer.step()
        
        return
        

In [None]:
class NFQ(NFQ):
    def evaluateAgent(self):
        #this function evaluates the agent using the value network, it evaluates agent for MAX_EVAL_EPISODES
        #typcially MAX_EVAL_EPISODES = 1
        
        #Your code goes in here
        reward_arr = []
        for e in range(self.max_test_ep):
            rs = 0
            
            st, info = self.env.reset()
            terminated = False
            truncated = False
            
            while not (terminated or truncated):
                act = self.exploration_eval(self.policy_network, torch.tensor([st], dtype=torch.float32))
                st, rew, terminated, truncated, info = self.env.step(act)
                rs += rew
                
            reward_arr.append(rs)
            
        self.test_reward = sum(reward_arr) / len(reward_arr)

## Deep Q-Network (DQN) 

Implement the Deep Q algorithm. We have studied about DQN algorithm in the Lecture. Use the function definitions (given below).

This class implements the DQN Agent, you are required to implement the various methods of this class
as outlined below. Note this class is generic and should work with any permissible Gym environment

```
class DQN():
    def __init__(env, seed, gamma, 
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn,
                 updateFrequency)
    def initBookKeeping(self)
    def performBookKeeping(self, train = True)
    def runDQN(self)
    def trainAgent(self)
    def trainNetwork(self, experiences)
    def updateNetwork(self, onlineNet, targetNet)
    def evaluateAgent(self)
```

In [None]:
class DQN():
    def __init__(self , env, seed, gamma, 
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn,
                 updateFrequency):
        #this DQN method 
        # 1. creates and initializes (with seed) the environment, train/eval episodes, gamma, etc. 
        # 2. creates and intializes all the variables required for book-keeping values via theinitBookKeepingmethod
        # 3. creates traget and online Q-networks using the createValueNetwork above
        # 4. creates and initializes (with network params) the optimizer function
        # 5. sets the explorationStartegy variables/functions for train and evaluation
        # 6. sets the batchSize for the number of experiences 
        # 7. Creates the replayBuffer
        
        #Your code goes in here
        
        # 1 , 6
        self.gamma = gamma 
        self.buffer_size = bufferSize
        self.batch_size = batchSize 
        self.max_train_ep = MAX_TRAIN_EPISODES
        self.max_test_ep = MAX_EVAL_EPISODES
        self.update_frequency = updateFrequency
        
        # 2
        self.train_reward = 0
        self.test_reward = 0
        self.total_steps = 0 
        self.cpu_time = 0
        self.wall_time = 0
        self.episode = 0
        self.initBookKeeping()
        
        
        # 5
        self.exploration_train = explorationStrategyTrainFn
        self.exploration_eval  = explorationStrategyEvalFn
        
        self.state = env.reset(seed = seed)
        self.seed = seed 
        self.env = env 
        
        # 3
        inDim = int(np.prod(env.observation_space.shape))
        outDim = env.action_space.n
        self.policy_network = createPolicyNetwork(inDim, outDim, hDim = [512,128], activation = nn.ReLU())
        self.target_network = copy.deepcopy(self.policy_network)
        
        # 4
        self.optimizer = optimizerFn(self.policy_network.parameters(), lr=optimizerLR)
        
        # 7
        self.replay_buffer = ReplayBuffer(self.buffer_size, bufferType = 'DQN')

In [None]:
class DQN(DQN):
    def initBookKeeping(self):
        #this method creates and intializes all the variables required for book-keeping values and it is called
        #init method
        #
        # Your code goes in here
        
        self.train_reward_array = []
        self.test_reward_array = []
        self.total_steps_array = []
        self.total_cpu_array = []
        self.total_wall_array = []
        return

In [None]:
class DQN(DQN):
    def performBookKeeping(self, train = True):
        #this method updates relevant variables for the bookKeeping, this can be called 
        #multiple times during training
        #if you want you can print information using this, so it may help to monitor progress and also help to debug
        #
        # Your code goes in here
        
        if train == False :
            self.test_reward_array.append(self.test_reward)
        else :
            self.train_reward_array.append(self.train_reward)
            self.total_steps_array.append(self.total_steps)
            self.total_cpu_array.append(self.cpu_time)
            self.total_wall_array.append(self.wall_time)
        
        return

In [None]:
class DQN(DQN):
    def runDQN(self):
        #this is the main method, it trains the agent, performs bookkeeping while training and finally evaluates
        #the agent and returns the following quantities:
        #1. episode wise mean train rewards
        #2. epsiode wise mean eval rewards 
        #2. episode wise trainTime (in seconds): time elapsed during training since the start of the first episode 
        #3. episode wise wallClockTime (in seconds): actual time elapsed since the start of training, 
        #                               note this will include time for BookKeeping and evaluation 
        # Note both trainTime and wallClockTime get accumulated as episodes proceed. 
        #
        #Your code goes in here
        
        self.trainAgent()
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array, self.test_reward_array
    
    
    
    
    

In [None]:
class DQN(DQN):
    def trainAgent(self):
        #this method collects experiences and trains the agent and does BookKeeping while training. 
        #this calls the trainNetwork() method internally, it also evaluates the agent per episode
        #it trains the agent for MAX_TRAIN_EPISODES
        #
        #Your code goes in here
        
        
        #print("I am here .......... in train agent fun")
        self.updateNetwork(self.policy_network, self.target_network)
        cpu_start = time.time()
        wall_start = time.perf_counter()
        
        for episode in range(self.max_train_ep):
            
            state , info = self.env.reset(seed = self.seed)
            
            self.replay_buffer.collectExperiences(self.env, state, self.exploration_train, countExperiences = -1, net = self.policy_network)
            
            if self.replay_buffer.length() < self.batch_size :
                continue 
                
            experience_buffer = self.replay_buffer.sample(self.batch_size)
            self.trainNetwork(experience_buffer , 5 )
            
            self.train_reward = self.replay_buffer.episode_reward 
            self.total_steps += self.replay_buffer.episode_total_steps
            self.cpu_time = time.time()-cpu_start
            self.wall_time = time.perf_counter() - wall_start
            
            self.episode = episode
            
            self.performBookKeeping(train=True)
            self.evaluateAgent()
            self.performBookKeeping(train=False)
            
            if episode % self.update_frequency == 0 :
                self.updateNetwork(self.policy_network, self.target_network)
        
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array

In [None]:
class DQN(DQN):
    def trainNetwork(self, experiences, epochs):
        # this method trains the value network epoch number of times and is called by the trainAgent function
        # it essentially uses the experiences to calculate target, using the targets it calculates the error, which
        # is further used for calulating the loss. It then uses the optimizer over the loss 
        # to update the params of the network by backpropagating through the network
        # this function does not return anything
        # you can try out other loss functions other than MSE like Huber loss, MAE, etc. 
        #
        #Your code goes in here
        
        #print("I am here .......... in train class")
        curr_states, actions, rewards, new_states, terminated_s, truncated_s = self.replay_buffer.splitExperiences(experiences)
        
        for _ in range(epochs):
            target_q_values      = self.target_network(new_states).detach()
            max_target_q_values  = target_q_values.max(dim=1, keepdim=True)[0]
            
            q_values  = self.policy_network(curr_states)
            temp = torch.logical_or(terminated_s,truncated_s)
            targets = rewards + gamma * (torch.logical_not(temp)) * max_target_q_values
            action_q_values = torch.gather(input=q_values , dim=1 , index=actions)
            
            error = F.smooth_l1_loss(action_q_values , targets)  # Huber loss
            
            # Gradient Descent 
            self.optimizer.zero_grad()
            error.backward()
            self.optimizer.step()
        
        return

In [None]:
class DQN(DQN):
    def updateNetwork(self, onlineNet, targetNet):
        #this function updates the onlineNetwork with the target network
        #
        # Your code goes in here
        targetNet.load_state_dict(onlineNet.state_dict())

        
        return

In [None]:
class DQN(DQN):
    def evaluateAgent(self):
        #this function evaluates the agent using the value network, it evaluates agent for MAX_EVAL_EPISODES
        #typically MAX_EVAL_EPISODES = 1
        #
        #Your code goes in here
        reward_arr = []
        for e in range(self.max_test_ep):
            rs = 0
            
            st, info = self.env.reset()
            terminated = False
            truncated = False
            
            while not (terminated or truncated):
                act = self.exploration_eval(self.policy_network, torch.tensor([st], dtype=torch.float32))
                st, rew, terminated, truncated, info = self.env.step(act)
                rs += rew
                
            reward_arr.append(rs)
            
        self.test_reward = sum(reward_arr) / len(reward_arr)

        
        
        

In [None]:
DQN_trainRewardsList=dict()
DQN_trainTimeList = dict()
DQN_evalRewardsList = dict()
DQN_wallClockTimeList=dict()
DQN_finalEvalReward=dict()

env = gym.make('CartPole-v1')
env.reset()

#DQN
agent_DQN = DQN(env, seed=i, gamma=0.99, bufferSize=1000, batchSize=32, 
        optimizerFn=optim.Adam, optimizerLR=0.00075, MAX_TRAIN_EPISODES=500, MAX_EVAL_EPISODES=1, 
        explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
        updateFrequency=25)
DQN_trainRewardsList[1], DQN_trainTimeList[1], DQN_evalRewardsList[1], DQN_wallClockTimeList[1], DQN_finalEvalReward[1] = agent_DQN.runDQN()
#print(DQN_trainRewardsList[1])

plotQuantity(DQN_trainRewardsList, 500, ['Training Reward' , 'Training reward vs episodes' , 'DQN'])

## Double DQN (DDQN)

Implement the Double DQN agent. We have studied about Double DQN agent in the Lecture. Use the function definitions (given below).

This class implements the Double DQN agent, you are required to implement the various methods of this class
as outlined below. Note this class is generic and should work with any permissible Gym environment

```
class DDQN():
    def __init__(env, seed, gamma, 
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn,
                 updateFrequency)
    def initBookKeeping(self)
    def performBookKeeping(self, train = True)
    def runDDQN(self)
    def trainAgent(self)
    def trainNetwork(self, experiences)
    def updateNetwork(self, onlineNet, targetNet)
    def evaluateAgent(self)
```

In [None]:
class DDQN():
    def __init__(self,env, seed, gamma, 
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn,
                 updateFrequency):
        #this DDQN method 
        # 1. creates and initializes (with seed) the environment, train/eval episodes, gamma, etc. 
        # 2. creates and intializes all the variables required for book-keeping values via the initBookKeeping method
        # 3. creates tareget and online Q-networks using the createValueNetwork above
        # 4. creates and initializes (with network params) the optimizer function
        # 5. sets the explorationStartegy variables/functions for train and evaluation
        # 6. sets the batchSize for the number of experiences 
        # 7. Creates the replayBuffer
        
        #Your code goes in here
        
        # 1 , 6
        self.gamma = gamma 
        self.buffer_size = bufferSize
        self.batch_size = batchSize 
        self.max_train_ep = MAX_TRAIN_EPISODES
        self.max_test_ep = MAX_EVAL_EPISODES
        self.update_frequency = updateFrequency
        
        # 2
        self.train_reward = 0
        self.test_reward = 0
        self.total_steps = 0 
        self.cpu_time = 0
        self.wall_time = 0
        self.episode = 0
        self.initBookKeeping()
        
        
        # 5
        self.exploration_train = explorationStrategyTrainFn
        self.exploration_eval  = explorationStrategyEvalFn
        
        self.state = env.reset(seed = seed)
        self.seed = seed 
        self.env = env 
        
        # 3
        inDim = int(np.prod(env.observation_space.shape))
        outDim = env.action_space.n
        self.policy_network = createPolicyNetwork(inDim, outDim, hDim = [512,128], activation = nn.ReLU())
        self.target_network = copy.deepcopy(self.policy_network)
        
        # 4
        self.optimizer = optimizerFn(self.policy_network.parameters(), lr=optimizerLR)
        
        # 7
        self.replay_buffer = ReplayBuffer(self.buffer_size, bufferType = 'DDQN')
        

In [None]:
class DDQN(DDQN):
    def initBookKeeping(self):
        #this method creates and intializes all the variables required for book-keeping values and it is called
        #init method
        #
        # Your code goes in here
        
        self.train_reward_array = []
        self.test_reward_array = []
        self.total_steps_array = []
        self.total_cpu_array = []
        self.total_wall_array = []
        return

In [None]:
class DDQN(DDQN):
    def performBookKeeping(self, train = True):
        #this method updates relevant variables for the bookKeeping, this can be called 
        #multiple times during training
        #if you want you can print information using this, so it may help to monitor progress and also help to debug
        #
        # Your code goes in here
        
        if train == False :
            self.test_reward_array.append(self.test_reward)
        else :
            self.train_reward_array.append(self.train_reward)
            self.total_steps_array.append(self.total_steps)
            self.total_cpu_array.append(self.cpu_time)
            self.total_wall_array.append(self.wall_time)
        
        return

In [None]:
class DDQN(DDQN):
    def runDDQN(self):
        #this is the main method, it trains the agent, performs bookkeeping while training and finally evaluates
        #the agent and returns the following quantities:
        #1. episode wise mean train rewards
        #2. epsiode wise mean eval rewards 
        #2. episode wise trainTime (in seconds): time elapsed during training since the start of the first episode 
        #3. episode wise wallClockTime (in seconds): actual time elapsed since the start of training, 
        #                               note this will include time for BookKeeping and evaluation 
        # Note both trainTime and wallClockTime get accumulated as episodes proceed. 
        
        #Your code goes in here
        
        self.trainAgent()
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array, self.test_reward_array
    
    
    
    
    

In [None]:
class DDQN(DDQN):
    def trainAgent(self):
        #this method collects experiences and trains the agent and does BookKeeping while training. 
        #this calls the trainNetwork() method internally, it also evaluates the agent per episode
        #it trains the agent for MAX_TRAIN_EPISODES
        #
        #Your code goes in here
        self.updateNetwork(self.policy_network, self.target_network)
        cpu_start = time.time()
        wall_start = time.perf_counter()
        
        for episode in range(self.max_train_ep):
            
            state , info = self.env.reset(seed = self.seed)
            
            self.replay_buffer.collectExperiences(self.env, state, self.exploration_train, countExperiences = -1, net = self.policy_network)
            
            if self.replay_buffer.length() < self.batch_size :
                continue 
                
            experience_buffer = self.replay_buffer.sample(self.batch_size)
            self.trainNetwork(experience_buffer , 5 )
            
            self.train_reward = self.replay_buffer.episode_reward 
            self.total_steps += self.replay_buffer.episode_total_steps
            self.cpu_time = time.time()-cpu_start
            self.wall_time = time.perf_counter() - wall_start
            
            self.episode = episode
            
            self.performBookKeeping(train=True)
            self.evaluateAgent()
            self.performBookKeeping(train=False)
            
            if episode % self.update_frequency == 0 :
                self.updateNetwork(self.policy_network, self.target_network)
        
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array

In [None]:
class DDQN(DDQN):
    def trainNetwork(self, experiences, epochs):
        # This method trains the value network epoch number of times and is called by the trainAgent function.
        # It uses experiences to calculate targets, then calculates the error for updating the network parameters.
        # It does not return anything.

        curr_states, actions, rewards, new_states, terminated_s, truncated_s = self.replay_buffer.splitExperiences(experiences)

        for _ in range(epochs):

            new_q_values_policy = self.policy_network(new_states)
            new_actions = new_q_values_policy.argmax(dim=1, keepdim=True)


            new_q_values_target = self.target_network(new_states).detach()
            max_new_q_values_target = new_q_values_target.gather(1, new_actions)

            temp = torch.logical_or(terminated_s, truncated_s)
            targets = rewards + self.gamma * (torch.logical_not(temp)) * max_new_q_values_target

            q_values = self.policy_network(curr_states)
            action_q_values = q_values.gather(1, actions)

            error = F.smooth_l1_loss(action_q_values, targets)

            # gradient descent
            self.optimizer.zero_grad()
            error.backward()
            self.optimizer.step()


        return

In [None]:
class DDQN(DDQN):
    def updateNetwork(self, onlineNet, targetNet):
        #this function updates the onlineNetwork with the target network
        #
        # Your code goes in here
        targetNet.load_state_dict(onlineNet.state_dict())

        
        return
        
       

In [None]:
class DDQN(DDQN):
    def evaluateAgent(self):
        #this function evaluates the agent using the value network, it evaluates agent for MAX_EVAL_EPISODES
        #typcially MAX_EVAL_EPISODES = 1
        
        #Your code goes in here
        reward_arr = []
        for e in range(self.max_test_ep):
            rs = 0
            
            st, info = self.env.reset()
            terminated = False
            truncated = False
            
            while not (terminated or truncated):
                act = self.exploration_eval(self.policy_network, torch.tensor([st], dtype=torch.float32))
                st, rew, terminated, truncated, info = self.env.step(act)
                rs += rew
                
            reward_arr.append(rs)
            
        self.test_reward = sum(reward_arr) / len(reward_arr)

        
        
        

## Dueling DDQN

Implement the Dueling Double Deep Q algorithm. We have studied about Dueling Double DQN agent in the Lecture. Use the function definitions (given below).

This class implements the Dueling Double DQN agent, you are required to implement the various methods of this class
as outlined below. Note this class is generic and should work with any permissible Gym environment

```
class D3QN():
    def __init__(env, seed, gamma, tau, 
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn,
                 updateFrequency)
    def initBookKeeping(self)
    def performBookKeeping(self, train = True)
    def runD3QN(self)
    def trainAgent(self)
    def trainNetwork(self, experiences)
    def updateNetwork(self, onlineNet, targetNet)
    def evaluateAgent(self)
```

In [None]:
class D3QN():
    def __init__(self , env, seed, gamma, tau, 
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn,
                 updateFrequency):
        #this D3QN method 
        # 1. creates and initializes (with seed) the environment, train/eval episodes, gamma, etc. 
        # 2. creates and intializes all the variables required for book-keeping values via the initBookKeeping method
        # 3. creates tareget and online Q-networks using the createValueNetwork above
        # 4. creates and initializes (with network params) the optimizer function
        # 5. sets the explorationStartegy variables/functions for train and evaluation
        # 6. sets the batchSize for the number of experiences 
        # 7. Creates the replayBuffer
        
        #Your code goes in here
        # 1 , 6
        self.gamma = gamma 
        self.buffer_size = bufferSize
        self.batch_size = batchSize 
        self.max_train_ep = MAX_TRAIN_EPISODES
        self.max_test_ep = MAX_EVAL_EPISODES
        self.update_frequency = updateFrequency
        self.tau = tau
        
        # 2
        self.train_reward = 0
        self.test_reward = 0
        self.total_steps = 0 
        self.cpu_time = 0
        self.wall_time = 0
        self.episode = 0
        self.initBookKeeping()
        
        
        # 5
        self.exploration_train = explorationStrategyTrainFn
        self.exploration_eval  = explorationStrategyEvalFn
        
        self.state = env.reset(seed = seed)
        self.seed = seed 
        self.env = env 
        
        # 3
        inDim = int(np.prod(env.observation_space.shape))
        outDim = env.action_space.n
        self.policy_network = createDuelingNetwork(inDim, outDim, hDim = [512,128], activation = nn.ReLU())
        self.target_network = copy.deepcopy(self.policy_network)
        
        # 4
        self.optimizer = optimizerFn(self.policy_network.parameters(), lr=optimizerLR)
        
        # 7
        self.replay_buffer = ReplayBuffer(self.buffer_size, bufferType = 'D3QN')
        

In [None]:
class D3QN(D3QN):
    def initBookKeeping(self):
        #this method creates and intializes all the variables required for book-keeping values and it is called
        #init method
        #
        # Your code goes in here
        
        self.train_reward_array = []
        self.test_reward_array = []
        self.total_steps_array = []
        self.total_cpu_array = []
        self.total_wall_array = []
        return
        


In [None]:
class D3QN(D3QN):
    def performBookKeeping(self, train = True):
        #this method updates relevant variables for the bookKeeping, this can be called 
        #multiple times during training
        #if you want you can print information using this, so it may help to monitor progress and also help to debug
        #
        # Your code goes in here
        
        if train == False :
            self.test_reward_array.append(self.test_reward)
        else :
            self.train_reward_array.append(self.train_reward)
            self.total_steps_array.append(self.total_steps)
            self.total_cpu_array.append(self.cpu_time)
            self.total_wall_array.append(self.wall_time)
        
        return

In [None]:
class D3QN(D3QN):
    def runD3QN(self):
        #this is the main method, it trains the agent, performs bookkeeping while training and finally evaluates
        #the agent and returns the following quantities:
        #1. episode wise mean train rewards
        #2. epsiode wise mean eval rewards 
        #2. episode wise trainTime (in seconds): time elapsed during training since the start of the first episode 
        #3. episode wise wallClockTime (in seconds): actual time elapsed since the start of training, 
        #                               note this will include time for BookKeeping and evaluation 
        # Note both trainTime and wallClockTime get accumulated as episodes proceed. 
        
        #Your code goes in here
        #Your code goes in here
        
        self.trainAgent()
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array, self.test_reward_array
    

In [None]:
class D3QN(D3QN):
    def trainAgent(self):
        #this method collects experiences and trains the agent and does BookKeeping while training. 
        #this calls the trainNetwork() method internally, it also evaluates the agent per episode
        #it trains the agent for MAX_TRAIN_EPISODES
        
        #Your code goes in here
        self.updateNetwork(self.policy_network, self.target_network)
        cpu_start = time.time()
        wall_start = time.perf_counter()
        
        for episode in range(self.max_train_ep):
            
            state , info = self.env.reset(seed = self.seed)
            
            self.replay_buffer.collectExperiences(self.env, state, self.exploration_train, countExperiences = -1, net = self.policy_network)
            
            if self.replay_buffer.length() < self.batch_size :
                continue 
                
            experience_buffer = self.replay_buffer.sample(self.batch_size)
            self.trainNetwork(experience_buffer , 5 )
            
            self.train_reward = self.replay_buffer.episode_reward 
            self.total_steps += self.replay_buffer.episode_total_steps
            self.cpu_time = time.time()-cpu_start
            self.wall_time = time.perf_counter() - wall_start
            
            self.episode = episode
            
            self.performBookKeeping(train=True)
            self.evaluateAgent()
            self.performBookKeeping(train=False)
            
            if episode % self.update_frequency == 0 :
                self.updateNetwork(self.policy_network, self.target_network)
        
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array

In [None]:
class D3QN(D3QN):
    def trainNetwork(self, experiences, epochs):
        # this method trains the value network epoch number of times and is called by the trainAgent function
        # it essentially uses the experiences to calculate target, using the targets it calculates the error, which
        # is further used for calulating the loss. It then uses the optimizer over the loss 
        # to update the params of the network by backpropagating through the network
        # this function does not return anything
        # you can try out other loss functions other than MSE like Huber loss, MAE, etc. 
        
        #Your code goes in here

        curr_states, actions, rewards, new_states, terminated_s, truncated_s = self.replay_buffer.splitExperiences(experiences)

        for _ in range(epochs):

            new_q_values_policy = self.policy_network(new_states)
            new_actions = new_q_values_policy.argmax(dim=1, keepdim=True)


            new_q_values_target = self.target_network(new_states).detach()
            max_new_q_values_target = new_q_values_target.gather(1, new_actions)

            temp = torch.logical_or(terminated_s, truncated_s)
            targets = rewards + self.gamma * (torch.logical_not(temp)) * max_new_q_values_target

            q_values = self.policy_network(curr_states)
            action_q_values = q_values.gather(1, actions)

            error = F.smooth_l1_loss(action_q_values, targets)

            # gradient descent
            self.optimizer.zero_grad()
            error.backward()
            self.optimizer.step()


        return

In [None]:
class D3QN(D3QN):
    def updateNetwork(self, onlineNet, targetNet):
        #this function updates the onlineNetwork with the target network using Polyak averaging
        #
        # Your code goes in here
        with torch.no_grad():
            for paramOnline, paramTarget in zip(onlineNet.parameters(), targetNet.parameters()):
                paramTarget.data = self.tau * paramOnline.data + (1 - self.tau) * paramTarget.data
        return

In [None]:
class D3QN(D3QN):
    def evaluateAgent(self):
        #this function evaluates the agent using the value network, it evaluates agent for MAX_EVAL_EPISODES
        #typcially MAX_EVAL_EPISODES = 1
        #
        #Your code goes in here
        reward_arr = []
        for e in range(self.max_test_ep):
            rs = 0
            
            st, info = self.env.reset()
            terminated = False
            truncated = False
            
            while not (terminated or truncated):
                act = self.exploration_eval(self.policy_network, torch.tensor([st], dtype=torch.float32))
                st, rew, terminated, truncated, info = self.env.step(act)
                rs += rew
                
            reward_arr.append(rs)
            
        self.test_reward = sum(reward_arr) / len(reward_arr)
 

## Dueling Double Deep Q Network with Prioritized Experience Replay (D3QN-PER)

Implement the Dueling Double DQN with Prioritized Experience Replay (D3QN-PER) agent. We have studied about D3QN-PER agent in the Lecture. Use the function definitions (given below).

This class implements the D3QN-PER agent, you are required to implement the various methods of this class
as outlined below. Note this class is generic and should work with any permissible Gym environment

```
class D3QN_PER():
    def __init__(env, seed, gamma, tau, alpha, beta, beta_rate, 
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn,
                 updateFrequency)
    def initBookKeeping(self)
    def performBookKeeping(self, train = True)
    def runD3QN_PER(self)
    def trainAgent(self)
    def trainNetwork(self, experiences)
    def updateNetwork(self, onlineNet, targetNet)
    def evaluateAgent(self)
``` 

In [None]:
class D3QN_PER():
    def __init__(self,env, seed, gamma, tau, alpha, beta, beta_rate, 
                 bufferSize,
                 batchSize,
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn,
                 updateFrequency):
        #this D3QN method 
        # 1. creates and initializes (with seed) the environment, train/eval episodes, gamma, etc. 
        # 2. creates and intializes all the variables required for book-keeping values via the initBookKeeping method
        # 3. creates tareget and online Q-networks using the createValueNetwork above
        # 4. creates and initializes (with network params) the optimizer function
        # 5. sets the explorationStartegy variables/functions for train and evaluation
        # 6. sets the batchSize for the number of experiences 
        # 7. Creates the replayBuffer, 
        #    the replayBuffer takes the parameters bufferSize, alpha, beta and beta_rate
        #
        # Your code goes in here
        # 1 , 6
        self.gamma = gamma 
        self.buffer_size = bufferSize
        self.batch_size = batchSize 
        self.max_train_ep = MAX_TRAIN_EPISODES
        self.max_test_ep = MAX_EVAL_EPISODES
        self.update_frequency = updateFrequency
        self.tau = tau
        self.alpha = alpha 
        self.beta = beta 
        self.beta_rate  = beta_rate
        # 2
        self.train_reward = 0
        self.test_reward = 0
        self.total_steps = 0 
        self.cpu_time = 0
        self.wall_time = 0
        self.episode = 0
        self.initBookKeeping()
        
        
        # 5
        self.exploration_train = explorationStrategyTrainFn
        self.exploration_eval  = explorationStrategyEvalFn
        
        self.state = env.reset(seed = seed)
        self.seed = seed 
        self.env = env 
        
        # 3
        inDim = int(np.prod(env.observation_space.shape))
        outDim = env.action_space.n
        self.policy_network = createDuelingNetwork(inDim, outDim, hDim = [512,128], activation = nn.ReLU())
        self.target_network = copy.deepcopy(self.policy_network)
        
        # 4
        self.optimizer = optimizerFn(self.policy_network.parameters(), lr=optimizerLR)
        
        # 7
        self.replay_buffer = ReplayBuffer(bufferSize, bufferType = 'PER-D3QN', priority_alpha=alpha,priority_beta=beta, priority_beta_rate=beta_rate)
                                          
                                          
                                          
                                          
                                          

In [None]:
class D3QN_PER(D3QN_PER):
    def initBookKeeping(self):
        #this method creates and intializes all the variables required for book-keeping values and it is called
        #init method
        #
        #Your code goes in here
        self.train_reward_array = []
        self.test_reward_array = []
        self.total_steps_array = []
        self.total_cpu_array = []
        self.total_wall_array = []
        return

In [None]:
class D3QN_PER(D3QN_PER):
    def performBookKeeping(self, train = True):
        #this method updates relevant variables for the bookKeeping, this can be called 
        #multiple times during training
        #if you want you can print information using this, so it may help to monitor progress and also help to debug
        #
        #Your code goes in here
        
        if train == False :
            self.test_reward_array.append(self.test_reward)
        else :
            self.train_reward_array.append(self.train_reward)
            self.total_steps_array.append(self.total_steps)
            self.total_cpu_array.append(self.cpu_time)
            self.total_wall_array.append(self.wall_time)
        
        return

In [None]:
class D3QN_PER(D3QN_PER):
    def runD3QN_PER(self):
        #this is the main method, it trains the agent, performs bookkeeping while training and finally evaluates
        #the agent and returns the following quantities:
        #1. episode wise mean train rewards
        #2. epsiode wise mean eval rewards 
        #2. episode wise trainTime (in seconds): time elapsed during training since the start of the first episode 
        #3. episode wise wallClockTime (in seconds): actual time elapsed since the start of training, 
        #                               note this will include time for BookKeeping and evaluation 
        # Note both trainTime and wallClockTime get accumulated as episodes proceed. 
        #
        # Your code goes in here
        self.trainAgent()
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array, self.test_reward_array
    

In [None]:
class D3QN_PER(D3QN_PER):
    def trainAgent(self):
        #this method collects experiences and trains the agent and does BookKeeping while training. 
        #this calls the trainNetwork() method internally, it also evaluates the agent per episode
        #it trains the agent for MAX_TRAIN_EPISODES
        #
        #Your code goes in here
        self.updateNetwork(self.policy_network, self.target_network)
        cpu_start = time.time()
        wall_start = time.perf_counter()
        
        for episode in range(self.max_train_ep):
            
            state , info = self.env.reset(seed = self.seed)
            
            self.replay_buffer.collectExperiences(self.env, state, self.exploration_train, countExperiences = -1, net = self.policy_network)
            
            if self.replay_buffer.length() < self.batch_size :
                continue 
                
            experience_buffer = self.replay_buffer.sample(self.batch_size)
            self.trainNetwork(experience_buffer , 5 )
            
            self.train_reward = self.replay_buffer.episode_reward 
            self.total_steps += self.replay_buffer.episode_total_steps
            self.cpu_time = time.time()-cpu_start
            self.wall_time = time.perf_counter() - wall_start
            
            self.episode = episode
            
            self.performBookKeeping(train=True)
            self.evaluateAgent()
            self.performBookKeeping(train=False)
            
            if episode % self.update_frequency == 0 :
                self.updateNetwork(self.policy_network, self.target_network)
        
        
        return self.train_reward_array, self.total_cpu_array, self.total_steps_array, self.total_wall_array

In [None]:
class D3QN_PER(D3QN_PER):
    def trainNetwork(self, experiences , epochs):
        # this method trains the value network epoch number of times and is called by the trainAgent function
        # it essentially uses the experiences to calculate target, using the targets it calculates the error, which
        # is further used for calulating the loss. It then uses the optimizer over the loss 
        # to update the params of the network by backpropagating through the network
        # this function does not return anything
        # you can try out other loss functions other than MSE like Huber loss, MAE, etc. 
        #
        #Your code goes in here

        curr_states, actions, rewards, new_states, terminated_s, truncated_s , weights , index= self.replay_buffer.splitExperiences(experiences)

        for _ in range(epochs):

            new_q_values_policy = self.policy_network(new_states)
            new_actions = new_q_values_policy.argmax(dim=1, keepdim=True)


            new_q_values_target = self.target_network(new_states).detach()
            max_new_q_values_target = new_q_values_target.gather(1, new_actions)

            temp = torch.logical_or(terminated_s, truncated_s)
            targets = rewards + self.gamma * (torch.logical_not(temp)) * max_new_q_values_target

            q_values = self.policy_network(curr_states)
            action_q_values = q_values.gather(1, actions)

            error = F.smooth_l1_loss(action_q_values, targets)
            error = (error*weights).mean()
            
            # gradient descent
            self.optimizer.zero_grad()
            error.backward()
            self.optimizer.step()
            
            self.replay_buffer.update(index, targets-q_values)


        return

In [None]:
class D3QN_PER(D3QN_PER):
    def updateNetwork(self, onlineNet, targetNet):
        #this function updates the onlineNetwork with the target network using Polyak averaging \
        #
        # Your code goes in here
        #
        with torch.no_grad():
            for paramOnline, paramTarget in zip(onlineNet.parameters(), targetNet.parameters()):
                paramTarget.data = self.tau * paramOnline.data + (1 - self.tau) * paramTarget.data
        return

In [None]:
class D3QN_PER(D3QN_PER):
    def evaluateAgent(self):
        #this function evaluates the agent using the value network, it evaluates agent for MAX_EVAL_EPISODES
        #typcially MAX_EVAL_EPISODES = 1
        #
        #Your code goes in here
        reward_arr = []
        for e in range(self.max_test_ep):
            rs = 0
            
            st, info = self.env.reset()
            terminated = False
            truncated = False
            
            while not (terminated or truncated):
                act = self.exploration_eval(self.policy_network, torch.tensor([st], dtype=torch.float32))
                st, rew, terminated, truncated, info = self.env.step(act)
                rs += rew
                
            reward_arr.append(rs)
            
        self.test_reward = sum(reward_arr) / len(reward_arr)
 

In [None]:


# env = gym.make('CartPole-v1')
# env.reset()

# #D3QN
# agent_D3QN_PER = D3QN_PER(env, seed=i,alpha=0.6 , beta=0.1,beta_rate=0.9992 ,tau =0.1 , gamma=0.99, bufferSize=1000, batchSize=64, 
#         optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=500, MAX_EVAL_EPISODES=1, 
#         explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
#         updateFrequency=15)
# D3QN_PER_trainRewardsList, D3QN_PER_trainTimeList, D3QN_PER_evalRewardsList, D3QN_PER_wallClockTimeList, D3QN_PER_finalEvalReward = agent_D3QN_PER.runD3QN_PER()

        

# Deep Policy Based RL agents.
<a id="deep-policy-based"></a>

### The purpose of this part is to learn about different Deep Policy Based RL agents.

In this part of the assignment you will be implementing Deep Policy based RL algorithms we learnt in Lectures. Namely, we will be implementing REINFORCE and VPG. 

For all the algorithms below, this time we will not be specifying the hyper-parameters, please play with the hyper-params to come up with the best values. This way you will learn to tune the model. Some of the values were specified in the lecture, that would be a good starting point. 

## REINFORCE

Implement the REINFORCE algorithm. We have studied about REINFORCE algorithm in the Lecture. Use the function definitions (given below).

This class implements the REINFORCE Agent, you are required to implement the various methods of this class
as outlined below. Note this class is generic and should work with any permissible Gym environment

```
class REINFORCE():
    def __init__(env, seed, gamma, 
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn)
    def initBookKeeping(self)
    def performBookKeeping(self, train = True)
    def runREINFORCE(self)
    def trainAgent(self)
    def trainPolicyNetwork(self, experiences)
    def evaluateAgent(self)
```

### Implement the methods for the REINFORCE class below

In [None]:
# def __init__

In [None]:
# def initBookKeeping

In [None]:
# def performBookKeeping

In [None]:
# def runREINFORCE

In [None]:
# def trainAgent

In [None]:
# def trainPolicyNetwork

In [None]:
# def evaluateAgent

## Vanilla Policy Gradient (VPG)

Implement the VPG algorithm. We have studied about VPG algorithm in the Lecture. Use the function definitions (given below).

This class implements the VPG Agent, you are required to implement the various methods of this class
as outlined below. Note this class is generic and should work with any permissible Gym environment

```
class VPG():
    def __init__(env, seed, gamma, beta, 
                 optimizerFn,
                 optimizerLR,
                 MAX_TRAIN_EPISODES, MAX_EVAL_EPISODES,
                 explorationStrategyTrainFn, 
                 explorationStrategyEvalFn)
    def initBookKeeping(self)
    def performBookKeeping(self, train = True)
    def runVPG(self)
    def trainAgent(self)
    def trainPolicyNetwork(self, experiences)
    def evaluateAgent(self)
```

In [None]:
# def __init__

In [None]:
# def initBookKeeping

In [None]:
# def performBookKeeping

In [None]:
# def runVPG

In [None]:
# def trainAgent

In [None]:
# def trainPolicyNetwork

In [None]:
# def evaluateAgent

# Experiments and Plots
<a id="experiments"></a>

Run the NFQ, DQN, Double DQN, Dueling Double DQN, Dueling Double Deep Q Network with Prioritized Experience Replay, REINFORCE and VPG agent on CartPole environment and MountainCar enviroment.

Plot the following for each of the environment separately. Note based on different hyper-parameters and stratgies you use, can you have multiple plots for each of the below. 

As you are aware from your past experience, single run of the agent over the environment results in plots that have lot of variance and look very noisy. One way to overcome this is to create several different instances of the environment using different seeds and then average out the results across these and plot these. For all the plots below, you this strategy. You need to run 5 different instances of the environment for each agent. As you have seen in the lecture slides, we plot the maximum and minimum values around the mean in the plots, so this gives us the shaded plot with the mean curve in the between. In this assignment, you are required to do the same. Generate plots with envelop between maximum and minimum value (check the plotQuantity() function in the helper functions).

For each of the quantity of interest, plot each of the agent within the same plot using different colors for the envelop. Choose colors such that that there is clear contrast between the plots corresponding to different agents.

1. Plot mean train rewards vs episodes for Cartpole environment.
2. Plot mean train rewards vs episodes for MountatinCar environment.
3. Plot mean evaluation rewards vs episodes 
4. Plot mean evaluation rewards vs episodes 
5. Plot total steps vs episode for Cartpole environment.
6. Plot total steps vs episode for MountatinCar environment.
7. Plot train time vs episode for Cartpole environment.
8. Plot train time vs episode for MountatinCar environment.
9. Plot wall clock time vs episode for Cartpole environment.
10. Plot wall clock time vs episode for MountatinCar environment.
11. Based on plots for CartPole environment, what are your observations about different agents. Compare different agents.  
12. Based on plots for MountainCar environment, what are your observations about different agents. Compare different agents. Do these observations concur with the ones for CartPole environment? 
13. Based on both the environments, can you generalize some of the findings for the value-based agents? If yes what are those findings?

In [None]:
################################### Cartpole env #####################################################
def runDeepValueBasedAgents():
    # this function will initialize 5 different instances of the env (using different seeds), run all the agents
    # over these different instances. Collects results and generate the plots state above.
    # generate your plots in the cells below
    # write the answers to part 11, 12 and 13 in the cells below the plot-cells.
    
    DQN_trainRewardsList=dict()
    DQN_trainTimeList = dict()
    DQN_evalRewardsList = dict()
    DQN_wallClockTimeList=dict()
    DQN_finalEvalReward=dict()
    
    NFQ_trainRewardsList=dict()
    NFQ_trainTimeList = dict()
    NFQ_evalRewardsList = dict()
    NFQ_wallClockTimeList=dict()
    NFQ_finalEvalReward=dict()
    
    DDQN_trainRewardsList=dict()
    DDQN_trainTimeList = dict()
    DDQN_evalRewardsList = dict()
    DDQN_wallClockTimeList=dict()
    DDQN_finalEvalReward=dict()
    
    D3QN_trainRewardsList=dict()
    D3QN_trainTimeList = dict()
    D3QN_evalRewardsList = dict()
    D3QN_wallClockTimeList=dict()
    D3QN_finalEvalReward=dict()
    
    D3QN_PER_trainRewardsList=dict()
    D3QN_PER_trainTimeList = dict()
    D3QN_PER_evalRewardsList = dict()
    D3QN_PER_wallClockTimeList=dict()
    D3QN_PER_finalEvalReward=dict()
    
    for i in range(1,6):
        env = gym.make('CartPole-v1')
        env.reset(seed=i)
        
        #DQN
        agent_DQN = DQN(env, seed=i, gamma=0.99, bufferSize=1000, batchSize=64, 
                optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
                explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
                updateFrequency=15)
        DQN_trainRewardsList[i], DQN_trainTimeList[i], DQN_evalRewardsList[i], DQN_wallClockTimeList[i], DQN_finalEvalReward[i] = agent_DQN.runDQN()
        #print(agent_DQN.runDQN())
        
        #NFQ
        agent_NFQ = NFQ(env, seed=i, gamma=0.99, bufferSize=1000, batchSize=64, 
                optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
                explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction)
        NFQ_trainRewardsList[i], NFQ_trainTimeList[i], NFQ_evalRewardsList[i], NFQ_wallClockTimeList[i], NFQ_finalEvalReward[i] = agent_NFQ.runNFQ()
        
        
        #DDQN
        agent_DDQN = DDQN(env, seed=i, gamma=0.99, bufferSize=1000, batchSize=64, 
                optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
                explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
                updateFrequency=15)
        DDQN_trainRewardsList[i], DDQN_trainTimeList[i], DDQN_evalRewardsList[i], DDQN_wallClockTimeList[i], DDQN_finalEvalReward[i] = agent_DDQN.runDDQN()
        
        
        #D3QN
        agent_D3QN = D3QN(env, seed=i, tau =0.1 , gamma=0.99, bufferSize=1000, batchSize=64, 
                optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
                explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
                updateFrequency=15)
        D3QN_trainRewardsList[i], D3QN_trainTimeList[i], D3QN_evalRewardsList[i], D3QN_wallClockTimeList[i], D3QN_finalEvalReward[i] = agent_D3QN.runD3QN()
        
        #PER-D3QN
        agent_D3QN_PER = D3QN_PER(env, seed=i,alpha=0.6 , beta=0.1,beta_rate=0.9992 ,tau =0.1 , gamma=0.99, bufferSize=1000, batchSize=64, 
        optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
        explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
        updateFrequency=15)
        D3QN_PER_trainRewardsList[i], D3QN_PER_trainTimeList[i], D3QN_PER_evalRewardsList[i], D3QN_PER_wallClockTimeList[i], D3QN_PER_finalEvalReward[i] = agent_D3QN_PER.runD3QN_PER()

        
        
        
        
    
    #print(finalEvalReward)
    plotQuantity(DQN_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'DQN'])
    plotQuantity(DDQN_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'DDQN'])
    plotQuantity(D3QN_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'D3QN'])
    plotQuantity(D3QN_PER_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'D3QN_PER'])
    plotQuantity(NFQ_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'NFQ'])
    plt.show()
    plotQuantity(DQN_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'DQN'])
    plotQuantity(DDQN_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'DDQN'])
    plotQuantity(D3QN_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'D3QN'])
    plotQuantity(D3QN_PER_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'D3QN_PER'])
    plotQuantity(NFQ_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'NFQ'])
    plt.show()
    plotQuantity(DQN_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'DQN'])
    plotQuantity(DDQN_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'DDQN'])
    plotQuantity(D3QN_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'D3QN'])
    plotQuantity(D3QN_PER_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'D3QN_PER'])
    plotQuantity(NFQ_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'NFQ'])
    plt.show()
    plotQuantity(DQN_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'DQN'])
    plotQuantity(DDQN_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'DDQN'])
    plotQuantity(D3QN_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'D3QN'])
    plotQuantity(D3QN_PER_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'D3QN_PER'])
    plotQuantity(NFQ_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'NFQ'])
    plt.show()
    plotQuantity(DQN_finalEvalReward,1000, ['Eval Reward' , 'Eval Reward vs episodes', 'DQN'])
    plotQuantity(DDQN_finalEvalReward, 1000, ['Eval Reward' , 'Eval Reward vs episodes', 'DDQN'])
    plotQuantity(D3QN_finalEvalReward, 1000, ['Eval Reward' , 'Eval Reward vs episodes', 'D3QN'])
    plotQuantity(D3QN_PER_finalEvalReward, 1000, ['Eval Reward' , 'Eval Reward vs episodes', 'D3QN_PER'])
    plotQuantity(NFQ_finalEvalReward, 1000, ['Eval Reward' , 'Eval Reward vs episodes', 'NFQ'])
    plt.show()

runDeepValueBasedAgents()


## Question 11 - 

Based on the plots above we can see that the D3QN with prioritized sampling performs the best in long run . In the plot training reward vs episodes we can see how better it is in the start but then eventually other better algorithm catches up. Also we can observe how slow NFQ is compared to other algorithm making it inefficient for normal use. While Double DQN clears the maximization bias did by DQN algo , it is more stable and converges quite quickly. Also we can see the speed of D3QN and D3QN PER and how good the algorithm performs , They have superior network structure which leads to better learning and performance. 

In [None]:
################################### Mountain Car env #####################################################
def runDeepValueBasedAgents_MC():
    # this function will initialize 5 different instances of the env (using different seeds), run all the agents
    # over these different instances. Collects results and generate the plots state above.
    # generate your plots in the cells below
    # write the answers to part 11, 12 and 13 in the cells below the plot-cells.
    
    DQN_trainRewardsList=dict()
    DQN_trainTimeList = dict()
    DQN_evalRewardsList = dict()
    DQN_wallClockTimeList=dict()
    DQN_finalEvalReward=dict()
    
    NFQ_trainRewardsList=dict()
    NFQ_trainTimeList = dict()
    NFQ_evalRewardsList = dict()
    NFQ_wallClockTimeList=dict()
    NFQ_finalEvalReward=dict()
    
    DDQN_trainRewardsList=dict()
    DDQN_trainTimeList = dict()
    DDQN_evalRewardsList = dict()
    DDQN_wallClockTimeList=dict()
    DDQN_finalEvalReward=dict()
    
    D3QN_trainRewardsList=dict()
    D3QN_trainTimeList = dict()
    D3QN_evalRewardsList = dict()
    D3QN_wallClockTimeList=dict()
    D3QN_finalEvalReward=dict()
    
    D3QN_PER_trainRewardsList=dict()
    D3QN_PER_trainTimeList = dict()
    D3QN_PER_evalRewardsList = dict()
    D3QN_PER_wallClockTimeList=dict()
    D3QN_PER_finalEvalReward=dict()
    
    for i in range(1,6):
        print(i)
        env = gym.make('MountainCar-v0')
        env.reset(seed=i)
        
        #DQN
        agent_DQN = DQN(env, seed=i, gamma=0.99, bufferSize=1000, batchSize=64, 
                optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
                explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
                updateFrequency=15)
        DQN_trainRewardsList[i], DQN_trainTimeList[i], DQN_evalRewardsList[i], DQN_wallClockTimeList[i], DQN_finalEvalReward[i] = agent_DQN.runDQN()
        #print(agent_DQN.runDQN())
        
        #NFQ
        agent_NFQ = NFQ(env, seed=i, gamma=0.99, bufferSize=1000, batchSize=64, 
                optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
                explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction)
        NFQ_trainRewardsList[i], NFQ_trainTimeList[i], NFQ_evalRewardsList[i], NFQ_wallClockTimeList[i], NFQ_finalEvalReward[i] = agent_NFQ.runNFQ()
        
        
        #DDQN
        agent_DDQN = DDQN(env, seed=i, gamma=0.99, bufferSize=1000, batchSize=64, 
                optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
                explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
                updateFrequency=15)
        DDQN_trainRewardsList[i], DDQN_trainTimeList[i], DDQN_evalRewardsList[i], DDQN_wallClockTimeList[i], DDQN_finalEvalReward[i] = agent_DDQN.runDDQN()
        
        
        #D3QN
        agent_D3QN = D3QN(env, seed=i, tau =0.1 , gamma=0.99, bufferSize=1000, batchSize=64, 
                optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
                explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
                updateFrequency=15)
        D3QN_trainRewardsList[i], D3QN_trainTimeList[i], D3QN_evalRewardsList[i], D3QN_wallClockTimeList[i], D3QN_finalEvalReward[i] = agent_D3QN.runD3QN()
        
        #PER-D3QN
        agent_D3QN_PER = D3QN_PER(env, seed=i,alpha=0.6 , beta=0.1,beta_rate=0.9992 ,tau =0.1 , gamma=0.99, bufferSize=1000, batchSize=64, 
        optimizerFn=optim.Adam, optimizerLR=0.0005, MAX_TRAIN_EPISODES=1000, MAX_EVAL_EPISODES=1, 
        explorationStrategyTrainFn= selectEpsilonGreedyAction, explorationStrategyEvalFn= selectGreedyAction, 
        updateFrequency=15)
        D3QN_PER_trainRewardsList[i], D3QN_PER_trainTimeList[i], D3QN_PER_evalRewardsList[i], D3QN_PER_wallClockTimeList[i], D3QN_PER_finalEvalReward[i] = agent_D3QN_PER.runD3QN_PER()

        
        
        
        
    
    #print(finalEvalReward)
    plotQuantity(DQN_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'DQN'])
    plotQuantity(DDQN_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'DDQN'])
    plotQuantity(D3QN_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'D3QN'])
    plotQuantity(D3QN_PER_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'D3QN_PER'])
    plotQuantity(NFQ_trainRewardsList, 1000, ['Training Reward' , 'Training reward vs episodes' , 'NFQ'])
    plt.show()
    plotQuantity(DQN_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'DQN'])
    plotQuantity(DDQN_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'DDQN'])
    plotQuantity(D3QN_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'D3QN'])
    plotQuantity(D3QN_PER_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'D3QN_PER'])
    plotQuantity(NFQ_trainTimeList, 1000, ['Training time' , 'Training time vs episodes' , 'NFQ'])
    plt.show()
    plotQuantity(DQN_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'DQN'])
    plotQuantity(DDQN_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'DDQN'])
    plotQuantity(D3QN_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'D3QN'])
    plotQuantity(D3QN_PER_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'D3QN_PER'])
    plotQuantity(NFQ_evalRewardsList, 1000, ['Total steps' , 'Total steps vs episodes', 'NFQ'])
    plt.show()
    plotQuantity(DQN_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'DQN'])
    plotQuantity(DDQN_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'DDQN'])
    plotQuantity(D3QN_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'D3QN'])
    plotQuantity(D3QN_PER_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'D3QN_PER'])
    plotQuantity(NFQ_wallClockTimeList, 1000, ['Wall clock' , 'Wall clock vs episodes', 'NFQ'])
    plt.show()
    plotQuantity(DQN_finalEvalReward,1000, ['Eval Reward' , 'Eval Reward vs episodes', 'DQN'])
    plotQuantity(DDQN_finalEvalReward, 1000, ['Eval Reward' , 'Eval Reward vs episodes', 'DDQN'])
    plotQuantity(D3QN_finalEvalReward, 1000, ['Eval Reward' , 'Eval Reward vs episodes', 'D3QN'])
    plotQuantity(D3QN_PER_finalEvalReward, 1000, ['Eval Reward' , 'Eval Reward vs episodes', 'D3QN_PER'])
    plotQuantity(NFQ_finalEvalReward, 1000, ['Eval Reward' , 'Eval Reward vs episodes', 'NFQ'])
    plt.show()

runDeepValueBasedAgents_MC()

## Question 12 

Mountain car is a special kind of environment in which the rewards are very sparse and you only get reward at the end when the car climbs up the hill. This is the reason why every algorithm fails to train the agent in a mere 1000 episodes. There are two options : 
1) Either you increase the number of episodes or 
2) You modify the reward based on position and velocity
Both of the above will help in efficient learning of the environment. However the second option is preferred due to computational complexity. 

## Question 13 

From both of the environments we can say that the value based method do a good job in predictiong the policy when the reward is dense but when the reward is sparse as seen in the mountain car environment the value based method fails and policy based method is better choice for them as it directly evaluates the policy without calculating the values.

In [None]:
def runDeepPolicyBasedAgents():
    # this function will initialize 5 different instances of the env (using different seeds), run all the agents
    # over these different instances. Collects results and generate the plots state above.
    # generate your plots in the cells below
    # write the answers to part 11, 12 and 13 in the cells below the plot-cells. 