<a href="https://colab.research.google.com/github/theindianwriter/RL_agents/blob/main/RL_FINAL_PROJECT_Double_Q_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IITM RL FINAL PROJECT

## Problem - bsuite 

This notebook uses an open source reinforcement learning benchmark known as bsuite. https://github.com/deepmind/bsuite

bsuite is a collection of carefully-designed experiments that investigate core capabilities of a reinforcement learning agent.

Your task is to use any reinforcement learning techniques at your disposal to get high scores on the environments specified.

**Note**: Since the course is on Reinforcement Learning,  please limit yourself to using traditional Reinforcement Learning algorithms, 

**Do not use deep reinforcement learning.**

# How to use this notebook? 📝

- This is a shared template and any edits you make here will not be saved. **You
should make a copy in your own drive**. Click the "File" menu (top-left), then "Save a Copy in Drive". You will be working in your copy however you like.

<p style="text-align: center"><img src="https://gitlab.aicrowd.com/aicrowd/assets/-/raw/master/notebook/aicrowd_notebook_submission_flow.png?inline=false" alt="notebook overview" style="width: 650px;"/></p>

- **Update the config parameters**. You can define the common variables here

Variable | Description
--- | ---
`AICROWD_RESULTS_DIR` | Path to write the output to.
`AICROWD_ASSETS_DIR` | In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY` | In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

- **Installing packages**. Please use the [Install packages 🗃](#install-packages-) section to install the packages

In [None]:
!pip install -q aicrowd-cli

[K     |████████████████████████████████| 61kB 4.5MB/s 
[K     |████████████████████████████████| 174kB 8.0MB/s 
[K     |████████████████████████████████| 204kB 10.9MB/s 
[K     |████████████████████████████████| 61kB 6.4MB/s 
[K     |████████████████████████████████| 81kB 7.3MB/s 
[K     |████████████████████████████████| 61kB 5.5MB/s 
[K     |████████████████████████████████| 71kB 6.5MB/s 
[K     |████████████████████████████████| 51kB 5.5MB/s 
[31mERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.[0m
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[?25h

# AIcrowd Runtime Configuration 🧷

Get login API key from https://www.aicrowd.com/participants/me


In [None]:
import os

AICROWD_RESULTS_DIR = os.getenv("OUTPUTS_DIR", "results")
os.environ["RESULTS_DIR"] = AICROWD_RESULTS_DIR
API_KEY = "ff63cf8cf4059a9eb241d7c1df822a6b"

In [None]:
!aicrowd login --api-key $API_KEY

[32mAPI Key valid[0m
[32mSaved API Key successfully![0m


# Install packages 🗃

Please add all pacakage installations in this section

In [None]:
!pip install git+http://gitlab.aicrowd.com/nimishsantosh107/bsuite.git
!pip install tabulate
!pip install tqdm

## Add any other installations you need here

Collecting git+http://gitlab.aicrowd.com/nimishsantosh107/bsuite.git
  Cloning http://gitlab.aicrowd.com/nimishsantosh107/bsuite.git to /tmp/pip-req-build-zd8mjibt
  Running command git clone -q http://gitlab.aicrowd.com/nimishsantosh107/bsuite.git /tmp/pip-req-build-zd8mjibt
Collecting dm_env
  Downloading https://files.pythonhosted.org/packages/fa/84/c96b6544b8a2cfefc663b7dbd7fc0c2f2c3b6cbf68b0171775693bda2a66/dm_env-1.4-py3-none-any.whl
Collecting frozendict
  Downloading https://files.pythonhosted.org/packages/6d/29/edb363cf898269cb322d0186baa0bd02874a69691d9dec8728b644fbcedc/frozendict-2.0.2-py3-none-any.whl
Building wheels for collected packages: bsuite
  Building wheel for bsuite (setup.py) ... [?25l[?25hdone
  Created wheel for bsuite: filename=bsuite-0.3.5-cp37-none-any.whl size=252042 sha256=b0f7e9f972ad432b8bf501db39ced60485e482227ecf6f9977a7f4e603809d3f
  Stored in directory: /tmp/pip-ephem-wheel-cache-95pn4q80/wheels/61/ea/06/77c82c07765fb8608e50e6c66bc566fa6d113c725bc69

# Import packages

In [None]:
import gym
import warnings

import numpy as np
import pandas as pd
import plotnine as gg
from tqdm.notebook import tqdm

import bsuite
from bsuite.aicrowd import environments
from bsuite.aicrowd.runner import Runner
from bsuite.aicrowd.analysis import Analyzer

pd.options.mode.chained_assignment = None
gg.theme_set(gg.theme_bw(base_size=16, base_family='serif'))
gg.theme_update(figure_size=(3, 1), panel_spacing_x=0.5, panel_spacing_y=0.5)
warnings.filterwarnings('ignore')

# **Agent Class**

You can modify the AGENT TEMPLATE below and implement the logic of your agent. Your agent must implement a few methods that will be called by the `Runner` class.
* `__init__` - put any initialization code here.
* `get_action` - takes in a `state` and returns an `action`.
* `learn` - takes in `(state, action, reward, next_state)`, implements the learning logic.
* `get_state` - takes in a raw `observation` directly from the env, discretizes it and returns a `state`.

In addition to these, you may implement other methods which can be called by the above methods.

Since there are multiple environments, you may need unique hyper parameters for each environment. Instantiate the agent while passing in the hyper parameters in a dictionary using the `agent_config` parameter so that each environment can use different hyper parameters for the agent while using a single `Agent` class for all of them.  You can use any names for the keys in the config dictionary.   

An example `RandomAgent` is given below.

In [None]:
# *** YOU CAN EDIT THIS CELL ***
# AGENT TEMPLATE
class Agent:
    def __init__(self, agent_config=None):

        self.config = agent_config
        self.env_name = self.config['env_name']
        self.valid_actions = self.config['valid_actions']
        self.lr = self.config["lr"]    
        self.gamma = self.config["gamma"] 
        self.epsilon_decay_rate = self.config["epsilon_decay_rate"] 
        self.lr_decay_rate = self.config["lr_decay_rate"]
        self.lr_min = self.config["lr_min"]
        self.epsilon = self.config["epsilon"]
        self.episode = 0           
        self.QA_table = dict()
        self.QB_table = dict()

        if self.env_name == 'cartpole':
        
            self.x_bins = pd.cut([-1, 1], bins=4, retbins=True)[1][1:-1]
            self.x_dot_bins = pd.cut([-5, 5], bins=4, retbins=True)[1][1:-1]
            self.sin_bins = pd.cut([-1, 1], bins=4, retbins=True)[1][1:-1]
            # self.cos_bins = pd.cut([-1, 1], bins=4, retbins=True)[1][1:-1]
            self.theta_dot_bins = pd.cut([-5, 5], bins=4, retbins=True)[1][1:-1]
            # time_elapsed_bins = pd.cut([0, 1], bins=10, retbins=True)[1][1:-1]

            
        elif self.env_name == 'cartpole_noise':

            self.x_bins = pd.cut([-1, 1], bins=4, retbins=True)[1][1:-1]
            self.x_dot_bins = pd.cut([-5, 5], bins=4, retbins=True)[1][1:-1]
            self.sin_bins = pd.cut([-1, 1], bins=4, retbins=True)[1][1:-1]
            # self.cos_bins = pd.cut([-1, 1], bins=4, retbins=True)[1][1:-1]
            self.theta_dot_bins = pd.cut([-5, 5], bins=4, retbins=True)[1][1:-1]
            # time_elapsed_bins = pd.cut([0, 1], bins=10, retbins=True)[1][1:-1]

            
        elif self.env_name == 'mountaincar':

            self.x_bins = pd.cut([-1.2, 0.6], bins=15, retbins=True)[1][1:-1]
            self.x_dot_bins = pd.cut([-0.07, 0.07], bins=15, retbins=True)[1][1:-1]
        

        elif self.env_name == 'mountaincar_noise':
            self.x_bins = pd.cut([-1.2, 0.6], bins=15, retbins=True)[1][1:-1]
            self.x_dot_bins = pd.cut([-0.07, 0.07], bins=15, retbins=True)[1][1:-1]
            
        else:
            pass

    def _create_Q(self, state):
        ''' Update the Q table given a new state/action pair.
        Args:
            state: List of state booleans.
            valid_actions: List of valid actions for environment.
        '''
        if state not in self.QA_table:
            self.QA_table[state] = dict()
            for action in self.valid_actions:
                self.QA_table[state][action] = 0.0
        
        if state not in self.QB_table:
            self.QB_table[state] = dict()
            for action in self.valid_actions:
                self.QB_table[state][action] = 0.0

    def _get_maxQA(self, state):
        ''' Find the maximum Q value in a given Q table.
        Args:
            Q_table: Q table dictionary.
            state: List of state booleans.
        Returns:
            maxQ: Maximum Q value for a given state.
        '''

        maxQ = max(self.QA_table[state].values())
        return maxQ
    
    def _get_maxQB(self, state):
        ''' Find the maximum Q value in a given Q table.
        Args:
            Q_table: Q table dictionary.
            state: List of state booleans.
        Returns:
            maxQ: Maximum Q value for a given state.
        '''

        maxQ = max(self.QB_table[state].values())
        return maxQ
    

    def _build_state(self, features):
        ''' Build state by concatenating features (bins) into 6 digit int. '''
        return int("".join(map(lambda feature: str(int(feature)), features)))

    def get_action(self, state):
        '''
        PARAMETERS  : 
            - state - discretized 'state'
        RETURNS     : 
            - action - 'action' to be taken
        '''
        self._create_Q(state)

        if np.random.uniform(0,1) < self.epsilon:
            action = np.random.choice(self.valid_actions)

        else:
            
            # Find max Q value
            # max_Q = self._get_maxQ(state)
            # actions = []
            # for key, value in self.Q_table[state].items():
            #     if value == max_Q:
            #         actions.append(key)
            # if len(actions) != 0:
            #     action = np.random.choice(actions)
            Q_table = dict()
            Q_table[state] = dict()
            for action in self.valid_actions:
                Q_table[state][action] = self.QA_table[state][action] + self.QB_table[state][action]

            max_Q = max(Q_table[state].values())
            actions = []
            for key, value in Q_table[state].items():
                if value == max_Q:
                    actions.append(key)
            if len(actions) != 0:
                action = np.random.choice(actions)
        
        return action
    
    def learn(self, state, action, reward, next_state, done):
        '''
        PARAMETERS  : 
            - state - discretized 'state'
            - action - 'action' performed in 'state'
            - reward - 'reward' received due to action taken
            - next_state - discretized 'next_state'
            - done - status flag to represent if an episode is done or not
        RETURNS     : 
            - NIL
        '''
        self._create_Q(next_state)
        
        if np.random.rand() < 0.5:
            self.QA_table[state][action] = (1 - self.lr) * self.QA_table[state][action] + self.lr *\
                (reward + (self.gamma * self._get_maxQB(next_state)))
        else:
            self.QB_table[state][action] = (1 - self.lr) * self.QB_table[state][action] + self.lr *\
                (reward + (self.gamma * self._get_maxQA(next_state)))

        if done:
            self.episode = self.episode + 1 
            #self.epsilon = max(0.01, np.exp(-self.epsilon_decay_rate*self.episode))
            self.epsilon *= (1 - self.epsilon_decay_rate)
            self.lr = max(self.lr_min, self.lr * (1 - self.lr_decay_rate))
            
    def get_state(self, observation):
        '''
        PARAMETERS  : 
            - observation - raw 'observation' from environment
        RETURNS     : 
            - state - discretized 'state' from raw 'observation'
        '''
        if self.env_name == 'catch':
            state = self._build_state(observation.flatten())
        elif self.env_name == 'catch_noise':
            state = self._build_state(observation.flatten())
        elif self.env_name == 'cartpole':
        
            state = self._build_state([np.digitize(x=[observation[0,0]], bins=self.x_bins)[0],
                                 np.digitize(x=[observation[0,1]], bins=self.x_dot_bins)[0],
                                 np.digitize(x=[observation[0,2]], bins=self.sin_bins)[0],
                                #  np.digitize(x=[observation[0,3]], bins=self.cos_bins)[0],
                                 np.digitize(x=[observation[0,4]], bins=self.theta_dot_bins)[0]])
                

            
        elif self.env_name == 'cartpole_noise':
           
            state = self._build_state([np.digitize(x=[observation[0,0]], bins=self.x_bins)[0],
                                 np.digitize(x=[observation[0,1]], bins=self.x_dot_bins)[0],
                                 np.digitize(x=[observation[0,2]], bins=self.sin_bins)[0],
                                #  np.digitize(x=[observation[0,3]], bins=self.cos_bins)[0],
                                 np.digitize(x=[observation[0,4]], bins=self.theta_dot_bins)[0]])
                        
            
        elif self.env_name == 'mountaincar':

            state = self._build_state([np.digitize(x=[observation[0,0]], bins=self.x_bins)[0],
                                 np.digitize(x=[observation[0,1]], bins=self.x_dot_bins)[0]])
            
        elif self.env_name == 'mountaincar_noise':
            
            state = self._build_state([np.digitize(x=[observation[0,0]], bins=self.x_bins)[0],
                                 np.digitize(x=[observation[0,1]], bins=self.x_dot_bins)[0]])
        else:
            raise NotImplementedError

        return state

In [None]:
# *** YOU CAN EDIT THIS CELL ***
# DO NOT rename the config dictionaries as the evaluator references them. However, you may use any names for the keys in them.
catch_config = {"env_name": "catch",'lr': 0.3,'gamma': 0.98, "valid_actions" : [0,1,2],"epsilon_decay_rate" : 0.01,"lr_decay_rate" : 5e-4,"lr_min": 1e-5,"epsilon": 0.9}
catch_noise_config = {"env_name": "catch_noise",'lr': 0.3,'gamma': 0.98, "valid_actions" : [0,1,2],"epsilon_decay_rate" : 0.01,"lr_decay_rate" : 5e-4,"lr_min": 1e-5,"epsilon": 0.9}
cartpole_config = {"env_name": "cartpole",'lr': 0.3,'gamma': 0.995, "valid_actions" : [0,1,2],"decay_factor" : 0.01,"epsilon_decay_rate" : 0.01,"lr_decay_rate" : 5e-4,"lr_min": 1e-5,"epsilon": 0.9}
cartpole_noise_config = {"env_name": "cartpole_noise",'lr': 0.3,'gamma': 0.995, "valid_actions" : [0,1,2],"epsilon_decay_rate" : 0.01,"lr_decay_rate" : 5e-4,"lr_min": 1e-5,"epsilon": 0.9}
mountaincar_config = {"env_name": "mountaincar",'lr': 0.28,'gamma': 0.98, "valid_actions" : [0,1,2],"epsilon_decay_rate" : 5e-3,"lr_decay_rate" : 5e-4,"lr_min": 1e-5,"epsilon": 0.9}
mountaincar_noise_config = {"env_name": "mountaincar_noise",'lr': 0.3,'gamma': 0.98, "valid_actions" : [0,1,2],"epsilon_decay_rate" : 0.01,"lr_decay_rate" : 5e-4,"lr_min": 1e-5,"epsilon": 0.9}

In [None]:
# *** YOU CAN EDIT THIS CELL ***
# EXAMPLE
class RandomAgent:
    def __init__(self, agent_config={}):
        self.config = agent_config
        self.env_name = self.config['env_name']

    def get_action(self, state):
        action = np.random.choice(2)
        return action
    
    def learn(self, state, action, reward, next_state, done):
        if ('BAR' in self.config):
            if (self.config['BAR']):
                self.config['FOO'] += 1

    def get_state(self, observation):
        # In this function you're allowed to use 
        # the environment name for observation preprocessing
        # Do not use it anywhere else
        if self.env_name == 'catch':
          state = observation
        elif self.env_name == 'catch_noise':
          state = observation
        elif self.env_name == 'cartpole':
          state = observation
        elif self.env_name == 'cartpole_noise':
          state = observation
        elif self.env_name == 'mountaincar':
          state = observation
        elif self.env_name == 'mountaincar_noise':
          state = observation
        else:
          raise NotImplementedError

        return state

env1_config = {
    "env_name": 'cartpole',
    'FOO': 0.1,
    'BAR': True
}

env2_config = {
    "env_name": 'cartpole',
    'FOO': 0.2,
    'BAR': False
}

randomAgent1 = RandomAgent(agent_config=env1_config)
randomAgent2 = RandomAgent(agent_config=env2_config)

# **Playing with the Environment**

#### **Instantiating the environment** :
You can create an environment by calling the following function:  
`environments.load_env(ENV_ID)` - RETURNS: `env`  
where, ENV_ID can be ONE of the following:
* `environments.CATCH`
* `environments.CATCH_NOISE`
* `environments.CARTPOLE`
* `environments.CARTPOLE_NOISE`
* `environments.MOUNTAINCAR`
* `environments.MOUNTAINCAR_NOISE`

The `NOISE` environments add a scaled random noise to the `reward`.
<br/>

#### **Runnning the environment** :
There are certain methods required to run the environments. The interface is very similar to OpenAI Gym's interfaces. Fore more information, read the OpenAI documentation [here](https://gym.openai.com/docs/).

`env.reset()` - RETURNS: `observation`  
`env.step(action)`  - RETURNS: `(next_observation, reward, done, info[NOT USED])`

There are also a few useful properties within the environments:

* `env.action_space.n` - total number of possible actions. eg: if 'n' is 3, then the possible actions are `[0, 1, 2]`
* `env.observation_space.shape` -  the shape of the observation.
* `env.bsuite_num_episodes` -  the pre-specified number of episodes which will be run during evaluation (unique for each environment).

##### *ONLY IN CATCH / CATCH_NOISE*
* `env.observation_space.high` -  the upper limit for every index in the observation.
* `env.observation_space.low` -  the lower limit for every index of the observation.
<br/>


## **Environment Observation Space Limits:**

The limits for the observation space (minimum and maximum) for all the environments are given in the table below:

| Environments                        | Limits                                                                      |
|-------------------------------------|-----------------------------------------------------------------------------|
| CATCH <br/>  CATCH_NOISE            | MIN: use `env.observation_space.low` <br/> MAX: use `env.observation_space.high` |
| CARTPOLE <br/> CARTPOLE_NOISE       | MIN: `[-1. -5., -1., -1., -5., 0.]` <br/> MAX: `[ 1.,  5.,  1.,  1.,  5., 1.]` |
| MOUNTAINCAR <br/> MOUNTAINCAR_NOISE | MIN: `[-1.2, -0.07, 0.]` <br/> MAX: `[ 0.6,  0.07,  1.]`                                 |

[NOTE] Use this code cell to play around and get used to the environments. However, the `Runner` class below will be used to evaluate your agent.

In [None]:
# *** YOU CAN EDIT THIS CELL ***
# TEST AREA
env = environments.load_env(environments.CATCH)  # replace 'environments.CARTPOLE' with other environments
agent = Agent(agent_config=catch_config)    # replace with 'RandomAgent()' to use your custom agent

NUM_EPISODES = 10                                   # replace with 'env.bsuite_num_episodes' to run for pre-specified number of episodes
for episode_n in tqdm(range(NUM_EPISODES)):
    done = False
    episode_reward = 0
    episode_moves = 0
 
    observation = env.reset()
    state = agent.get_state(observation)

    while not done:
        action = agent.get_action(state)

        next_observation, reward, done, _ = env.step(action)
        next_state = agent.get_state(next_observation)

        agent.learn(state, action, reward, next_state, done)

        state = next_state

        episode_reward += reward
        episode_moves += 1

    if (((episode_n+1) % 2) == 0): 
        print("EPISODE: ",episode_n+1,"\tREWARD: ",episode_reward,"\tEPISODE_LENGTH: ",episode_moves)

[1m[37mLoaded bsuite_id: catch/0.[0m
(10, 5)


  0%|          | 0/10 [00:00<?, ?it/s]

EPISODE:  2 	REWARD:  -1.0 	EPISODE_LENGTH:  9
EPISODE:  4 	REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  6 	REWARD:  -1.0 	EPISODE_LENGTH:  9
EPISODE:  8 	REWARD:  -1.0 	EPISODE_LENGTH:  9
EPISODE:  10 	REWARD:  1.0 	EPISODE_LENGTH:  9


## Point to the Agent Class you'll use for the final score

In [None]:
RLAgent = Agent

# **Evaluating the Agent on all the Environments**

* The following cells will take care of running your agent on each environment and aggregating the results in csv files. In each of the following cells, the `agent_config` parameter is already set to use the corresponding config dictionary for that environment. DO NOT EDIT THIS.
* Feel free to modify the `LOG_INTERVAL` parameter to change the interval between episodes for logging.  
* Please do not modify any other contents in each of the cells.  

In [None]:
LOG_INTERVAL = 100

In [None]:
runner = Runner(
    agent = RLAgent(agent_config=catch_config),
    env_id = environments.CATCH,
    log_interval = LOG_INTERVAL,
)
runner.play_episodes()

[1m[37mLoaded bsuite_id: catch/0.[0m
[1m[33mLogging results to CSV file for each bsuite_id in results.[0m


  0%|          | 0/10000 [00:00<?, ?it/s]

EPISODE:  100 	REWARD:  1.0 	MEAN_REWARD:  -0.34 	EPISODE_LENGTH:  9
EPISODE:  200 	REWARD:  1.0 	MEAN_REWARD:  0.34 	EPISODE_LENGTH:  9
EPISODE:  300 	REWARD:  1.0 	MEAN_REWARD:  0.86 	EPISODE_LENGTH:  9
EPISODE:  400 	REWARD:  -1.0 	MEAN_REWARD:  0.9 	EPISODE_LENGTH:  9
EPISODE:  500 	REWARD:  1.0 	MEAN_REWARD:  0.98 	EPISODE_LENGTH:  9
EPISODE:  600 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  700 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  800 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  900 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  1000 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  1100 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  1200 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  1300 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  1400 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EPISODE_LENGTH:  9
EPISODE:  1500 	REWARD:  1.0 	MEAN_REWARD:  1.0 	EP

In [None]:
runner = Runner(
    agent = RLAgent(agent_config=catch_noise_config),
    env_id = environments.CATCH_NOISE,
    log_interval = LOG_INTERVAL
)
runner.play_episodes()

[1m[37mLoaded bsuite_id: catch_noise/1.[0m
[1m[33mLogging results to CSV file for each bsuite_id in results.[0m


  0%|          | 0/10000 [00:00<?, ?it/s]

EPISODE:  100 	REWARD:  -2.4319180485080194 	MEAN_REWARD:  -0.37 	EPISODE_LENGTH:  9
EPISODE:  200 	REWARD:  0.8881949106608489 	MEAN_REWARD:  0.28 	EPISODE_LENGTH:  9
EPISODE:  300 	REWARD:  0.7058772924254098 	MEAN_REWARD:  0.71 	EPISODE_LENGTH:  9
EPISODE:  400 	REWARD:  0.5749511984405982 	MEAN_REWARD:  0.82 	EPISODE_LENGTH:  9
EPISODE:  500 	REWARD:  1.8165310035839968 	MEAN_REWARD:  0.99 	EPISODE_LENGTH:  9
EPISODE:  600 	REWARD:  0.39936420173707576 	MEAN_REWARD:  1.09 	EPISODE_LENGTH:  9
EPISODE:  700 	REWARD:  1.3884776851441103 	MEAN_REWARD:  0.91 	EPISODE_LENGTH:  9
EPISODE:  800 	REWARD:  1.8585504709523746 	MEAN_REWARD:  0.94 	EPISODE_LENGTH:  9
EPISODE:  900 	REWARD:  1.362278430457311 	MEAN_REWARD:  1.01 	EPISODE_LENGTH:  9
EPISODE:  1000 	REWARD:  1.829857544425355 	MEAN_REWARD:  1.05 	EPISODE_LENGTH:  9
EPISODE:  1100 	REWARD:  -0.37396752528475785 	MEAN_REWARD:  1.03 	EPISODE_LENGTH:  9
EPISODE:  1200 	REWARD:  0.9410664532715053 	MEAN_REWARD:  1.04 	EPISODE_LENGTH:  

In [None]:
runner = Runner(
    agent = RLAgent(agent_config=cartpole_config),
    env_id = environments.CARTPOLE,
    log_interval = LOG_INTERVAL
)
runner.play_episodes()

[1m[37mLoaded bsuite_id: cartpole/0.[0m
[1m[33mLogging results to CSV file for each bsuite_id in results.[0m


  0%|          | 0/1000 [00:00<?, ?it/s]

EPISODE:  100 	REWARD:  72.0 	MEAN_REWARD:  66.46 	EPISODE_LENGTH:  73
EPISODE:  200 	REWARD:  36.0 	MEAN_REWARD:  77.27 	EPISODE_LENGTH:  37
EPISODE:  300 	REWARD:  247.0 	MEAN_REWARD:  217.99 	EPISODE_LENGTH:  248
EPISODE:  400 	REWARD:  121.0 	MEAN_REWARD:  277.94 	EPISODE_LENGTH:  122
EPISODE:  500 	REWARD:  220.0 	MEAN_REWARD:  456.72 	EPISODE_LENGTH:  221
EPISODE:  600 	REWARD:  118.0 	MEAN_REWARD:  260.89 	EPISODE_LENGTH:  119
EPISODE:  700 	REWARD:  1001.0 	MEAN_REWARD:  681.0 	EPISODE_LENGTH:  1001
EPISODE:  800 	REWARD:  1001.0 	MEAN_REWARD:  1001.0 	EPISODE_LENGTH:  1001
EPISODE:  900 	REWARD:  1001.0 	MEAN_REWARD:  1001.0 	EPISODE_LENGTH:  1001
EPISODE:  1000 	REWARD:  1001.0 	MEAN_REWARD:  1001.0 	EPISODE_LENGTH:  1001


In [None]:
runner = Runner(
    agent = RLAgent(agent_config=cartpole_noise_config),
    env_id = environments.CARTPOLE_NOISE,
    log_interval = LOG_INTERVAL
)
runner.play_episodes()

[1m[37mLoaded bsuite_id: cartpole_noise/1.[0m
[1m[33mLogging results to CSV file for each bsuite_id in results.[0m


  0%|          | 0/1000 [00:00<?, ?it/s]

EPISODE:  100 	REWARD:  36.715083148951734 	MEAN_REWARD:  62.16 	EPISODE_LENGTH:  37
EPISODE:  200 	REWARD:  39.29689606435359 	MEAN_REWARD:  101.99 	EPISODE_LENGTH:  40
EPISODE:  300 	REWARD:  35.29209259584555 	MEAN_REWARD:  378.13 	EPISODE_LENGTH:  38
EPISODE:  400 	REWARD:  999.2396956459727 	MEAN_REWARD:  492.25 	EPISODE_LENGTH:  1001
EPISODE:  500 	REWARD:  1004.1082750813258 	MEAN_REWARD:  831.53 	EPISODE_LENGTH:  1001
EPISODE:  600 	REWARD:  1001.1586884055355 	MEAN_REWARD:  885.24 	EPISODE_LENGTH:  1001
EPISODE:  700 	REWARD:  1002.4590846465019 	MEAN_REWARD:  961.55 	EPISODE_LENGTH:  1001
EPISODE:  800 	REWARD:  994.4915962806824 	MEAN_REWARD:  969.89 	EPISODE_LENGTH:  1001
EPISODE:  900 	REWARD:  806.7554926865807 	MEAN_REWARD:  999.93 	EPISODE_LENGTH:  806
EPISODE:  1000 	REWARD:  115.27126164260049 	MEAN_REWARD:  778.02 	EPISODE_LENGTH:  117


In [None]:
runner = Runner(
    agent = RLAgent(agent_config=mountaincar_config),
    env_id = environments.MOUNTAINCAR,
    log_interval = LOG_INTERVAL
)
runner.play_episodes()

[1m[37mLoaded bsuite_id: mountain_car/0.[0m
[1m[33mLogging results to CSV file for each bsuite_id in results.[0m


  0%|          | 0/1000 [00:00<?, ?it/s]

EPISODE:  100 	REWARD:  -514.0 	MEAN_REWARD:  -870.81 	EPISODE_LENGTH:  514
EPISODE:  200 	REWARD:  -970.0 	MEAN_REWARD:  -484.62 	EPISODE_LENGTH:  970
EPISODE:  300 	REWARD:  -238.0 	MEAN_REWARD:  -390.61 	EPISODE_LENGTH:  238
EPISODE:  400 	REWARD:  -356.0 	MEAN_REWARD:  -312.91 	EPISODE_LENGTH:  356
EPISODE:  500 	REWARD:  -323.0 	MEAN_REWARD:  -241.27 	EPISODE_LENGTH:  323
EPISODE:  600 	REWARD:  -162.0 	MEAN_REWARD:  -201.45 	EPISODE_LENGTH:  162
EPISODE:  700 	REWARD:  -233.0 	MEAN_REWARD:  -206.9 	EPISODE_LENGTH:  233
EPISODE:  800 	REWARD:  -179.0 	MEAN_REWARD:  -199.01 	EPISODE_LENGTH:  179
EPISODE:  900 	REWARD:  -140.0 	MEAN_REWARD:  -196.24 	EPISODE_LENGTH:  140
EPISODE:  1000 	REWARD:  -154.0 	MEAN_REWARD:  -175.69 	EPISODE_LENGTH:  154


In [None]:
runner = Runner(
    agent = RLAgent(agent_config=mountaincar_noise_config),
    env_id = environments.MOUNTAINCAR_NOISE,
    log_interval = LOG_INTERVAL
)
runner.play_episodes()

[1m[37mLoaded bsuite_id: mountain_car_noise/1.[0m
[1m[33mLogging results to CSV file for each bsuite_id in results.[0m


  0%|          | 0/1000 [00:00<?, ?it/s]

EPISODE:  100 	REWARD:  -333.3603997115172 	MEAN_REWARD:  -763.56 	EPISODE_LENGTH:  337
EPISODE:  200 	REWARD:  -326.8620601575833 	MEAN_REWARD:  -396.58 	EPISODE_LENGTH:  324
EPISODE:  300 	REWARD:  -273.62591473780697 	MEAN_REWARD:  -242.07 	EPISODE_LENGTH:  271
EPISODE:  400 	REWARD:  -428.1501313802517 	MEAN_REWARD:  -268.54 	EPISODE_LENGTH:  427
EPISODE:  500 	REWARD:  -236.468116796463 	MEAN_REWARD:  -200.78 	EPISODE_LENGTH:  233
EPISODE:  600 	REWARD:  -178.60923101857816 	MEAN_REWARD:  -216.38 	EPISODE_LENGTH:  178
EPISODE:  700 	REWARD:  -186.82514676363155 	MEAN_REWARD:  -190.71 	EPISODE_LENGTH:  190
EPISODE:  800 	REWARD:  -161.49667794480445 	MEAN_REWARD:  -203.09 	EPISODE_LENGTH:  165
EPISODE:  900 	REWARD:  -152.03311973492632 	MEAN_REWARD:  -176.17 	EPISODE_LENGTH:  152
EPISODE:  1000 	REWARD:  -153.5394892329278 	MEAN_REWARD:  -214.11 	EPISODE_LENGTH:  153


# **Analysis & Result**

The following cells will show the score of the agent on each environment. The same scoring method will be used to evaluate your agent on a set of test environments.

In [None]:
# *** PLEASE DONT EDIT THE CONTENTS OF THIS CELL ***
analyzer = Analyzer(os.environ.get('RESULTS_DIR'))
analyzer.print_scores()

╒════════════════════╤══════════╕
│ ENVIRONMENT        │    SCORE │
╞════════════════════╪══════════╡
│ catch              │ 0.985875 │
├────────────────────┼──────────┤
│ catch_noise        │ 0.983    │
├────────────────────┼──────────┤
│ cartpole           │ 0.752063 │
├────────────────────┼──────────┤
│ cartpole_noise     │ 0.822941 │
├────────────────────┼──────────┤
│ mountain_car       │ 0.772049 │
├────────────────────┼──────────┤
│ mountain_car_noise │ 0.812676 │
╘════════════════════╧══════════╛


In [None]:
# If you want a object to get the scores
analyzer.get_scores()

{'cartpole': 0.019666499999999986,
 'cartpole_noise': 0.01956450000000001,
 'catch': 0.0006250000000000699,
 'catch_noise': 0.0025000000000000022,
 'mountain_car': 0.1,
 'mountain_car_noise': 0.1}

## What is the score function

The score function is developed by the BSuite team at Deepmind. It is open source and available at https://github.com/deepmind/bsuite

The score measures behavioral aspects of the agent only, and does not take into account internal state of the agent. For more details read Section 2 of the [BSuite paper](https://openreview.net/forum?id=rygf-kSYwH). In this case we use only the "Basic" aspect of the agent's scoring system.

**It is not necessary to understand the score in order to improve your agent's performance**

# **Backend Evaluation**

THIS CODE WILL EVALUATE THE AGENT USING THE SPECIFIED CONFIGS FOR THE CORRESPONDING ENVIRONMENTS. DO NOT EDIT THE CONTENTS OF THIS CELL.

In [None]:
## Do not edit this cell
if (os.environ.get('BACKEND_EVALUATOR') is not None):
    
    import backend_evaluator

    runs = {
        'catch': (
            backend_evaluator.CATCH, 
            catch_config),
        'catch_noise': (
            backend_evaluator.CATCH_NOISE, 
            catch_noise_config),
        'cartpole': (
            backend_evaluator.CARTPOLE, 
            cartpole_config),
        'cartpole_noise': (
            backend_evaluator.CARTPOLE_NOISE, 
            cartpole_noise_config),
        'mountaincar': (
            backend_evaluator.MOUNTAINCAR, 
            mountaincar_config),
        'mountaincar_noise': (
            backend_evaluator.MOUNTAINCAR_NOISE, 
            mountaincar_noise_config)
    }

    for run_name, run in runs.items():
        env_ids, config = run
        for env_id in env_ids:
            runner = Runner(env_id=env_id,
                            agent=RLAgent(agent_config=config),
                            verbose=False,
                            eval=True)
            runner.play_episodes()

# Submit to AIcrowd 🚀

**NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)**

In [None]:
! aicrowd notebook submit --no-verify -c iitm-rl-final-project -a assets

[1;34mMounting Google Drive 💾[0m
Your Google Drive will be mounted to access the colab notebook
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:
4/1AY0e-g66bWFtGB8ehTjzbTfOBuZz0k0xY4xzn3Qp-z1JZbHyJW_WFY7SuJY
Mounted at /content/drive
Using notebook: /content/drive/MyDrive/Colab Notebooks/RL_FINAL_PROJECT_ARJUN version 3.ipynb for submission...
Scrubbing 