
# Deep Learning Labs: Grid World
Let's train, evaluate and visualize the results of Q-Learning in the GridWorld environment







###Environment: GridWorld

<a target='_blank'><img src='https://i.postimg.cc/28Jp6kb5/Immagine2.png' border='0' alt='Immagine2'/></a>

# Weights and Biases (WandB)

Register to [WandB](https://wandb.ai/site)

<a href='https://postimg.cc/7fbkJh1H' target='_blank'><img src='https://i.postimg.cc/pXJX7npF/wandb-demo-experiments-gif.gif' border='0' alt='wandb-demo-experiments-gif'/></a>

In [None]:
!pip install wandb
!wandb login

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.12.17-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 4.9 MB/s 
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.5.12-py2.py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 71.4 MB/s 
[?25hCollecting setproctitle
  Downloading setproctitle-1.2.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29 kB)
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.9-py3-none-any.whl (9.4 kB)
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 43.9 MB/s 
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9

# GridWorld

## Environment

In [None]:
from gym.spaces import Discrete, Box


class GridWorld:
    # Env initialization
    def __init__(self):

        # State and action spaces
        self.action_space = Discrete(4)  # [0, 1, 2, 3] <-> [UP, DOWN, LEFT, RIGHT]
        self.observation_space = Box(low=0, high=9, shape=(2,), dtype=np.intc)  # [x, y] (x in [0 -> 9], y in [0 -> 9]) [x, y]
        # -> Number of states of this MDP = 10 x 10 = 100

        # Set information about the gridworld
        self.height = 10
        self.width = 10

        # Reward initialization (in this env, they depend solely on the position of the agent in the grid)
        self.grid_rewards = np.zeros((self.height, self.width)) - 1
        """
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        """

        self.wall = [[1,2],[2,2],[3,2],[4,2], [5,2], [6,2], [7,2], [8,2],
                    [1,4],[2,4],[3,4],[4,4], [5,4], [6,4], [7,4], [8,4],
                    [1,6],[2,6],[3,6],[4,6], [5,6], [6,6], [7,6], [8,6],
                    [1,8],[2,8],[3,8],[4,8], [5,8], [6,8], [7,8], [8,8],
                    ]

        # Let's set the goal and game over location
        self.bomb_location = (3,5)
        self.gold_location = (3,3)
        self.terminal_states = [self.bomb_location, self.gold_location]

        # Let's set the goal and game over rewards to 10 and -10
        self.grid_rewards[self.bomb_location[0], self.bomb_location[1]] = -10
        self.grid_rewards[self.gold_location[0], self.gold_location[1]] = 10

        """
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1  10 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -10 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
        """

    # Reset the env
    def reset(self):

        # Reset agent starting position
        self.current_location = (9, np.random.randint(0,10))

        return self.current_location

    # called at the end of env.step
    def get_reward(self, new_location):
        """Ritorna il reward in base alla posizione dell'agente"""
        return self.grid_rewards[new_location[0], new_location[1]]


    def step(self, action):
        """
        Implementation of agent movement. If agent would try to move into a wall,
        it doesn't move. This function returns the next state (agent new position),
        the reward (always -1 except when the agent gets a goal +10 otherwise dies -10).
        Done indicates that the episode is finished (agent either completed the goal or died)
        """


        last_location = self.current_location

        # action execution (environment "evolves")

        # UP
        if action == 0:  # action == 'UP':
            # If the agent can't go up, it stays put
            if last_location[1] == self.width - 1 or [self.current_location[0], self.current_location[1] + 1] in self.wall:
                reward = self.get_reward(last_location)
            else:
                self.current_location = ( self.current_location[0], self.current_location[1] + 1)
                reward = self.get_reward(self.current_location)

         # DOWN
        elif action == 1: # action == 'DOWN':
            # If it can't go down...
            if last_location[1] == 0 or [self.current_location[0], self.current_location[1] - 1] in self.wall:
                reward = self.get_reward(last_location)
            else:
                self.current_location = ( self.current_location[0], self.current_location[1] - 1)
                reward = self.get_reward(self.current_location)

        # LEFT
        elif action == 2: # action == 'LEFT':
            # If it can't go left...
            if last_location[0] == 0 or [self.current_location[0] - 1, self.current_location[1]] in self.wall:
                reward = self.get_reward(last_location)
            else:
                self.current_location = ( self.current_location[0] - 1, self.current_location[1])
                reward = self.get_reward(self.current_location)

        # RIGHT
        elif action == 3: # action == 'RIGHT':
            # If it can't go right...
            if last_location[0] == self.height - 1 or [self.current_location[0] + 1, self.current_location[1]] in self.wall:
                reward = self.get_reward(last_location)
            else:
                self.current_location = ( self.current_location[0] + 1, self.current_location[1])
                reward = self.get_reward(self.current_location)

        state = self.current_location
        if self.check_state() == 'TERMINAL':
          done = True
        else:
          done = False
        return reward, state, done

    def check_state(self):
        """Check if the agent is either in the position of the gold or bomb,
        if so, done is set to True"""
        if self.current_location in self.terminal_states:
            return 'TERMINAL'

## Q-Learning Agent

In [1]:
import numpy as np
import operator

class Q_Agent():
    # Intialise
    def __init__(self, environment, epsilon=0.05, alpha=0.1, gamma=0.95):
        self.environment = environment
        self.q_table = dict() # Let's save the action-values (Q-values) in a dictionary (like an array...)
        for x in range(environment.height): # I loop between all the possible states, and for each state...
            for y in range(environment.width):
                self.q_table[(x,y)] = {0:0, 1:0, 2:0, 3:0}  # ... I initialize 4 Q-values, one for each possible action, to zero
                # {'UP' :0, 'DOWN':0, 'LEFT':0, 'RIGHT':0}

        """
        (state) -> {action0: q-value0, action1: q-value1, action2: q-value2, action3: q-value3} for every state!

        (0,0) -> {0:0, 1:0, 2:0, 3:0}, (1,0) -> {0:0, 1:0, 2:0, 3:0}, ...
        (1,0) -> {0:0, 1:0, 2:0, 3:0}, (1,1) -> {0:0, 1:0, 2:0, 3:0}, ...
        (2,0) -> {0:0, 1:0, 2:0, 3:0}, (2,1) -> {0:0, 1:0, 2:0, 3:0}, ...
        ...

        """
        self.epsilon = epsilon  # exploration param
        self.alpha = alpha  # learning rate
        self.gamma = gamma  # discount factor

    def choose_action(self, state, train=True):
        """ Return the action with the highest (current) Q-Value. If more
        than one action has the same max value, it returns one at random among
        them. It also explores based on epsilon"""
        # state = environment.current_location
        if train and np.random.uniform(0,1) < self.epsilon:
            # action = available_actions[np.random.randint(0, 4))]
            action = self.environment.action_space.sample()
        else:
            q_values_of_state = self.q_table[state]
            # example: {0:6.5, 1:3.2, 2:10, 3:9}
            # {'UP':6.5, 'DOWN':3.2, 'LEFT':10, 'RIGHT':9}
            maxValue = max(q_values_of_state.values())  # could potentially return multiple values, if more q-values have the same max val
            # example: maxValue = 10 (it's the biggest value in {6.5, 3.2, 10, 9})
            action = np.random.choice([k for k, v in q_values_of_state.items() if v == maxValue])  # return one among the max q-values
            # example: 2 (is the action (key) associated to the q-value (value) with value=maxValue = 10)

        return action

    def learn(self, old_state, reward, new_state, action):
        """Update Q-values using the Q-Learning formula (see slides)"""
        q_values_of_state = self.q_table[new_state]
        max_q_value_in_new_state = max(q_values_of_state.values())
        current_q_value = self.q_table[old_state][action]

        self.q_table[old_state][action] = (1 - self.alpha) * current_q_value + self.alpha * (reward + self.gamma * max_q_value_in_new_state)

## Train & Test

In [None]:
def main():
    wandb.init(project='new-rl-example', config=args)
    environment = GridWorld()
    agentQ = Q_Agent(environment)

    # Note the learn=True argument!
    train(environment, agentQ, episodes=500)
    mean_reward = test(environment, agentQ, episodes=1000)

    wandb.log({'mean_reward_test': mean_reward})
    log_wandb(environment, agentQ)

# note: this function could be implemented inside the agent
def train(environment, agent, episodes=500, max_steps_per_episode=1000):
    """The play function runs iterations and updates Q-values if desired."""
    reward_per_episode = [] # Initialise performance log

    for episode in range(episodes): # Run episode
        #rewards = []
        cumulative_reward = 0
        step = 0
        done = False
        state = environment.reset()
        while step < max_steps_per_episode and not done: # Run until max steps or until episode is finished
            # state = environment.current_location
            action = agent.choose_action(state, True)
            reward, new_state, done = environment.step(action)
            # new_state = environment.current_location

            # Update Q-values
            agent.learn(state, reward, new_state, action)
            state = new_state

            cumulative_reward += reward
            step += 1
            #rewards.append(reward)

        #wandb.log({"rewards": wandb.Histogram(rewards)})
        reward_per_episode.append(cumulative_reward) # Append reward for current episode to performance log
        wandb.log({'reward cumulativo': cumulative_reward, 'episodio': episode})

# note: this function could be implemented inside the agent
def test(environment, agent, episodes=500, max_steps_per_episode=1000, learn=False):
    """The play function runs iterations and updates Q-values if desired."""
    reward_per_episode = [] # Initialise performance log

    for episode in range(episodes): # Run episodes
        cumulative_reward = 0
        step = 0
        done = False
        state = environment.reset()
        while step < max_steps_per_episode and not done: # Run until max steps or until game is finished
            # state = environment.current_location
            action = agent.choose_action(state, False)
            reward, new_state, done = environment.step(action)
            # new_state = environment.current_location

            cumulative_reward += reward
            step += 1


        reward_per_episode.append(cumulative_reward) # Append reward for current episode to performance log

    return sum(reward_per_episode)/len(reward_per_episode) # Return performance log

def log_wandb(environment, agentQ):
    value_map = np.zeros((environment.height, environment.width))
    best_action_map = np.zeros((environment.height, environment.width))
    for x in range(environment.height):
        for y in range(environment.width):
          q_values_of_state = agentQ.q_table[(x,y)]
          maxValue = max(q_values_of_state.values())
          value_map[x,y] = maxValue  # wandb visualizes heatmap with x and y inverted, dunno why
          # maxAction = max(q_values_of_state, key=q_values_of_state.get)  # to obtain the policy in each state

    wandb.log({'State Value Function': wandb.plots.HeatMap(list(range(environment.height)), list(range(environment.width)), value_map, show_text=True)})


def argumentParser():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--epsilon', default=0.05, type=float, help='Probability of chossing random action')
    parser.add_argument('--alpha', default=0.1, type=float, help='Learning Rate')
    parser.add_argument('--gamma', default=0.95, type=float, help='Discounting Factor')

    return parser

## Wandb Sweeps Setup
Let's insert everything inside a main.py file, so that we can send it to wandb to create **sweeps**

In [None]:
%%writefile main.py


import numpy as np
import operator
import wandb
import argparse
from gym.spaces import Box, Discrete

import numpy as np
import operator

class Q_Agent():
    # Intialise
    def __init__(self, environment, epsilon=0.05, alpha=0.1, gamma=0.95):
        self.environment = environment
        self.q_table = dict()
        for x in range(environment.height):
            for y in range(environment.width):
                self.q_table[(x,y)] = {0:0, 1:0, 2:0, 3:0}

        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma

    def choose_action(self, state, train=True):
        # state = environment.current_location
        if train and np.random.uniform(0,1) < self.epsilon:

            action = self.environment.action_space.sample()
        else:
            q_values_of_state = self.q_table[state]
            maxValue = max(q_values_of_state.values())
            action = np.random.choice([k for k, v in q_values_of_state.items() if v == maxValue])

        return action

    def learn(self, old_state, reward, new_state, action):
        q_values_of_state = self.q_table[new_state]
        max_q_value_in_new_state = max(q_values_of_state.values())
        current_q_value = self.q_table[old_state][action]

        self.q_table[old_state][action] = (1 - self.alpha) * current_q_value + self.alpha * (reward + self.gamma * max_q_value_in_new_state)

from gym.spaces import Discrete, Box


class GridWorld:
    ## Initialise starting data
    def __init__(self):


        self.action_space = Discrete(4)  # [0, 1, 2, 3] <-> [UP, DOWN, LEFT, RIGHT]
        self.observation_space = Box(low=0, high=9, shape=(2,), dtype=np.intc)  # [x, y] (x in [0 -> 9], y in [0 -> 9])


        # Set information about the gridworld
        self.height = 10
        self.width = 10


        self.grid_rewards = np.zeros((self.height, self.width)) - 1
        self.wall = [[1,2],[2,2],[3,2],[4,2], [5,2], [6,2], [7,2], [8,2],
                    [1,4],[2,4],[3,4],[4,4], [5,4], [6,4], [7,4], [8,4],
                    [1,6],[2,6],[3,6],[4,6], [5,6], [6,6], [7,6], [8,6],
                    [1,8],[2,8],[3,8],[4,8], [5,8], [6,8], [7,8], [8,8],
                    ]


        self.bomb_location = (3,5)
        self.gold_location = (3,3)
        self.terminal_states = [self.bomb_location, self.gold_location]


        self.grid_rewards[self.bomb_location[0], self.bomb_location[1]] = -10
        self.grid_rewards[self.gold_location[0], self.gold_location[1]] = 10


    def reset(self):


        self.current_location = (9, np.random.randint(0,10))
        # self.current_location = (np.random.randint(0,10), 9)
        return self.current_location


    def get_reward(self, new_location):

        return self.grid_rewards[new_location[0], new_location[1]]


    def step(self, action):

        last_location = self.current_location

        # LEFT
        if action == 2: # action == 'LEFT':

            if last_location[0] == 0 or [self.current_location[0] - 1, self.current_location[1]] in self.wall:
                reward = self.get_reward(last_location)
            else:
                self.current_location = ( self.current_location[0] - 1, self.current_location[1])
                reward = self.get_reward(self.current_location)

        # RIGHT
        elif action == 3: # action == 'RIGHT':

            if last_location[0] == self.height - 1 or [self.current_location[0] + 1, self.current_location[1]] in self.wall:
                reward = self.get_reward(last_location)
            else:
                self.current_location = ( self.current_location[0] + 1, self.current_location[1])
                reward = self.get_reward(self.current_location)

        # DOWN
        elif action == 1: # action == 'DOWN':

            if last_location[1] == 0 or [self.current_location[0], self.current_location[1] - 1] in self.wall:
                reward = self.get_reward(last_location)
            else:
                self.current_location = ( self.current_location[0], self.current_location[1] - 1)
                reward = self.get_reward(self.current_location)

        # UP
        elif action == 0: # action == 'UP':

            if last_location[1] == self.width - 1 or [self.current_location[0], self.current_location[1] + 1] in self.wall:
                reward = self.get_reward(last_location)
            else:
                self.current_location = ( self.current_location[0], self.current_location[1] + 1)
                reward = self.get_reward(self.current_location)
        state = self.current_location
        if self.check_state() == 'TERMINAL':
          done = True
        else:
          done = False
        return reward, state, done

    def check_state(self):

        if self.current_location in self.terminal_states:
            return 'TERMINAL'

def main():
    wandb.init(project='new-rl-example', config=args)
    environment = GridWorld()
    agentQ = Q_Agent(environment)

    # Note the learn=True argument!
    train(environment, agentQ, episodes=500)
    mean_reward = test(environment, agentQ, episodes=1000)

    wandb.log({'mean_reward_test': mean_reward})
    log_wandb(environment, agentQ)


def train(environment, agent, episodes=500, max_steps_per_episode=1000):
    """The play function runs iterations and updates Q-values if desired."""
    reward_per_episode = [] # Initialise performance log

    for episode in range(episodes): # Run episode
        rewards = []
        cumulative_reward = 0
        step = 0
        done = False
        state = environment.reset()
        while step < max_steps_per_episode and not done: # Run until max steps or until episode is finished
            # state = environment.current_location
            action = agent.choose_action(state, True)
            reward, new_state, done = environment.step(action)
            # new_state = environment.current_location

            # Update Q-values
            agent.learn(state, reward, new_state, action)
            state = new_state
            cumulative_reward += reward
            step += 1
            rewards.append(reward)



        #wandb.log({"rewards": wandb.Histogram(rewards)})
        reward_per_episode.append(cumulative_reward) # Append reward for current episode to performance log
        wandb.log({'reward cumulativo': cumulative_reward, 'episodio': episode})

def test(environment, agent, episodes=500, max_steps_per_episode=1000, learn=False):
    """The play function runs iterations and updates Q-values if desired."""
    reward_per_episode = [] # Initialise performance log
    for episode in range(episodes): # Run episodes
        cumulative_reward = 0
        step = 0
        done = False
        state = environment.reset()
        while step < max_steps_per_episode and not done: # Run until max steps or until episode is finished
            # old_state = environment.current_location
            action = agent.choose_action(state, False)
            reward, new_state, done = environment.step(action)
            # new_state = environment.current_location
            state = new_state

            cumulative_reward += reward
            step += 1


        reward_per_episode.append(cumulative_reward) # Append reward for current episode to performance log

    return sum(reward_per_episode)/len(reward_per_episode) # Return performance log

def log_wandb(environment, agentQ):
    value_map = np.zeros((environment.height, environment.width))
    xmap = np.zeros((environment.height, environment.width))
    for x in range(environment.height):
        for y in range(environment.width):
          q_values_of_state = agentQ.q_table[(x,y)]
          maxValue = max(q_values_of_state.values())
          # maxAction = max(q_values_of_state, key=q_values_of_state.get)  # to obtain the policy in each state

          # wandb visualizes heatmap with x and y inverted, dunno why
          value_map[y,x] = maxValue

    for x in range(environment.height):
        for y in range(environment.width):
          if x == 9:
            xmap[y,x] = 1
          elif x == environment.bomb_location[0] and y == environment.bomb_location[1]:
            xmap[y,x] = -2
          elif x == environment.gold_location[0] and y == environment.gold_location[1]:
            xmap[y,x] = 2
          elif [x,y] in environment.wall:
            xmap[y,x] = 0
          else:
            xmap[y,x] = -1



    wandb.log({'State Value Function': wandb.plots.HeatMap(list(range(environment.height)), list(range(environment.width)), value_map, show_text=True)})
    wandb.log({'Position': wandb.plots.HeatMap(list(range(environment.height)), list(range(environment.width)), xmap, show_text=True)})


def argumentParser():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--epsilon', default=0.05, type=float, help='Probability of chossing random action')
    parser.add_argument('--alpha', default=0.1, type=float, help='Learning Rate')
    parser.add_argument('--gamma', default=0.95, type=float, help='Discounting Factor')

    return parser

if __name__ == '__main__':
  global args
  args = argumentParser().parse_args()
  main()
  wandb.save('main.py')

Writing main.py


In [None]:
!python3 main.py

[34m[1mwandb[0m: Currently logged in as: [33mxraulz[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.17
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220607_160837-3tyf9975[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mbumbling-aardvark-225[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/xraulz/new-rl-example[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/xraulz/new-rl-example/runs/3tyf9975[0m
[34m[1mwandb[0m: Visualizing heatmap.
[34m[1mwandb[0m: Visualizing heatmap.
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m:                                                                                
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:          episodio ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
[34m

## Sweep for Hyperparameter tuning

In [None]:
%%writefile sweep.yaml
project: "new-rl-example"
program: main.py
method: bayes
metric:
  name: mean_reward_test
  goal: maximize
parameters:
  alpha:
    distribution: 'uniform'
    min: 0.1
    max: 1
  gamma:
    distribution: 'uniform'
    min: 0.1
    max: 1
  epsilon:
    distribution: 'uniform'
    min: 0.01
    max: 1

Writing sweep.yaml


In [None]:
!wandb sweep sweep.yaml

[34m[1mwandb[0m: Creating sweep from: sweep.yaml
[34m[1mwandb[0m: Created sweep with ID: [33m29qhifs0[0m
[34m[1mwandb[0m: View sweep at: [34m[4mhttps://wandb.ai/xraulz/new-rl-example/sweeps/29qhifs0[0m
[34m[1mwandb[0m: Run sweep agent with: [33mwandb agent xraulz/new-rl-example/29qhifs0[0m


Substitute `user_name/project_name/sweep_name` with the last line of output of the previous command

`wandb: Run sweep agent with: wandb agent user-name/project_name/sweep_name`

In [None]:
!wandb agent user_name/project_name/sweep_name

[34m[1mwandb[0m: Starting wandb agent 🕵️
[34m[1mwandb[0m: [32m[41mERROR[0m Find detailed error logs at: /content/wandb/debug-cli.log
Error: [31mSweep nome_utente/nome_progetto/nome_sweep not found[0m


In [None]:
!wandb agent xraulz/new-rl-example/3wmu40ee

[34m[1mwandb[0m: Starting wandb agent 🕵️
2022-06-07 16:08:58,042 - wandb.wandb_agent - INFO - Running runs: []
2022-06-07 16:08:58,700 - wandb.wandb_agent - INFO - Agent received command: run
2022-06-07 16:08:58,700 - wandb.wandb_agent - INFO - Agent starting run with config:
	alpha: 0.2759481289942661
	epsilon: 0.1380226024163234
	gamma: 0.5525242375112407
2022-06-07 16:08:58,701 - wandb.wandb_agent - INFO - About to run command: /usr/bin/env python main.py --alpha=0.2759481289942661 --epsilon=0.1380226024163234 --gamma=0.5525242375112407
[34m[1mwandb[0m: Currently logged in as: [33mxraulz[0m. Use [1m`wandb login --relogin`[0m to force relogin
2022-06-07 16:09:03,711 - wandb.wandb_agent - INFO - Running runs: ['oy4vtw5q']
[34m[1mwandb[0m: Tracking run with wandb version 0.12.17
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220607_160859-oy4vtw5q[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Sy