# Q-Learning on LunarLander-v2 Environment

Author: Gleb Tcivie

In this project, we are training a Q-Learning agent to solve the `LunarLander-v2` environment from OpenAI's Gym.

The objective of the LunarLander-v2 environment is to optimize the trajectory of a spacecraft landing on the moon. The environment is modeled after the classic rocket trajectory optimization problem, with the actions being discrete in nature - either fire the engine at full throttle or keep it off.

If you would like to just straight ahead and view this experiment's results you can see it here.

## Training Process

For the LunarLander problem, I adopted a similar approach as with the MountainCar environment and implemented a Q-Learning algorithm. However, this time I expanded upon the exploration strategy by introducing an Upper Confidence Bound (UCB) policy. My objective was to compare the performance and characteristics of the UCB policy against the ε-greedy policy, a commonly used method in Q-Learning. In addition, I aimed to evaluate the effects of applying an optimistic initialization strategy in combination with the ε-greedy policy.

The UCB policy is designed to balance exploration and exploitation in reinforcement learning. It leverages uncertainty and variance in the reward estimates to guide the exploration process. This differs from the ε-greedy policy, which explores the action space purely randomly with a certain probability ε.

On the other hand, optimistic initialization is a simple yet powerful method to encourage exploration in the early stages of training. By initializing the Q-values optimistically (i.e., with higher than achievable values), the agent is incentivized to explore all actions to learn their actual rewards.

## Hyper Parameter Tuning

We use Weights & Biases Sweeps for hyperparameter tuning. We explore different values of the learning rate, discount factor, and the number of discretized states. The agent's performance is measured by the average reward over 100 episodes.

# Code and Running

Imports and Installs

In [2]:
!pip install swig
!pip install wandb
!pip install gym[all]
!pip install imageio

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mCollecting wandb
  Downloading wandb-0.15.5-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting GitPython!=3.1.29,>=1.0.0
  Downloading GitPython-3.1.32-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 KB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting PyYAML
  Downloading PyYAML-6.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (738 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m738.9/738.9 KB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle
  Downloading setproctitle-1.3.2-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_6

In [3]:
import math
import gym
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import wandb
import time
import imageio

## Sweeps configurations

At first I thought of keeping the states with ranges of [10, 20, 30, 40] But then I quickly noticed that I encounter the "curse of dimensionality". Which with fast calculation gets really problematic in terms of RAM:

40^8*4 * 4 bytes in GB = 104,857.6 GB
20^8*4 * 4 bytes in GB = 409.6 GB
10^8*4 * 4 bytes in GB = 1.6 GB

So to cope with this I had to lower the ranges of the states to something more feasible, we can manipulate this function to something which would be more comfortable to us:

![Ram usage calculation](image-20230705-104029.png)
where the <code>RAM</code> is our target RAM in GB and the <code>4 bytes</code> indicates the approximate size of each cell and the last <code>4</code> indicates the the number of actions.

I have 5GB available for this machine so we can use this calculation to get the approximations for the required dimentions and round it to lower for our comfortability:

⌊root(8)(5/((4 bytes * 4) in GB))⌋ = 11 ; ⌊root(8)(1/((4 bytes * 4) in GB))⌋ = 9

We can see that ideally our upper bound is 11 which is not very helpful, also we know that our machine would need to use some RAM for other parameters and calculations so we would have to suffice with bucket sizes of max 10 and minimum 4.

Also in comparison with the previous notebook I learned that we can use ranges in the sweep configuration like you see below:

In [4]:
sweep_config = {
    'method': 'bayes',
    'metric': {
        'name': 'avg_reward',
        'goal': 'maximize'
    },
    'parameters': {
        'learning_rate': {
            'min': 0.001, 
            'max': 0.1, 
            'distribution': 'uniform'
        },
        'discount': {
            'min': 0.8, 
            'max': 1.0, 
            'distribution': 'uniform'
        },
        'epsilon': {
            'min': 0.1, 
            'max': 1.0, 
            'distribution': 'uniform'
        },
        'num_states': {
            'min': 4,
            'max': 10,
            'distribution': 'int_uniform'
        },
        'exploration_strategy': {
            'values': ['epsilon_greedy', 'ucb']
        },
        'ucb_c': {
            'min': 0.1, 
            'max': 2, 
            'distribution': 'uniform'
        },
        'init_q_value': {
            'min': 0, 
            'max': 200, 
            'distribution': 'int_uniform'
        },
        'init_q_state': {
            'values': ['random']
        },
        'end_epsilon_decay': {
            'values': [0.1, 0.25, 0.5, 1, 2]
        }
    },
    'count': 20  # limit sweep to 20 runs
}

Number of episodes to iterate, at first I ran the tests at 25K episodes, but then I noticed that some training were "cut" in the middle end (Suggesting that we might benefit from using more episodes and that not all runs converge)

In [5]:
EPISODES = 50000

## Helper Functions

choose_action_greedy - This function selects an action for a given state based on the Q-table values. It implements the ε-greedy exploration strategy, which makes a trade-off between exploration and exploitation. If a random number is greater than ε (epsilon), it chooses the action with the highest Q-value for the given state (exploitation). Otherwise, it selects a random action (exploration).

choose_action_UCB - This function also selects an action for a given state, but uses the Upper Confidence Bound (UCB) algorithm. This algorithm balances exploitation and exploration by choosing the action with the highest combined Q-value and exploration function value. The exploration function increases as the action is chosen less frequently, leading to a more balanced exploration of the action space.

UCB is based on the principle of optimism in the face of uncertainty, meaning it tends to prefer actions that have not been tried often. The core idea of UCB is to choose the action that has the maximum upper confidence bound, which is a sum of the current estimated value of the action and an uncertainty term. The uncertainty term increases with fewer trials of an action, thereby making less tried actions more attractive.The UCB action selection formula is as follows:

<img src="/work/SCR-20230626-rqjk.png">Where <code>t</code> is the total number of actions taken so far, <code>Nt(a)</code> is the number of times action <code>a</code> has been taken, The number <code>c > 0</code> controls the degree of exploration. Important thing we have to take into account is that under the root we cannot have negative numbers, therefore I had to restrict the values in the <code>q_table</code> to be <code>>= 0</code></img>




update_q_table - This function updates the Q-table using the Q-Learning update rule. It calculates the new Q-value for the state-action pair as a weighted average of the old value and the learned value, where the learned value is the sum of the reward and the discounted estimate of the optimal future Q-value.

An important mention in this function that was made due to the UCB algorithm was that I had to set the lower bound of the rewards to 0 (Due to the ln() function and the root that might get to there).

log_metrics - This function logs various metrics after each episode, including the reward of the episode, the maximum and minimum rewards obtained so far, the average reward over the last 100 episodes, and the exploration strategy used.

get_q_table_shape - Is a helper function for the initiation ( I have seperated it to make the code more readable )

In [6]:
def discretize_state(state):
    digitized_state = []
    for i in range(len(RANGES)):  # Adjust this if the number of states changes
        if i < 6:  # This is a continuous variable
            digitized_state.append(np.digitize(state[i], RANGES[i]))
        else:  # This is a boolean variable
            digitized_state.append(int(state[i]))
    return tuple(digitized_state)


def reset_environment():
    observation, info = ENV.reset()
    discrete_state = discretize_state(observation)
    return observation, discrete_state


def choose_action_greedy(discrete_state):
    if np.random.random() > EPSILON:
        action = np.argmax(Q_TABLE[discrete_state])
    else:
        action = np.random.randint(0, ENV.action_space.n)
    return action


def choose_action_UCB(discrete_state):
    c = UCB_C
    ucb_values = Q_TABLE[discrete_state] + c * np.sqrt(
        np.log(np.sum(Q_TABLE[discrete_state]) + 1) / (Q_TABLE[discrete_state] + 1e-5))
    action = np.argmax(ucb_values)
    return action


def update_q_table(discrete_state, action, reward, new_discrete_state):
    global Q_TABLE
    max_future_q = np.max(Q_TABLE[new_discrete_state])  # estimate of optimal future value
    current_q = Q_TABLE[discrete_state + (action,)]  # current Q-value
    new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
    if CONFIG.exploration_strategy == 'ucb': # apply lower bound only if UCB strategy
        new_q =  0 if new_q < 0 else new_q # set lower bound 0 to avoid negative numbers to accomodate UCB algorithm
    Q_TABLE[discrete_state + (action,)] = new_q # update Q-table with new Q-value


def log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration):
    reward_list.append(episode_reward)
    max_reward_list.append(max(episode_reward, max_reward_list[-1]) if max_reward_list else episode_reward)
    min_reward_list.append(min(episode_reward, min_reward_list[-1]) if min_reward_list else episode_reward)
    avg_reward = np.mean(reward_list[-100:])  # average over last 100 episodes
    metrics = {'eps': 1 / duration, 'reward': episode_reward, 'steps': episode_steps,
               'avg_reward': avg_reward, 'max_reward': max_reward_list[-1],
               'min_reward': min_reward_list[-1]}
    run.log(metrics)


def get_q_table_shape():
    shapes = []
    for i in range(8):  # Adjust this if the number of states changes
        if i < 6:  # This is a continuous variable
            shapes.append(len(RANGES[i]) + 1)
        else:  # This is a boolean variable
            shapes.append(2)  # This accounts for the two states (0 and 1)
    shapes.append(ENV.action_space.n)  # Add size of action space at the end
    return tuple(shapes)


## Running the episodes

Like the title and the function name states, this part is responsible on running the actual episodes (The training process).

We start by initiating some logging helping arrays that would hold the information on the max/min rewards. From there we continue to the main for loop that would be running all the episodes, I have used here the method tqdm which helps me visualise the approximate time left for the run and the progress so far - You can find the documentation here.

In this section I firstly initiated and empty array for the frames that would capture the last 3 runs (To later create a gif from them to nicely visualise the results). 

After that we initiate the start time for this episode (For additional metrics) and also the episode_reward and the number of steps made for this episode.

### The calculations

We initially reset the environment to it's initial state observation, discrete_state = reset_environment() and we enter the main while loop which would only terminate if we were terminated or truncated (Either the spaceship was flying too long/ crashed/ landed successfully). This is important as we don't want to run indeffinetly or run too short without reaching the end goal.

From there we run the relevant selection algorithm which selects based on the strategy provided by the W&B configuration.
```python
if STRATEGY == 'ucb':
    action = choose_action_UCB(discrete_state)
else:
    action = choose_action_greedy(discrete_state)
```

From there we perform a step and receive all the nessecary infomration from it observation, reward, terminated, truncated, info. And right after it we descritisize the returned observation to lower the dimentions - This is another subject which would be explored in the next upcoming notes.

Finally if everything went successfully we get to update the Q table with the new values using the following formula:<image src="https://wikimedia.org/api/rest_v1/media/math/render/svg/d247db9eaad4bd343e7882ec546bf3847ebd36d8"></image><ul>
<li>Q(s,a) is the current estimate of the Q-value for the state-action pair (s, a)</li><li>α is the learning rate</li><li>r is the immediate reward obtained after taking action a in state s</li><li>γ is the discount factor</li><li>max Q(s',a') is the maximum Q-value over all actions a' in the next state s'</li></ul>

Another important part of the code is the following section:
```python
if (END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING) and STRATEGY == 'epsilon_greedy':
    EPSILON -= EPSILON_DECAY_VALUE
```
Without this section we cannot perform the fnctionality of epsilon decaying which should lower the epsilon value after some episodes (Resulting in more stable optimization by the end and less exploration).

Finally we perfomr some additional gif saving functionality for nice visualisations.

In [7]:
def run_episodes(run):
    global EPSILON
    # Additional data lists
    reward_list = []
    max_reward_list = []
    min_reward_list = []
    print("Starting to run episodes")

    for episode in tqdm(range(EPISODES), desc="Training", unit="episode"):
        frames = [] # for saving the simulations
        is_saving_video = (episode + 3 >= EPISODES) # How many runs to log
        start_time = time.time()
        episode_reward = 0  # initialize the reward for this episode
        episode_steps = 0  # initialize the number of steps for this episode

        observation, discrete_state = reset_environment()

        terminated = False
        truncated = False
        reward = 0
        done = False
        while not terminated and not truncated and not done:
            if is_saving_video:
                frame = ENV.render()
                frames.append(frame)

            if STRATEGY == 'ucb':
                action = choose_action_UCB(discrete_state)
            else:
                action = choose_action_greedy(discrete_state)

            observation, reward, terminated, truncated, done = ENV.step(action)
            new_discrete_state = discretize_state(observation)

            if not terminated and not truncated:
                update_q_table(discrete_state, action, reward, new_discrete_state)
            else:
                update_q_table(discrete_state, action, reward, 0) # max future q is 0 for terminal state 
                

            discrete_state = new_discrete_state
            episode_reward += reward
            episode_steps += 1

        end_time = time.time()
        duration = end_time - start_time

        log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration)

        if (END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING) and STRATEGY == 'epsilon_greedy':
            EPSILON -= EPSILON_DECAY_VALUE
        
        if is_saving_video:
            # Save the frames as a GIF file
            imageio.mimsave(f'run_{episode}.gif', frames, format='GIF', duration=(1000 * 1/30))
            frames.clear()  # Clear the frame list
            run.log({"simulations": wandb.Video(f'run_{episode}.gif',format="gif")})

## Training function

In this function we perform all of the "janitorial functionality" which is setting up global variables (For some reason jupiter notes don't work well without it and passing the values in the function is too messy) and unpack the configurations we received from wandb which is the Weights and Biases library that grants us access to the configurations for the current sweep.

In [8]:
def train():
    global LEARNING_RATE
    global DISCOUNT
    global EPSILON
    global START_EPSILON_DECAYING
    global END_EPSILON_DECAYING
    global EPSILON_DECAY_VALUE
    global UCB_C
    global STRATEGY
    global RANGES
    global Q_TABLE
    global CONFIG
    global EPSILON
    global ENV

    # Initialize a new wandb run
    run = wandb.init(config=wandb.config)

    # Config is a variable that holds and saves hyperparameters and inputs
    CONFIG = wandb.config

    LEARNING_RATE = CONFIG.learning_rate
    DISCOUNT = CONFIG.discount
    EPSILON = CONFIG.epsilon if CONFIG.exploration_strategy == 'epsilon_greedy' else None
    START_EPSILON_DECAYING = 1
    END_EPSILON_DECAYING = EPISODES // CONFIG.end_epsilon_decay
    EPSILON_DECAY_VALUE = EPSILON / (
                END_EPSILON_DECAYING - START_EPSILON_DECAYING) if CONFIG.exploration_strategy == 'epsilon_greedy' else None
    UCB_C = CONFIG.ucb_c if CONFIG.exploration_strategy == 'ucb' else None
    STRATEGY = CONFIG.exploration_strategy

    ENV = gym.make('LunarLander-v2', render_mode='rgb_array')

    num_states_continuous = CONFIG.num_states  # The number of bins for continuous variables
    num_states_boolean = 2  # The number of bins for boolean variables

    # Define the range of each dimension
    RANGES = [
        np.linspace(-90, 90, num_states_continuous),  # X coordinate
        np.linspace(-90, 90, num_states_continuous),  # Y coordinate
        np.linspace(-5, 5, num_states_continuous),  # X velocity
        np.linspace(-5, 5, num_states_continuous),  # Y velocity
        np.linspace(-np.pi, np.pi, num_states_continuous),  # Angle
        np.linspace(-5, 5, num_states_continuous),  # Angular velocity
        np.linspace(0, 1, num_states_boolean),  # Left leg contact
        np.linspace(0, 1, num_states_boolean)  # Right leg contact
    ]
    print(RANGES)
    print(CONFIG)
    # Initialize Q-table with the defined initial Q value
    
    if CONFIG.init_q_state == 'static':
        Q_TABLE = np.full(shape=get_q_table_shape(), fill_value=CONFIG.init_q_value)
    else:
        Q_TABLE = np.random.uniform(low=0, high=CONFIG.init_q_value, size=get_q_table_shape())

    run_episodes(run)

    # Save the Q-table as an Artifact
    artifact = wandb.Artifact('q_table', type='model')
    np.save('q_table.npy', Q_TABLE)
    artifact.add_file('q_table.npy')
    run.log_artifact(artifact)

    ENV.close()
    run.finish()  # End the Run

## Sweep initialization

This part is simply running the sweep with the given configurations in section 2.i.

In [9]:
sweep_id = wandb.sweep(sweep_config, project="lunar-lander-v2")
wandb.agent(sweep_id, train)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Create sweep with ID: 1itf7z14
Sweep URL: https://wandb.ai/got-tree/lunar-lander-v2/sweeps/1itf7z14
[34m[1mwandb[0m: Agent Starting Run: 08m5q6nf with config:
[34m[1mwandb[0m: 	discount: 0.8980840047672926
[34m[1mwandb[0m: 	end_epsilon_decay: 0.25
[34m[1mwandb[0m: 	epsilon: 0.26847381425975003
[34m[1mwandb[0m: 	exploration_strategy: ucb
[34m[1mwandb[0m: 	init_q_state: random
[34m[1mwandb[0m: 	init_q_value: 93
[34m[1mwandb[0m: 	learning_rate:

[array([-90., -70., -50., -30., -10.,  10.,  30.,  50.,  70.,  90.]), array([-90., -70., -50., -30., -10.,  10.,  30.,  50.,  70.,  90.]), array([-5.        , -3.88888889, -2.77777778, -1.66666667, -0.55555556,
        0.55555556,  1.66666667,  2.77777778,  3.88888889,  5.        ]), array([-5.        , -3.88888889, -2.77777778, -1.66666667, -0.55555556,
        0.55555556,  1.66666667,  2.77777778,  3.88888889,  5.        ]), array([-3.14159265, -2.44346095, -1.74532925, -1.04719755, -0.34906585,
        0.34906585,  1.04719755,  1.74532925,  2.44346095,  3.14159265]), array([-5.        , -3.88888889, -2.77777778, -1.66666667, -0.55555556,
        0.55555556,  1.66666667,  2.77777778,  3.88888889,  5.        ]), array([0., 1.]), array([0., 1.])]
{'discount': 0.8980840047672926, 'end_epsilon_decay': 0.25, 'epsilon': 0.26847381425975003, 'exploration_strategy': 'ucb', 'init_q_state': 'random', 'init_q_value': 93, 'learning_rate': 0.08778374101568641, 'num_states': 10, 'ucb_c': 1.9155583

0,1
avg_reward,█▂▇▂▃▂▃▂▂▂▁▇▆▄▆▆▆▅▆▆▆▂▂▁▁▂▂▂▂▁▂▂▂▁▁▂▂▁▁▁
eps,▃▃▂▂▄▇▂▆▃▃▃▂▅█▇▅▂▃▇▅▃▂▂▂▄▁▂▃▂▅▂▁▁▃▁▂▂▂▂▄
max_reward,▁▁██████████████████████████████████████
min_reward,█▅▅▅▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁
reward,█▇█▅▇█▄▅▅▅▆█▅▆▅██▇▅█▆▇▅▁▆▁▅▇▃▆▄▅▄▅▃▆▅▅▅▅
steps,▃▂▃▄▁▂▃▂▂▂▂▃▂▁▂▃▃▂▂▁▂▂▃▆▁█▂▂▅▁▃▃▃▃▅▂▂▃▃▂

0,1
avg_reward,-734.87708
eps,0.94618
max_reward,269.84585
min_reward,-3357.77714
reward,-627.10189
steps,80.0


[34m[1mwandb[0m: Agent Starting Run: olp90i7h with config:
[34m[1mwandb[0m: 	discount: 0.9735989520808894
[34m[1mwandb[0m: 	end_epsilon_decay: 2
[34m[1mwandb[0m: 	epsilon: 0.3076643475463193
[34m[1mwandb[0m: 	exploration_strategy: epsilon_greedy
[34m[1mwandb[0m: 	init_q_state: random
[34m[1mwandb[0m: 	init_q_value: 194
[34m[1mwandb[0m: 	learning_rate: 0.08271768521347939
[34m[1mwandb[0m: 	num_states: 6
[34m[1mwandb[0m: 	ucb_c: 0.1860104764541042
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


[array([-90., -54., -18.,  18.,  54.,  90.]), array([-90., -54., -18.,  18.,  54.,  90.]), array([-5., -3., -1.,  1.,  3.,  5.]), array([-5., -3., -1.,  1.,  3.,  5.]), array([-3.14159265, -1.88495559, -0.62831853,  0.62831853,  1.88495559,
        3.14159265]), array([-5., -3., -1.,  1.,  3.,  5.]), array([0., 1.]), array([0., 1.])]
{'discount': 0.9735989520808894, 'end_epsilon_decay': 2, 'epsilon': 0.3076643475463193, 'exploration_strategy': 'epsilon_greedy', 'init_q_state': 'random', 'init_q_value': 194, 'learning_rate': 0.08271768521347939, 'num_states': 6, 'ucb_c': 0.1860104764541042}
Starting to run episodes
Training: 100%|██████████| 50000/50000 [1:07:34<00:00, 12.33episode/s]


0,1
avg_reward,▄▁▁▁▂▂▃▁▂▃▃▆▄▂▅▅▅▇█▇▇▇████▇▇██▇▇████▇▇▇█
eps,▂▂▂▂▂▅▁▃▂▂▂▁▄▂▃▂▂▂▃▂▂▅▃▅▃▃▆▃▄▄▄▃▅█▄▄▃▄▃▅
max_reward,▁▁▁▁▁▁▁▁▆███████████████████████████████
min_reward,█▆▆▆▄▄▄▄▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
reward,▁▅▅▂▃▄█▇▅▄▁▆▆▃▃▅▂▄▅▅▆▄▅▅█▅▅▅▅▄▄▅▆▅▄▅▅▅▄▅
steps,▃▅▅▆▅▁█▄▃▁█▄▁▅▂▃▅▃▄▂▆▁▃▃▅▂▃▁▂▂▃▃▁▂▁▃▁▂▁▂

0,1
avg_reward,-118.51238
eps,0.54105
max_reward,302.63775
min_reward,-950.45124
reward,20.7183
steps,93.0


[34m[1mwandb[0m: Agent Starting Run: ewfj65rx with config:
[34m[1mwandb[0m: 	discount: 0.896239956837221
[34m[1mwandb[0m: 	end_epsilon_decay: 2
[34m[1mwandb[0m: 	epsilon: 0.4704609407197182
[34m[1mwandb[0m: 	exploration_strategy: ucb
[34m[1mwandb[0m: 	init_q_state: random
[34m[1mwandb[0m: 	init_q_value: 19
[34m[1mwandb[0m: 	learning_rate: 0.09511197386588024
[34m[1mwandb[0m: 	num_states: 5
[34m[1mwandb[0m: 	ucb_c: 0.971857377395093
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


[array([-90., -45.,   0.,  45.,  90.]), array([-90., -45.,   0.,  45.,  90.]), array([-5. , -2.5,  0. ,  2.5,  5. ]), array([-5. , -2.5,  0. ,  2.5,  5. ]), array([-3.14159265, -1.57079633,  0.        ,  1.57079633,  3.14159265]), array([-5. , -2.5,  0. ,  2.5,  5. ]), array([0., 1.]), array([0., 1.])]
{'discount': 0.896239956837221, 'end_epsilon_decay': 2, 'epsilon': 0.4704609407197182, 'exploration_strategy': 'ucb', 'init_q_state': 'random', 'init_q_value': 19, 'learning_rate': 0.09511197386588024, 'num_states': 5, 'ucb_c': 0.971857377395093}
Starting to run episodes
Training: 100%|██████████| 50000/50000 [1:11:19<00:00, 11.68episode/s]


0,1
avg_reward,▇▇█▆█▃█▇▇▆▇▅▂▁▅▄▇▃▇█▆▆▅▆▅▆▅▆▇▇▇▆▆▇▆▇▆▅▅▆
eps,▂▂▂██▃▂▂▁▂▂▂▃▂▁▂▁▂▄▂▁▁▂▃▁▃▃▃▂▁▂▁▇▃▂▃▁▃▂▁
max_reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
min_reward,██▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
reward,▇▆▅▇█▆▆▆▅▅▆▁▆▆▅▇▇▄▄▃▂▆▅▆▅▆▄▇▆▅▆▅▅▇▅▆▆▆▆▇
steps,▁▂▂▁▁▂▁▁▂▂▂█▁▂▂▁▂▂▂▃▇▂▂▃▃▂▂▁▂▃▃▂▂▁▃▂▄▁▂▁

0,1
avg_reward,-582.83334
eps,3.40944
max_reward,-6.2026
min_reward,-1687.01615
reward,-445.50621
steps,51.0


[34m[1mwandb[0m: Agent Starting Run: x0jnsxyd with config:
[34m[1mwandb[0m: 	discount: 0.8653799683198891
[34m[1mwandb[0m: 	end_epsilon_decay: 1
[34m[1mwandb[0m: 	epsilon: 0.1375435431723935
[34m[1mwandb[0m: 	exploration_strategy: epsilon_greedy
[34m[1mwandb[0m: 	init_q_state: random
[34m[1mwandb[0m: 	init_q_value: 24
[34m[1mwandb[0m: 	learning_rate: 0.03992200319270856
[34m[1mwandb[0m: 	num_states: 8
[34m[1mwandb[0m: 	ucb_c: 1.0755231557921436
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


[array([-90.        , -64.28571429, -38.57142857, -12.85714286,
        12.85714286,  38.57142857,  64.28571429,  90.        ]), array([-90.        , -64.28571429, -38.57142857, -12.85714286,
        12.85714286,  38.57142857,  64.28571429,  90.        ]), array([-5.        , -3.57142857, -2.14285714, -0.71428571,  0.71428571,
        2.14285714,  3.57142857,  5.        ]), array([-5.        , -3.57142857, -2.14285714, -0.71428571,  0.71428571,
        2.14285714,  3.57142857,  5.        ]), array([-3.14159265, -2.24399475, -1.34639685, -0.44879895,  0.44879895,
        1.34639685,  2.24399475,  3.14159265]), array([-5.        , -3.57142857, -2.14285714, -0.71428571,  0.71428571,
        2.14285714,  3.57142857,  5.        ]), array([0., 1.]), array([0., 1.])]
{'discount': 0.8653799683198891, 'end_epsilon_decay': 1, 'epsilon': 0.1375435431723935, 'exploration_strategy': 'epsilon_greedy', 'init_q_state': 'random', 'init_q_value': 24, 'learning_rate': 0.03992200319270856, 'num_states': 8

# Final results

The final results for this run can be seen in this W&B interactive report. I have provided a quick overview on the results from this experiment. I hope you like it and would learn something new from it 🙂

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=5878bf73-13c4-4232-bd61-633eeedc1f05' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>