# Q-Learning on LunarLander-v2 Environment

In this project, we are training a Q-Learning agent to solve the `LunarLander-v2` environment from OpenAI's Gym.

The objective of the LunarLander-v2 environment is to optimize the trajectory of a spacecraft landing on the moon. The environment is modeled after the classic rocket trajectory optimization problem, with the actions being discrete in nature - either fire the engine at full throttle or keep it off.

## Training Process

For the LunarLander problem, I adopted a similar approach as with the MountainCar environment and implemented a Q-Learning algorithm. However, this time I expanded upon the exploration strategy by introducing an Upper Confidence Bound (UCB) policy. My objective was to compare the performance and characteristics of the UCB policy against the ε-greedy policy, a commonly used method in Q-Learning. In addition, I aimed to evaluate the effects of applying an optimistic initialization strategy in combination with the ε-greedy policy.

The UCB policy is designed to balance exploration and exploitation in reinforcement learning. It leverages uncertainty and variance in the reward estimates to guide the exploration process. This differs from the ε-greedy policy, which explores the action space purely randomly with a certain probability ε.

On the other hand, optimistic initialization is a simple yet powerful method to encourage exploration in the early stages of training. By initializing the Q-values optimistically (i.e., with higher than achievable values), the agent is incentivized to explore all actions to learn their actual rewards.

## Hyper Parameter Tuning

We use Weights & Biases Sweeps for hyperparameter tuning. We explore different values of the learning rate, discount factor, and the number of discretized states. The agent's performance is measured by the average reward over 100 episodes.

# Code and Running

Imports and Installs:

In [1]:
!pip install wandb
!pip install gym
!pip install pygame
!pip install box2d-py

^C
[31mERROR: Operation cancelled by user[0m[31m
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
import math
import gym
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import wandb

## Configure Sweeps & Login to Weights and Biases

In [3]:
sweep_config = {
    'method': 'random',
    'metric': {
        'name': 'avg_reward',
        'goal': 'maximize'   
    },
    'parameters': {
        'learning_rate': {
            'values': [0.1, 0.01, 0.001]
        },
        'discount': {
            'values': [0.9, 0.95, 0.99]
        },
        'epsilon': {
            'values': [0.5, 0.8, 0.9]
        },
        'num_states': {
            'values': [10, 20, 30, 40]
        },
        'exploration_strategy': {
            'values': ['epsilon_greedy', 'ucb']
        },
        'ucb_c': {
            'values': [0.5, 1, 2]
        },
        'init_q_value': {
            'values': [0, 1, 10]
        }
    },
    'count': 20  # limit sweep to 20 runs
}

Number of episodes to iterate

In [4]:
EPISODES = 25000

## Helper Functions

choose_action_greedy - This function selects an action for a given state based on the Q-table values. It implements the ε-greedy exploration strategy, which makes a trade-off between exploration and exploitation. If a random number is greater than ε (epsilon), it chooses the action with the highest Q-value for the given state (exploitation). Otherwise, it selects a random action (exploration).
choose_action_UCB - This function also selects an action for a given state, but uses the Upper Confidence Bound (UCB) algorithm. This algorithm balances exploitation and exploration by choosing the action with the highest combined Q-value and exploration function value. The exploration function increases as the action is chosen less frequently, leading to a more balanced exploration of the action space.

UCB is based on the principle of optimism in the face of uncertainty, meaning it tends to prefer actions that have not been tried often. The core idea of UCB is to choose the action that has the maximum upper confidence bound, which is a sum of the current estimated value of the action and an uncertainty term. The uncertainty term increases with fewer trials of an action, thereby making less tried actions more attractive.The UCB action selection formula is as follows:

<img src="/work/SCR-20230626-rqjk.png">Where <code>t</code> is the total number of actions taken so far, <code>Nt(a)</code> is the number of times action <code>a</code> has been taken, The number <code>c > 0</code> controls the degree of exploration.</img>




update_q_table - This function updates the Q-table using the Q-Learning update rule. It calculates the new Q-value for the state-action pair as a weighted average of the old value and the learned value, where the learned value is the sum of the reward and the discounted estimate of the optimal future Q-value.

log_metrics - This function logs various metrics after each episode, including the reward of the episode, the maximum and minimum rewards obtained so far, the average reward over the last 100 episodes, and the exploration strategy used. If ε-greedy strategy is used, it also logs the current value of ε.

In [5]:
def discretize_state(state, ranges):
    discrete_state = np.zeros(len(state))
    for i in range(len(state)):
        discrete_state[i] = np.digitize(state[i], bins=ranges[i])
    return tuple(discrete_state.astype(int))

def reset_environment(env, ranges):
    observation = env.reset()
    discrete_state = discretize_state(observation, ranges)
    return observation, discrete_state

def choose_action_greedy(discrete_state, q_table, epsilon, env):
    if np.random.random() > epsilon:
        action = np.argmax(q_table[discrete_state])
    else:
        action = np.random.randint(0, env.action_space.n)
    return action

def choose_action_UCB(discrete_state, q_table, env):
    c = UCB_C
    ucb_values = q_table[discrete_state] + c * np.sqrt(np.log(np.sum(q_table[discrete_state]) + 1) / (q_table[discrete_state] + 1e-5))
    action = np.argmax(ucb_values)
    return action

def update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT):
    max_future_q = np.max(q_table[new_discrete_state])  # estimate of optimal future value
    current_q = q_table[discrete_state + (action,)]  # current Q-value
    new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
    q_table[discrete_state + (action,)] = new_q # update Q-table with new Q-value

def log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon):
    reward_list.append(episode_reward)
    max_reward_list.append(max(episode_reward, max_reward_list[-1]) if max_reward_list else episode_reward)
    min_reward_list.append(min(episode_reward, min_reward_list[-1]) if min_reward_list else episode_reward)
    avg_reward = np.mean(reward_list[-100:])  # average over last 100 episodes
    metrics = {'eps': 1/duration, 'reward': episode_reward, 'steps': episode_steps,
               'avg_reward': avg_reward, 'max_reward': max_reward_list[-1], 
               'min_reward': min_reward_list[-1]}
    run.log(metrics)

In [6]:
def run_episodes(run, env, q_table, ranges, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING, UCB_C, exploration_strategy):
    # Additional data lists
    reward_list = []
    max_reward_list = []
    min_reward_list = []

    for episode in tqdm(range(EPISODES), desc="Training", unit="episode"):
        start_time = time.time()
        episode_reward = 0  # initialize the reward for this episode
        episode_steps = 0  # initialize the number of steps for this episode

        observation, discrete_state = reset_environment(env, ranges)
        
        done = False
        while not done:
            if exploration_strategy == 'ucb':
                action = choose_action_UCB(discrete_state, q_table, env)
            else:
                action = choose_action_greedy(discrete_state, q_table, epsilon, env)
            observation, reward, done, info = env.step(action)
            new_discrete_state = discretize_state(observation, ranges)

            if not done:
                update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT)
            elif reward == 100: # Condition indicating that the lander has landed successfully
                q_table[discrete_state + (action,)] = 0

            discrete_state = new_discrete_state
            episode_reward += reward
            episode_steps += 1

        end_time = time.time()
        duration = end_time - start_time

        log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon)
        
        if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
            epsilon -= epsilon_decay_value

In [7]:
def get_q_table_shape(ranges, env):
    # Return the shape of the Q-table
    return [len(r) - 1 for r in ranges] + [env.action_space.n]

In [8]:
def train():
    # Initialize a new wandb run
    run = wandb.init(config=wandb.config, mode="dryrun")

    # Config is a variable that holds and saves hyperparameters and inputs
    config = wandb.config

    LEARNING_RATE = config.learning_rate
    DISCOUNT = config.discount
    epsilon = config.epsilon if config.exploration_strategy == 'epsilon_greedy' else None
    START_EPSILON_DECAYING = 1
    END_EPSILON_DECAYING = EPISODES // 2
    epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING) if config.exploration_strategy == 'epsilon_greedy' else None
    UCB_C = config.UCB_C if config.exploration_strategy == 'UCB' else None
    strategy = config.exploration_strategy

    env = gym.make('LunarLander-v2')

    num_states_continuous = config.num_states  # The number of bins for continuous variables
    num_states_boolean = 2  # The number of bins for boolean variables

    # Define the range of each dimension
    ranges = [
        np.linspace(-90, 90, num_states_continuous),  # X coordinate
        np.linspace(-90, 90, num_states_continuous),  # Y coordinate
        np.linspace(-5, 5, num_states_continuous),  # X velocity
        np.linspace(-5, 5, num_states_continuous),  # Y velocity
        np.linspace(-np.pi, np.pi, num_states_continuous),  # Angle
        np.linspace(-5, 5, num_states_continuous),  # Angular velocity
        np.linspace(0, 1, num_states_boolean),  # Left leg contact
        np.linspace(0, 1, num_states_boolean)  # Right leg contact
    ]

    # Initialize Q-table with the defined initial Q value
    q_table = np.full(shape=get_q_table_shape(ranges, env), fill_value=config.init_q_value)
    run_episodes(run, env, q_table, ranges, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING, UCB_C, strategy)

    # Save the Q-table as an Artifact
    artifact = wandb.Artifact('q_table', type='model')
    np.save('q_table.npy', q_table)
    artifact.add_file('q_table.npy')
    run.log_artifact(artifact)

    env.close()
    run.finish() # End the Run

In [1]:
sweep_id = wandb.sweep(sweep_config, project="lunar-lander-v2")
wandb.agent(sweep_id, train)

NameError: name 'wandb' is not defined

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=5878bf73-13c4-4232-bd61-633eeedc1f05' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>