# Q-Learning on MountainCar-v0 Environment

In this project, we are training a Q-Learning agent to solve the `MountainCar-v0` environment from OpenAI's Gym. 

The objective of the MountainCar environment is to get an underpowered car to the top of a hill. The car is on a one-dimensional track, and the position and velocity of the car are observable at each time step.

## Training Process

We implement a Q-Learning algorithm with an ε-greedy policy for action selection. We use a simple table to represent the Q-values of state-action pairs. To handle the continuous state space of the environment, we discretize the states by splitting the position and velocity into bins.

The agent's goal is to maximize the total reward it receives in an episode. The reward at each time step is -1, and an episode ends when the car reaches the goal (position 0.5) or after 200 time steps.

## Hyperparameter Tuning

We use Weights & Biases Sweeps for hyperparameter tuning. We explore different values of the learning rate, discount factor, and the number of discretized states. The agent's performance is measured by the average reward over 100 episodes.

## Analysis and Visualization

We use Weights & Biases for experiment tracking and visualization. We log the following metrics during training:

- Episode Reward: The total reward obtained in an episode.
- Steps: The number of steps taken in an episode.
- Epsilon: The current value of ε in the ε-greedy policy.
- Average Reward: The average episode reward over the last 100 episodes.
- Max/Min Reward: The maximum/minimum episode reward obtained so far.

The results are visualized on a Weights & Biases dashboard, which shows how the agent's performance evolves over time as it learns from its interactions with the environment.

## Results

After training for a specified number of episodes, the agent is able to consistently reach the goal within the 200 time step limit. The hyperparameters found by the sweep lead to more efficient learning compared to a manually chosen baseline.

This project demonstrates the effectiveness of Q-Learning and the importance of hyperparameter tuning for reinforcement learning tasks. The next steps could include experimenting with other RL algorithms or environments.


# Code and running

In [1]:
!pip install --upgrade wandb
!pip install gym==0.26.2
!pip install wandb

Collecting wandb
  Downloading wandb-0.15.4-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0
  Downloading GitPython-3.1.31-py3-none-any.whl (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 KB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting PyYAML
  Downloading PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m661.8/661.8 KB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting setproctitle
  Downloading setproctitle-1.3.2-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting docker-pycreds>=0.4.0
  Downlo

Import necessary libraries

In [2]:
import gym
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import wandb
import time
import os

## Configure Sweeps & Login to Weights and Biases

In [3]:
wandb.login()
sweep_config = {
    'method': 'random',  # or 'grid' or 'bayes'
    'metric': {
        'name': 'avg_reward',
        'goal': 'maximize'   
    },
    'parameters': {
        'learning_rate': {
            'values': [0.1, 0.01, 0.001]
        },
        'discount': {
            'values': [0.9, 0.95, 0.99]
        },
        'epsilon': {
            'values': [0.5, 0.8, 0.9]
        },
        'num_states': {
            'values': [10, 20, 30, 40]
        },
    },
    'count': 20  # limit sweep to 20 runs
}

sweep_id = wandb.sweep(sweep_config, project="mountain-car-v0")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mtcivie[0m ([33mgot-tree[0m). Use [1m`wandb login --relogin`[0m to force relogin
Create sweep with ID: lf1ym46s
Sweep URL: https://wandb.ai/got-tree/mountain-car-v0/sweeps/lf1ym46s


Number of episodes to iterate

In [4]:
EPISODES = 25000

This function takes in a continuous state and returns a discrete state

In [5]:
def get_discrete_state(state, env, bin_size):
    # This is used to convert continuous state space into a discrete state space
    discrete_state = (state - env.observation_space.low) / bin_size
    return tuple(discrete_state.astype(int))

## Defining the main training loop

The main loop for the Q-learning algorithm is in the section where we loop over each episode. Inside this loop:

- We reset the environment and initialize the reward and steps for this episode to zero.

- We use the epsilon-greedy method to select actions and execute them in the environment.

- The Q-value for the executed action is updated using the Q-learning update rule.We also log the episode per second, reward, reward per second, and steps per second to wandb.

### Creating helper functions

reset_environment(env, bin_size) - This function resets the environment to its initial state at the beginning of each episode. It also converts the initial state into a discrete format, as our Q-table is based on discrete states and actions. It takes the environment and bin size as input and returns the initial observation and the discrete state.

choose_action(discrete_state, q_table, epsilon, env) - This function implements the ε-greedy policy for action selection. With a probability of ε, it selects a random action, and with a probability of (1-ε), it selects the action with the highest Q-value in the current state. It takes the current discrete state, the Q-table, the epsilon value, and the environment as input and returns the chosen action.

update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT) - This function updates the Q-value for the current state-action pair based on the Q-Learning update rule. It takes the Q-table, the current discrete state, the chosen action, the reward obtained, the new discrete state, learning rate, and discount factor as input. It doesn't return anything as the Q-table is updated in-place.

This is the Q-Learning formula used in the <i>update_q_table</i> function:<image src="https://wikimedia.org/api/rest_v1/media/math/render/svg/d247db9eaad4bd343e7882ec546bf3847ebd36d8"></image><ul>
<li>Q(s,a) is the current estimate of the Q-value for the state-action pair (s, a)</li><li>α is the learning rate</li><li>r is the immediate reward obtained after taking action a in state s</li><li>γ is the discount factor</li><li>max Q(s',a') is the maximum Q-value over all actions a' in the next state s'</li></ul>

log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon) - This function logs various metrics of interest during the training process. These metrics include the average reward over the last 100 episodes, the total reward in the current episode, the number of steps in the current episode, the current ε value, and the minimum and maximum rewards obtained so far. It takes the current run, lists to store total, maximum and minimum rewards per episode, reward for the current episode, number of steps in the current episode, duration of the current episode, and the current ε value as input. The metrics are logged to the current Weights & Biases run for visualizing the training progress.

In [6]:
def reset_environment(env, bin_size):
    observation, info = env.reset()
    discrete_state = get_discrete_state(observation, env, bin_size)
    return observation, discrete_state

def choose_action(discrete_state, q_table, epsilon, env):
    if np.random.random() > epsilon:
        action = np.argmax(q_table[discrete_state])
    else:
        action = np.random.randint(0, env.action_space.n)
    return action

def update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT):
    max_future_q = np.max(q_table[new_discrete_state])  # estimate of optimal future value
    current_q = q_table[discrete_state + (action,)]  # current Q-value
    new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
    q_table[discrete_state + (action,)] = new_q # update Q-table with new Q-value

def log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon):
    reward_list.append(episode_reward)
    max_reward_list.append(max(episode_reward, max_reward_list[-1]) if max_reward_list else episode_reward)
    min_reward_list.append(min(episode_reward, min_reward_list[-1]) if min_reward_list else episode_reward)
    avg_reward = np.mean(reward_list[-100:])  # average over last 100 episodes
    metrics = {'eps': 1/duration, 'reward': episode_reward, 'steps': episode_steps,
               'epsilon': epsilon, 'avg_reward': avg_reward, 'max_reward': max_reward_list[-1], 
               'min_reward': min_reward_list[-1]}
    run.log(metrics)


### The main loop

run_episodes(run, env, q_table, bin_size, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING) - This function contains the main training loop. In each episode, it resets the environment, selects actions according to the ε-greedy policy, takes the actions in the environment, updates the Q-table, and logs the training metrics. It takes the current run, the environment, the Q-table, the bin size for discretizing states, the initial epsilon value, the learning rate, the epsilon decay value, the discount factor, and the start and end episodes for epsilon decay as input. The Q-table gets updated in-place during the training process, and the training metrics are logged to the current Weights & Biases run.

In [7]:
def run_episodes(run, env, q_table, bin_size, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING):
    # Additional data lists
    reward_list = []
    max_reward_list = []
    min_reward_list = []
    
    for episode in tqdm(range(EPISODES), desc="Training", unit="episode"):
        start_time = time.time()
        episode_reward = 0  # initialize the reward for this episode
        episode_steps = 0  # initialize the number of steps for this episode

        observation, discrete_state = reset_environment(env, bin_size)
        
        done = False
        while not done:
            action = choose_action(discrete_state, q_table, epsilon, env)
            observation, reward, terminated, truncated, info = env.step(action)
            new_discrete_state = get_discrete_state(observation, env, bin_size)

            if not done:
                update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT)
            if observation[0] >= env.goal_position:
                done = True
                q_table[discrete_state + (action,)] = 0

            discrete_state = new_discrete_state
            episode_reward += reward
            episode_steps += 1

        end_time = time.time()
        duration = end_time - start_time

        log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon)
        
        if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
            epsilon -= epsilon_decay_value

train() is a warper for the run_episodes() which is made to utilise the Weights and Biases sweeps functionality

In [8]:
def train():

    # Initialize a new wandb run
    run = wandb.init(config=wandb.config)

    # Config is a variable that holds and saves hyperparameters and inputs
    config = wandb.config

    LEARNING_RATE = config.learning_rate
    DISCOUNT = config.discount
    epsilon = config.epsilon
    START_EPSILON_DECAYING = 1
    END_EPSILON_DECAYING = EPISODES // 2
    num_states = np.array([config.num_states, config.num_states])
    epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING)

    env = gym.make('MountainCar-v0')
    bin_size = (env.observation_space.high - env.observation_space.low) / num_states

    # Initialize Q-table with zeros
    q_table = np.zeros(shape=(num_states[0] ,num_states[1], env.action_space.n))
    run_episodes(run, env, q_table, bin_size, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING)
    env.close()
    run.finish()  # End the run

Run the sweeps

In [None]:
sweep_id = wandb.sweep(sweep_config, project="mountain-car-v0")
wandb.agent(sweep_id, train)


Create sweep with ID: v31n9m5m
Sweep URL: https://wandb.ai/got-tree/mountain-car-v0/sweeps/v31n9m5m
[34m[1mwandb[0m: Agent Starting Run: n7gmrlk2 with config:
[34m[1mwandb[0m: 	discount: 0.95
[34m[1mwandb[0m: 	epsilon: 0.9
[34m[1mwandb[0m: 	learning_rate: 0.1
[34m[1mwandb[0m: 	num_states: 10
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=5878bf73-13c4-4232-bd61-633eeedc1f05' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>