# Reinforcement Learning with the MountainCar Environment

In this tutorial, we will guide you through training a reinforcement learning agent using the MountainCar environment from the Gym library. This classic problem involves an underpowered car driving up a steep hill. We will leverage the `mercury2` library for training the agent.

## Steps

1. **Import Libraries**: Import necessary standard and third-party libraries and custom modules from `mercury2`.

2. **Set Up Paths**: Determine the root path of the project and append it to the system path.

3. **Load Dataset**: Read the MountainCar dataset into a pandas DataFrame.

4. **Initialize Agent**: Set batch size and initialize the imitation agent with a learning rate of 0.001.

5. **Train Agent**: Train the agent.

6. **Save Model**: Save the trained policy model to a specified path.

7. **Initialize Environment**: Set up the MountainCar environment in Gym with rendering.

8. **Load Model**: Initialize a new imitation agent and load the saved model.

9. **Run Agent**: Execute the agent in the environment and calculate the total reward.

10. **Close Environment**: Close the environment to release resources.

## Setup and Libraries

In this section, we will import and set up the necessary libraries. We will use `mercury2` for reinforcement learning agents and environment configuration, along with other essential libraries.

In [1]:
# Import necessary standard libraries
import sys  #Provides access to system-specific parameters and functions
import os # Provides a way of using operating system-dependent functionality
from pathlib import Path # Provides an object-oriented interface for filesystem paths

# Determine the root path of the project, assuming the current working directory is within the project structure
# `Path(os.getcwd())` creates a Path object for the current working directory
# `.parents[3]` navigates three levels up from the current directory, adjusting as needed based on your project structure
root_path = str(Path(os.getcwd()).parents[3])

# Append the root path to the system path, allowing the interpreter to locate project modules regardless of the current working directory
sys.path.append(root_path)

In [2]:
# Import third-party libraries
import gymnasium as gym # Import the Gymnasium library for creating and managing reinforcement learning environments
import pandas as pd # Import Pandas for data manipulation and analysis

# Import TensorFlow for building and training ML models
import tensorflow as tf

# Import custom modules from the mercury2 library
from mercury2.rl.agents import ImitationAgent # Import the ImitationAgent class, which will be used to create an agent for reinforcement learning

2024-10-25 11:34:05.028370: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-25 11:34:05.046109: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-25 11:34:05.051707: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-25 11:34:05.064713: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Download the Dataset

The **MountainCar** dataset, available on [Kaggle](https://www.kaggle.com/datasets/gibrano/offline-mountaincar?select=MountainCar.csv), is a benchmark dataset used for reinforcement learning tasks. This dataset is specifically designed for the MountainCar environment, a classic problem in reinforcement learning where an underpowered car must drive up a steep hill.

## Dataset Description

The dataset contains data collected from an agent interacting with the MountainCar environment. Each row in the dataset represents a single step taken by the agent, including the state of the environment before and after the action, the action taken, and the reward received.

### Columns

- `state_0`: The position of the car at the start of the step.
- `state_1`: The velocity of the car at the start of the step.
- `action`: The action taken by the agent (0 = push left, 1 = no push, 2 = push right).
- `reward`: The reward received after taking the action.
- `next_state_0`: The position of the car after the action.
- `next_state_1`: The velocity of the car after the action.
- `done`: A boolean indicating whether the episode has ended (True or False).

### Usage

This dataset can be used to train and evaluate reinforcement learning algorithms, particularly those that rely on offline data. It provides a fixed set of experiences from which an agent can learn without requiring interaction with a live environment. This can be useful for debugging, testing new algorithms, and comparing performance against established benchmarks.

## Loading the dataset

In [3]:
# Define the path to the dataset
data_path = root_path+"/data/MountainCarExpert.csv"

# Define the column names in the dataset for states, actions, rewards, episode IDs, and sequence/order
states_cols = ['x', 'vel'] # Columns representing the state of the environment (position 'x' and velocity 'vel')
action_col = 'action' # Column representing the action taken by the agent
reward_col = 'reward' # Column representing the reward received after the action
episode_col_id = 'episode_id' # Column representing the unique identifier for each episode
order_col = 'seq' # Column representing the sequence/order of the steps within an episode

df = pd.read_csv(data_path)

In [17]:
len(df.episode_id.unique())

1213

## Training

In this section, we will train our reinforcement learning agent using the offline environment created earlier. The agent will learn from pre-recorded trajectories by iterating over the dataset for a specified number of epochs.


In [4]:
# Set the batch size for processing episodes
batch_size = 1

# Initialize the imitation agent with a specified learning rate
agent = ImitationAgent(learning_rate=0.001)

# Define the number of epochs for training
epochs = 30

# Loop over each epoch
for epoch in range(epochs):
    
    # Get the unique episode IDs from the dataframe
    episodes = df.episode_id.unique()
    # Convert the episode IDs to a list
    episodes = list(episodes)
    
    # Loop until there are no more episodes to process
    while len(episodes) > 0:
        # Pop the first episode ID from the list
        episode_id = episodes.pop(0)

        # Filter the dataframe for the current episode and sort by episode ID and sequence
        df_batch = df[df.episode_id == episode_id].sort_values(by=["episode_id","seq"], ascending=True)

        # Extract the current states, actions, and rewards from the dataframe
        curr_states = df_batch[states_cols].values
        actions = df_batch[action_col].values
        rewards = df_batch[reward_col].values

        # Store each transition in the agent's memory
        for j in range(df_batch.shape[0]):
            agent.store_transition(curr_states[j], actions[j], rewards[j])

        # Train the agent with the stored transitions
        agent.learn()  

    # Print the current epoch and the loss value after training
    print("Epoch:", epoch, "Loss:", agent.loss.numpy())

I0000 00:00:1729877670.932428  136770 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1729877670.971330  136770 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1729877670.971577  136770 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1729877670.973083  136770 cuda_executor.cc:1015] successful NUMA node read from SysFS ha

Epoch: 0 Loss: 0.6830408
Epoch: 1 Loss: 0.659641
Epoch: 2 Loss: 0.64032567
Epoch: 3 Loss: 0.62700754
Epoch: 4 Loss: 0.61185896
Epoch: 5 Loss: 0.6022146
Epoch: 6 Loss: 0.59757
Epoch: 7 Loss: 0.5945625
Epoch: 8 Loss: 0.58951503
Epoch: 9 Loss: 0.5828993
Epoch: 10 Loss: 0.57632357
Epoch: 11 Loss: 0.571684
Epoch: 12 Loss: 0.56850845
Epoch: 13 Loss: 0.56353176
Epoch: 14 Loss: 0.5603784
Epoch: 15 Loss: 0.5587101
Epoch: 16 Loss: 0.5558426
Epoch: 17 Loss: 0.55460846
Epoch: 18 Loss: 0.55363977
Epoch: 19 Loss: 0.55416286
Epoch: 20 Loss: 0.55410564
Epoch: 21 Loss: 0.555156
Epoch: 22 Loss: 0.55579066
Epoch: 23 Loss: 0.55592906
Epoch: 24 Loss: 0.55600363
Epoch: 25 Loss: 0.55653787
Epoch: 26 Loss: 0.55680287
Epoch: 27 Loss: 0.5563757
Epoch: 28 Loss: 0.554058
Epoch: 29 Loss: 0.554322


## Saving the Trained Model

After training the reinforcement learning agent, it is important to save the trained model for future use. This allows us to load the model later for further training, evaluation, or deployment.

In [5]:
# Save the trained policy model to a specified file path
agent.policy.save(root_path+'/models/offline_mountain_car_imitation_model.h5')



## Initializing the Gym Environment

To test the performance of our trained reinforcement learning agent, we need to initialize the Gym environment. The Gym library provides a standard API for interacting with a wide variety of reinforcement learning environments, including the MountainCar environment.

In [18]:
# Initialize the MountainCar environment from the Gym library with human-readable rendering
env = gym.make('MountainCar-v0', render_mode="human")

## Loading the Trained Model into a New Agent

To utilize the trained policy in a new instance of the agent, we need to load the saved model and assign it to the new agent's policy. This allows the new agent to leverage the learned policy without retraining from scratch.

In [19]:
# Initialize a new imitation agent
agent2 = ImitationAgent()

# Load the trained model from the specified file path
agent2.policy = tf.keras.models.load_model(root_path+'/models/offline_mountain_car_imitation_model.h5')



In [20]:
# Reset the environment to get the initial state and information
curr_state, info = env.reset()

# Initialize the total reward counter
total_R = 0

# Loop to interact with the environment until termination
while True:
    
    # Agent chooses an action based on the current state
    action = agent2.choose_action(curr_state)

    # Take the action in the environment and receive the next state, reward, and termination info
    next_state, reward, terminated, _, _ = env.step(action)
    
    # Update the current state to the next state
    curr_state = next_state.copy()
    # Accumulate the reward
    total_R += reward

    # Break the loop if the episode is terminated
    if terminated:
        break

# Print the total reward obtained in this episode
print("Total reward:", total_R)

Total reward: -203.0


### Closing the Environment

After running the agent in the environment and evaluating its performance, it is important to properly close the environment. This ensures that all resources are released and the environment is cleanly shut down.

In [21]:
# Close the environment to release resources
env.close()

In [10]:
2*2

4