# Reinforcement Learning with the MountainCar Environment

In this tutorial, we will guide you through training a reinforcement learning agent using the MountainCar environment from the Gym library. This classic problem involves an underpowered car driving up a steep hill. We will leverage the `mercury-rl` library for configuring the environment and training the agent.

## Steps

1. **Download the Dataset**: Obtain and load the MountainCar dataset from Kaggle.
2. **Creating the Offline Environment**: Configure the offline environment using the dataset and `mercury-rl`.
3. **Training the Agent**: Train the imitation agent from `mercury-rl` using pre-recorded trajectories.
4. **Saving the Model**: Save the trained policy model to disk.
5. **Initializing the Environment**: Set up the MountainCar environment in Gym.
6. **Loading the Model**: Load the saved model into a new agent.
7. **Running the Agent**: Execute the agent in the environment and calculate the total reward.
8. **Closing the Environment**: Ensure the environment is properly closed to release resources.


## Setup and Libraries

In this section, we will import and set up the necessary libraries. We will use `mercury-rl` for reinforcement learning agents and environment configuration, along with other essential libraries.

In [1]:
# Import necessary standard libraries
import sys  #Provides access to system-specific parameters and functions
import os # Provides a way of using operating system-dependent functionality
from pathlib import Path # Provides an object-oriented interface for filesystem paths

# Determine the root path of the project, assuming the current working directory is within the project structure
# `Path(os.getcwd())` creates a Path object for the current working directory
# `.parents[3]` navigates three levels up from the current directory, adjusting as needed based on your project structure
root_path = str(Path(os.getcwd()).parents[3])

# Append the root path to the system path, allowing the interpreter to locate project modules regardless of the current working directory
sys.path.append(root_path)

In [5]:
# Import third-party libraries
import gymnasium as gym # Import the Gymnasium library for creating and managing reinforcement learning environments
import numpy as np # Import NumPy for numerical operations and array handling
import pandas as pd # Import Pandas for data manipulation and analysis

# Import TensorFlow for building and training ML models
import tensorflow as tf

# Import custom modules from the mercury-rl library
from mercury.rl.agents import ImitationAgent # Import the ImitationAgent class, which will be used to create an agent for reinforcement learning
from mercury.rl.environment import ENV # Import the ENV environment configuration from the mercury-rl library

## Download the Dataset

The **MountainCar** dataset, available on [Kaggle](https://www.kaggle.com/datasets/gibrano/offline-mountaincar?select=MountainCar.csv), is a benchmark dataset used for reinforcement learning tasks. This dataset is specifically designed for the MountainCar environment, a classic problem in reinforcement learning where an underpowered car must drive up a steep hill.

## Dataset Description

The dataset contains data collected from an agent interacting with the MountainCar environment. Each row in the dataset represents a single step taken by the agent, including the state of the environment before and after the action, the action taken, and the reward received.

### Columns

- `state_0`: The position of the car at the start of the step.
- `state_1`: The velocity of the car at the start of the step.
- `action`: The action taken by the agent (0 = push left, 1 = no push, 2 = push right).
- `reward`: The reward received after taking the action.
- `next_state_0`: The position of the car after the action.
- `next_state_1`: The velocity of the car after the action.
- `done`: A boolean indicating whether the episode has ended (True or False).

### Usage

This dataset can be used to train and evaluate reinforcement learning algorithms, particularly those that rely on offline data. It provides a fixed set of experiences from which an agent can learn without requiring interaction with a live environment. This can be useful for debugging, testing new algorithms, and comparing performance against established benchmarks.


## Creating the offline environment.

To create an offline environment for training our reinforcement learning agent, we need to specify the path to our dataset and define the relevant columns that our environment will use.

In [6]:
# Define the path to the dataset
data_path = root_path+"/data/MountainCarExpert.csv"

# Define the column names in the dataset for states, actions, rewards, episode IDs, and sequence/order
states_cols = ['x', 'vel'] # Columns representing the state of the environment (position 'x' and velocity 'vel')
action_col = 'action' # Column representing the action taken by the agent
reward_col = 'reward' # Column representing the reward received after the action
episode_col_id = 'episode_id' # Column representing the unique identifier for each episode
order_col = 'seq' # Column representing the sequence/order of the steps within an episode

offline_env = ENV(data_path, states_cols, action_col, reward_col, episode_col_id, order_col, batch_size=64, shuffle=True)

## Training

In this section, we will train our reinforcement learning agent using the offline environment created earlier. The agent will learn from pre-recorded trajectories by iterating over the dataset for a specified number of epochs.


In [None]:
# Initialize the imitation agent with a specified learning rate
agent = ImitationAgent(learning_rate=0.001)

# Define the number of epochs for training
epochs = 30

# Training loop
for epoch in range(epochs):  # Loop over each epoch

    offline_env.reset()
    
    # Loop over each batch in the offline environment
    for batch_id in range(offline_env.env.episodes):
        
        # Retrieve trajectories (episodes) for the current batch
        episode_ids, sequence, states, actions, rewards = offline_env.get_replay(batch_id)

        # Loop over each trajectory in the batch
        for j in range(len(episode_ids)):
            # Store the transition (state, action, reward) in the agent's memory
            agent.store_transition(states[j], actions[j], rewards[j])

        # Train the agent using the stored transitions
        agent.learn()

    # Print the epoch number and the current loss of the agent
    print("Epoch:", epoch, "Loss:", agent.loss.numpy())

## Saving the Trained Model

After training the reinforcement learning agent, it is important to save the trained model for future use. This allows us to load the model later for further training, evaluation, or deployment.

In [None]:
# Save the trained policy model to a specified file path
agent.policy.save('models/mountain_car_imitation_model_env.h5')

## Initializing the Gym Environment

To test the performance of our trained reinforcement learning agent, we need to initialize the Gym environment. The Gym library provides a standard API for interacting with a wide variety of reinforcement learning environments, including the MountainCar environment.

In [None]:
# Initialize the MountainCar environment from the Gym library with human-readable rendering
env = gym.make('MountainCar-v0', render_mode="human")

## Loading the Trained Model into a New Agent

To utilize the trained policy in a new instance of the agent, we need to load the saved model and assign it to the new agent's policy. This allows the new agent to leverage the learned policy without retraining from scratch.


In [None]:
# Initialize a new imitation agent
agent2 = ImitationAgent()

# Load the trained model from the specified file path
agent2.policy = tf.keras.models.load_model('models/mountain_car_imitation_model_env.h5')

## Running the Agent in the Environment

After loading the trained policy into a new agent, we can run the agent in the MountainCar environment to observe its performance and calculate the total reward obtained during an episode.

In [None]:
# Reset the environment to start a new episode and retrieve the initial state
curr_state, info = env.reset()

# Initialize the total reward accumulator
total_R = 0

# Run the agent in the environment
while True:
    # Choose an action based on the current state using the trained agent
    action = agent2.choose_action(curr_state)
    
    # Take the action in the environment and receive the next state and reward
    next_state, reward, terminated, _, _ = env.step(action)
    
    # Update the current state
    curr_state = next_state.copy()
    
    # Accumulate the total reward
    total_R += reward

    # Check if the episode has terminated
    if terminated:
        break

# Print the total reward obtained in the episode
print("Total reward:", total_R)

By running the agent in the MountainCar environment, we can evaluate its performance based on the total reward obtained. In this case, the total reward was **-84.0**. This process involves resetting the environment, choosing actions based on the agent's policy, updating the state, and accumulating rewards until the episode terminates.

Running this loop allows us to observe how well the agent performs and make adjustments if necessary to improve its behavior.

### Closing the Environment

After running the agent in the environment and evaluating its performance, it is important to properly close the environment. This ensures that all resources are released and the environment is cleanly shut down.


In [None]:
# Close the environment to release resources
env.close()