### MountainCar-v0 with Gymnasium + Stable-Baselines3 (Step-by-Step)

In this notebook you'll:

- Inspect the MountainCar-v0 environment (state, actions, rewards)

- Run a random policy to build intuition

- Train a DQN agent with Stable-Baselines3

- Evaluate, record a short video, and plot learning curves

In [None]:
pip install "gymnasium[other,classic_control]" stable-baselines3 tensorboard

In [None]:
import gymnasium as gym
import numpy as np
import os 
import matplotlib.pyplot as plt

import stable_baselines3 as sb3
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.results_plotter import load_results, ts2xy

print("Gymnasium:", gym.__version__)
print("Stable-Baselines3:", sb3.__version__)

### Meet MountainCar-v0

State (observation): [position, velocity] (shape: 2)

Actions: {0: push left, 1: no push, 2: push right}

Goal: Drive the underpowered car up the right hill (position ≥ 0.5).

Reward: −1 per step until the goal is reached (shorter episodes = better).

Episode ends: goal reached or after 200 steps.

In [None]:
env = gym.make("MountainCar-v0")
print("Observation space: ", env.observation_space)
print("Action space: ", env.action_space)

obs, info = env.reset(seed=42)
print("Initial observation:", obs, " | info:", info)
print("Sample action:", env.action_space.sample())
env.close()

**Roll out one random episode**

This helps you see how the state evolves and why naive actions fail.
We'll track the car's position over time and plot it.

In [None]:
env = gym.make("MountainCar-v0")
positions = []
rewards = []
obs, info = env.reset(seed=123)

done = False
total_r = 0
while not done:
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    positions.append(obs[0])
    rewards.append(reward)
    total_r += reward

env.close()

print("Episode length: ", len(rewards), " | Total reward:", total_r)

# Plot position over steps
plt.figure()
plt.plot(positions)
plt.title("Random driving")
plt.xlabel("Step")
plt.ylabel("Position")
plt.show()

In [8]:
from gymnasium.wrappers import RecordVideo

log_dir = "./mountaincar_logs"
video_dir = "./mountaincar_videos"
os.makedirs(log_dir, exist_ok=True)
os.makedirs(video_dir, exist_ok=True)

# Monitored env for training stats
def make_env(seed=0):
    e = gym.make("MountainCar-v0", render_mode=None)
    e = Monitor(e, filename=os.path.join(log_dir, "monitor.csv"))
    return e

# A seperate env only for video to avoid slowing down training
def make_video_env(seed=123):
    e = gym.make("MountainCar-v0", render_mode="rgb_array")
    e = RecordVideo(e, video_folder=video_dir, episode_trigger=lambda ep : True)
    return e

print("Log dir:", log_dir)
print("Video dir:", video_dir)

Log dir: ./mountaincar_logs
Video dir: ./mountaincar_videos


In [None]:
env_video = gym.make("MountainCar-v0", render_mode="rgb_array")
env_video = RecordVideo(env_video, video_folder=video_dir, 
                        episode_trigger=lambda ep: True, name_prefix="test_run")

obs, info = env_video.reset(seed=2)
done = False
while not done:
    action = env_video.action_space.sample()
    obs, reward, terminated, truncated, info = env_video.step(action)
    done = terminated or truncated

env_video.close()
print("Sample video saved to: ", os.path.abspath(video_dir))

A tiny callback to save the best model

We'll compute a simple moving mean of rewards using Monitor logs and save the best-performing checkpoint.

In [None]:
class SaveBest(BaseCallback):
    def __init__(self, check_freq, log_dir, verbose=1):
        super().__init__(verbose)