## We want to learn the Mountain Car Environment using DDPG 

### Mountain Car Environment
![MountainCar](mountain_car_continuous.gif) 

Mountain Car is a continuous environment provided by Gym. There is only one action, a directional force applied to the car (think gas pedal but with reverse as well), and two inputs, the cars position in the x axis, and its velocity, the model does not see the height, nor pixels. 

#### Action/Obs Space 
The observation is a `ndarray` with shape `(2,)` where the elements correspond to the following:
|Num|Observation                          |Min|Max|Unit              |
|---|-------------------------------------|---|---|------------------|
|0  |position of the car along the x axis | -Inf | Inf | position(m) |
|1  |velocity of the car                  | -Inf | Inf | position(m) |

The action space is also a `ndarray`, but with shape `(1,)` giving the force, if a value outside of $[-1,1]$ is passed it is clipped.
#### Dynamics
The car's dynamics obey a discretized (implicit?) state space model such that
$$v_{t+1} = v_{t+1} + \mathrm{power}f  - 0.0025\mathrm{cos}(3x)$$
$$x_{t+1} = x_{t} + v_{t+1}$$
where force is the clipped output of the model (in range $[-1,1]$) and the power is always 0.0015. If the car in a special case collides with a wall it's velocity is reset to 0. The position ($x$) is always between $[-1.2, 0.6]$ and the velocity is clipped to $[-0.07, 0.07]$.

#### Goal and Rewards
The ultimate goal of the car is to reach the goal, the car is punished by ($0.1*\mathrm{action}^2$) to keep actions reasonable and avoid explosions and if ht ecar reaches the goal it is granted the big reward of $100$.

#### Initial State
The car begins somewhere on the downwards slope between $[-0.6,-0.4]$, this increases the likelihood of seeing rewards.

#### Episode End
The episode ends if the car either reaches the goal $x \ge 0.45$ or if the length of the episode exceeds 999.



## Code

### Setup

#### Imports

In [12]:
import gymnasium as gym
from stable_baselines3 import DDPG
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise

import numpy as np

#### Making the environment

In [13]:
env = gym.make('MountainCarContinuous-v0')

#### Making the Agent

In [None]:
n_actions = env.action_space.shape[-1]
action_noise = OrnsteinUhlenbeckActionNoise(mean = np.zeros(n_actions), sigma = 0.65*np.ones(n_actions), theta = 0.3)
model = DDPG("MlpPolicy", env, action_noise=action_noise, verbose=1, device="cuda")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


#### Training and Saving

In [None]:
model.learn(total_timesteps=60000, log_interval=100)
model.save("ddpg_mc")

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 923      |
|    ep_rew_mean     | -40.6    |
| time/              |          |
|    episodes        | 100      |
|    fps             | 69       |
|    time_elapsed    | 1321     |
|    total_timesteps | 92271    |
| train/             |          |
|    actor_loss      | 0.997    |
|    critic_loss     | 0.38     |
|    learning_rate   | 0.001    |
|    n_updates       | 92170    |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 914      |
|    ep_rew_mean     | -43.7    |
| time/              |          |
|    episodes        | 200      |
|    fps             | 69       |
|    time_elapsed    | 2627     |
|    total_timesteps | 183626   |
| train/             |          |
|    actor_loss      | -0.242   |
|    critic_loss     | 0.732    |
|    learning_rate   | 0.001    |
|    n_updates       | 183525   |
--------------

KeyboardInterrupt: 

#### Display Learned Strategy

In [16]:
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
num_eval_episodes = 1

env = gym.make("MountainCarContinuous-v0", render_mode="rgb_array")  # replace with your environment
env = RecordVideo(env, video_folder="folder", name_prefix="eval",
                  episode_trigger=lambda x: True)
env = RecordEpisodeStatistics(env, buffer_length=num_eval_episodes)

for episode_num in range(num_eval_episodes):
    obs, info = env.reset()

    episode_over = False
    while not episode_over:
        action, _states = model.predict(obs, deterministic=True)  # replace with actual agent
        obs, reward, terminated, truncated, info = env.step(action)

        episode_over = terminated or truncated
env.close()

print(f'Episode time taken: {env.time_queue}')
print(f'Episode total rewards: {env.return_queue}')
print(f'Episode lengths: {env.length_queue}')

Episode time taken: deque([2.998103], maxlen=1)
Episode total rewards: deque([-99.8999999999986], maxlen=1)
Episode lengths: deque([999], maxlen=1)
