# Exercice 13) Deep Deterministic Policy Gradients and Proximal Policy Optimization

Congratulations! You survived the course until the last exercise. 

![](https://media3.giphy.com/media/fwi2qY9VmH33uukTXr/giphy.gif)

Source: https://giphy.com/gifs/Rambo-rambo-fwi2qY9VmH33uukTXr

After this exercise, no semi-informed human resources department will be able to reject you. Showtime!

In this exercise we will investigate two state-of-the-art algorithms: deep deterministic policy gradient (DDPG) and proximal policy optimization (PPO).

We will examine their performance on [Goddard's rocket problem](https://github.com/osannolik/gym-goddard).
This environment comes prepackaged in this notebook's folder, so it can be just imported.

```
First formulated by R. H. Goddard around 1910, this is a classical problem within dynamic optimization and optimal control. The task is simply to find the optimal thrust profile for a vertically ascending rocket in order for it to reach the maximum possible altitude, given that its mass decreas as the fuel is spent and that it is subject to varying drag and gravity.

The state, and the gym's observation space, of the rocket is its vertical position, velocity and mass.

The rocket engine is assumed to be throttled such that the thrust can be continuously controlled between 0 to some maximum limit.
```

![](https://github.com/osannolik/gym-goddard/blob/master/animation.gif)

In [1]:
from rocket_env import GoddardEnv
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## 1) DDPG

## 2) PPO

The [original paper from 2017](https://arxiv.org/abs/1707.06347) for the PPO came up with an idea to combine A2C (having multiple workers) and TRPO (using a trust region to improve the actor).
The PPO algorithm achieves this by hard clipping gradients in order to ensure that new policies won't be too far away from old ones.

In [11]:

from stable_baselines3 import PPO


env = GoddardEnv()

model = PPO('MlpPolicy', env, verbose=0)
model.learn(total_timesteps=int(1e6))




<stable_baselines3.ppo.ppo.PPO at 0x7f5dd2b98c90>

In [17]:
obs = env.reset()
for i in range(10000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
        obs = env.reset()

KeyboardInterrupt: 

In [None]:
env.close()

In [None]:
fig = plt.figure()

plt.plot(np.array(train_log['cum_rew']), label='Cum. Reward')
plt.plot(pd.Series(train_log['cum_rew']).ewm(span=30).mean(), label='EWMA')
plt.xlabel('episode')
plt.ylabel('cumulative reward')
plt.title('cum. reward over episodes during training')
plt.legend()