# Proximal Policy Optimization (PPO)
---
In this notebook, we train PPO with plain pixel-wise perturbation environment.

### 1. Import the Necessary Packages

In [4]:
import importlib

import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline


import dynamics
importlib.reload(dynamics)
from dynamics import Dynamics

import sys
sys.path.append('./PPO/')

from utils import *
from mlp_policy import Policy
from mlp_critic import Value
from core.ppo import ppo_step
from core.common import estimate_advantages
from core.agent import Agent

dtype = torch.float64
torch.set_default_dtype(dtype)
device = torch.device('cuda', index = args.gpu_index) if torch.cuda.is_available() else torch.device('cpu')
if torch.cuda.is_available():
    torch.cuda.set_device(args.gpu_index)

ModuleNotFoundError: No module named 'utils.math'

### 2. Instantiate the Environment and Agent

In [23]:
env = Dynamics(dataset = 'mnist', vae = 'VAE_mnist', cls = 'CLS_mnist', target = 9)
env.reset()
state_size = env.state_size()
action_size = env.action_size()
agent = Agent(state_size=state_size, action_size=action_size, random_seed=2)

### 3. Train the Agent with DDPG

In [24]:
def ddpg(n_episodes=10000, max_t=300, print_every=100):
    scores_deque = deque(maxlen=print_every)
    scores = []
    for i_episode in range(1, n_episodes+1):
        state = env.reset()
        agent.reset()
        score = 0
        for t in range(max_t):
            action = agent.act(state)
            next_state, reward, done, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break 
        scores_deque.append(score)
        scores.append(score)
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)), end="")
        torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
        torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
        if i_episode % print_every == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            
    return scores

scores = ddpg()

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

Episode 100	Average Score: -948.35
Episode 200	Average Score: -948.68
Episode 300	Average Score: -948.68
Episode 400	Average Score: -948.68
Episode 500	Average Score: -948.68
Episode 600	Average Score: -948.68
Episode 700	Average Score: -948.68
Episode 800	Average Score: -948.68
Episode 900	Average Score: -948.68
Episode 1000	Average Score: -948.68
Episode 1100	Average Score: -948.68
Episode 1200	Average Score: -948.68
Episode 1300	Average Score: -948.68
Episode 1400	Average Score: -948.68
Episode 1500	Average Score: -948.68
Episode 1600	Average Score: -948.68
Episode 1700	Average Score: -948.68
Episode 1800	Average Score: -948.68
Episode 1900	Average Score: -948.68
Episode 2000	Average Score: -948.68
Episode 2100	Average Score: -948.68
Episode 2200	Average Score: -948.68
Episode 2300	Average Score: -948.68
Episode 2362	Average Score: -948.68

KeyboardInterrupt: 

### 4. Watch a Smart Agent!

In [12]:
agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))

state = env.reset()
for t in range(200):
    action = agent.act(state, add_noise=False)
    env.render()
    state, reward, done, _ = env.step(action)
    print(reward, done)
    if done:
        break 
img = env.render()
img.show()
#env.close()

-20.030639093362726 False
-3872.7560537558 False
-3814.1891055153455 False
-3847.071071732393 False
-3847.235048856188 False
-3883.7205412791177 False
-3856.5942801737847 False
-3938.4364384628357 False
-3848.2640285111243 False
-3864.712574613516 False
-3890.0591526860585 False
-3887.154923866444 False
-3815.3349221751105 False
-3919.779576944705 False
-3814.6427619502856 False
-3899.882242669465 False
-3849.6326178992063 False
-3915.5297254360485 False
-3832.3922267187795 False
-3883.1383394974214 False
-3841.4541112790334 False
-3871.5215972598353 False
-3793.070107809585 False
-3894.8703980375253 False
-3817.175592979543 False
-3921.784381949799 False
-3886.4961287202254 False
-3865.5335189619336 False
-3872.723037845125 False
-3879.5586626183763 False
-3859.685907582471 False
-3868.208791225044 False
-3922.323519392859 False
-3814.273924015916 False
-3864.169002066616 False
-3867.0410888726265 False
-3859.4003732343144 False
-3940.2572250151243 False
-3846.5198443218883 False
-385

### 6. Explore

In this exercise, we have provided a sample DDPG agent and demonstrated how to use it to solve an OpenAI Gym environment.  To continue your learning, you are encouraged to complete any (or all!) of the following tasks:
- Amend the various hyperparameters and network architecture to see if you can get your agent to solve the environment faster than this benchmark implementation.  Once you build intuition for the hyperparameters that work well with this environment, try solving a different OpenAI Gym task!
- Write your own DDPG implementation.  Use this code as reference only when needed -- try as much as you can to write your own algorithm from scratch.
- You may also like to implement prioritized experience replay, to see if it speeds learning.  
- The current implementation adds Ornsetein-Uhlenbeck noise to the action space.  However, it has [been shown](https://blog.openai.com/better-exploration-with-parameter-noise/) that adding noise to the parameters of the neural network policy can improve performance.  Make this change to the code, to verify it for yourself!
- Write a blog post explaining the intuition behind the DDPG algorithm and demonstrating how to use it to solve an RL environment of your choosing.  