# Welcome!
Below, we will learn to implement and train a policy to play atari-pong, using only the pixels as input. We will use convolutional neural nets, multiprocessing, and pytorch to implement and train our policy. Let's get started!

(I strongly recommend you to try this notebook on the Udacity workspace first before running it locally on your desktop/laptop, as performance might suffer in different environments)

In [2]:
# install package for displaying animation
!pip install JSAnimation

# custom utilies for displaying animation, collecting rollouts and more
import pong_utils

%matplotlib inline

# check which device is being used. 
# I recommend disabling gpu until you've made sure that the code runs
device = pong_utils.device
print("using device: ",device)

Collecting JSAnimation
  Downloading https://files.pythonhosted.org/packages/3c/e6/a93a578400c38a43af8b4271334ed2444b42d65580f1d6721c9fe32e9fd8/JSAnimation-0.1.tar.gz
Building wheels for collected packages: JSAnimation
  Running setup.py bdist_wheel for JSAnimation ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/3c/c2/b2/b444dffc3eed9c78139288d301c4009a42c0dd061d3b62cead
Successfully built JSAnimation
Installing collected packages: JSAnimation
Successfully installed JSAnimation-0.1
using device:  cuda:0


In [3]:
# render ai gym environment
import gym
import time

# PongDeterministic does not contain random frameskip
# so is faster to train than the vanilla Pong-v4 environment
env = gym.make('PongDeterministic-v4')

print("List of available actions: ", env.unwrapped.get_action_meanings())

# we will only use the actions 'RIGHTFIRE' = 4 and 'LEFTFIRE" = 5
# the 'FIRE' part ensures that the game starts again after losing a life
# the actions are hard-coded in pong_utils.py

List of available actions:  ['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']


# Preprocessing
To speed up training, we can simplify the input by cropping the images and use every other pixel



In [None]:
import matplotlib
import matplotlib.pyplot as plt

# show what a preprocessed image looks like
env.reset()
_, _, _, _ = env.step(0)
# get a frame after 20 steps
for _ in range(20):
    frame, _, _, _ = env.step(1)

print(frame.shape)
plt.subplot(1,2,1)
plt.imshow(frame)
plt.title('original image')

plt.subplot(1,2,2)
plt.title('preprocessed image')

f = pong_utils.preprocess_single(frame)
print(f.shape)
# 80 x 80 black and white image
plt.imshow(pong_utils.preprocess_single(frame), cmap='Greys')
plt.show()



# Policy

## Exercise 1: Implement your policy
 
Here, we define our policy. The input is the stack of two different frames (which captures the movement), and the output is a number $P_{\rm right}$, the probability of moving left. Note that $P_{\rm left}= 1-P_{\rm right}$

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F



# set up a convolutional neural net
# the output is the probability of moving right
# P(left) = 1-P(right)
class Policy(nn.Module):

    def __init__(self):
        super(Policy, self).__init__()
        
        
    ########
    ## 
    ## Modify your neural network
    ##
    ########
        
        # 80x80 to outputsize x outputsize
        # outputsize = (inputsize - kernel_size + stride)/stride 
        # (round up if not an integer)

        # output = 20x20 here
        self.conv1 = nn.Conv2d(2, 4, kernel_size=4, stride=4)
        #size becomes 20*20*8
        self.conv2 = nn.Conv2d(4, 8, kernel_size=4, stride=4)
        self.size=8*5*5
        
        # 2 fully connected layer
        self.fc1 = nn.Linear(self.size, 32)
        self.fc2 = nn.Linear(32, 1)
        self.sig = nn.Sigmoid()
        
    def forward(self, x):
        
    ########
    ## 
    ## Modify your neural network
    ##
    ########
    
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        # flatten the tensor
        x = x.view(-1,self.size)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.sig(x)


# run your own policy!
#policy=Policy().to(device)
policy=pong_utils.Policy().to(device)

# we use the adam optimizer with learning rate 2e-4
# optim.SGD is also possible
import torch.optim as optim
optimizer = optim.Adam(policy.parameters(), lr=1e-4)

# Game visualization
pong_utils contain a play function given the environment and a policy. An optional preprocess function can be supplied. Here we define a function that plays a game and shows learning progress

In [None]:
pong_utils.play(env, policy, time=100) 
# try to add the option "preprocess=pong_utils.preprocess_single"
# to see what the agent sees

# Rollout
Before we start the training, we need to collect samples. To make things efficient we use parallelized environments to collect multiple examples at once

In [18]:
envs = pong_utils.parallelEnv('PongDeterministic-v4', n=4, seed=12345)
prob, state, action, reward = pong_utils.collect_trajectories(envs, policy, tmax=100)

In [21]:
print(np.asarray(rewards))

[[ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 ...
 [ 0.  0. -1. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]]


# Function Definitions
Here you will define key functions for training. 

## Exercise 2: write your own function for training
(this is the same as policy_loss except the negative sign)

### REINFORCE
you have two choices (usually it's useful to divide by the time since we've normalized our rewards and the time of each trajectory is fixed)

1. $\frac{1}{T}\sum^T_t R_{t}^{\rm future}\log(\pi_{\theta'}(a_t|s_t))$
2. $\frac{1}{T}\sum^T_t R_{t}^{\rm future}\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}$ where $\theta'=\theta$ and make sure that the no_grad is enabled when performing the division

In [5]:
import numpy as np

def surrogate(policy, old_probs, states, actions, rewards,
              discount = 0.995, beta=0.01):

    ########
    ## 
    ## WRITE YOUR OWN CODE HERE
    ##
    ########
    
    seq_len = len(rewards)
    
    #calculate the future rewards
    discounts = discount**np.arange(seq_len)
    """
    the following codes is equal to:
        discount_rewards = np.asarray(rewards)
        for row in range(seq_len):
            discount_rewards[row, :] = discount_rewards[row, :]*discounts[n]
    
    it is used to calculate the point multiply by row elements 
    """
    discount_rewards = np.asarray(rewards)*discounts[:, np.newaxis]
    future_rewards = discount_rewards[::-1].cumsum(axis=0)[::-1]
    """batch normalization of future_rewards"""
    means = np.mean(future_rewards, axis=1)
    stds = np.std(future_rewards, axis=1)+ 1e-10
    normalized_future_rewards = (future_rewards - means[:, np.newaxis]) / stds[:, np.newaxis]
    
#     """
#     convert all to torch.tensor to calculate in gpu
#     """
#     #loss = torch.tensor(normalized_future_rewards*log_pro)    
#     actions = torch.tensor(actions, dtype=torch.int8, device=device)
#     normalized_future_rewards = torch.tensor(normalized_future_rewards, dtype=torch.float, device=device)
#     '''
#     the following old_probs is generated by collect_trajectories, it is a 
#     numpoy array detached from gpu which has no gradient info, so can not used to calculate the gradient 
#     '''
#     #old_probs = torch.tensor(old_probs, dtype=torch.float, device=device)
#     #log_pro = torch.tensor(log_pro, dtype=torch.float, device=device)
    
#     # convert states to policy (or probability)
#     new_probs = pong_utils.states_to_prob(policy, states)
#     new_probs = torch.where(actions == pong_utils.RIGHT, new_probs, 1.0-new_probs)

#     # include a regularization term
#     # this steers new_policy towards 0.5
#     # which prevents policy to become exactly 0 or 1
#     # this helps with exploration
#     # add in 1.e-10 to avoid log(0) which gives nan
#     entropy = -(new_probs*torch.log(old_probs+1.e-10)+ \
#         (1.0-new_probs)*torch.log(1.0-old_probs+1.e-10))

#     '''
#     `normalized_future_rewards*torch.log(new_probs)` is useful to caculate the policy gradient, 
#     but seems not good as 
#     `normalized_future_rewards*(new_probs/old_probs)`, where old_probs is generated by collect_trajectories and it is not tensor in GPU
#     maybe it is caused by using torch.log(new_probs) to simulate the original prob (eg. old_probs), but it is not the same as original.     
#     '''
#     return torch.mean(normalized_future_rewards*torch.log(new_probs) + beta*entropy)
#     #return torch.mean(normalized_future_rewards*(new_probs/old_probs) + beta*entropy)
  
    
    return torch.mean(normalized_future_rewards*torch.log(old_probs+1.0e-10))

#Lsur= surrogate(policy, prob, state, action, reward)

#print(Lsur)

# Training
We are now ready to train our policy!
WARNING: make sure to turn on GPU, which also enables multicore processing. It may take up to 45 minutes even with GPU enabled, otherwise it will take much longer!

In [None]:
from parallelEnv import parallelEnv
import numpy as np
# WARNING: running through all 800 episodes will take 30-45 minutes

# training loop max iterations
# episode = 500
episode = 800


# widget bar to display progress
!pip install progressbar
import progressbar as pb
widget = ['training loop: ', pb.Percentage(), ' ', 
          pb.Bar(), ' ', pb.ETA() ]
timer = pb.ProgressBar(widgets=widget, maxval=episode).start()

# initialize environment
envs = parallelEnv('PongDeterministic-v4', n=8, seed=1234)

discount_rate = .99
beta = .01
tmax = 320

# episode = 100
# tmax = 10

# keep track of progress
mean_rewards = []

for e in range(episode):

    # collect trajectories
    old_probs, states, actions, rewards = \
        pong_utils.collect_trajectories2(envs, policy, tmax=tmax)
        
    total_rewards = np.sum(rewards, axis=0)

    # this is the SOLUTION!
    # use your own surrogate function
#     L = -surrogate(policy, old_probs, states, actions, rewards, beta=beta)
    
    L = -pong_utils.surrogate(policy, old_probs, states, actions, rewards, beta=beta)
    optimizer.zero_grad()
    L.backward()
    optimizer.step()
    del L
        
    # the regulation term also reduces
    # this reduces exploration in later runs
    beta*=.995
    
    # get the average reward of the parallel environments
    mean_rewards.append(np.mean(total_rewards))
    
    # display some progress every 20 iterations
    if (e+1)%20 ==0 :
        print("Episode: {0:d}, score: {1:f}".format(e+1,np.mean(total_rewards)))
        print(total_rewards)
        
    # update progress widget bar
    timer.update(e+1)
    
timer.finish()
    



training loop:   2% |#                                          | ETA:  1:16:23

Episode: 20, score: -14.625000
[-14. -16. -17. -15.  -8. -14. -16. -17.]


training loop:   5% |##                                         | ETA:  1:13:22

Episode: 40, score: -14.875000
[-14. -16. -13. -17. -17.  -9. -17. -16.]


training loop:   7% |###                                        | ETA:  1:11:00

Episode: 60, score: -13.500000
[-14. -17. -10. -17. -15. -15.  -8. -12.]


training loop:  10% |####                                       | ETA:  1:08:57

Episode: 80, score: -14.625000
[-14. -13. -13. -14. -16. -16. -17. -14.]


training loop:  12% |#####                                      | ETA:  1:06:54

Episode: 100, score: -12.125000
[-16. -10.  -8. -12. -15. -12. -10. -14.]


training loop:  15% |######                                     | ETA:  1:04:52

Episode: 120, score: -14.750000
[-16. -17. -13. -16. -16. -16.  -9. -15.]


training loop:  17% |#######                                    | ETA:  1:02:50

Episode: 140, score: -14.875000
[-11. -14. -16. -15. -14. -16. -17. -16.]


training loop:  20% |########                                   | ETA:  1:00:52

Episode: 160, score: -13.875000
[-11.  -9. -14. -13. -17. -17. -13. -17.]


training loop:  22% |#########                                  | ETA:  0:58:55

Episode: 180, score: -13.250000
[-14. -15. -14. -16. -12. -12.  -8. -15.]


training loop:  25% |##########                                 | ETA:  0:57:02

Episode: 200, score: -12.000000
[-15.  -6. -11. -15. -10. -13. -14. -12.]


training loop:  27% |###########                                | ETA:  0:55:05

Episode: 220, score: -14.500000
[-16. -11. -16. -16. -16. -14. -14. -13.]


training loop:  30% |############                               | ETA:  0:53:08

Episode: 240, score: -13.500000
[-16. -16. -16. -11. -16. -14.  -9. -10.]


training loop:  32% |#############                              | ETA:  0:51:15

Episode: 260, score: -14.125000
[-10. -14. -17. -15. -12. -15. -16. -14.]


training loop:  35% |###############                            | ETA:  0:49:20

Episode: 280, score: -14.375000
[-15. -15. -15. -16. -16. -11. -14. -13.]


training loop:  37% |################                           | ETA:  0:47:26

Episode: 300, score: -15.500000
[-17. -16. -15. -14. -15. -14. -17. -16.]


training loop:  40% |#################                          | ETA:  0:45:35

Episode: 320, score: -13.875000
[-14. -12. -14. -16. -15. -10. -17. -13.]


training loop:  40% |#################                          | ETA:  0:45:17

In [None]:
# play game after training!
pong_utils.play(env, policy, time=2000) 

In [None]:
plt.plot(mean_rewards)

In [None]:
# save your policy!
torch.save(policy, 'REINFORCE.policy')

# load your policy if needed
# policy = torch.load('REINFORCE.policy')

# try and test out the solution!
# policy = torch.load('PPO_solution.policy')

In [None]:
import numpy as np
a = np.array([[1,2,3,4], [2,3,4,5], [3,4,5,6]])
print(a)
sum_a = a[::-1].cumsum(axis=0)[::-1]
print(sum_a)
mean = np.mean(sum_a, axis=1) 
print(mean)

In [None]:
np.ones((8))*4