# Assignment 2
1. This assignment is due in two weeks, at 23:59 Feb 11th 2022.
2. There are two files to submit. Please name your .py and .ipynb file using your student number as Axxxxxx.py, Axxxxxxx.ipynb and submit it to Luminus->assignments->submissions->assignment2

# Part 1: Policy Gradients

You will implement the vanilla policy gradients algorithm, also referred to as
REINFORCE.

## Review

In policy gradients, the objective is to learn a parameter $\theta^*$ that
maximizes the following objective:

\begin{equation}
J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}[R(\tau)]
\end{equation}

where $\tau = (s_1,a_1,s_2,\ldots,s_{T-1},a_{T-1},s_T)$ is a *trajectory*
(also referred to as an *episode*), and factorizes as

\begin{equation}
\pi_\theta(\tau) = p(s_1)\pi_\theta(a_1|s_1)\prod_{t=2}^{T} p(s_t|s_{t-1},a_{t-1})\pi_\theta(a_t|s_t)
\end{equation}

and $R(\tau)$ denotes the full trajectory reward $R(\tau) = \sum_{t=1}^{T}
r(s_t,a_t)$ with $r(s_t,a_t)$ the rewards at the individual time steps.

In policy gradients, we directly apply the gradient $\nabla_\theta$ to
$J(\theta)$. In order to do so, we require samples of trajectories, meaning that
we now denote them as $\tau_i$ for the $i$th trajectory, and have $\tau_i =
(s_{i1},a_{i1},s_{i2},\ldots,s_{iT})$. When we approximate the gradient with
samples, we get:

\begin{align*}
\nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau_i) R(\tau_i) \\
&= \frac{1}{N}\sum_{i=1}^N \left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it}|s_{it}) \right)  \left( \sum_{t=1}^{T} r(s_{it},a_{it}) \right)
\end{align*}

Multiplying a discount factor $\gamma$ to the rewards can be interpreted as
encouraging the agent to focus on rewards closer in the future, which can also
be thought of as a means for reducing variance (because there are more
possible futures further into the future). The discount factor can be
incorporated in two ways, from the full trajectory:

\begin{equation}
\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N
\left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it}|s_{it}) \right) 
\left( \sum_{t=1}^T \gamma^{t-1} r(s_{it},a_{it}) \right)
\end{equation}

and from the reward to go:

\begin{equation}
\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N
\left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it}|s_{it}) \right) 
\left( \sum_{t'=t}^T \gamma^{t'-t} r(s_{it},a_{it}) \right)
\end{equation}

**In this assignment, we only focus on the first version: full tragectory.**



# Policy Gradients Implementation


**You will need to write code in `PolicyGradient.ipynb`. The places where you need to write code are
 clearly indicated with the comments `START OF YOUR CODE` and
`END OF YOUR CODE`. 
You do not need to change any other files for this part of the assignment.**

The dataflow of the code is structured like this: 

- Set Up Hyperparameters and environment.
- Build a MLP model for policy learning.
- Initialize the agent, such as define the policy network and optimizer.
- Forward Computation: Sample trajectories by conducting an action given an observation from the environment, and calculate sum of rewards in each trajectory, That includes `sample_action`, `sample_trajectory`, `sample_trajectories` and `sum_of_rewards`.
- Backward Computation: Optimize the policy network based on the update rule. That contains `compute_advantage`, `estimate_return`, `get_log_prob` , `update_parameters`.

## Problem 1: data sampling

You need to implement any parts with a "Problem 1" header in the code. Here's what you need to do:

- 1. Implement `sample_action`, which samples an action from $\pi_\theta(a|s)$. This operation will be called in `sample_trajectories`.
- 2. Implement `sample_trajectory`, you need to call `sample_action` to obtain current action.
- 3. Implement `sum_of_rewards`, which is the Monte Carlo estimation of the Q function. You need to estimate the q-value of each path and return a single vector for the estimated q values whose length is the sum of the lengths of the paths.

## Problem 2: apply policy gradient
You only need to implement the parts with the "Problem 2" header.

- **Estimate return**: in `estimate_return`, normalize the advantages to have a mean of zero and a standard deviation of one.  This is a trick for reducing variance.
- Implement `get_log_prob` to obtain $\log \pi_\theta(a_{it}|s_{it})$: Given an action that the agent took in the environment, this computes the log probability of that action under $\pi_\theta(a|s)$. This will be used in the parameters update: 

\begin{equation}
\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N
\left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it}|s_{it}) \right) 
\left( \sum_{t=1}^T \gamma^{t-1} r(s_{it},a_{it}) \right)
\end{equation}

- **Update parameters**: In `update_parameters`, using the update operation `optimizer.step()` to update the parameters of the policy. You firstly need to create loss value with the inputs.



# Environment Introduction: 


[##CartPole-v0](https://gym.openai.com/envs/CartPole-v0/): 

This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in ["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077). A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

### Observation Space
The observation is a `ndarray` with shape `(4,)` where the elements correspond to the following:

| Num | Observation           | Min                  | Max                |
|-----|-----------------------|----------------------|--------------------|
| 0   | Cart Position         | -4.8*                |  4.8*                |
| 1   | Cart Velocity         | -Inf                 | Inf                |
| 2   | Pole Angle            | ~ -0.418 rad (-24°)** | ~ 0.418 rad (24°)** |
| 3   | Pole Angular Velocity | -Inf                 | Inf                 |

- `*`: the cart x-position can be observed between `(-4.8, 4.8)`, but an episode terminates if the cart leaves the
    `(-2.4, 2.4)` range.
- `**`: Similarly, the pole angle can be observed between  `(-.418, .418)` radians or precisely **±24°**, but an episode is
    terminated if the pole angle is outside the `(-.2095, .2095)` range or precisely **±12°**

### Action Space
The agent take a 1-element vector for actions.
The action space is `(action)` in `[0, 1]`, where `action` is used to push
the cart with a fixed amount of force:

| Num | Action                 |
|-----|------------------------|
| 0   | Push cart to the left  |
| 1   | Push cart to the right |

Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing.
This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

### Rewards
Reward is 1 for every step taken, including the termination step.
### Starting State
All observations are assigned a uniform random value between (-0.05, 0.05).
### Episode Termination
The episode terminates of one of the following occurs:
1. Pole Angle is more than ±12°
2. Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
3. Episode length is greater than 200. 


In [1]:
# !pip install gym==0.10.5

In [2]:
import numpy as np
import gym
import os
import time
import inspect
import sys
from multiprocessing import Process
import torch
from torch import nn as nn
import torch.nn.functional as F

## Set Up Hyperparameters

In [3]:
env_name = 'CartPole-v0'
# exp_name = 'vpg'
render = False
animate = render
discount = 1.0
n_iter = 101
batch_size = 1000
ep_len = -1.
learning_rate = 5e-3
reward_to_go = False
dont_normalize_advantages = False
seed = 1
n_experiments = 1
max_path_length = ep_len if ep_len > 0 else None
min_timesteps_per_batch = batch_size #shouldn't this be "max_timesteps_per_batch"?
gamma = discount
normalize_advantages = not(dont_normalize_advantages)

## Set Up Environment

In [4]:
#========================================================================================#
# Set Up Env
#========================================================================================#

# Make the gym environment
env = gym.make(env_name)

# Set random seeds
torch.manual_seed(seed)
np.random.seed(seed)
env.seed(seed)

# Maximum length for episodes
max_path_length = max_path_length or env.spec.max_episode_steps

# Is this env continuous, or self.discrete? In this assignment, we only consider discrete action space.
discrete = isinstance(env.action_space, gym.spaces.Discrete)

# Observation and action sizes
ob_dim = env.observation_space.shape[0] # should be 4
ac_dim = env.action_space.n if discrete else env.action_space.shape[0] # should be 2

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


  result = entry_point.load(False)


## Build a MLP model for policy learning.

In [5]:
class MLP(nn.Module):

    def __init__(self, input_size, num_actions):
        super(MLP, self).__init__()
        self.dense1 = nn.Linear(input_size, 32)
        self.dense2 = nn.Linear(32, 32)
        self.dense3 = nn.Linear(32, num_actions)

    def forward(self, x):
        x = F.tanh(self.dense1(x))
        x = F.tanh(self.dense2(x))
        out = F.softmax(self.dense3(x))
        return out


## Initialize Agent
    

In [6]:
policy_net = MLP(input_size=ob_dim, num_actions=ac_dim)
optimizer = torch.optim.Adam(policy_net.parameters(), lr=learning_rate)

## Data Sampling

In [7]:
def sample_action(policy_parameters):
    """
    -for every epoch sample a batch from all trajectories
    -predict which action to choose using probability
    -get reward at each timestep assuming pole "passes"
    -estimate reward using gamma
    -input and outputs are torch arrays
    
    Stochastically sampling from the policy distribution

    arguments:
        policy_parameters: logits of a categorical distribution over actions
                sy_logits_na: (batch_size, self.ac_dim)

    returns:
        sy_sampled_ac: (batch_size,)
    """

    sy_logits_na = policy_parameters
    #========================================================================================#
    #                           ----------PROBLEM 1----------
    #========================================================================================#
    # Stochastically sampling an action from the policy distribution $\pi_\theta(a|s)$.
    # ------------------------------------------------------------------
    # START OF YOUR CODE
    # ------------------------------------------------------------------

#     sy_sampled_ac = []
    
#     for row_index in range(sy_logits_na.shape[1]):
#         sy_sampled_ac.append(np.random.choice(sy_logits_na[:,row_index],1))

#     sy_sampled_ac = np.array(sy_sampled_ac)

    sy_sampled_ac = torch.argmax(sy_logits_na, dim=1)

#     action_probs = F.softmax(sy_logits_na, dim=-1)
#     dist = torch.distributions.Categorical(action_probs)
#     sy_sampled_ac = dist.sample()

    #                    time 0               time 1
    # sample_input = [[0.1,0.2,0.3,0.4], [0.4,0.3,0.2,0.1]]
    # sample_output = [3,0]
    # expected probability = 0.4*0.4 = 0.16
    # log(0.4) + log(0.4)
    # use torch implemented 
    
#     sy_sampled_ac = torch.squeeze(torch.multinomial(sy_logits_na, 1), axis=1)
    
    # ------------------------------------------------------------------
    # END OF YOUR CODE
    # ------------------------------------------------------------------

    return sy_sampled_ac

In [8]:
sample_input = torch.FloatTensor([[0.1,0.2,0.3,0.4], [0.4,0.3,0.2,0.1]])
print(sample_action(sample_input))

tensor([3, 0])


In [9]:
def sample_trajectory(env):
    """
    Given an observation at the starting point (env.reset()), decide first action, get reward, 
    """
    ob = env.reset()
    obs, acs, rewards = [], [], []
    steps = 0
    while True:

        obs.append(ob)
        #====================================================================================#
        #                           ----------PROBLEM 1----------
        #====================================================================================#
        # obtain the action 'ac' for current observation 'ob'
        # ------------------------------------------------------------------
        # START OF YOUR CODE
        # ------------------------------------------------------------------
        
        policy_parameters = policy_net(torch.FloatTensor([ob]))
        
#         print('policy_parameters', policy_parameters)
        ac = sample_action(policy_parameters)

        # ------------------------------------------------------------------
        # END OF YOUR CODE
        # ------------------------------------------------------------------
        ac = ac.numpy()[0]
        acs.append(ac)
        ob, rew, done, _ = env.step(ac)
        rewards.append(rew)
        steps += 1
        if done or steps > max_path_length:
            break
    path = {"observation" : np.array(obs, dtype=np.float32),
            "reward" : np.array(rewards, dtype=np.float32),
            "action" : np.array(acs, dtype=np.float32)}
    return path


In [10]:
def sample_trajectories(itr, env):
    """Collect paths until we have enough timesteps, as determined by the
    length of all paths collected in this batch.
    """
    timesteps_this_batch = 0
    paths = []
    while True:
        path = sample_trajectory(env)
        paths.append(path)
        timesteps_this_batch += len(path["reward"])
        if timesteps_this_batch > min_timesteps_per_batch:
            break
    return paths, timesteps_this_batch

For sum of rewards, we use the total discounted reward summed over entire trajectory (regardless of which time step the Q-value should be for).

In [11]:
def sum_of_rewards(re_n):
    """ Monte Carlo estimation of the Q function.

    let sum_of_path_lengths be the sum of the lengths of the paths sampled from
        the function sample_trajectories
    let num_paths be the number of paths sampled from sample_trajectories

    arguments:
        re_n: length: num_paths. Each element in re_n is a numpy array
            containing the rewards for the particular path

    returns:
        q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values
            whose length is the sum of the lengths of the paths
    ----------------------------------------------------------------------------------

    Your code should construct numpy arrays for Q-values which will be used to compute
    advantages.


    You will write code for trajectory-based PG: 

          We use the total discounted reward summed over
          entire trajectory (regardless of which time step the Q-value should be for).

          For this case, the policy gradient estimator is

              E_{tau} [sum_{t=0}^T grad log pi(a_t|s_t) * Ret(tau)]

          where

              tau=(s_0, a_0, ...) is a trajectory,
              Ret(tau) = sum_{t'=0}^T gamma^t' r_{t'}.

          Thus, you should compute

              Q_t = Ret(tau)

    Store the Q-values for all timesteps and all trajectories in a variable 'q_n',
    like the 'ob_no' and 'ac_na' above.
    """
    #====================================================================================#
    #                           ----------PROBLEM 1----------
    #====================================================================================#
    # q_n: A single vector for the estimated q values whose length is the sum of the lengths of the paths.
    # Q-values: Q_t = Ret(tau) = sum_{t'=0}^T gamma^t' r_{t'}. 
    # Store the Q-values for all timesteps and all trajectories in a variable 'q_n'.
    # ------------------------------------------------------------------
    # START OF YOUR CODE
    # ------------------------------------------------------------------
    
    # this is a confusing part
    # input re_n = [[1,1,1], [1,1,1,1]]
    # notice re_n is of two samples, one of length 3, other of lenght 4
    # output q_n = [3,3,3,4,4,4,4]
    # alt output q_n = [3,2,1,4,3,2,1]
    
    q_n = []
    
    for path_index in range(len(re_n)):
        path = re_n[path_index]
        Q_t = sum(np.array([gamma**(t-1) for t in range(1,len(path)+1)])*path)
        q_n.extend([Q_t]*len(path))
        
    q_n = np.array(q_n)

    # # ------------------------------------------------------------------
    # END OF YOUR CODE
    # ------------------------------------------------------------------
    return q_n

sum_of_rewards([[1,1,1], [1,1,1,1]])

array([3., 3., 3., 4., 4., 4., 4.])

## Apply Policy Gradient

We firstly need to estimate return `estimate_return` and calculate log probability of actions `get_log_prob`. Then we can update parameters based on the rule:

\begin{equation}
\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N
\left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it}|s_{it}) \right) 
\left( \sum_{t=1}^T \gamma^{t-1} r(s_{it},a_{it}) \right)
\end{equation}

In [12]:
def compute_advantage(ob_no, q_n):
    # don't have to modify this
    # sometimes we would use: adv_n = q_n - baseline
    adv_n = q_n.copy()
    return adv_n

In [13]:
def estimate_return(ob_no, re_n):
    """ Estimates the returns over a set of trajectories.

    let sum_of_path_lengths be the sum of the lengths of the paths sampled from
        sample_trajectories
    let num_paths be the number of paths sampled from sample_trajectories

    arguments:
        ob_no: shape: (sum_of_path_lengths, ob_dim)
        re_n: length: num_paths. Each element in re_n is a numpy array
            containing the rewards for the particular path

    returns:
        q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values
            whose length is the sum of the lengths of the paths
        adv_n: shape: (sum_of_path_lengths). A single vector for the estimated
            advantages whose length is the sum of the lengths of the paths
    """
    
#     print(f'ob_no.shape:{ob_no.shape}')
#     print(f'len(re_n):{len(re_n)}')
    
    q_n = sum_of_rewards(re_n)
    
#     print(f'q_n.shape:{q_n.shape}')
#     print('q_n', q_n)
    
    adv_n = compute_advantage(ob_no, q_n)
    #====================================================================================#
    #                           ----------PROBLEM 2----------
    # Advantage Normalization
    #====================================================================================#
    if normalize_advantages:
        # On the next line, implement a trick which is known empirically to reduce variance
        # in policy gradient methods: normalize adv_n to have mean zero and std=1.
        # ------------------------------------------------------------------
        # START OF YOUR CODE
        # ------------------------------------------------------------------
        
#         print('np.mean(adv_n)', np.mean(adv_n))
#         print('np.std(adv_n)', np.std(adv_n))

        # without normalization, every value could be a positive value, making it harder to converge
        
        adv_n = (adv_n - np.mean(adv_n)) / (np.std(adv_n) + 1e-8)

        # ------------------------------------------------------------------
        # END OF YOUR CODE
        # ------------------------------------------------------------------
    return q_n, adv_n

In [1]:
def get_log_prob(policy_parameters, sy_ac_na):
    """
    Computing the log probability of a set of actions that were actually taken according to the policy

    arguments:
        policy_parameters: logits of a categorical distribution over actions
                sy_logits_na: (batch_size, self.ac_dim)

        sy_ac_na: (batch_size,) # i think this is the actual action

    returns:
        sy_logprob_n: (batch_size)

    Hint:
        For the discrete case, use the log probability under a categorical distribution.
    """

    sy_logits_na = policy_parameters
    #========================================================================================#
    #                           ----------PROBLEM 2----------
    #========================================================================================#
    # sy_logprob_n = \sum_{t=1}^T \log \pi_\theta(a_{it}|s_{it})
    # ------------------------------------------------------------------
    # START OF YOUR CODE
    # ------------------------------------------------------------------    
    
#     print('sy_logits_na:', sy_logits_na)
#     print('sy_ac_na:', sy_ac_na)
    
#     action_probs = F.softmax(torch.FloatTensor(sy_logits_na), dim=-1)
#     dist = torch.distributions.Categorical(action_probs)
#     sy_logprob_n = dist.log_prob(torch.LongTensor(sy_ac_na))
    
#     print('policy_parameters', policy_parameters)
#     sy_logprob_n = np.take_along_axis(policy_parameters, np.array([[int(i)] for i in sy_ac_na]), axis=1)

#     state_action_values = policy_parameters.gather(1, torch.LongTensor([[i] for i in sy_ac_na]))
#     sy_logprob_n = torch.log(state_action_values)

    one_hot = F.one_hot(sy_ac_na, sy_logits_na.shape[1])
    sy_logprob_n = torch.sum(torch.mul(one_hot, sy_logits_na), dim=1)

    print('sy_logprob_n:', sy_logprob_n)
#     print(f'sy_logprob_n.shape: {sy_logprob_n.shape}')
    
    # ------------------------------------------------------------------
    # END OF YOUR CODE
    # ------------------------------------------------------------------
    return sy_logprob_n

sample_input = torch.FloatTensor([[0.1,0.2,0.3,0.4], [0.4,0.3,0.2,0.1]])
sample_output = sample_action(sample_input)
get_log_prob(sample_input, sample_output)

NameError: name 'torch' is not defined

In [15]:
def update_parameters(ob_no, ac_na, q_n, adv_n):
    """
    Update the parameters of the policy and (possibly) the neural network baseline,
    which is trained to approximate the value function.

    arguments:
        ob_no: shape: (sum_of_path_lengths, ob_dim)
        ac_na: shape: (sum_of_path_lengths).
        q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values
            whose length is the sum of the lengths of the paths
        adv_n: shape: (sum_of_path_lengths). A single vector for the estimated
            advantages whose length is the sum of the lengths of the paths

    returns:
        nothing
    """
    #====================================================================================#
    #                           ----------PROBLEM 2----------
    #====================================================================================#
    # Performing the Policy Update based on the current batch of rollouts.
    # 
    # ------------------------------------------------------------------
    # START OF YOUR CODE
    # ------------------------------------------------------------------
    
#     policy_parameters = policy_net(torch.FloatTensor(ob_no))
#     acs = sample_action(policy_parameters)
    
#     obs, rewards = [], []
#     for ac in acs:
#         ob, rew, done, _ = env.step(ac.numpy())
#         rewards.append(rew)
#         obs.append(ob)
    
#     # update
#     ob_no = torch.FloatTensor(obs)
#     ac_na = torch.LongTensor(acs)
#     re_n = torch.FloatTensor(rewards)
    
    # get actions
#     print('ob_no.shape', ob_no.shape) # batch_size, 4
    state_action_values = policy_net(torch.FloatTensor(ob_no))
#     print('state_action_values.shape', state_action_values.shape) # batch_size, 2 [LEFT,RIGHT]
    
#     print(f'state_action_values.shape:', state_action_values.shape)
    
    # compute expected Q values
    sy_logprob_n = get_log_prob(state_action_values, torch.LongTensor(ac_na))
    print('sy_logprob_n', sy_logprob_n)
    print('adv_n', adv_n)
    
#     print(f'sy_logprob_n.shape:', sy_logprob_n.shape)
    
    #     q_n, adv_n = estimate_return(ob_no, re_n)
#     expected_state_action_values =  sy_logprob_n.squeeze() * torch.FloatTensor(adv_n).squeeze()
    
#     print(f'expected_state_action_values.shape:', expected_state_action_values.shape)
    
#     # Compute Huber loss
#     criterion = nn.SmoothL1Loss()
#     state_action_values = state_action_values.gather(1, torch.LongTensor([[i] for i in ac_na]))
#     loss = criterion(state_action_values.float(), expected_state_action_values.unsqueeze(1).float())
    
    loss = torch.sum(sy_logprob_n.squeeze() * torch.FloatTensor(adv_n).squeeze())
    
#     print('sy_logprob_n', sy_logprob_n)
#     print('adv_n', adv_n)
    print('loss', loss)
    
    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    for param in policy_net.parameters():
        param.grad.data.clamp_(-1, 1)
    optimizer.step()

    # ------------------------------------------------------------------
    # END OF YOUR CODE
    # ------------------------------------------------------------------

## Training Loop.

In [16]:
print('Running experiment with seed %d'%seed)

start = time.time()

total_timesteps = 0

return_data = []

for itr in range(n_iter):

    paths, timesteps_this_batch = sample_trajectories(itr, env)
    total_timesteps += timesteps_this_batch

    # Build arrays for observation, action for the policy gradient update by
    # concatenating across paths
    ob_no = np.concatenate([path["observation"] for path in paths])
    ac_na = np.concatenate([path["action"] for path in paths])

    re_n = [path["reward"] for path in paths]

    q_n, adv_n = estimate_return(ob_no, re_n)

    update_parameters(ob_no, ac_na, q_n, adv_n)

    # Log diagnostics
    returns = [path["reward"].sum() for path in paths]

    if itr%10 == 0:
        print("********** Iteration %i ************"%itr)
        ep_lengths = [len(path["reward"]) for path in paths]
        print("Time: ", time.time() - start)
        print("Iteration: ", itr)
        print("AverageReturn: ", np.mean(returns))
        print("StdReturn: ", np.std(returns))
        print("MaxReturn: ", np.max(returns))
        print("MinReturn", np.min(returns))
        print("EpLenMean: ", np.mean(ep_lengths))
        print("EpLenStd: ", np.std(ep_lengths))
        print("TimestepsThisBatch: ", timesteps_this_batch)
        print("TimestepsSoFar: ", total_timesteps)
    return_data.append(np.mean(returns))

Running experiment with seed 1


  if sys.path[0] == '':


sy_logprob_n: tensor([0.5002, 0.5120, 0.5005,  ..., 0.5087, 0.5031, 0.5084],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5002, 0.5120, 0.5005,  ..., 0.5087, 0.5031, 0.5084],
       grad_fn=<SumBackward1>)
adv_n [0.72720184 0.72720184 0.72720184 ... 0.82363949 0.82363949 0.82363949]
loss tensor(0.0630, grad_fn=<SumBackward0>)
********** Iteration 0 ************
Time:  0.5634307861328125
Iteration:  0
AverageReturn:  39.185184
StdReturn:  19.529724
MaxReturn:  80.0
MinReturn 20.0
EpLenMean:  39.18518518518518
EpLenStd:  19.529724803274426
TimestepsThisBatch:  1058
TimestepsSoFar:  1058
sy_logprob_n: tensor([0.5158, 0.5060, 0.5164,  ..., 0.5132, 0.5057, 0.5178],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5158, 0.5060, 0.5164,  ..., 0.5132, 0.5057, 0.5178],
       grad_fn=<SumBackward1>)
adv_n [0.36122219 0.36122219 0.36122219 ... 0.73723847 0.73723847 0.73723847]
loss tensor(-0.0399, grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5228, 0.5049, 0.5221,  ..., 0.5204, 0

sy_logprob_n: tensor([0.5187, 0.5395, 0.5226,  ..., 0.5274, 0.5066, 0.5443],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5187, 0.5395, 0.5226,  ..., 0.5274, 0.5066, 0.5443],
       grad_fn=<SumBackward1>)
adv_n [1.21696744 1.21696744 1.21696744 ... 1.5355532  1.5355532  1.5355532 ]
loss tensor(0.1693, grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5207, 0.5388, 0.5200,  ..., 0.5215, 0.5139, 0.5420],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5207, 0.5388, 0.5200,  ..., 0.5215, 0.5139, 0.5420],
       grad_fn=<SumBackward1>)
adv_n [-0.29218302 -0.29218302 -0.29218302 ...  0.50204603  0.50204603
  0.50204603]
loss tensor(-0.1646, grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5058, 0.5509, 0.5106,  ..., 0.5329, 0.5267, 0.5328],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5058, 0.5509, 0.5106,  ..., 0.5329, 0.5267, 0.5328],
       grad_fn=<SumBackward1>)
adv_n [0. 0. 0. ... 0. 0. 0.]
loss tensor(0., grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5063, 0.5

sy_logprob_n: tensor([0.5112, 0.5657, 0.5050,  ..., 0.5154, 0.5336, 0.5455],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5112, 0.5657, 0.5050,  ..., 0.5154, 0.5336, 0.5455],
       grad_fn=<SumBackward1>)
adv_n [-1.3857551  -1.3857551  -1.3857551  ... -1.66526152 -1.66526152
 -1.66526152]
loss tensor(0.0438, grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5312, 0.5483, 0.5260,  ..., 0.5512, 0.5025, 0.5794],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5312, 0.5483, 0.5260,  ..., 0.5512, 0.5025, 0.5794],
       grad_fn=<SumBackward1>)
adv_n [-0.65799956 -0.65799956 -0.65799956 ... -0.11968271 -0.11968271
 -0.11968271]
loss tensor(0.0043, grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5526, 0.5238, 0.5583,  ..., 0.5367, 0.5420, 0.5095],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5526, 0.5238, 0.5583,  ..., 0.5367, 0.5420, 0.5095],
       grad_fn=<SumBackward1>)
adv_n [0.62186828 0.62186828 0.62186828 ... 0.56713864 0.56713864 0.56713864]
loss tensor(0.0487,

sy_logprob_n: tensor([0.5115, 0.5981, 0.5141,  ..., 0.5475, 0.5379, 0.5940],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5115, 0.5981, 0.5141,  ..., 0.5475, 0.5379, 0.5940],
       grad_fn=<SumBackward1>)
adv_n [-0.24046963 -0.24046963 -0.24046963 ... -0.0932172  -0.0932172
 -0.0932172 ]
loss tensor(-0.5168, grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5318, 0.5810, 0.5320,  ..., 0.5802, 0.5040, 0.6270],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5318, 0.5810, 0.5320,  ..., 0.5802, 0.5040, 0.6270],
       grad_fn=<SumBackward1>)
adv_n [ 0.268032   0.268032   0.268032  ... -0.0242904 -0.0242904 -0.0242904]
loss tensor(-0.0806, grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5362, 0.5840, 0.5241,  ..., 0.6276, 0.5272, 0.5747],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5362, 0.5840, 0.5241,  ..., 0.6276, 0.5272, 0.5747],
       grad_fn=<SumBackward1>)
adv_n [-1.21376859 -1.21376859 -1.21376859 ... -0.46014749 -0.46014749
 -0.46014749]
loss tensor(-0.383

sy_logprob_n: tensor([0.5306, 0.6009, 0.5179,  ..., 0.5669, 0.5704, 0.5495],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5306, 0.6009, 0.5179,  ..., 0.5669, 0.5704, 0.5495],
       grad_fn=<SumBackward1>)
adv_n [0. 0. 0. ... 0. 0. 0.]
loss tensor(0., grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5566, 0.5661, 0.5661,  ..., 0.5525, 0.5534, 0.5917],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5566, 0.5661, 0.5661,  ..., 0.5525, 0.5534, 0.5917],
       grad_fn=<SumBackward1>)
adv_n [0. 0. 0. ... 0. 0. 0.]
loss tensor(0., grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5237, 0.6111, 0.5044,  ..., 0.5814, 0.5232, 0.6274],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5237, 0.6111, 0.5044,  ..., 0.5814, 0.5232, 0.6274],
       grad_fn=<SumBackward1>)
adv_n [0. 0. 0. ... 0. 0. 0.]
loss tensor(0., grad_fn=<SumBackward0>)
sy_logprob_n: tensor([0.5619, 0.5692, 0.5553,  ..., 0.5227, 0.5681, 0.5804],
       grad_fn=<SumBackward1>)
sy_logprob_n tensor([0.5619, 0.5692, 

## Plot Average-Return curve.





In [18]:
len(return_data)

101

In [19]:
return_data

[39.185184,
 49.38095,
 37.814816,
 34.517242,
 45.590908,
 116.111115,
 134.25,
 98.27273,
 162.71428,
 196.83333,
 199.0,
 193.83333,
 147.71428,
 147.28572,
 137.125,
 135.125,
 128.875,
 125.375,
 120.0,
 155.14285,
 149.71428,
 141.75,
 150.71428,
 160.57143,
 200.0,
 177.66667,
 149.42857,
 135.25,
 122.44444,
 124.666664,
 128.125,
 148.42857,
 168.5,
 185.66667,
 141.0,
 82.30769,
 74.5,
 81.76923,
 97.0,
 116.111115,
 126.875,
 170.5,
 181.66667,
 200.0,
 171.66667,
 152.57143,
 150.57143,
 143.42857,
 135.5,
 127.5,
 140.25,
 132.5,
 137.125,
 140.25,
 149.57143,
 149.28572,
 129.5,
 129.25,
 125.125,
 134.25,
 185.33333,
 184.83333,
 159.0,
 163.14285,
 194.33333,
 191.66667,
 198.66667,
 182.83333,
 164.28572,
 149.57143,
 141.75,
 133.5,
 121.111115,
 119.666664,
 121.22222,
 125.666664,
 134.75,
 139.25,
 137.25,
 199.5,
 188.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 200.0,
 

In [None]:
import matplotlib.pyplot as plt
plt.plot(return_data)
plt.xlabel("Iterations")
plt.ylabel("Average Return")