<a href="https://colab.research.google.com/github/tzs930/mlbootcamp-2022-rl-practice/blob/main/2_ActorCritic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning Practice 2 : REINFORCE, Actor-Critic

- In this assignment, we will implement two basic policy gradient methods, REINFORCE and (state-value) actor-critic.

In [None]:
!pip install numpy torch matplotlib gym

In [None]:
# import packages
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import numpy as np    
import gym
import matplotlib.pyplot as plt

np.random.seed(123)
torch.manual_seed(123)

This assignment features the Cartpole domain which tasks the agent with balancing a pole affixed to a movable cart. The agent employs two discrete actions which apply force to the cart. Episodes provide +1 reward for each step in which the pole has not fallen over, up to a maximum of 200 steps. (See https://gym.openai.com/envs/CartPole-v0/) for more details. 

Additionally, we'll save the size of the state and action spaces, and define hyperparameters such as the number of hidden units in our network. These parameters don't need to be changed, but you can try varying hyperparameters and see how learning is affected.

In [None]:
env = gym.make('CartPole-v0')
env.seed(123)

state_dim = env.observation_space.shape[0] # Dimension of state space
action_count = env.action_space.n          # Number of actions

# Hyperparameters
hidden_size = 128             # Number of hidden units
max_number_of_episodes = 500  # Number of training episodes
log_frequency = 20            # Frequency of logging
gamma = 0.999                 # Discount rate
policy_lr = 1e-2              # Learning rate of policy (actor) network
critic_lr = 1e-2              # Learning rate of critic network

## 1. Implementing REINFORCE algorithm
- In this section, we will implement REINFORCE algorithm.
- We use the softmax policy since Cartpole domain has the discrete action space.
- Recall that REINFORCE objective is: 
$$\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R_t$$ where $R_t = \sum_{i=t}^{T} \gamma^i r_i$.
- Then, the desired loss function for REINFORCE is: 
$$L_{\pi}(\theta) = \sum_{t=1}^T \log \pi_\theta(a_t|s_t) R_t$$.


In [None]:
class Policy(nn.Module):
    def __init__(self, state_dim, action_count, hidden_size):
        super(Policy, self).__init__()
        self.W1 = nn.Linear(state_dim, hidden_size)
        self.W2 = nn.Linear(hidden_size, action_count)

    def forward(self, state):
        out = self.W1(state)
        out = F.relu(out)
        out = self.W2(out)
        return F.softmax(out, dim=-1)

Now, we have two things to do here:
- TODO 1 : define **the list of discounted returns** for the loss function 

 (Hint: we need $[R_1, R_2, ..., R_T]$ where $R_t = \sum_{i=t}^{T} \gamma^i r_i$)
- TODO 2 : define the loss for **REINFORCE** algorithm

 (Hint: we can easily obtain loss function using the inner product of discounted returns and log-probabilities, i.e.  $[\log \pi(a_1|s_1), \log \pi(a_2|s_2), ..., \log \pi(a_T|s_T)]^\top [R_1, R_2, ..., R_T]= \sum_{t=1}^T \log \pi(a_t|s_t) R_t$ )

In [None]:
policy = Policy(state_dim, action_count, hidden_size)
policy_optimizer = optim.Adam(policy.parameters(), lr=policy_lr)

reward_sum = 0
episode_return_list = []
episode_length_list = []

num_episode_list = []
reinforce_score_avg = []
reinforce_score_std = []

states, rewards, actions, logprobs = [],[],[],[]
    
for episode_number in range(max_number_of_episodes):
    episode_return = 0
    episode_length = 0
    done = False
    observation = env.reset()
    t = 1
    while not done:
        state = np.reshape(observation, [1, state_dim]).astype(np.float32)
        states.append(state)

        # Run the policy network and get an action to take.
        state = torch.Tensor(state)
        probs = policy(state)[0]
        dist = Categorical(probs)
        action = dist.sample()
        logprob = dist.log_prob(action)
        action = action.detach().numpy()

        logprobs.append(logprob)
        
        # step the environment and get new measurements
        observation, reward, done, _ = env.step(action)
        reward_sum += float(reward)

        # Record reward (has to be done after we call step() to get reward for previous action)
        rewards.append(float(reward))
        
        episode_return += reward
        episode_length = t
        t += 1

    episode_return_list.append(episode_return)
    episode_length_list.append(episode_length)

    # Finish Episode
    # Compute the discounted reward backwards through time.
    R = 0
    returns = []
    ##########################################################################################
    # TODO 1:
    # - define a list of discounted sums
    # i.e. [r_0 + \gamma * r_1 + ... + \gamma^{T-1} * r_T, ... ,  r_{T-1} + \gamma * r_T, r_T]
    for r in rewards[::-1]:
      R = r
      returns.insert(0, R)
    ##########################################################################################

    returns = torch.tensor(returns)
    
    policy_loss = 0
    for log_prob, R in zip(logprobs, returns):
      ########################################################################################
      # TODO 2:
      # - define REINFORCE loss
      policy_loss += logprob
      ########################################################################################
    
    policy_optimizer.zero_grad()
    policy_loss.backward()
    policy_optimizer.step()
    
    states, rewards, actions, logprobs = [],[],[],[]

    if episode_number % log_frequency == 0:
        print('Episode: %d. Average reward for episode %f. Variance %f' % (episode_number, np.mean(episode_return_list), np.std(episode_return_list)**2 ))
        num_episode_list.append(episode_number)
        reinforce_score_avg.append(np.mean(episode_return_list))
        reinforce_score_std.append(np.std(episode_return_list))
        episode_return_list = []


- After training, plot the training curve using the below code.

In [None]:
num_episode_list = np.arange(0,max_number_of_episodes,log_frequency)
plt.plot(num_episode_list, reinforce_score_avg, label='REINFORCE')
plt.fill_between(x=num_episode_list,
                 y1=np.array(reinforce_score_avg)-np.array(reinforce_score_std),
                 y2=np.array(reinforce_score_avg)+np.array(reinforce_score_std),
                 alpha=0.3)
plt.legend()
plt.xlabel('The number of episodes')
plt.ylabel('Episode score')

## 2. Implementing Actor-Critic algorithm
- In this section, we will implement Actor-Critic algorithm, especially using the state-value function as a baseline for reducing variance.
- We will approximate the state-value function $V_\phi(s)$ with the discounted return of the remained epsiode starting from $s$, i.e., 
$$
V_\phi(s) \approx V^\pi(s) = \mathbb{E}_\pi\left[\sum_{i=t}^T \gamma^i r_i | s_t=s \right]
$$
- Recall that 
 - the desired loss function for the critic is L2 loss over the episode return:
 $L_{V}(\phi) = \frac{1}{m}\sum_{t=1}^m  (V_\phi(s_t) - R_t)^2$.

 - the desired loss function for the actor is: 
 $L_\pi(\theta) = \frac{1}{m}\sum_{t=1}^m \log \pi_\theta(a_t|s_t) (R_t - V_\phi(s_t))$.
 


In [None]:
class Critic(nn.Module):
    def __init__(self, state_dim, hidden_size):
        super(Critic, self).__init__()
        self.W1 = nn.Linear(state_dim, hidden_size)
        self.W2 = nn.Linear(hidden_size, 1)

    def forward(self, state) :
        out = self.W1(state)
        out = F.relu(out)
        out = self.W2(out)
        
        return out

We have two things to do here:
- TODO 1: define the list of discounted returns (reuse the result of TODO 1)
- TODO 3: define **the critic loss** and **the actor loss**.

In [None]:
actor = Policy(state_dim, action_count, hidden_size)
actor_optimizer = optim.Adam(actor.parameters(), lr=policy_lr)
critic = Critic(state_dim, hidden_size)
critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr)

reward_sum = 0

max_number_of_episodes = 500
episode_return_list = []
episode_length_list = []

num_episode_list = []
ac_score_avg = []
ac_score_std = []

states, rewards, actions, logprobs, state_values = [],[],[],[],[]
    
for episode_number in range(max_number_of_episodes):
    episode_return = 0
    episode_length = 0
    done = False
    observation = env.reset()
    t = 1
    while not done:
        state = np.reshape(observation, [1, state_dim]).astype(np.float32)
        states.append(state)

        # Run the policy network and get an action to take.
        state = torch.Tensor(state)
        probs = actor(state)[0]
        dist = Categorical(probs)
        action = dist.sample()
        logprob = dist.log_prob(action)
        action = action.detach().numpy()
        state_value = critic(state)

        logprobs.append(logprob)
        state_values.append(state_value)
        
        # step the environment and get new measurements
        observation, reward, done, _ = env.step(action)
        reward_sum += float(reward)

        # Record reward (has to be done after we call step() to get reward for previous action)
        rewards.append(float(reward))
        
        episode_return += reward
        episode_length = t
        t += 1

    episode_return_list.append(episode_return)
    episode_length_list.append(episode_length)

    # Finish Episode
    # Compute the discounted reward backwards through time.
    R = 0
    returns = []
    ##########################################################################################
    # TODO 1: Reuse result of TODO 1 for discounted return
    for r in rewards[::-1]:
      R = r
      returns.insert(0, R)
    ##########################################################################################
    
    returns = torch.tensor(returns)
    actor_loss = 0
    critic_loss = 0

    for log_prob, state_value, R in zip(logprobs, state_values, returns):        
        ##########################################################################################
        # TODO 3:
        # - define actor_loss and critic_loss here
        actor_loss += log_prob
        critic_loss += state_value
        ##########################################################################################

    actor_optimizer.zero_grad()
    actor_loss.backward()
    actor_optimizer.step()

    critic_optimizer.zero_grad()
    critic_loss.backward()
    critic_optimizer.step()
    
    states, rewards, actions, logprobs, state_values = [],[],[],[],[]

    if episode_number % log_frequency == 0:
        print('Episode: %d. Average reward for episode %f. Variance %f' % (episode_number, np.mean(episode_return_list), np.std(episode_return_list)**2 ))
        ac_score_avg.append(np.mean(episode_return_list))
        ac_score_std.append(np.std(episode_return_list))
        episode_return_list = []

- After training, plot and compare training curves of REINFORCE and Actor-Critic algorithm using the below code.

In [None]:
num_episode_list = np.arange(0,max_number_of_episodes,log_frequency)
plt.plot(num_episode_list, reinforce_score_avg, label='REINFORCE')
plt.fill_between(x=num_episode_list,
                 y1=np.array(reinforce_score_avg)-np.array(reinforce_score_std),
                 y2=np.array(reinforce_score_avg)+np.array(reinforce_score_std),
                 alpha=0.3)
plt.plot(num_episode_list, ac_score_avg, label='Actor-Critic')
plt.fill_between(x=num_episode_list,
                 y1=np.array(ac_score_avg)-np.array(ac_score_std),
                 y2=np.array(ac_score_avg)+np.array(ac_score_std),
                 alpha=0.3)
plt.legend()
plt.xlabel('The number of episodes')
plt.ylabel('Episode score')