# CE-40719: Deep Learning
## HW6 - Deep Reinforcement Learning
(20 points)

#### Name: Sadroddin Barikbin
#### Student No.: 98208824

In this assignment we are going to train a simple Actor-Critic model to solve classical control problems. We are going to use a batch version of the standard [gym](https://gym.openai.com/) library that is given to you in `multi_env.py`. The only difference between these two versions is that in `multi_env.py` instead of a single environment we have a batch of environments, therefore the observations are in shape `(batch_size * observation_size)`. We will focus on `CartPole-v1` problem but you can apply this to other problems as well.

## Algorithm

The vanilla actor-critic algorithm is as follows:

1.   Sample a batch $\{(s_i, a_i, r_i, s_{i + 1})\}_i$ under policy $\pi_\theta$.
2.   Fit $V_\phi^{\pi_\theta}(s_i)$ to $r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})$ by minimizing squared error $\|r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V_\phi^{\pi_\theta}(s_i)\|^2$.
3. $\max_{\theta}~ \sum_{i} \log \pi_\theta(a_i|s_i) \left[ r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V^{\pi_\theta}_\phi(s_i) \right]$

We need two parametrized models, one for value function $V^{\pi_\theta}_\phi$ and one for stochastic policy $\pi_\theta$. Since both $\pi_\theta$ and $V^{\pi_\theta}_\phi$ are functions of state $s$, instead of modeling each with a seperate neural network, we can model both with a single network with shared parameters. In other words we train a single network that outputs both $\pi_\theta(a|s)$ and $V^{\pi_\theta}_\phi(s)$. To train this network we combine step 2 and 3 in the main algoritm and optimize the following objective:
$$\min_{\theta, \phi}~ -\sum_{i} \log \pi_\theta(a_i|s_i) \left[ r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V^{\pi_\theta}_\phi(s_i) \right] + \|r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V_\phi^{\pi_\theta}(s_i)\|^2$$

Note that the gradient must be backpropagated only through $\log \pi_\theta(a_i|s_i)$ and $V_\phi^{\pi_\theta}(s_i)$ in the squared error. A negative entropy term $-\mathcal{H} (\pi_\theta(a_i|s_i))$ can also be added to above objective to encourage exploration. 

## Setup

In [0]:
import gym
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributions as dist

from multi_env import SubprocVecEnv

In [0]:
env_name = 'CartPole-v1'
num_envs = 16

def make_env():
    def _thunk():
        env = gym.make(env_name)
        return env

    return _thunk

envs = [make_env() for i in range(num_envs)]
envs = SubprocVecEnv(envs)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 1. Model (8 Points)

To define a stochastic policy we use [`torch.distributions`](https://pytorch.org/docs/stable/distributions.html) module. Networks shared parameters are defined in a simple MLP. Network has two heads, one for $V$ that takes in MLPs output and outputs a scalar, and one for $\pi$ that takes in the MLPs output and outputs a categorical distribution for each action. 

In [0]:
class ActorCritic(nn.Module):
    def __init__(self, state_size, hidden_size, num_actions):
        super(ActorCritic, self).__init__()
        #################################################################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
        # state_size: size of the input state
        # hidden_size: a list containing size of each mlp hidden layer in order
        # num_action: number of actions
        # do not use batch norm for any layer in this network
        #################################################################################
        hid_inp=state_size
        self.fcs=[]
        for hidden in hidden_size:
          self.fcs+=[nn.Linear(hid_inp,hidden)]
          hid_inp=hidden
        self.fcs=nn.ModuleList(self.fcs)
        self.fcv=nn.Linear(hid_inp,1)
        self.fca=nn.Linear(hid_inp,num_actions)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################


    def forward(self, x):
        #################################################################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
        for linear in self.fcs:
          x=F.relu(linear(x))
        policy=F.softmax(self.fca(x),dim=-1)
        value=self.fcv(x)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################
        return policy, value

In [0]:
def test_model(model):
    env = gym.make(env_name)
    total_reward = 0
    #################################################################################
    #                          COMPLETE THE FOLLOWING SECTION                       #
    #################################################################################
    # run given model for a single episode and compute total reward.
    #################################################################################
    done=False
    obs=[torch.FloatTensor(env.reset())] * num_state_obs
    while not done:
      state=torch.cat(obs,dim=-1).to(device)
      actions,v=model(state)
      action=torch.distributions.Categorical(actions).sample().item()
      ob,reward,done,_=env.step(action)
      obs=obs[1:]+[torch.FloatTensor(ob)]
      total_reward+=reward      
    #################################################################################
    #                                   THE END                                     #
    #################################################################################
    return total_reward

## 2. Objective and Training (12 Points)

A single observation is not always enough to understand state of an environment, hence we take previous `num_state_obs` observations at time t as state of the environment at time t. Initialize and train the model using Adam optimizer. You should be able to get to 500 in less than 20000 iterations.

In [0]:
num_state_obs =  10
gamma = 0.99
#################################################################################
#                          COMPLETE THE FOLLOWING SECTION                       #
#################################################################################
# experiment with different parameters and models to get the best result
#################################################################################
num_iterations = 20000

obs_size = 4
state_size = num_state_obs*obs_size
num_actions = 2

model = ActorCritic(state_size,[30,20],num_actions)
model.to(device)
optimizer = optim.Adam(model.parameters(),lr=0.001)
#################################################################################
#                                   THE END                                     #
#################################################################################

In [48]:
obs = [torch.FloatTensor(envs.reset())] * num_state_obs
for t in range(num_iterations):
    model.train()
    #################################################################################
    #                          COMPLETE THE FOLLOWING SECTION                       #
    #################################################################################
    # implement the algorithm
    #################################################################################
    state=torch.cat(obs,dim=-1).to(device)
    actions,v=model(state)
    action=torch.distributions.Categorical(actions).sample()
    ob,reward,done,_=envs.step(action.cpu().numpy())
    obs=obs[1:]+[torch.FloatTensor(ob)]

    state_=torch.cat(obs,dim=-1).to(device)
    _,v_=model(state_)
    v_=v_.detach()
    actions,v=model(state)
    reward=torch.tensor(reward,device=device,dtype=torch.float32).unsqueeze(dim=-1)
    target=reward+gamma*v_*(~ torch.tensor(done,device=device)).float().unsqueeze(dim=-1)
    advantage=(target-v.detach()).squeeze(dim=-1)
    
    loss=F.mse_loss(v,target,reduction='sum')-(torch.log(actions[torch.arange(0,action.size(0)),action])@advantage)
    loss+=torch.mean(torch.sum(actions*torch.log(actions),dim=-1))

    model.zero_grad()
    loss.backward()
    optimizer.step()

    model.eval()

    #################################################################################
    #                                   THE END                                     #
    #################################################################################
    if t % 1000 == 999:
        print('iteration {:5d}: average reward = {:5f}'.format(t + 1, np.mean([test_model(model) for _ in range(10)])))

iteration  1000: average reward = 27.800000
iteration  2000: average reward = 14.300000
iteration  3000: average reward = 18.900000
iteration  4000: average reward = 191.700000
iteration  5000: average reward = 331.600000
iteration  6000: average reward = 492.200000
iteration  7000: average reward = 500.000000
iteration  8000: average reward = 32.600000
iteration  9000: average reward = 10.500000
iteration 10000: average reward = 500.000000
iteration 11000: average reward = 500.000000
iteration 12000: average reward = 500.000000
iteration 13000: average reward = 500.000000
iteration 14000: average reward = 500.000000
iteration 15000: average reward = 314.400000
iteration 16000: average reward = 221.000000
iteration 17000: average reward = 136.800000
iteration 18000: average reward = 500.000000
iteration 19000: average reward = 500.000000
iteration 20000: average reward = 124.800000
