# Declaration 


# Objective 
In this project, using simulation I'm training an agent (double-jointed arm) that can move to target location.Thus the goal of agent is to maintain its position at the target location for as many time steps as possible.

## Environment
* **Set-up**: Double-jointed arm which can move to target locations.
* **Goal**: The agent must move it's hand to goal location, and keep it there.
* **Agents**: The environment contain 20 agent linked to a single brain.
* **Agent Reward Function**(independent):
  * +0.1 Each step agent's hand is in goal location.
* **Brains**: One Brain with the following observation/action space.
* **Vector Observation space** : 33 variables corresponding to position, rotation, velocity, and angular velocities of the two arm Rigidbodies.
* Vector Action space (Continuous) size of 4, corresponding to torque applicable to two joints.
* Benchmark Mean Reward: 30

## Benchmark
The environment is considered solved, when average reward (over 100 episode) of average score is at least +30. To calculate average score, after each episode, we add up the rewards taht each agent received (without discounting), to get a score for each agent. This yields 20 (potentially different) scores. We then take average of these 20 scores.

<img src = a.png>

### Algorithm used for training the Agent -- DDPG
* To the train the agent I have used [DDPG](https://arxiv.org/pdf/1509.02971.pdf).
* DDPG is classified as an Actor-Critc method.
* And it chooses optimal action determinstically.
* DDPG is best classified as a DQN method for continuous action spaces. 
* In **DDPG** we want to output the best believe action every time we query the action from the network.(That is determinstic policy)
    * But by adding __noise__ in the action space we can force agent to explore more.

<img src= b.png>

#### Actor
* Now actor here is used to approximate **optimal policy determinstically**
* The actor is basically learning to output `argmax_aQ(s,a)` which is the best action.
* This equivalent to policy gradient method approach where we directly estimate the policy of the environment with a neural network.
* But one important thing to notice here is unlike policy gradient method where we collect trajectories and then compute cumulative future reward and then have a noisy MonteCarlo estimate(high variance) across which we calculate gradient and then optimize the network.
* In DDPG we have CRITIC which estimates optimal value function, and with help of this we compute gradinet for actor and optimize it's weight. 

#### Critic
* Learns to evaluate the optimal action value function by using **ACTOR** best believe action.


## Pipeline for DDPG
* We have four network
    1. `actor_local`
    2. `actor_target`
    3. `critic_local`
    4. `critic_target`
* First step is we collect our experience tuples which consisits of `(state, action, reward, next_state, done)`.
* We randomly initialize (xavier initialization) our network. One of the key component which I used and found very helpful in this problem was initializing local and target networks with same set of random weights.
* We first get a state from environment and then we pass it through `actor_local` and get an action, by taking this action we get next_state and reward. By this process we collect experience tuples and push it in replay buffer.
* After our replay buffer length is greater or equal to batch_size, we randomly sample from replay buffer to break sequential correlation.
* The fact we have four network is because while training our neural network we need targets, so that we can compute loss and then gradient across it. But in RL target itself is a function of weights, so we want to break this relation that is the reason we have target network which are different from local network, but with time we do weighted sum of local network and target network weights and assign them to target network and this is known as **soft update**.
    * The soft update consists of slowly blending our regular network weights with our target weights.
        * Every step, mix  0.010.01  of regular network weights with target network weights.


### Update Critic
* In DDPG paper critic maps **Q-value**, and in critic network we input both state vector and action vector.
* So in critic network we want to output optimal **action value** and we know this from **Temporal difference algorithm** that `Q(s,a)= r + γ * max_a Q(next state,a)`.
* So in a Critic network:
    * Predicted_value : ` Q_expected = critic_local(state,action) `
    * Target_value : `r + γ * critic_target(next_state,actor_target(next_state))`
* This way of training critic network is pushing actor network to output the optimal action, in other words the action which maximize the action value.
* And one important thing to notice is that we want our critic to be somewhat bias but should have low **variance** that is the reason we use temporal difference algorithm to compute our target.
* And as Actor has a high variance, with critic's output we train Actor, the main idea is to reduce the variance in actor network so we can train the network.
* Advance algorithm like A3C,PPO all want to reduce the variance problem in RL agent.


### Update Actor
* In Actor network we input the state and get the action vector. (estimates Policy)
* In main idea behind actor training is that, we want actor network to output such action that maximizes the action value function which is estimated by critic.
* So in Actor network:
    * Loss: `-critic_local(state,actor_local(state))` -- gradient is calculated across this loss.
    * minus sign because we want to do gradient ascent.

## Python Implementation
#### model.py
```python
import numpy as np

import torch 
import torch.nn as nn
import torch.nn.functional as F

## code is inspired from https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum


def hidden_init(layer):
    """
        xavier initialization.
    """
    fan_in = layer.weight.data.size()[0]
    lim = 1./np.sqrt(fan_in)
    return (-lim, lim)

class Actor(nn.Module):
    """Actor (Policy) Model."""
    
    def __init__(self, state_size, action_size, random_seed, fc1_units =200, fc2_units=150):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc1_units (int): Number of nodes in first hidden layer
            fc2_units (int): Number of nodes in second hidden layer
        """
        super().__init__() ## initialize the nn.Module class
        self.seed = torch.manual_seed(random_seed)
        self.fc2 = nn.Linear(state_size, fc2_units)
        self.fc3 = nn.Linear(fc2_units,action_size)
        self.reset_parameters()
        
    def reset_parameters(self):
        #self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3,3e-3)
        
    def forward(self, state):
        """Build an actor (policy) network that maps state -> actions"""
        x = F.leaky_relu(self.fc2(state))
        return F.tanh(self.fc3(x))
    

class Critic(nn.Module):
    """ Critic (value) model."""
    
    def __init__(self, state_size, action_size, random_seed, fcs1_units=200, fc2_units=150):
        """Initialize parameters and build model.
        Params
        ======
            state_size(int): Dimension of each state
            action_size(int): Dimension of each action
            seed (init): Random seed
            fcs1_units (int): Number of nodes in the first hidden layer
            fc2_units (int): Number of nodes in the second hidden layer
        """
        super().__init__() ##initialize the nn.Module class
        self.seed = torch.manual_seed(random_seed)
        self.fcs1 = nn.Linear(state_size,fcs1_units)
        self.fc2 = nn.Linear(fcs1_units+action_size, fc2_units)
        self.fc3 = nn.Linear(fc2_units,1)
        self.reset_parameters()
        
    def reset_parameters(self):
        self.fcs1.weight.data.uniform_(*hidden_init(self.fcs1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3,3e-3)
        
    def forward(self,state,action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        xs = F.leaky_relu(self.fcs1(state))
        x = torch.cat((xs,action), dim=1)
        x = F.leaky_relu(self.fc2(x))
        return self.fc3(x)


```
#### ddpg_agnet.py
```python
import numpy as np
import random
import copy
from collections import namedtuple, deque

from model import Actor, Critic

import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

## code is inspired from https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum

##========== HYPERPARAMETER ============##
BUFFER_SIZE = int(1e5)    # replay buffer
BATCH_SIZE = 128          # minibatch size
GAMMA = 0.99              # discounting factor
TAU = 1e-3                # soft update of traget parameters
LR_ACTOR = 1e-3           # learning rate for actor
LR_CRITIC = 1e-3          # learning rate for critic
WEIGHT_DECAY = 0.         # L2 weight weight decay

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class Agent():
    """Interacts with and learns from the environment"""
    
    
    def __init__(self, state_size, action_size, random_seed):
        """Initialize an Agent object
        Params
        ======
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            random_seed (int): random seed
        """
        self.state_size = state_size 
        self.action_size = action_size
        self.seed = random.seed(random_seed)
        
        # Actor Network (w/ Target Network)
        self.actor_local = Actor(state_size, action_size, random_seed).to(device)
        self.actor_target = Actor(state_size, action_size, random_seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr = LR_ACTOR)
        
        
        # Critic Network (w/ Target Network)
        self.critic_local = Critic(state_size,action_size,random_seed).to(device)
        self.critic_target = Critic(state_size,action_size,random_seed).to(device)
        self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr = LR_CRITIC, weight_decay = WEIGHT_DECAY)
        
        
        # Noise process
        self.noise = OUNoise(action_size,random_seed)
        
        # Replay Buffer
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, random_seed)
        
        self.counter = 0
        
       # Make sure target is with the same weight as the source found on slack
        self.hard_update(self.actor_target, self.actor_local)
        self.hard_update(self.critic_target, self.critic_local)
        
    def step(self, state, action, reward, next_state, done):
        """Save experience in replay memory, and use random sample from buffer to learn."""
        # Save experience / reward 
        for state,action,reward,next_state,done in zip(state, action, reward, next_state, done):
            self.memory.add(state, action, reward, next_state, done)
            self.counter+=1
        
        # Learn, if enough samples are available in memory
        if len(self.memory) > BATCH_SIZE and self.counter%10==0: 
            experience = self.memory.sample()
            self.learn(experience, GAMMA)
            
    def act(self, state, add_noise=True):
        """Return actions for given state as per current policy."""
        #Save experience / reward
        state = torch.from_numpy(state).float().to(device)
        self.actor_local.eval()
        with torch.no_grad():
            action = self.actor_local(state).cpu().data.numpy()
        self.actor_local.train()
        if add_noise:
            action += self.noise.sample()
        return np.clip(action, -1, 1)
    
    def reset(self):
        self.noise.reset()
        
    def learn(self, experience, gamma):
        """Update policy and value parameters using given batch of experience tuples
        
        Q_targets = r + γ * critic_target(next_state, actor_target(next_state))
        
        where:
            actor_target(state) -> action
            critic_target(state,action) -> Q-value
            
        Params
        ======
            experience (Torch[torch.Tensor]): tuple of (s, a, r, s', done) tuples
            gamma (float): discount factor
        """
        state, action, reward, next_state, done = experience
        
        # ============================== Update Critic =================================#
        # Get predicted next-state actions and Q values from target models
        
        self.actor_target.eval() ## there is no point is saving gradient
        self.critic_target.eval()
        
        actions_next = self.actor_target(next_state)
        Q_target_next = self.critic_target(next_state,actions_next)
        
        # Compute Q targets for current states (y_i)
        Q_targets = reward + (gamma*Q_target_next*(1-done))
        
        ## Compute Critic Loss
        Q_expected = self.critic_local(state,action)
        critic_loss = F.mse_loss(Q_expected, Q_targets)
        
        ## Minize the loss
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        
        
        # ============================== Update Actor =================================#
        ## Compute actor loss
        action_pred  = self.actor_local(state)
        actor_loss = -(self.critic_local(state,action_pred).mean())
        ## The reason we can calculate loss this way and we don't have
        ## to collect trajector ( noisy Monte carlo estimation; cum_reward/reward_future)
        ## is action space is continuous and differentiable and we calculate
        ## gradient w.r.t to q_value which is estimated by CRITIC.
        # Minimize loss
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()
        
        
        # ========================== Update target network =================================#

        self.soft_update(self.critic_local,self.critic_target,TAU)
        self.soft_update(self.actor_local,self.actor_target,TAU)
        
    def soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target
        Params
        ======
            local_model: PyTorch model (weights will be copied from)
            target_model: PyTorch model (weights will be copied to)
            tau (float): interpolation parameter 
        """
        for target_param,local_param in zip(target_model.parameters(),
                                           local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1-tau)*target_param.data)

    
    def hard_update(self, target, source):
        for target_param, param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(param.data)
    
    
class OUNoise:
    """Ornstein-Uhlenbeck process."""
    
    def __init__(self, size, seed, mu=0.01, theta=0.15, sigma=0.2):
        """Initialize parameters and noise process."""
        self.mu = mu*np.ones(size)
        self.theta = theta
        self.sigma = sigma
        self.seed = random.seed(seed)
        self.reset()
        
    def reset(self):
        """Reset the internal state (= noise) to mean (mu)."""
        self.state = copy.copy(self.mu)
        
    def sample(self):
        """Update internal state and return it as a noise sample"""
        x = self.state
        dx = self.theta*(self.mu-x) + self.sigma*np.array([random.gauss(0., 1.) for i in range(len(x))])
        self.state =x +dx
        return self.state
    

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)
    
    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)
    
    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)
        
```

#### Training
```python
from ddpg_agent import Agent
random_seed=8
agent = Agent(state_size,action_size,random_seed)

from logger import Logger
from collections import deque
import torch
def ddpg(n_episode=300, max_t=320, print_every=100):
    scores_deque = deque(maxlen=print_every)
    scoress= []
    logger = Logger('./logs')
    for i_episodes in range(1, n_episode+1):
        env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
        states = env_info.vector_observations                  # get the current state (for each agent)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        for ii in range(1,1001):
            actions=[agent.act(states[no_agent,:]) for no_agent in range(20)]
            actions = np.array(actions).reshape(20,4)
            actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            agent.step(states,actions,rewards,next_states,dones)
            scores += env_info.rewards                         # update the score (for each agent)
            states = next_states                               # roll over states to next time step
            if np.any(dones):                                  # exit loop if episode finished
                break
        mean_reward = np.mean(scores)
        scores_deque.append(mean_reward)
        scoress.append(mean_reward)
        print('\rEpisode {}\tAverage Score: {:.2f}  Best_score: {} current score: {}'.format(i_episodes,np.mean(scores_deque),
                                                                                             max(scores_deque),scores_deque[-1]),end="")
        #torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
        #torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
        if i_episodes%print_every ==0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episodes,np.mean(scores_deque)))
            
            
        # =============================================================================== #
        #                           Tensorboard Logging                                   #
        # =============================================================================== #
        
        # 1. Log scalar values (scalar summary)
        info = {'Average_Score_episode(over-100)': np.mean(scores_deque),
                'Average_Score_acrossAgent': mean_reward}
        for tag, value in info.items():
            logger.scalar_summary(tag, value, i_episodes)
            
        # 2. Log value and gradient of the parameters (histogram summary)
        ## Actor_local 
        for tag, value in agent.actor_local.named_parameters():
            tag = tag.replace('.','/')
            logger.histo_summary('ActorLocal/' +tag, value.data.cpu().numpy(), i_episodes)
            logger.histo_summary('ActorLocal/'+tag+'/grad', value.grad.data.cpu().numpy(),i_episodes)
        ## Actor_target
        for tag, value in agent.actor_target.named_parameters():
            tag = tag.replace('.','/')
            logger.histo_summary('ActorTarget/'+tag, value.data.cpu().numpy(), i_episodes)
            logger.histo_summary('ActorTarget/'+tag+'/grad', value.grad.data.cpu().numpy(),i_episodes)
            
        ## Critic_local
        for tag, value in agent.critic_local.named_parameters():
            tag = tag.replace('.','/')
            logger.histo_summary('CriticLocal/'+tag, value.data.cpu().numpy(), i_episodes)
            logger.histo_summary('CriticLocal/'+tag+'/grad', value.grad.data.cpu().numpy(),i_episodes)
        ## Critic_target
        for tag, value in agent.critic_target.named_parameters():
            tag = tag.replace('.','/')
            logger.histo_summary('CriticTarget/'+tag, value.data.cpu().numpy(), i_episodes)
            logger.histo_summary('CriticTarget/'+tag+'/grad', value.grad.data.cpu().numpy(),i_episodes)
            
            
    return scoress
scores = ddpg()
```
