## Project : Navigation 


In this project we will train an agent that will collect (yellow)bananas! in large grid space.
- The state space is a tensor with 37 dimensions.
- And action space has 4 possible actions to take in a particular state(The action space is discrete that is the reason we're able to use deep Q-learning).
- In my implementation I am using **Double DQN** strategy.
- And I'm using a vanilla neural network to keep track for our __action value__. We will pass the state space in batches from the neural network and will get the corresponding action value for that state.
- And I'm using replay buffer to use the experience tuple again and again and to break the correlation between the sequence of experience tuple.

# HyperParameters
```python
BUFFER_SIZE = int(1e5)  #replay buffer size
BATCH_SIZE = 64#32        # minibatch size
GAMMA = 0.999            # discount factor
TAU = 4e-3 ##5e-3             # for soft update of target parameters
LR = 3.9e-4  ##3e-4               # learning rate
UPDATE_EVERY = 5        # how often to update the network
```

### Architecture of NeuralNetwork
* $Dense(stateSize,64) -> ReLU(x) -> Dense(64,32) -> ReLU(x) -> Dense(32,8) -> ReLU(x) -> Dense(8,actionSize)$

## Deep Q-Learning PipeLine

1. __Qnetwork__ -> Actor (Policy) model.
    * Basically maps states and action space, its a neural network that works as Q-table, its input dimension is equal to dimensions of state space and out dimension is equal to action space dimensions.
    * We basically keep two neural networks because while training our labels and predicted value are both function of neural network weights. To decouple the label from weights we keep two neural network.(__fixed q-targets__).
2. __dqn_agent__ -> its a class with many methods and it helps the agent (dqn_agent) to **interact** and learn from the environment.
3. __Replay Buffer__ ->Fixed size buffer to store experience tuples.

## Different methods of dqn_agent
* **\_\_init\_\_** method: We initialize the state_size,action and random seed.
    * then we initialize two different q-network (qnetwork_local and qnetwork_target) One for mapping predictions and other for mapping targets.
    * then we declare a optimizer and we only define this for parameters of qnetwork_local and later we will do softupdate and update the parameters for qnetwork_target using the parameters of qnetwork_local.
    * then we initialize Replay buffer.
    * then we initialize t_Step, which decides after how many steps our agent should learn from experiences.
    
* __Step__(self,state,action, reward, next_state, done)
    * this method decides whether we will train(learn) the network actor (local_qnetwrok) and fill the replay buffer __or__ we will __only__ fill the replay buffer.
    * We will only learn from the experiences if len of replay buffer is greater than batch_size __and__ t_Step is multiple of a number (of our choice , say after this many steps we want our agent to learn (for e.g 40 iterations).
    
* __learn__(self, experiences, gamma)
    * this step is equivalent to the step in qlearning where we update the q_table(**state-action value**) for a state(S) after taking corresponding action(A) $Q[S,A] += \alpha(R + \gamma \times(max_{a} Q[nextState]) - Q[S,A]) $ 
    * But here instead doing this step as our state space is continuous so we have non-linear function approximator for mapping the state, we do a back propagation on our neural network weights.
    * And our __label__ is $max_a Q(Snext,A,w^-)*\gamma + Reward$, where $Q(S,A,w^-)$ is the output from **qnetwork_target**.
    * so the **dimension** of $Q(Snext,A,w^-)$ = \[batch_size,dimension of actionSpace\], that is how we have defined our **qnetwork**. but according to our formual target should be $max_a Q(S,A,w^-)*\gamma + Reward$ so find $max_a Q(Snext,A,w^-)$, we do the following operation in __Pytorch__.
    ```python 
    labels_next = self.qnetwork_target(next_State).detach().max(1)[0].unsqueeze(1)
    ```
    * After using the max along 1th dimension( among actions) our dimensions will be \[batch_size\] so to make it a dimension (batch_size,1) for Pytoch operations we have used unsqueeze(1) method.
    * The states we get is of dimension (batch_size,state_dimension) one thing important to note here along batch_size we have different state at random order because of Replay Buffer (we have broken the **correlation of sequence order**)
    * And this implementation $(1-dones)* labelsnext$ makes sure that there is no next state for **Terminating state**.
    * After passing this state from `qnetwork_local` our output dimension will be (batch_size,actionSpace dimension) so in the this experience tuple which action our agent has taken we can choose that action by this command.
    * `self.qnetwork_local(state).gather(1,actions)`
    * so our output dimension will be (batch_size,1) -> predicted value.
    * Now we can compute the loss and then we can use `backpropagation` to update our weights and hence is equivalent to updation **state action value**.(Q table)
    * And then we do **softupdate** the update the **gradient** of **qnetwork_target**, remember we are only training one **sets weights** that is of **qnetwork_local**, so we need a way to update the weights of **qnetwork_target** and by updating those weights we are hoping that our target too improve after each steps as we are improving our predicted value, and the main idea we are using two network is because we want to decouple both target and predicted value from each other as both are the function weights, with fixed qtarget we are making it sure that are function different set of weights. So our network doesn't oscillates.
    
* __soft_update__(local_model,target_model, tau)
    * One important thing to note is that when we were passing $NextState$ to the `qnetwork_taget` we were **not** calculating the **gradient** for each pass because we have wrapped with `with torch.no_grad()` and there is no need of calculating gradient.
    * **tau** decides how much weightage will be given to the **qnetwork_local** and **qnetwork_target** weights.
    ```python
    for target_param,local_param in zip(target_model.parameters(),local_model.parameters()):
        target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
    ```
    better than tensorflow :)

* __act__(state,eps=0.)
    * Returns the action for given state as per current policy.
    * First we change our model in evaluation mode.
    * then we change the state tensor from numpy to `torch.tensor` and the `.unsqueeze(1)` method is used to get a dimension along batch_size becuase in Pytorch you can only pass a input when it has a dimension which addresses the batch_size.
    * And then we pass the state and get the crossponding action and note that we have use `qnetwork_local`.
    * and we have a implementation of greedy action selection because we want to explore random action too. So that the agent gets more experience and `eps` hyperparameter controls this process.
    * And as we know we decrease the `eps` gradually as our agent becomes smarter so we want to decrease the **exploration** and increase **exploitations**. Sounds fancy!  
    

# Double DQN
* The basic idea here is while training the agent in early stages when agent is naive for target updation use, we use the action that maximizes the Q-value\[next_state\]. But in early in stage this is an noisy approximation so we tends to overestimate the Q-value.
## Implementation
* We select the best acton using one set of parameters $w$(qnetwork_local), but **evaluate** it with different set of parameters $w^-$(qnetwork_target).
<br> $R + \gamma q^{`}[S^{`},(argmax_a q^{`}(S^{`},a,w)),w^{`}]$
* It's basically like having two separate function approximators that must agree on the best action.
* If $w$ picks an action that is not the best according $w^-$, then Q-value returned is not that high.

```python
def learn(self, experiences, gamma):
        """Update value parameters using given batch of experience tuples.

        Params
        =======

            experiences (Tuple[torch.Variable]): tuple of (s, a, r, s', done) tuples

            gamma (float): discount factor
        """
        states, actions, rewards, next_state, dones = experiences
        ## TODO: compute and minimize the loss
        criterion = torch.nn.MSELoss()
        self.qnetwork_local.train()
        self.qnetwork_target.eval()
        #shape of output from the model (batch_size,action_dim) = (64,4)
        predicted_targets = self.qnetwork_local(states).gather(1,actions)
        
        #################Updates for Double DQN learning###########################
        self.qnetwork_local.eval()
        with torch.no_grad():
            actions_q_local = self.qnetwork_local(next_states).detach().max(1)[1].unsqueeze(1).long()
            labels_next = self.qnetwork_target(next_states).gather(1,actions_q_local)
        self.qnetwork_local.train()
        ############################################################################

        # .detach() ->  Returns a new Tensor, detached from the current graph.
        labels = rewards + (gamma* labels_next*(1-dones))
        
        loss = criterion(predicted_targets,labels).to(device)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # ------------------- update target network ------------------- #
        self.soft_update(self.qnetwork_local,self.qnetwork_target,TAU)
```

The DDQN agent was able to solve this navigation problem in **166** episodes. One important thing to note in my training is that the **Epsilon** value was quite high during the training the lower bound was set by me was **0.51**, which I feel is high, but theoretically speaking my agent was taking high number of random action accroding to **epsilon-greedy policy** but it worked fine. But I'm bit confused.


<img scr = "./reward.png">

## Ideas for future work

* During the actual use of the trained agent, for some reason the agent gets "confused" if it does not see any yellow bananas in the visual field and it starts to jitter. I dont know how to solve this problem (prolly some problem with unity environment)
* Use other variations of DDQN agent, ultimately finishing with Rainbow algorithm to see how fast it would solve the task :-) Perhaps the training plot would be smoother than the plain vanilla DQN as well.

In [1]:
import os
os.getcwd()

'/Users/unnat/deep-reinforcement-learning/p1_navigation'