# Navigation Project Report

In this project, I train a DQN agent to navigate in a large, square world and to collect bananas.  


## Problem Overview

**State Space**: The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  

**Action Space**: Four discrete actions are available, corresponding to:
- **`0`** - move forward.
- **`1`** - move backward.
- **`2`** - turn left.
- **`3`** - turn right.

**Reward**: A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana.  Thus, the goal of your agent is to collect as many yellow bananas as possible while avoiding blue bananas.  

**Problem Solved**: The task is episodic. The environment is considered solved when the agent get an average score of +13 over 100 consecutive episodes.


## Method

DQN uses a deep neural network to approximate the Q-value function. For each state, it estimates the q-value of each action and performs a gradient decent on the MSE loss between the expected Q-value and the current Q-value.  To stabilize and improve the DQN training procedure, two techniques are employed:

**Experience Replay**: When we feed the experience tuples $(s,a,r,s')$ sequentially to train the neural network, there exists a correlationship between two consecutive tuples. To avoid this, we store experience tuples in a replay buffer and randomly sample a batch to calculate the expected value function.

**Fixed Q-targets**: We use two networks: local and target. The local network is updated at every gradient step while the target network is updated with the current weights of the local network at regular interval.

### DQN architecture

The DQN networks has 3 fully connected layers, each with 256 neurons. Each layer if followed by a ReLu activation function. To avoid overfitting, dropout with a probability of $50\%$ is used after each fully connected layer except the last one.

The network accepts a tensor of 37 dimensions which is the dimension of each state; the network returns a tensor of 4 dimension which is the number of action an agent can perform at each state.

### DQN hyperparameters

- Replay buffer size: $e^5$

- Batch size: 64

- Discount factor (gamma): 0.99

- Soft update parameter (TAU): $e^{-3}$

- Learning rate (alpha): $1e^{-3}$

- Frequency of network update: 50


## Results

The DQN agent is able to solve the environment in 2866 episodes. After __ episodes, the score did not improve much until about __ episode, the score restarts increasing.
![average_socres](average_socres.png)

## Next Steps
The performance of the current DQN model is quite sensitive to its hyperparameters. It would be extremely valuable to have a systematic way to do hyperparameter tuning and find ways to control/decrease the sensitivity of model hyperparameters. 

Moreover, the current DQN model samples experience uniformly from the replay buffer, assuming that each experience has the same priority. In reality, the experience with bigger TD-error indicates that the neural network can learn more than from them. Thus we can associate each experience with a priority score which is monotone increasing function of TD-error and sample them based on the score. This will help to select important experiences that may be rare and easily got ignored by the uniform sampling method.

- Prioritized Experience Replay: https://arxiv.org/abs/1511.05952

Other than Prioritized Experience Replay, there are other extensions of DQN: Double-DQN, Dueling-DQN. I will look into these research to see the edges they add to the vanilla DQN.

- Dueling DQN: https://arxiv.org/abs/1511.06581

- Double-Q Learning: https://arxiv.org/abs/1509.06461

