## Introduction
In this project, I have trained an agent to navigate (and collect bananas!) in a large, square world.

A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. Thus, the goal of the agent is to collect as many yellow bananas as possible while avoiding blue bananas.

The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction. Given this information, the agent has to learn how to best select actions. Four discrete actions are available, corresponding to:

0 - move forward.
1 - move backward.
2 - turn left.
3 - turn right.

## Learning algoritm

### Background
- The goal of the agent is to maximize the expected value of future. I made the assumption that future rewards are discounted by a factor gamma (see details below)
- We consider sequences of actions and observations, st =x1, a1, x2, ..., at−1, xt, and learn game strategies that depend upon these sequences. All sequences are assumed to terminate in a finite number of time-steps, which corresponds to a finite Markov Decision process
- We define the optimal action-value function Q∗(s, a) as the maximum expected return achievable by following any strategy, after seeing some sequences and then taking some action a, Q∗(s, a) = maxπ E [Rt|st = s, at = a, π], where π is a policy mapping sequences to actions (or distributions over actions) as described in the Playing Atari with Deep Reinforcement Learning paper from the University of Toronto (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
- The optimal action-value function obeys to the Bellman equation as described in the same paper, which allow to estimate the action-value function, by using the Bellman equation as an iterative update
- In this specific case to approximate the action value function, we use a non linear function approximator with a neural network (Q network) for which the hyperparameters are detailed in the section below
- I used stochastic gradient descent to train the network which uses the same loss function as (2) in the paper mentioned above.
- To avoid strong correlations between consequentive samples, I am using experience replay to save episodes in a replay memory as mentioned in the Human-level control through deep reinforcement learning article (https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)
- Similarly, I am using fixed Q targets to avoid updating a guess with a guess and making the algorithm more stable described in the same paper




### Details of the implementation
- I have used a Deep Q Learning to solve this task with the following hyperparameters:
    - eps_start=1.0, eps_end=0.01, eps_decay=0.995
- The DQN agent has the following hyperparamters:
    - BUFFER_SIZE = int(1e5)  # replay buffer size
    - BATCH_SIZE = 64         # minibatch size
    - GAMMA = 0.99            # discount factor
    - TAU = 1e-3              # for soft update of target parameters
    - LR = 5e-4               # learning rate 
    - UPDATE_EVERY = 4        # how often to update the network
- The neural network has the following specificities:
    - FC NN with 3 layers of 128, 64 and 4 (action size)
    - Two first layers are activated with Relu function

### Results

- The agent reached an average of 15.02 over 100 episodes

<img src="files/image.png">

## Areas of improvements

- To improve the performance of the agents, I could use:
    1.  double DQN as described in this paper https://arxiv.org/abs/1509.06461
    2. Prioritized experience replay to learn from the most important yet infrequent experiences as described in this research paper (https://arxiv.org/abs/1511.05952)
