# Banana Collector Agent Training Report

The implementation for this project is almost line-for-line the same implemenation as used for the lunar lander coding project from earlier in the course. The only significant change is switching out the OpenAI environment for the Unity Banana environment provided for this project. 

## Learning Algorithm

### Model - Best Performing

In this case, the model for the network is a three layer fully-connected network. The first layer with 64 nodes takes 37 inputs corresponding to the state space. That layer is sent through a RELU non-linear function to the second layer, also with 64 nodes. That, in turn, is passed through another RELU layer to the third layer with four nodes corresponding to each action in the action space. The output of that layer is used directly as the action output without any further nonlinearization layers.

### Model - Alternate (Deep)

I set up an alternate version of the model, `model_deeper.py`. This model has five fully connected layers with 64, 128, 64, 32 and 4 nodes respectively. All have a RELU function on the output except for the final output layer.

Surprisingly, this model generally performed more poorly on the task than the initial, shallower model. Results from this model have been included in the report along with the better-performing shallow model.

### Process

The learning algorithm is a basic DQN learning algorithm. There are two networks, the Local Network and the Target Network. Both start off initialized identically.

After each action chosen by the model (Epsilon Start: 1.0, Epsilon Decay: 0.995, Epsilon End: 0.01) and environment update, the state, action, reward, next state and whether the episode is compelte are stored in the replay buffer. Once the replay buffer has enough tuples stored to generate a sample for batch learning (default batch size: 64), then every 4 steps, the agent would select a random batch-sized sample from the replay buffer. The local network's predicted Q value for the states selected from the replay buffer is compared to the target network's discounted maximum predicted Q value for the next state's actions plus actual rewards for the current states (Bellman equation) in the buffer and the MSE is calculated.  The MSE is the loss function to optimize the local network using the Adam optimizer with a learning rate of 5e-4.

After optimizing the local network, the target network would then be updated with Tau = 1e-3.

## Results

In addition to the raw scores per episode, I also created a plot that contained the average score over the last 100 episodes (black line) to get a clearer picture of the learning trajectory. That plot also includes the rolling average of total yellow bananas collected (yellow) and blue bananas collected (blue) in order to get an idea of the raw numbers (averaged over 100 episodes) of each type of banana collected. This was to get an idea of how much further improvement was possible.

### Default - Batch size: 64, Standard Model

Episode 100	Average Score: 0.31  
Episode 200	Average Score: 3.35  
Episode 300	Average Score: 8.00  
Episode 400	Average Score: 10.55  
Solved in 484 episodes  

![b64standard.png](Plots/b64standard.png)
![b64standard-average.png](Plots/b64standard-average.png)

### Batch Size: 64, Deep Model

Episode 100	Average Score: 0.73  
Episode 200	Average Score: 3.26  
Episode 300	Average Score: 7.01  
Episode 400	Average Score: 10.32
Solved in 482 episodess  

![b64deeper.png](Plots/b64deeper.png)
![b64deeper-average.png](Plots/b64deeper-average.png)

### Batch size: 256, Standard Model

Episode 100	Average Score: 0.58  
Episode 200	Average Score: 4.28  
Episode 300	Average Score: 6.50  
Episode 400	Average Score: 10.46  
Solved in 475 episodes  

![b256standard.png](Plots/b256standard.png)
![b256standard-average.png](Plots/b256standard-average.png)

### Batch size: 256, Deep Model

Episode 100	Average Score: 0.88  
Episode 200	Average Score: 3.27  
Episode 300	Average Score: 7.24  
Episode 400	Average Score: 9.66  
Solved in 489 episodes  

![b256deeper.png](Plots/b256deeper.png)
![b256deeper-average.png](Plots/b256deeper-average.png)

## Possible Improvements

My first idea for improvement, which I actually tried (and failed at), was to use a deeper network described above. That was mostly in an attempt to get to the +13 solution faster. In fact, that model performed more poorly on the whole (see above).

Consider this though: using the default model and parameters from the Lunar Lander assignment, the agent already learned to solve the environment in less than 500 episodes which is far below the suggested number of 1800. Given that, I did not feel a strong incentive to get the agent to solve the environment faster than that. However, I was interested to see if there were improvements which might increase the upper limit of average reward per 100 episodes.

I tried to see if any of the model/parameters might perform better using a higher bar for the solution. Specifically, it seemed that reward of approximately +17 per 100 episode average was around the limit of what the particular model/parameters could acheive. 

### Batch size: 256, Standard Model

In an alternate implementation of `navigation.ipyb` and `navigation.py` (not submitted), I track how many episodes to get to +13 (the initial assignment), +16, +17, +18, +19 and +20.

Episode 100	Average Score: 0.58  
Episode 200	Average Score: 4.28  
Episode 300	Average Score: 6.50  
Episode 400	Average Score: 10.46  
Solved for +13 in 475 episodes  
Episode 500	Average Score: 13.42  
Episode 600	Average Score: 14.67  
Episode 700	Average Score: 15.02  
Episode 800	Average Score: 14.96  
Solved for +16 in 859 episodes 
Episode 900	Average Score: 16.15  
Episode 1000	Average Score: 15.26  
Episode 1100	Average Score: 15.86  
Episode 1200	Average Score: 15.74  
Episode 1300	Average Score: 16.49  
Solved for +17 in 1335 episodes  
Episode 1400	Average Score: 16.44  
Episode 1500	Average Score: 16.40  
Episode 1600	Average Score: 15.66  
Episode 1700	Average Score: 16.41  
Episode 1800	Average Score: 15.62  
Episode 1900	Average Score: 14.98  
Episode 2000	Average Score: 15.21  
Episode 2100	Average Score: 15.72  
Episode 2200	Average Score: 15.22  
Episode 2300	Average Score: 15.13  
Episode 2400	Average Score: 14.74  
Episode 2500	Average Score: 15.32  
Episode 2600	Average Score: 15.00  
Episode 2700	Average Score: 15.68  
Episode 2800	Average Score: 16.13  
Episode 2900	Average Score: 15.38  
Episode 3000	Average Score: 15.62  
Episode 3100	Average Score: 15.50  
Episode 3200	Average Score: 14.91  
Episode 3300	Average Score: 16.10  
Episode 3400	Average Score: 16.41  
Episode 3500	Average Score: 15.94  
Episode 3600	Average Score: 15.09  
Episode 3700	Average Score: 16.29  
Episode 3800	Average Score: 16.15  
Episode 3900	Average Score: 15.93  
Episode 4000	Average Score: 15.85  
Episode 4100	Average Score: 14.48  
Episode 4200	Average Score: 15.53  
Episode 4300	Average Score: 14.72  
Episode 4400	Average Score: 15.25  
Episode 4500	Average Score: 15.42  
Episode 4600	Average Score: 15.07  
Episode 4700	Average Score: 14.75  
Episode 4800	Average Score: 15.53  
Episode 4900	Average Score: 14.83  
Episode 5000	Average Score: 14.77  

![b256standard-17-18-19-20.png](Plots/b256standard-17-18-19-20.png)
![b256standard-average-17-18-19-20.png](Plots/b256standard-average-17-18-19-20.png)

With a batch size of 256, the standard model was able to get a +16 average on the 859th episode and +17 at the 1335th episode. The best performing model to +13 is never able to get to +18. No change in parameters was able to break +18 including:

* Tau
* Learning Rate
* Gamma

Admittedly, I had little time to devote to experimentation. Furthermore, without more detail concerning exactly what the comprises the 37 state values, it's hard to know what parameters would best be adjusted. It's mostly just guesses.