# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [None]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [None]:
env = UnityEnvironment(file_name="Banana_Windows_x86_64/Banana.exe")

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [None]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

### 3. Determine if GPU can be used

If CUDA is available, this will speed up the learning process.

In [None]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

seed = 42

### 4. Define neural network used as an approximation for the Q-table

The neural network defined below has consists of a GRU layer followed by two fully connected layers. It uses RelU actication and dropout.


In [None]:
import torch.nn as nn
import torch.nn.functional as F

class QNetwork(nn.Module):
    def __init__(self, state_size, action_size, seed):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
        """
        super(QNetwork, self).__init__()
        self.seed = torch.manual_seed(seed)
        
        self.gru_hidden_dim = 100
        self.hidden_dim = 150
        
        self.gru = nn.GRU(input_size=state_size, hidden_size=self.gru_hidden_dim, batch_first=True)
        self.lin1 = nn.Linear(self.gru_hidden_dim, self.hidden_dim)
        self.dropout = nn.Dropout(p=0.2)
        self.lin2 = nn.Linear(self.hidden_dim, action_size)

    def forward(self, state):
        """Build a network that maps state -> action values."""
        x= F.relu(self.gru(input=state)[0])
        x = x[:, -1, :] # keep only last entry of sequence dim
        x = self.dropout(x)
        x = F.relu(self.lin1(x))
        x = self.dropout(x)
        return self.lin2(x)

### 5. Define some hyper parameters
In the following cell we define hyper parmeters and constants for the learning algorithm.

In [None]:
n_episodes = 2000
max_timesteps = 1000
replay_buffer_size = 10000
batch_size = 64
gamma = 0.995
update_every = 5
learning_rate = 0.001
tau = 0.001
rnn_seq_length = 30 # one state consists of 30 consecutive time frames
epsilon_end = 0.01
epsilon_decay = 0.995

checkpoint_file = 'banana_dqn_checkpoint.pth'
target_score = 13.0

### 6. The algorithm

In [None]:
local_network = QNetwork(state_size, action_size, seed).to(device)
target_network = QNetwork(state_size, action_size, seed).to(device)
optimizer = torch.optim.Adam(local_network.parameters(), lr=learning_rate)

from collections import deque
import random
memory = deque(maxlen=replay_buffer_size) # stores (s,a,r,s',d) tuples   

scores = []
scores_window = deque(maxlen=100)
epsilon = 1.0
solved = False

for episode in range(1, 1 + n_episodes):
    env_info = env.reset(train_mode=True)[brain_name]
    
    state = np.zeros((rnn_seq_length, state_size))
    state[-1] = env_info.vector_observations[0]
    
    score = 0
    for t in range(1, 1 + max_timesteps):
        
        # choose action via epsilon greedy
        if random.random() > epsilon:
            torch_state = torch.from_numpy(state).float().unsqueeze(0).to(device)
            torch_state.requires_grad = False
            local_network.eval()
            with torch.no_grad():
                action_values = local_network(torch_state)
            local_network.train()
            action = np.argmax(action_values.cpu().data.numpy())
        else:
            action = random.choice(np.arange(action_size))
        
        # apply action in environment
        env_info = env.step(action)[brain_name]
        
        next_state = np.zeros_like(state)
        next_state[:-1] = state[1:]
        next_state[-1] = env_info.vector_observations[0]   
        reward = env_info.rewards[0]  
        done = env_info.local_done[0]
        
        # apply action on agent:
        memory.append((state, action, reward, next_state, done))# remember experience
        
        if t % update_every == 0 and len(memory) >= batch_size: # do not always learn
            samples = random.sample(memory, k=batch_size) # sample from memory          
            states, actions, rewards, next_states, dones = zip(*samples)
            
            states = torch.from_numpy(np.array(states)).float().to(device)
            actions = torch.from_numpy(np.array(actions)).long().to(device)
            rewards = torch.from_numpy(np.array(rewards)).float().to(device)
            next_states = torch.from_numpy(np.array(next_states)).float().to(device)
            dones = torch.from_numpy(np.array(dones, dtype='uint8')).float().to(device)
            
            next_targets = target_network(next_states).detach().max(1)[0] # max Q values for next state
            targets = rewards + (gamma * next_targets * (1 - dones)) # Q values for current state
            
            expected = local_network(states)
            expected = expected.gather(1, actions.unsqueeze(1)).squeeze(1) # expected Q values from local model
            
            loss = F.mse_loss(expected, targets)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # update target network with tau interpolation
            for t_param, l_param in zip(target_network.parameters(), local_network.parameters()):
                t_param.data.copy_(tau * l_param.data + (1.0 - tau) * t_param.data)
        
        score += reward
        state = next_state
        if done:                                       
            break

    scores.append(score)
    scores_window.append(score)
    epsilon = max(epsilon_end, epsilon * epsilon_decay)
    
    avg_score = np.mean(scores_window)
    if avg_score > target_score:
        torch.save(local_network.state_dict(), checkpoint_file)
        np.save('scores13.npy', np.array(scores))
        if episode > 100 and not solved:
            solved = True
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode, np.mean(scores_window)))

    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_window)), end="")
    if episode % 100 == 0:
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_window)))
        
env.close()

### 7. Plot the score history

In [None]:
import matplotlib.pyplot as plt

means = []
for i in range(100, len(scores)):
    means.append([i, np.mean(scores[i-100:i])])
means = np.array(means)
plt.figure(figsize=(10, 6))
plt.plot(scores ,'.')
plt.plot(means[:, 0],means[:, 1],'r')
plt.grid()
plt.xlabel('Episode')
plt.ylabel('Score')
plt.title('basic banana scores')
plt.legend(['Episode Score', '100 Episode Averge Score'])