# Navigation

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 2.0.10 which is incompatible.[0m


The environment is already saved in the Workspace and can be accessed at the file path provided below.  Please run the next code cell without making any changes.

In [2]:
from unityagents import UnityEnvironment
import numpy as np

# please do not modify the line below
env = UnityEnvironment(file_name="/data/Banana_Linux_NoVis/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
print(brain_name)

BananaBrain


### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

Number of agents: 1
Number of actions: 4
States look like: [ 1.          0.          0.          0.          0.84408134  0.          0.
  1.          0.          0.0748472   0.          1.          0.          0.
  0.25755     1.          0.          0.          0.          0.74177343
  0.          1.          0.          0.          0.25854847  0.          0.
  1.          0.          0.09355672  0.          1.          0.          0.
  0.31969345  0.          0.        ]
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agent while it is training**, and you should set `train_mode=True` to restart the environment.

In [5]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: 0.0


When finished, you can close the environment.

In [6]:
#env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agent while it is training.  However, **_after training the agent_**, you can download the saved model weights to watch the agent on your own machine! 

In [7]:
#necessary imports
import gym
import random
import torch
import numpy as np
from collections import  namedtuple,deque
import matplotlib.pyplot as plt
%matplotlib inline
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim

In [8]:
#use GPU is avaliable
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


In [9]:
#build dqn model artitechture
class DQNmodel(nn.Module):
    def __init__(self,state_size,action_size,seed):
        super(DQNmodel, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.model=nn.Sequential(nn.Linear(state_size,32),
                                 nn.ReLU(),
                                 nn.Linear(32,16),
                                 nn.ReLU(),
                               
                                 nn.Linear(16,action_size))
    def forward(self,state):
        return self.model(state)

In [10]:
#parameters
buffer_size=int(1e5) 
batch_size=128
gamma=0.95
update_every=4
TAU=1e-3 
LR=0.0005

In [11]:
#define replay buffer
class Replay_buffer():
    def __init__(self, action_size, buffer_size, batch_size, seed):
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)
        
    def add(self, state, action, reward, next_state, done):
            e=self.experience(state, action, reward, next_state, done)
            #add new experience to the memory
            self.memory.append(e)
    def sample(self):
        #randomly generate a batch of experiences from the memory
        experiences = random.sample(self.memory, k=self.batch_size)
        
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
        return (states, actions, rewards, next_states, dones)
    def __len__(self):
        #get the memory length
        return len(self.memory)

In [12]:
#define the dqn agent

class DQNAgent():
    
    def __init__(self, state_size, action_size, seed):
        self.state_size=state_size
        self.action_size=action_size
        self.seed=random.seed()
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.memory= Replay_buffer(self. action_size, buffer_size, batch_size, self.seed)
        
        #instantiate dqn models for the local and target model 
          
        self.model = DQNmodel(state_size, action_size, seed).to(device)
        self.target_model=DQNmodel(state_size, action_size, seed).to(device)
        self.optimizer = optim.Adam(self.model.parameters(),LR)
        
        # Initialize time step (for updating every UPDATE_EVERY steps)
        self.t_step = 0
   
    def step(self, state, action, reward, next_state, done):
        #save the experience replay in the memory
        self.memory.add(state, action, reward, next_state, done)
        # Learn every UPDATE_EVERY time steps.
        self.t_step = (self.t_step + 1) % update_every
        if self.t_step == 0:
            # If enough samples are available in memory, get random subset and learn
            if len(self.memory)>batch_size:
                experiences=self.memory.sample()
                self.learn(experiences)   
    def act(self,state,eps):
        
        #convert state into a floattensor then add it to cpu
        state=torch.from_numpy(state).float().unsqueeze(0).to(device)
        #evaluation mode
        self.model.eval()
        with torch.no_grad():
            action_values=self.model(state)
        self.model.train()    
        #select action according to the epislon greedy probability    
        if random.random()>eps:
            chosen_action=np.argmax(action_values.cpu().data.numpy())
        else:
            chosen_action=random.choice(np.arange(self.action_size))

        return chosen_action
    def learn(self,experiences):
        
        s,a,r,n_s,done=experiences
        # Get max predicted Q values (for next states) from target model   
        future_Q_target=self.target_model(n_s).detach().max(1)[0].unsqueeze(1)
        # Compute Q targets for current states
        Q_target=r + (gamma *  future_Q_target* (1 - done))

        #get expected q values in the local model
        Q_expected=self.model(s).gather(1, a)
        #define loss function
        loss=F.mse_loss(Q_expected, Q_target) 
        #optimize loss
        #clear gradients
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        #update the target model with thelocal model weights
        self.update_weights()
    def update_weights(self):
        parameters=zip(self.model.parameters(), self.target_model.parameters())
        for local_param,target_param in parameters:
            #θ_target = τ*θ_local + (1 - τ)*θ_target
            update_param=TAU*local_param.data+(1.0-TAU)*target_param.data
            target_param.data.copy_(update_param)       
    

        

In [13]:
#get the state size and action_space size
state_space_size=len(env_info.vector_observations[0])
action_space_size=brain.vector_action_space_size
print(state_space_size)
print(action_space_size)

37
4


In [14]:
#instantiate dqnagent
agent= DQNAgent(state_size=state_space_size,action_size=action_space_size,seed=0)

In [15]:
#train  agent
n_episodes=1000
max_timesteps=2000
#define the epislon ,its decay rate and the minimum epislon.
eps_start=1.0
eps_min=0.01
eps_decay=0.99
#list containing the rewards goten per episode
rewards_list=[]
scores_window=deque(maxlen=100) 
for episode in range(1,n_episodes+1):
    env_info = env.reset(train_mode=True)[brain_name] # reset the environment
    state = env_info.vector_observations[0]     #get current state
    score=0          #instantiate the score
    
    for t in range(max_timesteps):
        action=agent.act(state,eps=eps_start)
        env_info = env.step(action)[brain_name]        # send the action to the environment
        next_state = env_info.vector_observations[0]   # get the next state
        reward = env_info.rewards[0]                   # get the reward
        done = env_info.local_done[0]                  #check if episode is finished 
        agent.step(state, action, reward, next_state, done)
        state = next_state                             # roll over the state to next time step                                     # exit loop if episode finished
        score+=reward  # update the score
        if done:  #exit loop when the episode is finished
            break
    
    #apend scores to reward list
    rewards_list.append(score)
    scores_window.append(score) 
    #epislon decay
    eps_start= max(eps_min,eps_decay*eps_start)
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_window)), end="")
    if episode % 100 == 0:
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_window)))
    if np.mean(scores_window)>=13.0:
        print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode-100, np.mean(scores_window)))
        torch.save(agent.model.state_dict(), 'checkpoint.pth')
        break

Episode 100	Average Score: 1.44
Episode 200	Average Score: 6.60
Episode 300	Average Score: 10.73
Episode 372	Average Score: 13.02
Environment solved in 272 episodes!	Average Score: 13.02


In [None]:
# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(rewards_list)), rewards_list)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()