# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
env = UnityEnvironment(file_name="Banana_Windows_x86_64/Banana.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

Number of agents: 1
Number of actions: 4
States look like: [1.         0.         0.         0.         0.84408134 0.
 0.         1.         0.         0.0748472  0.         1.
 0.         0.         0.25755    1.         0.         0.
 0.         0.74177343 0.         1.         0.         0.
 0.25854847 0.         0.         1.         0.         0.09355672
 0.         1.         0.         0.         0.31969345 0.
 0.        ]
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
# 随机走10轮
for i in range(10):
    env_info = env.reset(train_mode=True)[brain_name] # reset the environment
    state = env_info.vector_observations[0]            # get the current state
    score = 0                                          # initialize the score
    while True:
        action = np.random.randint(action_size)        # select an action
        env_info = env.step(action)[brain_name]        # send the action to the environment
        next_state = env_info.vector_observations[0]   # get the next state
        reward = env_info.rewards[0]                   # get the reward
        done = env_info.local_done[0]                  # see if episode has finished
        score += reward                                # update the score
        state = next_state                             # roll over the state to next time step
        if done:                                       # exit loop if episode finished
            break

    print("episode:{}, score: {}".format(i, score))

episode:0, score: 0.0
episode:1, score: 1.0
episode:2, score: -1.0
episode:3, score: 0.0
episode:4, score: 0.0
episode:5, score: 0.0
episode:6, score: -1.0
episode:7, score: 0.0
episode:8, score: -2.0
episode:9, score: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

### 定义dqn

In [7]:
from collections import deque
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers
import matplotlib.pyplot as plt

import os
import gc
import objgraph


class DQN(object):
    def __init__(self):
        self.step = 0
        self.update_freq = 100  # 模型更新频率
        self.replay_size = 2000  # 训练集大小
        self.replay_queue = deque(maxlen=self.replay_size)
        self.model = self.create_model()
        self.target_model = self.create_model()

    def create_model(self):
        """创建一个隐藏层为100的神经网络"""
        STATE_DIM, ACTION_DIM = 37, 4
        model = models.Sequential([
            layers.Dense(100, input_dim=STATE_DIM, activation='relu'),
            layers.Dense(ACTION_DIM, activation="linear")
        ])
        model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(0.001))
        return model

    def act(self, s, epsilon=0.1):
        """预测动作"""
        # 刚开始时，加一点随机成分，产生更多的状态
        if np.random.uniform() < epsilon - self.step * 0.0002:
            return np.random.choice([0, 1, 2, 3])
        return int(np.argmax(self.model.predict(np.array([s]))[0]))

    def save_model(self, file_path='p1_navigation-dqn.h5'):
        print('model saved')
        self.model.save(file_path)

    def remember(self, s, a, next_s, reward):
        self.replay_queue.append((s, a, next_s, reward))

    def train(self, batch_size=64, lr=1, factor=0.95):
        if len(self.replay_queue) < self.replay_size:
            return
        self.step += 1
        # 每 update_freq 步，将 model 的权重赋值给 target_model
        if self.step % self.update_freq == 0:
            self.target_model.set_weights(self.model.get_weights())

        replay_batch = random.sample(self.replay_queue, batch_size)
        s_batch = np.array([replay[0] for replay in replay_batch])
        next_s_batch = np.array([replay[2] for replay in replay_batch])

        Q = self.model.predict(s_batch)
        Q_next = self.target_model.predict(next_s_batch)

        # 使用公式更新训练集中的Q值
        for i, replay in enumerate(replay_batch):
            _, a, _, reward = replay
            Q[i][a] = (1 - lr) * Q[i][a] + lr * (reward + factor * np.amax(Q_next[i]))

        # 传入网络进行训练        
        self.model.fit(s_batch, Q, verbose=0)   
        
        del Q
        del Q_next
    

In [8]:
# GPU显存,防止爆显存
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

1 Physical GPUs, 1 Logical GPUs


### 训练

In [9]:
episodes = 100 # 训练次数
score_list = [] # 记录所有训练分数
agent = DQN()

max_avg_score = 5
for i in range(episodes):
    gc.collect() # 释放资源,防止爆内存
    
    env_info = env.reset(train_mode=True)[brain_name]
    state = env_info.vector_observations[0]
    score = 0
    while True:
        # print(state)
        action = agent.act(state)   
        env_info = env.step(action)[brain_name]
        next_state = env_info.vector_observations[0]
        reward = env_info.rewards[0]
        done = env_info.local_done[0]
             
        agent.remember(state, action, next_state, reward)
        agent.train()
        score += reward
        state = next_state
        if done:
            score_list.append(score)
            print('episode:', i, 'score:', score, 'max', max(score_list))
            objgraph.show_growth()
            break
            
    # 保存模型
    avg_score = np.max(score_list[-5:])
    if avg_score > max_avg_score:
        max_avg_score = avg_score
        agent.save_model()
        # break

# 训练结束后关闭环境
# env.close()

plt.plot(score_list, color='green')
plt.show()

episode: 0 score: 0.0 max 0.0
tuple                        473158   +473158
dict                          69542    +69542
list                          67278    +67278
function                      54690    +54690
cell                          14271    +14271
weakref                       12214    +12214
builtin_function_or_method     8642     +8642
Operation                      7528     +7528
_InputList                     7528     +7528
Tensor                         7524     +7524
episode: 1 score: 1.0 max 1.0
tuple            903298   +430140
list             120074    +52796
dict             106258    +36716
Tensor            14760     +7236
TF_Output         14752     +7236
Operation         14764     +7236
_InputList        14764     +7236
set               10021     +4020
Dimension          7102     +3484
TraceableStack     6529     +3216
episode: 2 score: 2.0 max 2.0
tuple           1330229   +426931
list             172477    +52403
dict             142700    +36442
Tensor  

episode: 22 score: 0.0 max 2.0
tuple          31400641  +1744800
list            3980944   +221400
dict            2814830   +155400
Tensor           528537    +29400
TF_Output        528497    +29400
Operation        528557    +29400
_InputList       528557    +29400
set              322535    +18000
Dimension        292710    +16500
TraceableStack   256521    +14400
episode: 23 score: 2.0 max 2.0
tuple          33145443  +1744802
list            4202344   +221400
dict            2970230   +155400
Tensor           557937    +29400
TF_Output        557897    +29400
Operation        557957    +29400
_InputList       557957    +29400
set              340535    +18000
Dimension        309210    +16500
TraceableStack   270921    +14400
episode: 24 score: 0.0 max 2.0
tuple          34890243  +1744800
list            4423744   +221400
dict            3125630   +155400
Tensor           587337    +29400
TF_Output        587297    +29400
Operation        587357    +29400
_InputList       587357

tuple          69786256  +1744813
list            8851743   +221399
dict            6233630   +155400
Tensor          1175337    +29400
TF_Output       1175297    +29400
Operation       1175357    +29400
_InputList      1175357    +29400
set              718535    +18000
Dimension        655710    +16500
TraceableStack   573321    +14400
model saved
episode: 45 score: 5.0 max 8.0
tuple          71531068  +1744812
list            9073143   +221400
dict            6389040   +155410
Tensor          1204737    +29400
TF_Output       1204697    +29400
Operation       1204757    +29400
_InputList      1204757    +29400
set              736536    +18001
Dimension        672210    +16500
TraceableStack   587721    +14400


KeyboardInterrupt: 

### 验证,观看实际效果

In [10]:
model = models.load_model('p1_navigation-dqn.h5')

episodes = 10
score_list = []
for i in range(episodes):
    env_info = env.reset(train_mode=False)[brain_name] # reset the environment
    state = env_info.vector_observations[0]            # get the current state
    score = 0                                          # initialize the score
    while True:
        action = int(np.argmax(model.predict(np.array([state]))[0]))        # select an action
        env_info = env.step(action)[brain_name]        # send the action to the environment
        next_state = env_info.vector_observations[0]   # get the next state
        reward = env_info.rewards[0]                   # get the reward
        done = env_info.local_done[0]                  # see if episode has finished
        score += reward                                # update the score
        state = next_state                             # roll over the state to next time step
        if done:                                       # exit loop if episode finished
            score_list.append(score)
            print("episode:", i ,"score:", score)
            break
print("{} episode, avg score:{}".format(episodes, np.average(score_list)))

episode: 0 score: 1.0
episode: 1 score: 7.0
episode: 2 score: 0.0
episode: 3 score: 2.0
episode: 4 score: 5.0
episode: 5 score: -1.0
episode: 6 score: 1.0
episode: 7 score: 4.0
episode: 8 score: 0.0
episode: 9 score: 1.0
10 episode, avg score:2.0


### 关闭环境

In [11]:
env.close()