update online docs

snowkylin · Oct 13, 2019 · c02e3c3 · c02e3c3
1 parent 53577e5
commit c02e3c3
Show file tree

Hide file tree

Showing 17 changed files with 151 additions and 131 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -2,6 +2,6 @@
     "restructuredtext.builtDocumentationPath": "build/html",
     "restructuredtext.updateOnTextChanged"    : "true",
     "restructuredtext.sphinxBuildPath": "sphinx-build",
-    "python.pythonPath": "C:\\Users\\xihan\\Anaconda3\\envs\\tf2.0-beta\\python.exe",
+    "python.pythonPath": "C:\\Users\\xihan\\AppData\\Local\\Continuum\\anaconda3\\envs\\tf2.0-py37\\python.exe",
     "restructuredtext.confPath": "c:\\Users\\xihan\\Desktop\\TensorFlow-cn\\source"
 }
diff --git a/docs/.doctrees/en/basic/models.doctree b/docs/.doctrees/en/basic/models.doctree
diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle
diff --git a/docs/.doctrees/index.doctree b/docs/.doctrees/index.doctree
diff --git a/docs/.doctrees/zh/basic/models.doctree b/docs/.doctrees/zh/basic/models.doctree
diff --git a/docs/_sources/index.rst.txt b/docs/_sources/index.rst.txt
@@ -183,5 +183,5 @@ GitHub: https://github.com/snowkylin/tensorflow-handbook
 
     .. raw:: html
 
-        <a href="https://info.flagcounter.com/Hyjs"><img src="https://s05.flagcounter.com/count2/Hyjs/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_16/viewers_0/labels_1/pageviews_1/flags_0/percent_0/" alt="Flag Counter" border="0"></a>
+        <a href="https://info.flagcounter.com/Hyjs"><img src="https://s05.flagcounter.com/count2/Hyjs/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_6/viewers_0/labels_1/pageviews_1/flags_0/percent_0/" alt="Flag Counter" border="0"></a>
 
diff --git a/docs/_sources/zh/basic/models.rst.txt b/docs/_sources/zh/basic/models.rst.txt
@@ -499,7 +499,7 @@ Keras 模型以类的形式呈现，我们可以通过继承 ``tf.keras.Model``
     - `Demystifying Deep Reinforcement Learning <https://ai.intel.com/demystifying-deep-reinforcement-learning/>`_ （`中文编译 <https://snowkylin.github.io/rl/2017/01/04/Reinforcement-Learning.html>`_）
     - [Mnih2013]_
 
-这里，我们使用深度强化学习玩 CartPole（平衡杆）游戏。简单说，我们需要让模型控制杆的左右运动，以让其一直保持竖直平衡状态。
+这里，我们使用深度强化学习玩 CartPole（倒立摆）游戏。倒立摆是控制论中的经典问题，在这个游戏中，一根杆的底部与一个小车通过轴相连，而杆的重心在轴之上，因此是一个不稳定的系统。在重力的作用下，杆很容易倒下。而我们则需要控制小车在水平的轨道上进行左右运动，以使得杆一直保持竖直平衡状态。
 
 .. only:: html
 
@@ -517,7 +517,7 @@ Keras 模型以类的形式呈现，我们可以通过继承 ``tf.keras.Model``
 
         CartPole 游戏
 
-我们使用 `OpenAI 推出的 Gym 环境库 <https://gym.openai.com/>`_ 中的 CartPole 游戏环境，具体安装步骤和教程可参考 `官方文档 <https://gym.openai.com/docs/>`_ 和 `这里 <https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/4-4-gym/>`_ 。Gym 的基本调用方法如下：
+我们使用 `OpenAI 推出的 Gym 环境库 <https://gym.openai.com/>`_ 中的 CartPole 游戏环境，可使用 ``pip install gym`` 进行安装，具体安装步骤和教程可参考 `官方文档 <https://gym.openai.com/docs/>`_ 和 `这里 <https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/4-4-gym/>`_ 。和Gym的交互过程很像是一个回合制游戏，我们首先获得游戏的初始状态（比如杆的初始角度和小车位置），然后在每个回合t，我们都需要在当前可行的动作中选择一个并交由Gym执行（比如向左或者向右推动小车，每个回合中二者只能择一），Gym在执行动作后，会返回动作执行后的下一个状态和当前回合所获得的奖励值（比如我们选择向左推动小车并执行后，小车位置更加偏左，而杆的角度更加偏右，Gym将新的角度和位置返回给我们。而如果杆在这一回合仍没有倒下，Gym同时返回给我们一个小的正奖励）。这个过程可以一直迭代下去，直到游戏终止（比如杆倒下了）。在 Python 中，Gym 的基本调用方法如下：
 
 .. code-block:: python
 
@@ -532,11 +532,22 @@ Keras 模型以类的形式呈现，我们可以通过继承 ``tf.keras.Model``
         if done:                        # 如果游戏结束则退出循环
             break
 
-那么，我们的任务就是训练出一个模型，能够根据当前的状态预测出应该进行的一个好的动作。粗略地说，一个好的动作应当能够最大化整个游戏过程中获得的奖励之和，这也是强化学习的目标。
+那么，我们的任务就是训练出一个模型，能够根据当前的状态预测出应该进行的一个好的动作。粗略地说，一个好的动作应当能够最大化整个游戏过程中获得的奖励之和，这也是强化学习的目标。以CartPole游戏为例，我们的目标是希望做出合适的动作使得杆一直不倒，即游戏交互的回合数尽可能地多。而回合每进行一次，我们都会获得一个小的正奖励，回合数越多则累积的奖励值也越高。因此，我们最大化游戏过程中的奖励之和与我们的最终目标是一致的。
 
-以下代码展示了如何使用深度强化学习中的 Deep Q-Learning 方法来训练模型。
+以下代码展示了如何使用深度强化学习中的 Deep Q-Learning 方法来训练模型。首先，我们引入TensorFlow、Gym和一些常用库，并定义一些模型超参数：
 
 .. literalinclude:: /_static/code/zh/model/rl/rl.py
+    :lines: 1-14
+
+然后，我们使用 ``tf.keras.Model`` 建立一个Q函数网络（Q-network），用于拟合Q Learning中的Q函数。这里我们使用较简单的多层全连接神经网络进行拟合。该网络输入当前状态，输出各个动作下的Q-value（CartPole下为2维，即向左和向右推动小车）。
+
+.. literalinclude:: /_static/code/zh/model/rl/rl.py
+    :lines: 16-31
+
+最后，我们在主程序中实现Q Learning算法。
+
+.. literalinclude:: /_static/code/zh/model/rl/rl.py
+    :lines: 34-82
 
 对于不同的任务（或者说环境），我们需要根据任务的特点，设计不同的状态以及采取合适的网络来拟合 Q 函数。例如，如果我们考虑经典的打砖块游戏（Gym 环境库中的  `Breakout-v0 <https://gym.openai.com/envs/Breakout-v0/>`_ ），每一次执行动作（挡板向左、向右或不动），都会返回一个 ``210 * 160 * 3`` 的 RGB 图片，表示当前屏幕画面。为了给打砖块游戏这个任务设计合适的状态表示，我们有以下分析：
 
@@ -546,9 +557,9 @@ Keras 模型以类的形式呈现，我们可以通过继承 ``tf.keras.Model``
 
 而考虑到我们需要从图像信息中提取特征，使用 CNN 作为拟合 Q 函数的网络将更为适合。由此，将上面的 ``QNetwork`` 更换为 CNN 网络，并对状态做一些修改，即可用于玩一些简单的视频游戏。
 
-.. admonition:: 强化学习原理初探
+.. admonition:: 深度强化学习原理初探
 
-    强化学习
+    与前面所介绍的卷积神经网络和循环神经网络不同，强化学习（Reinforcement Learning）是一种学习算法的类型
 
 Keras Pipeline *
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/_static/basic.css b/docs/_static/basic.css
@@ -520,24 +520,23 @@ dl.citation > dd:after {
 }
 
 dl.field-list {
-    display: grid;
-    grid-template-columns: fit-content(30%) auto;
+    display: flex;
+    flex-wrap: wrap;
 }
 
 dl.field-list > dt {
+    flex-basis: 20%;
     font-weight: bold;
     word-break: break-word;
-    padding-left: 0.5em;
-    padding-right: 5px;
 }
 
 dl.field-list > dt:after {
     content: ":";
 }
 
 dl.field-list > dd {
-    padding-left: 0.5em;
-    margin-top: 0em;
+    flex-basis: 70%;
+    padding-left: 1em;
     margin-left: 0em;
     margin-bottom: 0em;
 }

diff --git a/docs/_static/code/zh/model/rl/rl.py b/docs/_static/code/zh/model/rl/rl.py
@@ -4,17 +4,15 @@
 import random
 from collections import deque
 
-num_episodes = 500
-num_exploration_episodes = 100
-max_len_episode = 1000
-batch_size = 32
-learning_rate = 1e-3
-gamma = 1.
-initial_epsilon = 1.
-final_epsilon = 0.01
+num_episodes = 500              # 游戏训练的总episode数量
+num_exploration_episodes = 100  # 探索过程所占的episode数量
+max_len_episode = 1000          # 每个episode的最大回合数
+batch_size = 32                 # 批次大小
+learning_rate = 1e-3            # 学习率
+gamma = 1.                      # 折扣因子
+initial_epsilon = 1.            # 探索起始时的探索率
+final_epsilon = 0.01            # 探索终止时的探索率
 
-
-# Q-network用于拟合Q函数，和前节的多层感知机类似。输入state，输出各个action下的Q-value（CartPole下为2维）。
 class QNetwork(tf.keras.Model):
     def __init__(self):
         super().__init__()
@@ -37,49 +35,48 @@ def predict(self, inputs):
     env = gym.make('CartPole-v1')       # 实例化一个游戏环境，参数为游戏名称
     model = QNetwork()
     optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
-    replay_buffer = deque(maxlen=10000)
+    replay_buffer = deque(maxlen=10000) # 使用一个 deque 作为 Q Learning 的经验回放池
     epsilon = initial_epsilon
     for episode_id in range(num_episodes):
         state = env.reset()             # 初始化环境，获得初始状态
-        epsilon = max(
+        epsilon = max(                  # 计算当前探索率
             initial_epsilon * (num_exploration_episodes - episode_id) / num_exploration_episodes,
             final_epsilon)
         for t in range(max_len_episode):
             env.render()                                # 对当前帧进行渲染，绘图到屏幕
-            if random.random() < epsilon:               # epsilon-greedy探索策略
-                action = env.action_space.sample()      # 以epsilon的概率选择随机动作
+            if random.random() < epsilon:               # epsilon-greedy 探索策略，以 epsilon 的概率选择随机动作
+                action = env.action_space.sample()      # 选择随机动作（探索）
             else:
-                action = model.predict(
-                    tf.constant(np.expand_dims(state, axis=0), dtype=tf.float32)).numpy()
+                action = model.predict(np.expand_dims(state, axis=0)).numpy()   # 选择模型计算出的 Q Value 最大的动作
                 action = action[0]
 
             # 让环境执行动作，获得执行完动作的下一个状态，动作的奖励，游戏是否已结束以及额外信息
             next_state, reward, done, info = env.step(action)
             # 如果游戏Game Over，给予大的负奖励
             reward = -10. if done else reward
-            # 将(state, action, reward, next_state)的四元组（外加done标签表示是否结束）放入经验重放池
+            # 将(state, action, reward, next_state)的四元组（外加 done 标签表示是否结束）放入经验回放池
             replay_buffer.append((state, action, reward, next_state, 1 if done else 0))
-            # 更新当前state
+            # 更新当前 state
             state = next_state
 
-            if done:                                    # 游戏结束则退出本轮循环，进行下一个episode
+            if done:                                    # 游戏结束则退出本轮循环，进行下一个 episode
                 print("episode %d, epsilon %f, score %d" % (episode_id, epsilon, t))
                 break
 
             if len(replay_buffer) >= batch_size:
-                # 从经验回放池中随机取一个批次的四元组，并分别转换为NumPy数组
+                # 从经验回放池中随机取一个批次的四元组，并分别转换为 NumPy 数组
                 batch_state, batch_action, batch_reward, batch_next_state, batch_done = zip(
                     *random.sample(replay_buffer, batch_size))
                 batch_state, batch_reward, batch_next_state, batch_done = \
                     [np.array(a, dtype=np.float32) for a in [batch_state, batch_reward, batch_next_state, batch_done]]
                 batch_action = np.array(batch_action, dtype=np.int32)
 
-                q_value = model(tf.constant(batch_next_state, dtype=tf.float32))
-                y = batch_reward + (gamma * tf.reduce_max(q_value, axis=1)) * (1 - batch_done)  # 按照论文计算y值
+                q_value = model(batch_next_state)
+                y = batch_reward + (gamma * tf.reduce_max(q_value, axis=1)) * (1 - batch_done)  # 计算 y 值
                 with tf.GradientTape() as tape:
-                    loss = tf.keras.losses.mean_squared_error(  # 最小化y和Q-value的距离
+                    loss = tf.keras.losses.mean_squared_error(  # 最小化 y 和 Q-value 的距离
                         y_true=y,
-                        y_pred=tf.reduce_sum(model(tf.constant(batch_state)) * tf.one_hot(batch_action, depth=2), axis=1)
+                        y_pred=tf.reduce_sum(model(batch_state) * tf.one_hot(batch_action, depth=2), axis=1)
                     )
                 grads = tape.gradient(loss, model.variables)
                 optimizer.apply_gradients(grads_and_vars=zip(grads, model.variables))       # 计算梯度并更新参数
diff --git a/docs/_static/jquery.js b/docs/_static/jquery.js