### 来源

https://www.cnblogs.com/pinard/p/10272023.html

Actor负责生成动作并和环境交互, Critic使用价值函数来评估Actor的表现并指导Actor下一阶段的动作.
之前策略参数更新公式为:
$$
\theta=\theta+\alpha \nabla_{\theta} \log \pi_{\theta}\left(s_{t}, a_{t}\right) v_{t}
$$
梯度更新部分中，$\nabla_{\theta} \log \pi_{\theta}\left(s_{t}, a_{t}\right)$是我们的分值函数，不用动，要变成Actor的话改动的是$v_t$，这块不能再使用蒙特卡罗法来得到，而应该从Critic得到。

而对于Critic来说，这块是新的，不过我们完全可以参考之前DQN的做法，即用一个Q网络来做为Critic, 这个Q网络的输入可以是状态，而输出是每个动作的价值或者最优动作的价值。
现在我们汇总来说，就是Critic通过Q网络计算状态的最优价值$v_t$, 而Actor利用$v_t$这个最优价值迭代更新策略函数的参数$\theta$,进而选择动作，并得到反馈和新的状态，Critic使用反馈和新的状态更新Q网络参数, 在后面Critic会使用新的网络参数来帮Actor计算状态的最优价值$v_t$.

我们对于Critic评估的点选择是和上一篇策略梯度一样的状态价值$v_t$,实际上，我们还可以选择很多其他的指标来做为Critic的评估点。而目前可以使用的Actor-Critic评估点主要有：

1. 基于状态价值:
   $$
   \theta=\theta+\alpha \nabla_{\theta} \log \pi_{\theta}\left(s_{t}, a_{t}\right) V(s, w)
   $$

2. 基于动作价值:
   $$
   \theta=\theta+\alpha \nabla_{\theta} \log \pi_{\theta}\left(s_{t}, a_{t}\right) Q(s, a, w)
   $$
   
3. 基于TD误差, 它的表达式为$\delta(t)=R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)$或$\delta(t)=R_{t+1}+\gamma Q\left(S_{t+1}, A_{t+1}\right)-Q\left(S_{t}, A_{t}\right)$. 这样Actor的参数更新公式为:
   $$
   \theta=\theta+\alpha \nabla_{\theta} \log \pi_{\theta}\left(s_{t}, a_{t}\right) \delta(t)
   $$

4. 基于优势函数, 优势函数定义为$A(S, A, w, \beta)=Q(S, A, w, \alpha, \beta)-V(S, w, \alpha)$, 即动作价值函数和状态价值函数的差值. 这样Actor的参数更新公式为:
   $$
   \theta=\theta+\alpha \nabla_{\theta} \log \pi_{\theta}\left(s_{t}, a_{t}\right) A(S, A, w, \beta)
   $$

5. 基于$TD(\lambda)$误差, 一般都是基于后向$TD(\lambda)$误差, 是TD误差和效用迹E的乘积。这样Actor的策略函数参数更新的法公式为:
   $$
   \theta=\theta+\alpha \nabla_{\theta} \log \pi_{\theta}\left(s_{t}, a_{t}\right) \delta(t) E_{( } t )
   $$

对于Critic本身的模型参数$w$，一般都是使用均方误差损失函数来做做迭代更新，类似DQN中所讲的迭代方法. 如果我们使用的是最简单的线性Q函数，比如$Q(s, a, w)=\phi(s, a)^{T} w$,则Critic本身的模型参数$w$的更新公式可以表示为：
$$
\begin{array}{c}{\delta=R_{t+1}+\gamma Q\left(S_{t+1}, A_{t+1}\right)-Q\left(S_{t}, A_{t}\right)} \\ {w=w+\beta \delta \phi(s, a)}\end{array}
$$
通过对均方误差损失函数求导可以很容易的得到上式。当然实际应用中，我们一般不使用线性Q函数，而使用神经网络表示状态和Q值的关系。

这里给一个Actor-Critic算法的流程总结，评估点基于TD误差，Critic使用神经网络来计算TD误差并更新网络参数，Actor也使用神经网络来更新网络参数.

---

* 算法输入：迭代轮数$T$，状态特征维度$n$, 动作集$A$, 步长$\alpha,\beta$，衰减因子$\gamma$, 探索率$\epsilon$, Critic网络结构和Actor网络结构.

* 输出：Actor 网络参数$\theta$, Critic网络参数$w$.

1.  随机初始化所有的状态和动作对应的价值Q.  随机初始化Critic网络的所有参数$w$。随机初始化Actor网络的所有参数$\theta$。
2. 从$i$到$T$进行迭代:

　　a) 初始化$S$为当前状态序列的第一个状态, 拿到其特征向量$\phi(S)$

　　b) 在Actor网络中使用$\phi(S)$作为输入，输出动作$A$,基于动作$A$得到新的状态$S'$,反馈$R$.

　　c) 在Critic网络中分别使用$\phi (S)$，$\phi (S')$作为输入，得到Q值输出$V(S),V(S')$.

　　d) 计算TD误差$\delta=R+\gamma V\left(S^{\prime}\right)-V(S)$

　　e) 使用均方差损失函数$\sum\left(R+\gamma V\left(S^{\prime}\right)-V(S, w)\right)^{2}$作Critic网络参数$w$的梯度更新

　　f)  更新Actor网络参数$\theta $:
$$
\theta=\theta+\alpha \nabla_{\theta} \log \pi_{\theta}\left(S_{t}, A\right) \delta
$$
对于Actor的分值函数$\nabla_{\theta} \log \pi_{\theta}\left(S_{t}, A\right)$,可以选择softmax或者高斯分值函数。

---

In [23]:
import gym
import tensorflow as tf
import numpy as np
import random
from collections import deque
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "5"
config = tf.ConfigProto() 
config.gpu_options.per_process_gpu_memory_fraction = 0.2
session = tf.Session(config=config)
# Hyper Parameters
GAMMA = 0.95 # discount factor
LEARNING_RATE=0.01
tf.__version__

'1.12.0'

In [24]:
class Actor:
    def __init__(self, env, sess):
        # init some parameters
        self.time_step = 0
        self.state_dim = env.observation_space.shape[0]
        self.action_dim = env.action_space.n
        self.create_softmax_network()

        # Init session
        self.session = sess
        self.session.run(tf.global_variables_initializer())

    def create_softmax_network(self):
        # network weights
        W1 = self.weight_variable([self.state_dim, 20])
        b1 = self.bias_variable([20])
        W2 = self.weight_variable([20, self.action_dim])
        b2 = self.bias_variable([self.action_dim])
        # input layer
        self.state_input = tf.placeholder("float", [None, self.state_dim])
        self.tf_acts = tf.placeholder(tf.int32, [None, 2], name="actions_num")
        self.td_error = tf.placeholder(tf.float32, None, "td_error")  # TD_error
        # hidden layers
        h_layer = tf.nn.relu(tf.matmul(self.state_input, W1) + b1)
        # softmax layer
        self.softmax_input = tf.matmul(h_layer, W2) + b2
        # softmax output
        self.all_act_prob = tf.nn.softmax(self.softmax_input, name='act_prob')

        self.neg_log_prob = tf.nn.softmax_cross_entropy_with_logits(logits=self.softmax_input,
                                                                    labels=self.tf_acts)
        self.exp = tf.reduce_mean(self.neg_log_prob * self.td_error)

        # 这里需要最大化当前策略的价值，因此需要最大化self.exp,即最小化-self.exp
        self.train_op = tf.train.AdamOptimizer(LEARNING_RATE).minimize(-self.exp)

    def weight_variable(self, shape):
        initial = tf.truncated_normal(shape)
        return tf.Variable(initial)

    def bias_variable(self, shape):
        initial = tf.constant(0.01, shape=shape)
        return tf.Variable(initial)

    def choose_action(self, observation):
        prob_weights = self.session.run(self.all_act_prob, feed_dict={self.state_input: observation[np.newaxis, :]})
        action = np.random.choice(range(prob_weights.shape[1]),
                                  p=prob_weights.ravel())  # select action w.r.t the actions prob
        return action

    def learn(self, state, action, td_error):
        s = state[np.newaxis, :]
        one_hot_action = np.zeros(self.action_dim)
        one_hot_action[action] = 1
        a = one_hot_action[np.newaxis, :]
        # train on episode
        self.session.run(self.train_op, feed_dict={
            self.state_input: s,
            self.tf_acts: a,
            self.td_error: td_error,
        })

In [25]:
EPSILON = 0.01  # final value of epsilon
REPLAY_SIZE = 10000  # experience replay buffer size
BATCH_SIZE = 32  # size of minibatch
REPLACE_TARGET_FREQ = 10  # frequency to update target Q network

In [26]:
class Critic:
    def __init__(self, env, sess):
        # init some parameters
        self.time_step = 0
        self.epsilon = EPSILON
        self.state_dim = env.observation_space.shape[0]
        self.action_dim = env.action_space.n

        self.create_Q_network()
        self.create_training_method()

        # Init session
        self.session = sess
        self.session.run(tf.global_variables_initializer())

    def create_Q_network(self):
        # network weights
        W1q = self.weight_variable([self.state_dim, 20])
        b1q = self.bias_variable([20])
        W2q = self.weight_variable([20, 1])
        b2q = self.bias_variable([1])
        self.state_input = tf.placeholder(tf.float32, [1, self.state_dim], "state")
        # hidden layers
        h_layerq = tf.nn.relu(tf.matmul(self.state_input, W1q) + b1q)
        # Q Value layer
        self.Q_value = tf.matmul(h_layerq, W2q) + b2q

    def create_training_method(self):
        self.next_value = tf.placeholder(tf.float32, [1, 1], "v_next")
        self.reward = tf.placeholder(tf.float32, None, 'reward')

        with tf.variable_scope('squared_TD_error'):
            self.td_error = self.reward + GAMMA * self.next_value - self.Q_value
            self.loss = tf.square(self.td_error)
        with tf.variable_scope('train'):
            self.train_op = tf.train.AdamOptimizer(self.epsilon).minimize(self.loss)

    def train_Q_network(self, state, reward, next_state):
        s, s_ = state[np.newaxis, :], next_state[np.newaxis, :]
        v_ = self.session.run(self.Q_value, {self.state_input: s_})
        td_error, _ = self.session.run([self.td_error, self.train_op],
                                       {self.state_input: s, self.next_value: v_, self.reward: reward})
        return td_error

    def weight_variable(self, shape):
        initial = tf.truncated_normal(shape)
        return tf.Variable(initial)

    def bias_variable(self, shape):
        initial = tf.constant(0.01, shape=shape)
        return tf.Variable(initial)

In [27]:
# Hyper Parameters
ENV_NAME = 'CartPole-v0'
EPISODE = 3000  # Episode limitation
STEP = 3000  # Step limitation in an episode
TEST = 10  # The number of experiment test every 100 episode

In [28]:
# initialize OpenAI Gym env and dqn agent
sess = tf.InteractiveSession()
env = gym.make(ENV_NAME)
actor = Actor(env, sess)
critic = Critic(env, sess)

for episode in range(EPISODE):
    # initialize task
    state = env.reset()
    # Train
    for step in range(STEP):
        action = actor.choose_action(state)  # e-greedy action for train
        next_state, reward, done, _ = env.step(action)
        td_error = critic.train_Q_network(state, reward, next_state)  # gradient = grad[r + gamma * V(s_) - V(s)]
        actor.learn(state, action, td_error)  # true_gradient = grad[logPi(s,a) * td_error]
        state = next_state
        if done:
            break

    # Test every 100 episodes
    if episode % 100 == 0:
        total_reward = 0
        for i in range(TEST):
            state = env.reset()
            for j in range(STEP):
#                 env.render()
                action = actor.choose_action(state)  # direct action for test
                state, reward, done, _ = env.step(action)
                total_reward += reward
                if done:
                    break
        ave_reward = total_reward / TEST
        print('episode: ', episode, 'Evaluation Average Reward:', ave_reward)



episode:  0 Evaluation Average Reward: 23.0
episode:  100 Evaluation Average Reward: 10.3
episode:  200 Evaluation Average Reward: 25.1
episode:  300 Evaluation Average Reward: 20.0
episode:  400 Evaluation Average Reward: 15.3
episode:  500 Evaluation Average Reward: 15.5
episode:  600 Evaluation Average Reward: 12.5
episode:  700 Evaluation Average Reward: 12.4
episode:  800 Evaluation Average Reward: 11.9
episode:  900 Evaluation Average Reward: 9.6
episode:  1000 Evaluation Average Reward: 9.4
episode:  1100 Evaluation Average Reward: 9.4
episode:  1200 Evaluation Average Reward: 9.4
episode:  1300 Evaluation Average Reward: 10.1
episode:  1400 Evaluation Average Reward: 9.3
episode:  1500 Evaluation Average Reward: 9.6
episode:  1600 Evaluation Average Reward: 9.2
episode:  1700 Evaluation Average Reward: 9.3
episode:  1800 Evaluation Average Reward: 9.4
episode:  1900 Evaluation Average Reward: 9.4
episode:  2000 Evaluation Average Reward: 9.4
episode:  2100 Evaluation Average Re