# **一个Actor-Critic算法的简单实现**
## **算法概述**
- 一个比较典型的Actor-Critic算法的简单实现；
- 这类算法融合了Value Based 和 Policy Based方法，其基本范式为AC算法的模型有两类输出头，一类输出价值（状态价值、动作价值、TD误差、优势函数及其他价值等）（以下代码使用TD无误差作为价值），另一类直接输出动作（Softmax或者log_softmax）。此举主要是为了扩展Policy Based的应用场景（因为Policy Based往往会采用MC方法来近似计算价值，但由于有些场景状态长度是无限的，所以没法用MC计算，但可以用价值网络来估计）。
- 算法网络直接输出动作（包括离散动作的softmax或者直接输出连续动作），以及输出价值。
- on-policy和off-policy都有，AC算法

论文链接：*https://docs.popo.netease.com/docs/502c4f53c35444afbce8ef384a2474ed*# (AC算法的理论基础)

In [1]:
import time
import platform

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
if platform.system() == "Darwin":
    PYTORCH_ENABLE_MPS_FALLBACK=1
    device = "mps"
else:
    device = "cuda" if torch.cuda.is_available() else "cpu"

In [3]:
class ActorCriticTrainer(nn.Module):
    def __init__(self, env):
        super(ActorCriticTrainer, self).__init__()
        self.state_dim = env.observation_space.shape[0]
        self.action_dim = env.action_space.n
        self.create_training_network()
        self.create_training_method()
        self.GAMMA = 0.9
        self.to(device)

    def create_training_network(self):
        self.fc = nn.Linear(self.state_dim, 20)
        self.critic = nn.Sequential(self.fc,nn.ReLU(),nn.Linear(20,1))
        self.actor = nn.Sequential(self.fc,nn.ReLU(),nn.Linear(20, self.action_dim))

    def create_training_method(self):
        self.optim = optim.Adam(self.parameters(),lr=0.001)
        self.value_loss = nn.MSELoss()
        self.actor_loss = nn.LogSoftmax(dim=-1)
    
    def choose_action(self, state):
        with torch.no_grad():
            state = torch.tensor(state, device=device)
            action_probs = F.softmax(self.actor(state), dim=-1)
            action = torch.multinomial(action_probs, 1).item()
            return action

    def calculate_td_error(self, state, reward, next_state):
        next_value = self.critic(next_state)
        value = self.critic(state)
        td_error = reward + self.GAMMA * next_value - value
        return td_error

    def calculate_policy_loss(self, state, action, td_error):
        action_logits = self.actor(state)
        log_probs = torch.log(F.softmax(action_logits, dim=-1))
        action_log_probs = torch.gather(log_probs,0,action)
        return action_log_probs * td_error

    def calculate_entropy_loss(self, state, action):
        action_logits = self.actor(state)
        probs = F.softmax(action_logits, dim=-1)
        log_probs = torch.log(probs)
        entropy = -torch.sum(probs * log_probs)
        return entropy
    
    def train_loop(self, state, action, reward, state_next):
        state = torch.tensor(state, device=device)
        action = torch.tensor(action, device=device)
        reward = torch.tensor(reward, device=device)
        next_state = torch.tensor(state_next, device=device)
        td_error = self.calculate_td_error(state, reward, next_state)
        value_loss = torch.square(td_error)
        policy_loss = self.calculate_policy_loss(state, action, td_error.item())
        entropy_loss = self.calculate_entropy_loss(state, action,)
        loss = value_loss - policy_loss
        self.optim.zero_grad()
        loss.backward()
        self.optim.step()


In [4]:
import gym
env_name = "CartPole-v1"
env = gym.make(env_name)
agent = ActorCriticTrainer(env)

In [5]:
def main():
    start_time = time.time()
    for episode in range(3000):
        state, _ = env.reset()
        for step in range(300):
            action = agent.choose_action(state)
            next_state, reward, done, _, _ = env.step(action)
            reward = -1 if done else 0.01
            agent.train_loop(state, action, reward, next_state)
            state = next_state
            if done:
                break
        
        if episode % 100 == 0 and episode !=0:
            total_reward = 0
            for i in range(10):
                state, _ = env.reset()
                for step in range(300):
                    action = agent.choose_action(state)
                    next_state, reward, done, _, _ = env.step(action)
                    total_reward += reward
                    state = next_state
                    if done:
                        break
            print(f"episode {episode} total reward is {total_reward/10}")

    end_time = time.time()
    print(f"total time is {end_time - start_time}")

In [6]:
if __name__ == "__main__":
    main()

  if not isinstance(terminated, (bool, np.bool8)):


episode 100 total reward is 40.5
episode 200 total reward is 35.5
episode 300 total reward is 77.3
episode 400 total reward is 119.6
episode 500 total reward is 203.4


KeyboardInterrupt: 

## 实验记录
2023-3-6
1、添加了entropy loss
结果：单纯添加上entropy_loss会减慢模型收敛速度。
