## 虚拟遗憾最小化

虚拟遗憾最小化算法原论文:

- [Regret Minimization in Games with Incomplete Information](https://proceedings.neurips.cc/paper/2007/file/08d98638c6fcd194a4b1e6992063e944-Paper.pdf)

之后还有一篇文章是在No Limit Poker上打败了人类选手。

- [Superhuman AI for heads-up no-limit poker: Libratus beats top professionals](https://www.science.org/doi/10.1126/science.aao1733)

虚拟遗憾最小化的英文全称为`Counterfactual Regret Minimization`，将其拆开理解:

1. 虚拟(`Counterfactual`): 定义为实际没有发生，但是在某些条件成立的情况下，是会发生的事情。比如你和我玩石头剪刀布的游戏，我这把出的石头，但是我也是有可能会出剪刀和布的，所以剪刀和布就可以看作是这把的虚拟动作。

2. 遗憾(`Regret`): 定义为你已经做了，但是你希望没有做，比如我出了石头，但是你出了布，所以我希望我没有出石头，不然我就输掉了，所以这个称为遗憾。

3. 最小化(`Minimization`): 是算法的目标，我们想要遗憾变得最小。

## 举例详解

石头剪刀布的游戏，

1. 第一把：我出的石头，对手也出的石头，此时我们两收到的奖励都是0。

对于遗憾值的计算，如果我出剪刀，那么将会得到的奖励是-1，对手得到的奖励就是+1。如果我出布，那么得到的奖励就是1，对手得到的奖励就是-1。

对于我来说，遗憾奖励(`Counterfactual rewards`)就是:

| 布     | 剪刀   |
| ------ | ------ |
| 1      | -1    |

遗憾值就是遗憾奖励减去真实奖励(`Counterfactual rewards - real rewards`)

| 石头    | 布     |   剪刀   | 
| ------ | ------ |  ------ |
| 0      |   1    |    -1   |


这是一种情况。

2. 第二把：如果我出石头，对方出布的话，我得到的奖励就是-1。

此时对于我来说，遗憾奖励(`Counterfactual rewards`)就是:

| 布     | 剪刀    |
| ------ | ------ |
| 0      |    1   |

遗憾值就是遗憾奖励减去真实奖励(`Counterfactual rewards - real rewards`)

| 石头    |    布  |    剪刀  | 
| ------ | ------ |  ------ |
| 0     |   1    |     2   |


## 纳什均衡

`Nash Equilibrium`是在非合作博弈中确定最优解的一种理论。石头剪刀布需要10000次迭代收敛到纳什均衡。



http://modelai.gettysburg.edu/2013/cfr/cfr.pdf

## 代码实现CFR玩石头剪刀布

In [1]:
import numpy as np

In [2]:
class RPSTrainer(object):
    def __init__(self):
        self.numActions = 3 # 动作个数3个
        self.possibleActions = np.arange(self.numActions) # 可能的动作
        self.actionUtility = np.array([
            [0, -1, 1],
            [1, 0, -1],
            [-1, 1, 0]
        ])  # 横坐标是我方，纵坐标是对方，对应的是石头，布，剪刀的收益。
        
        self.regretSum = np.zeros(self.numActions)  # 遗憾值的和，对应的是各个动作。
        self.strategySum = np.zeros(self.numActions)  # 策略和
        
        self.oppregretSum = np.zeros(self.numActions)  # 对方遗憾值的和
        self.oppstrategySum = np.zeros(self.numActions) # 对方策略和
    
    def getStrategy(self, regret_sum):
        """输入遗憾值获取策略"""
        regret_sum[regret_sum < 0] = 0
        normalizing_sum = sum(regret_sum)
        strategy = regret_sum
        for a in range(self.numActions):
            if normalizing_sum > 0:
                strategy[a] /= normalizing_sum
            else:
                strategy[a] = 1.0 / self.numActions
        return strategy
    
    def getAverageStrategy(self, strategySum):
        average_strategy = [0, 0, 0]
        normalizing_sum = sum(strategySum)
        for a in range(self.numActions):
            if normalizing_sum > 0:
                average_strategy[a] = strategySum[a] / normalizing_sum
            else:
                average_strategy[a] = 1.0 / self.numActions
        return average_strategy
    
    def getAction(self, strategy):
        """依据策略概率选择动作"""
        return np.random.choice(self.possibleActions, p=strategy)
    
    def getReward(self, myAction, opponentAction):
        """依据我方选择的动作和对方的动作获取奖励值"""
        return self.actionUtility[myAction, opponentAction]
    
    def train(self, iterations):
        
        for i in range(iterations):
            strategy = self.getStrategy(self.regretSum)  # 拿到己方策略
            oppStrategy = self.getStrategy(self.oppregretSum) # 拿到对方策略
            
            self.strategySum += strategy  # 己方策略和, 方便之后获取最终策略
            self.oppstrategySum += oppStrategy # 对方策略和
            
            oppenentAction = self.getAction(oppStrategy)
            myAction = self.getAction(strategy)
            
            myReward = self.getReward(myAction, oppenentAction)
            oppReward = self.getReward(oppenentAction, myAction)
            
            for a in range(self.numActions):
                myRegret = self.getReward(a, oppenentAction) - myReward
                oppRegret = self.getReward(a, myAction) - oppReward
                
                self.regretSum[a] += myRegret
                self.oppregretSum[a] += oppRegret  

In [3]:
def main():
    trainer = RPSTrainer()
    trainer.train(1000)
    targetPolicy = trainer.getAverageStrategy(trainer.strategySum)
    opp_target_policy = trainer.getAverageStrategy(trainer.oppstrategySum)
    print("player 1 policy {}".format(targetPolicy))
    print("player 2 policy {}".format(opp_target_policy))

In [4]:
if __name__ == "__main__":
    main()

player 1 policy [0.3235306458519336, 0.3438130831862383, 0.3326562709618281]
player 2 policy [0.33752920158710675, 0.3330774095362037, 0.32939338887668956]


## 虚拟遗憾最小化应用于序贯博弈

### Kuhn Poker

- [Vanilla CFR](https://justinsermeno.com/posts/cfr/)



库恩扑克是一种两个玩家的零和非完美信息游戏，非完美说的是玩家并不知道对手手中的牌是什么。

牌值只有1，2，3这三种情况，每轮每位玩家各持一张手牌，根据各自判断来决定加多少赌注。


<img src="../images/11-kuhnPoker.png" width="60%">


库恩扑克中有两个玩家，玩家1选择过牌pass，玩家2也选择过牌pass的话，就看看谁的牌值大谁就获胜。整个决策过程的决策树如上图所示。

### 虚拟遗憾最小化算法回顾

虚拟遗憾最小化算法可以分为以下几步:

1. 选择你想要采取的动作

2. 计算获得的奖励

3. 计算遗憾奖励，就是选择其它的动作能够获得的奖励

4. 真实奖励减去遗憾奖励得到遗憾值

5. 将所有的遗憾值存储起来

5. 对遗憾值求和，并将其归一化得到策略

### 程序思路

1. 递归遍历博弈树

2. 当遇见终止节点时，计算所能获得的奖励

3. 在每个决策节点计算总的遗憾值

4. 计算每个节点的决策概率

5. 更新策略

### 程序实现

In [5]:
class Node(object):
    def __init__(self, key, actionDict, nActions=2):
        """
        """
        self.key = key
        self.nActions = nActions # 可选动作个数，默认为2个
        self.regretSum = np.zeros(self.nActions) # 遗憾值记录列表
        self.strategySum = np.zeros(self.nActions)  # 策略记录列表
        self.actionDict = actionDict
        self.strategy = np.repeat(1 / self.nActions, self.nActions)  # 初始化策略
        
        self.reachPr = 0  # 到达概率
        self.reachPrSum = 0 
        
    def updateStrategy(self):
        self.strategySum += self.reachPr * self.strategy
        self.reachPrSum += self.reachPr
        
        self.strategy = self.getStrategy()  # 更新策略
        self.reachPr = 0
    
    def getStrategy(self):
        regrets = self.regretSum
        regrets[regrets < 0] = 0
        normalizingSum = sum(regrets)
        if normalizingSum > 0:
            return regrets / normalizingSum
        else:
            return np.repeat(1 / self.nAction, self.nAction)
    
    def getAverageStrategy(self):
        strategy = self.strategySum / self.reachPrSum
        # Re-normalize
        total = sum(strategy)
        strategy /= total
        return strategy
    
    def __str__(self):
        strategies = ['{:03.2f}'.format(x)
                      for x in self.get_average_strategy()]
        return '{} {}'.format(self.key.ljust(6), strategies)

In [7]:
from random import shuffle

class Kunh(object):
    def __init__(self):
        self.nodeMap = {} # 创建节点哈希表
        self.deck = np.array([0, 1, 2])
        self.nAction = 2
    
    def train(self, nIterations=50000):
        expectedGameValue = 0
        for _ in range(nIterations):
            shuffle(self.deck)  # 打乱牌序
            expected_game_value += self.cfr('', 1, 1)
            for _, v in self.nodeMap.items():
                v.updateStra
    
    def cfr(self, history, pr_1, pr_2):
        """
        """
        n = len(history)
        isPlayer1 = n % 2 == 0 # 判断是否是玩家1
        
        playerCard = self.deck[0] if isPlayer1 else self.deck[1]
        
        if self.isTerminal(history):
            cardPlayer = self.deck[0] if isPlayer1 else self.deck[1]
            cardOpponent = self.deck[1] if isPlayer1 else self.deck[0]
            reward = self.getReward(history, cardPlayer, cardOpponent)
            return reward
        
        node = self.getNode(self, playerCard, history)
        strategy = node.strategy
        
        # 对于每个动作的遗憾收益
        actionUtils = np.zeros(self.nAction)
        
        for act in range(self.nAction):
            nextHistory = history + node.actionDict[act]  # 添加历史动作
            if isPlayer1:
                actionUtils[act] = -1 * self.cfr(nextHistory, pr_1 * strategy[act], pr_2) # 收益等于对手收益 * -1
            else:
                actionUtils[act] = -1 * self.cfr(nextHistory, pr_1, pr_2 * strategy[act])
        
        # Utility of information set
        util = sum(actionUtils * strategy)
        regrets = actionUtils - util
        
        if isPlayer1:
            node.reachPr += pr_1
            node.regretSum += pr_2 * regrets
        else:
            node.reachPr += pr_2
            node.reachSum += pr_1 * regrets
        
        return util
        
    
    @staticmethod
    def isTerminal(history):
        """p表示pass，b表示bet"""
        if history[-2:] == 'pp' or history[-2:] == 'bb' or history[-2:] == 'bp':
            return true
    
    @staticmethod
    def getReward(history, playerCard, opponentCard):
        """"""
        terminalPass = history[-2] == 'p'
        doubleBet = history[-2:] == 'bb'
        if terminalPass: # 如果最后一个状态是pass。
            if history[-2:] == 'pp': # 都pass则比较大小。
                return 1 if playerCard > opponentCard else -1
            else: # bet , pass, 则我方赢
                return 1
        elif doubleBet:
            return 2 if playerCard > opponentCard else -2
    
    def getNode(self, card, history):
        key = str(card) + " " + history
        if key not in self.nodeMap:
            actionDict = {0: 'p', 1: 'b'}
            infoSet = Node(key, actionDict)
            self.nodeMap[key] = infoSet
            return infoSet
        return self.nodeMap[key]