# Fictitious Play
This tutorial  demonstrates fictitious play (FP) on a zero-sum strategic (non-extensive) game. 

For the payoff matrix, row player is maximizer, coloumn player is minimizer.

Reference: https://towardsdatascience.com/introduction-to-fictitious-play-12a8bc4ed1bb

In [1]:
import numpy as np
! pip install pandas
import pandas as pd
# GameMatrix = np.array([[-2,3], [3,-4]])
GameMatrix = np.array([[0,2,-1], [-1,0,1], [1,-1,0]])

Itr = 10000
print('Game matrix: ')
pd.DataFrame(GameMatrix)


You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Game matrix: 


Unnamed: 0,0,1,2
0,0,2,-1
1,-1,0,1
2,1,-1,0


In [37]:
# a random initial step on row
row_value = np.zeros(GameMatrix.shape[0])
col_value = np.zeros(GameMatrix.shape[1])
min_id_list, max_id_list = [], []

for i in range(Itr):
    # current l is row
    min_id = np.argmin(row_value)
    min_id_list.append(min_id)
    l = GameMatrix[:, min_id] # l is column now
    col_value += np.array(l)
    
    # current l is column
    max_id = np.argmax(col_value)
    max_id_list.append(max_id)
    l = GameMatrix[max_id]  #  l is row now
    row_value += np.array(l)
    
# The statistical frequencies of occurrence of different entries are just their probability masses in final policy,
# which is the average best response over history.
hist, _ = np.histogram(max_id_list, bins=GameMatrix.shape[0])
max_policy = hist/np.sum(hist)
hist, _ = np.histogram(min_id_list, bins=GameMatrix.shape[1])
min_policy = hist/np.sum(hist)

row_value = row_value / (i+1)  # historical average
col_value = col_value / (i+1)
print(f'For row player, strategy is {max_policy}, game value [lower bound, upper bound]: {row_value}')
print(f'For column player, strategy is {min_policy}, game value [lower bound, upper bound]: {col_value}')


For row player, strategy is [0.25   0.3332 0.4168], game value [lower bound, upper bound]: [0.0836 0.0832 0.0832]
For column player, strategy is [0.3333 0.2499 0.4168], game value [lower bound, upper bound]: [0.083  0.0835 0.0834]


# 2 Fictious Self-Play
Here we try to demonstrate the fictious self-play algorithm (or say a non-neural verison of Neural Fictious Self-Play, NFSP, [2]) on a zero-sum strategic (non-extensive) game.

Note that in original paper of FSP [1], the algorithm of FSP actually uses the supervised learning (SL) and reinforcement learning (RL) language to describe it, quite close to NFSP. SL is for achieving the average policy and RL is for reaching the best response policy. But here we have simple normal-form game with utility written in matrix, so it is a non-neural version. For the average policy, from above FP example we can see, it can be actually directly estimated with the historical frequency of actions for each player. Best response policy is easily calculated since it should be deterministic.

Also note that it is not the XFP algorithm for extensive-form games.

Refereneces:

[1] FSP http://proceedings.mlr.press/v37/heinrich15.html

[2] NFSP https://arxiv.org/abs/1603.01121

In [2]:
import numpy as np
! pip install pandas
import pandas as pd
# GameMatrix = np.array([[-2,3], [3,-4]])
GameMatrix = np.array([[0,2,-1], [-1,0,1], [1,-1,0]])

print('Game matrix: ')
pd.DataFrame(GameMatrix)


You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Game matrix: 


Unnamed: 0,0,1,2
0,0,2,-1
1,-1,0,1
2,1,-1,0


In [47]:
# hyperparameters for FSP
Eta = 0.2 # eta=1, only best response policy as FP; eta=0, only average policy; see either [1] or [2]
Avg_warmup = 0.2 # start to get average policy after certain ratio of overall iterations
Itr = 10000

def sample_from_categorical(dist):
    """
    sample once from a categorical distribution, return the entry index.
    dist: should be a list or array of probabilities for a categorical distribution
    """
    return np.argmax(np.random.multinomial(1, dist))

# a random initial step on row
row_value = np.zeros(GameMatrix.shape[0])
col_value = np.zeros(GameMatrix.shape[1])
min_id_list, max_id_list = [], []

# initialize the average policy as uniform
max_policy = 1./GameMatrix.shape[0]*np.ones(GameMatrix.shape[0])
min_policy = 1./GameMatrix.shape[1]*np.ones(GameMatrix.shape[1])

for i in range(Itr):
    # current l is row
    if i< Avg_warmup or np.isnan(max_policy).any() or sample_from_categorical([1-Eta, Eta]): # if True, the case under Eta, best response (greedy) policy
        min_id = np.argmin(row_value)
        min_id_list.append(min_id)  # the calculation of average policy only uses actions from the best response policy
    else:  # sample action from the average policy
        min_id = sample_from_categorical(min_policy)
    l = GameMatrix[:, min_id] # l is column now
    col_value += np.array(l)
    
    # current l is column
    if i< Avg_warmup or np.isnan(max_policy).any() or sample_from_categorical([1-Eta, Eta]):
        max_id = np.argmax(col_value)
        max_id_list.append(max_id)
    else:
        max_id = sample_from_categorical(max_policy)
    l = GameMatrix[max_id]  #  l is row now
    row_value += np.array(l)
    
    # get the average policy for each player at each iteration
    hist, _ = np.histogram(max_id_list, bins=GameMatrix.shape[0])
    max_policy = hist/np.sum(hist)
    hist, _ = np.histogram(min_id_list, bins=GameMatrix.shape[1])
    min_policy = hist/np.sum(hist)
    

# The statistical frequencies of occurrence of different entries are just their probability masses in final policy,
# which is the average best response over history.
hist, _ = np.histogram(max_id_list, bins=GameMatrix.shape[0])
max_policy = hist/np.sum(hist)
hist, _ = np.histogram(min_id_list, bins=GameMatrix.shape[1])
min_policy = hist/np.sum(hist)

row_value = row_value / (i+1)  # historical average
col_value = col_value / (i+1)
print(f'For row player, strategy is {max_policy}, game value [lower bound, upper bound]: {row_value}')
print(f'For column player, strategy is {min_policy}, game value [lower bound, upper bound]: {col_value}')


For row player, strategy is [0.49797776 0.42163802 0.08038423], game value [lower bound, upper bound]: [ 0.0155  0.948  -0.3812]
For column player, strategy is [0.04058116 0.13777555 0.82164329], game value [lower bound, upper bound]: [-0.0518  0.4711 -0.1458]


it seems FSP does not work?

A modified version, never get best response against the opponent's best response.

In [56]:
# hyperparameters for FSP
Eta = 0.2 # eta=1, only best response policy as FP; eta=0, only average policy; see either [1] or [2]
Avg_warmup = 0.2 # start to get average policy after certain ratio of overall iterations
Itr = 20

def sample_from_categorical(dist):
    """
    sample once from a categorical distribution, return the entry index.
    dist: should be a list or array of probabilities for a categorical distribution
    """
    return np.argmax(np.random.multinomial(1, dist))

# a random initial step on row
row_value = np.zeros(GameMatrix.shape[0])
col_value = np.zeros(GameMatrix.shape[1])
min_id_list, max_id_list = [], []

# initialize the average policy as uniform
max_policy = 1./GameMatrix.shape[0]*np.ones(GameMatrix.shape[0])
min_policy = 1./GameMatrix.shape[1]*np.ones(GameMatrix.shape[1])

for i in range(Itr):
    # current l is row
    if i< Avg_warmup or np.isnan(max_policy).any() or sample_from_categorical([1-Eta, Eta]): # if True, the case under Eta, best response (greedy) policy
        min_id = np.argmin(row_value)
        min_id_list.append(min_id)  # the calculation of average policy only uses actions from the best response policy
    # sample action from the average policy
#     print(min_id_list, min_policy)
    min_id = sample_from_categorical(min_policy)
    l = GameMatrix[:, min_id] # l is column now
    col_value = np.array(l)
    
    # current l is column
    if i< Avg_warmup or np.isnan(max_policy).any() or sample_from_categorical([1-Eta, Eta]):
        max_id = np.argmax(col_value)
        max_id_list.append(max_id)
    print(max_id_list, max_policy)
    max_id = sample_from_categorical(max_policy)
    l = GameMatrix[max_id]  #  l is row now
    row_value = np.array(l)
    
    # get the average policy for each player at each iteration
    hist, _ = np.histogram(max_id_list, bins=GameMatrix.shape[0])
    max_policy = hist/np.sum(hist)
    hist, _ = np.histogram(min_id_list, bins=GameMatrix.shape[1])
    min_policy = hist/np.sum(hist)
    

# The statistical frequencies of occurrence of different entries are just their probability masses in final policy,
# which is the average best response over history.
hist, _ = np.histogram(max_id_list, bins=GameMatrix.shape[0])
max_policy = hist/np.sum(hist)
hist, _ = np.histogram(min_id_list, bins=GameMatrix.shape[1])
min_policy = hist/np.sum(hist)

row_value = row_value / (i+1)  # historical average
col_value = col_value / (i+1)
print(f'For row player, strategy is {max_policy}, game value [lower bound, upper bound]: {row_value}')
print(f'For column player, strategy is {min_policy}, game value [lower bound, upper bound]: {col_value}')


[1] [0.33333333 0.33333333 0.33333333]
[1] [0. 1. 0.]
[1, 0] [0. 1. 0.]
[1, 0] [0.5 0.  0.5]
[1, 0] [0.5 0.  0.5]
[1, 0, 0] [0.5 0.  0.5]
[1, 0, 0] [0.66666667 0.         0.33333333]
[1, 0, 0] [0.66666667 0.         0.33333333]
[1, 0, 0] [0.66666667 0.         0.33333333]
[1, 0, 0] [0.66666667 0.         0.33333333]
[1, 0, 0] [0.66666667 0.         0.33333333]
[1, 0, 0, 1] [0.66666667 0.         0.33333333]
[1, 0, 0, 1] [0.5 0.  0.5]
[1, 0, 0, 1] [0.5 0.  0.5]
[1, 0, 0, 1] [0.5 0.  0.5]
[1, 0, 0, 1] [0.5 0.  0.5]
[1, 0, 0, 1] [0.5 0.  0.5]
[1, 0, 0, 1] [0.5 0.  0.5]
[1, 0, 0, 1] [0.5 0.  0.5]
[1, 0, 0, 1] [0.5 0.  0.5]
For row player, strategy is [0.5 0.  0.5], game value [lower bound, upper bound]: [ 0.05 -0.05  0.  ]
For column player, strategy is [0.5 0.  0.5], game value [lower bound, upper bound]: [-0.05  0.05  0.  ]


In [None]:
test increasing eta version