# RL solutions for combinatorial games: Monte Carlo Control

## The Game of Nim(Mathematical jargon):
- The Game of Nim is a simple 2 player impartial perfect information game
- Any Combinatorial Game is equivalent to a one heap game of nim under normal play $\rightarrow$ [Sprague Grundy Theorem](https://en.wikipedia.org/wiki/Sprague%E2%80%93Grundy_theorem)
- refer [MIT lecture notes](https://web.mit.edu/sp.268/www/nim.pdf) or [wikipedia](https://en.wikipedia.org/wiki/Nim) for more

## The Game of Nim(In simpler terms):
- The game of Nim is given by the following setting:
- There is a set of **heaps** arranged in front of you each containing some number of **pebbles**
- On your turn you must pick a non-empty heap and take out any non-zero number of pebbles from it and obviously you cannot take out more pebbles than there are in the heap
- There are 2 ways to play Nim which are "equivalent"(quotes here because the notation of equivalence here is clearly defined but omitted here)
    1. Normal Play: The player who empties the last heap wins
    2. Misère Play: The player who is forced to take the last pebble wins

### The winning strategy:
- It can be proved that this game has a winning strategy which is to always end your turn on a **Nim Sum** of zero
- For the case of our simple nim game, the nim sum is simply the XOR of the number of pebbles in each heap

Note: Under certain conditions, **Go** endgames can be modeled as a combinatorial game

In [45]:
import numpy as np
from tqdm import tqdm

In [46]:
np.random.seed(42)

### Defining the Game Simulator
`SimpleNim` is a class that allows the agent to interact with the game and perform actions via SimpleNim::make_move()

In [47]:
# A Simple Nim Game
class SimpleNim:
    def __init__(self, state):
        self.heaps = state
        self.turn = 0
        self.num_heaps = len(self.heaps)

    def get_state(self):
        if isinstance(self.heaps, list):
            return tuple(self.heaps)
        else:
            return tuple(self.heaps.tolist())

    def get_num_heaps(self):
        return self.num_heaps

    def get_heap_size(self, heap_index):
        if 0 <= heap_index < self.num_heaps:
            return self.heaps[heap_index]
        return None

    def is_game_over(self):
        return all(pebbles == 0 for pebbles in self.heaps)

    def make_move(self, heap_index, num_pebbles):
        if self.is_game_over():
            return False

        # Validate the move
        if 0 <= heap_index < self.num_heaps and 0 < num_pebbles <= self.heaps[heap_index]:
            self.heaps[heap_index] -= num_pebbles
            self.turn = 1 - self.turn  # Switch turns
            return True
        return False

    def winner(self):
        if self.is_game_over():
            return 1-self.turn
        return None

### Optimal Play
- There exists an optimal strategy on positions having nim sum of 0
- Here we define an opponent who plays optimally on positions of nim sum 0 and makes a move that removes the least possible number of pebbles(one pebble) hoping for the agent to fumble on their turn and give the opponent a state with nim sum zero

In [48]:
def nim_sum(s):
    xor = 0
    for heap in s:
        xor ^= heap
    return xor

In [49]:
def play_optimal(game):
    if game.is_game_over():
        return
    heaps = game.get_state()

    nimber = nim_sum(heaps)
    largest_index, largest = max(enumerate(heaps), key=lambda x:x[1])

    if nimber != 0:
        game.make_move(largest_index, largest - (largest^nimber)) # optimal move
    else:
        game.make_move(largest_index, 1) # take the move that progresses the game to the least possible extent and hope for a blunder

In [50]:
def play_random(game):
    if game.is_game_over():
        return
    heaps = game.get_state()

    index = np.random.randint(0, len(heaps))
    while heaps[index] == 0:
        index = np.random.randint(0, len(heaps))
    return (index, np.random.randint(0, heaps[index]+1))

In [51]:
# simple simulation of a nim game with both players using the optimal strategy
heaps = [10] * 3
game = SimpleNim(heaps)
while not game.is_game_over():
    play_optimal(game)
    print(game.get_state())
print("The winner is Player ", game.winner())

(0, 10, 10)
(0, 9, 10)
(0, 9, 9)
(0, 8, 9)
(0, 8, 8)
(0, 7, 8)
(0, 7, 7)
(0, 6, 7)
(0, 6, 6)
(0, 5, 6)
(0, 5, 5)
(0, 4, 5)
(0, 4, 4)
(0, 3, 4)
(0, 3, 3)
(0, 2, 3)
(0, 2, 2)
(0, 1, 2)
(0, 1, 1)
(0, 0, 1)
(0, 0, 0)
The winner is Player  0


## Defining State and Action Spaces:
**The State** is given by a configuration of heaps($\N_0$ means the set of naturals that includes zero):
$$ \mathcal S = \N_0^h $$
where h is the number of heaps
**Actions** are given by two tuples of heap number and number of pebbles to be taken out:
$$ \mathcal A(s) = H(s) \times \N $$ 
where $\N$ does not include zero and $H(s)$ is the set of heaps having non zero number of pebbles in the state s

In [None]:
# for simplicity consider the version of the problem that starts off with same number of pebbles in each heap
num_heaps = 3
initial_pebbles = 8
max_episodes = 100000

### Epsilon Greedy Policy:
- A policy that chooses to either make an exploratory action or to pick an action via a greedy approach on the current value function
- This policy helps to continue exploring different actions throughout the run
- Here a method update() is provided to update epsilon in order to implement **GLIE**(Greedy in the Limit with Infinite Exploration)
- refer [David Silver's Slides](https://davidstarsilver.wordpress.com/wp-content/uploads/2025/04/lecture-5-model-free-control-.pdf) for a better explanation on GLIE and epsilon greedy policies

In [53]:
class EpsilonGreedyPolicy:
    def __init__(self):
        self.epsilon = 1

    def update(self, epsilon):
        self.epsilon = epsilon

    def policy(self, s, Q):
        if s not in Q or np.random.random() < self.epsilon:
            index = np.random.randint(0, len(s))
            while s[index] == 0:
                index = np.random.randint(0, len(s))
            return (index, 1)
        else:
            return max(Q[s], key=Q[s].get)

    def final_policy(self, s, Q):
        if s not in Q:
            return (0,0)
        return max(Q[s], key=Q[s].get)

In [54]:
class SimpleGreedyPolicy:
    def update(self, epsilon): pass

    def policy(self, s, Q):
        if s not in Q:
            index = np.random.randint(0, len(s))
            while s[index] == 0:
                index = np.random.randint(0, len(s))
            return (index, 1)
        return max(Q[s], key=Q[s].get)

    def final_policy(self, s, Q):
        if s not in Q:
            return (0,0)
        return max(Q[s], key=Q[s].get)

In [55]:
def simulate_agent_move(game, states, actions, policy, Q):
    curr_state = game.get_state()
    states.append(curr_state)
    curr_action = policy.policy(curr_state, Q)
    actions.append(curr_action)
    game.make_move(curr_action[0], curr_action[1])

    return curr_state, curr_action

#### Rewards Model
$$R_T = \begin{cases} 
            1 & agent & win \\
            0 & agent & loss
        \end{cases}$$
$$R_t = 0 \hspace{3mm}\forall t<T$$

In [56]:
def generate_episode(policy, s0, a0, Q, negative_on_loss=False, move_opponent=play_optimal):

    # init
    game = SimpleNim(s0)
    states = []
    actions = []
    rewards = []

    first_state = tuple(s0.tolist())
    states.append(first_state)
    actions.append(a0)
    game.make_move(a0[0], a0[1])

    first_visit = {(first_state,a0): 0}
    index = 1

    # loop until end of episode
    while not game.is_game_over():
        # opponent move
        move_opponent(game)

        # agent move
        if game.is_game_over():
            break
        rewards.append(0)

        curr_state, curr_action = simulate_agent_move(game, states, actions, policy, Q)

        if (tuple(curr_state), curr_action) not in first_visit:
            first_visit[(tuple(curr_state), curr_action)] = index
        index+=1

    if game.winner() == 0:
        rewards.append(1)
    else:
        if negative_on_loss: rewards.append(-1)
        else: rewards.append(0)
    return states, actions, rewards, first_visit

In [57]:
# Monte Carlo Control with exploring starts
def monte_carlo_es(gamma=1, policy=EpsilonGreedyPolicy(), negative_on_loss=False, move_opponent=play_optimal):
    Q = {} # take default to be zero
    N = {}

    for i in tqdm(range(max_episodes)):
        s0 = np.random.randint(0, initial_pebbles+1, num_heaps)
        while all(s0 == 0):
            s0 = np.random.randint(0, initial_pebbles+1, num_heaps)
        heap_num = np.random.randint(0,num_heaps)
        while s0[heap_num] == 0:
            heap_num = np.random.randint(0,num_heaps)
        a0 = (heap_num, np.random.randint(1, s0[heap_num]+1))

        S, A, R, first_visit = generate_episode(policy=policy, s0=s0, a0=a0, Q=Q, negative_on_loss=negative_on_loss, move_opponent=move_opponent)

        if len(S) == 0:
            continue

        G = 0
        for t in range(len(R)-1, -1, -1):
            G = gamma * G + R[t]
            if first_visit[(S[t], A[t])] == t:

                if (S[t], A[t]) not in N:
                    N[(S[t], A[t])] = 0
                N[(S[t], A[t])] += 1

                if S[t] not in Q:
                    Q[S[t]] = {}
                if A[t] not in Q[S[t]]:
                    Q[S[t]][A[t]] = 0
                Q[S[t]][A[t]] = Q[S[t]][A[t]] + float((G-Q[S[t]][A[t]]))/float(N[(S[t], A[t])])


        policy.update(epsilon=1.0/float(i+1))

    return policy, Q

In [58]:
def is_optimal(s, a):
    if a[1] == 0: return False
    s = list(s)
    s[a[0]] -= a[1]
    return nim_sum(s) == 0

In [59]:
# Epsilon Greedy and exploring starts
policy, Q = monte_carlo_es()

print(Q)

100%|██████████| 100000/100000 [00:02<00:00, 36286.93it/s]

{(1, 0, 1): {(0, 1): 0.38381492429269554, (2, 1): 0.0}, (2, 3, 1): {(0, 1): 0.5000893655049156, (2, 1): 0.0, (0, 2): 0.0, (1, 1): 0.0, (1, 3): 0.0, (1, 2): 0.0}, (6, 3, 4): {(2, 3): 0.0, (2, 1): 0.0, (0, 6): 0.0, (0, 5): 0.6521739130434783, (1, 1): 0.6607142857142857, (2, 2): 0.6451612903225805, (2, 4): 0.0, (0, 3): 0.0, (0, 4): 0.0, (1, 3): 0.0, (1, 2): 0.6477272727272725, (0, 1): 0.6410256410256409, (0, 2): 0.0}, (0, 1, 1): {(2, 1): 0.02395551119349773, (1, 1): 0.0}, (6, 1, 2): {(2, 1): 0.0, (1, 1): 0.0, (2, 2): 0.0, (0, 4): 0.0, (0, 5): 0.0, (0, 3): 1.0, (0, 2): 0.0, (0, 1): 0.0, (0, 6): 0.0}, (4, 4, 1): {(1, 1): 0.8117203365245139, (0, 2): 0.0, (1, 3): 0.0, (1, 4): 0.0, (2, 1): 0.8068181818181818, (0, 1): 0.0, (1, 2): 0.0, (0, 4): 0.0, (0, 3): 0.0}, (5, 4, 1): {(0, 1): 0.9256265157639455, (0, 4): 0.0, (2, 1): 0.0, (1, 3): 0.0, (1, 4): 0.0, (1, 2): 0.0, (0, 2): 0.0, (0, 3): 0.0, (0, 5): 0.0, (1, 1): 0.0}, (7, 4, 3): {(2, 2): 1.0, (1, 3): 0.0, (2, 3): 0.0, (0, 5): 0.0, (0, 1): 0.0, (




## How Good is our Policy?
- For this game there is a nice way to deal with thinking about how good our policy is in a way that is intuitive
- Rather than looking at cumulative reward functions(value functions) and trying to figure out what they actually mean, we can compare the policy to the optimal strategy which we know:
    - If the current turn ends on a nim sum of zero, the opponent cannot win if the agent continues to follow the optimal strategy from the next turn onwards(end every turn on a nim sum of zero)
    - If not, an opponent playing optimally will always win

### Fumbles:
- Here what I mean by a fumble is: 
    - given a state where on playing optimally the agent can win,
    - the agent chooses an action such that an opponent playing optimally always wins

So, the percentage of winning positions for the agent fumbled is a way to see how good our policy is

In [60]:
import itertools
# evaluate the policy based on number of fumbles
def evaluate_on_winning_positions(policy, Q):
    fumbles = 0
    num_winning_pos = 0
    for s in itertools.product(range(initial_pebbles), repeat=num_heaps):
        if s == (0)*num_heaps: continue
        action = policy.final_policy(s, Q)
        if nim_sum(s) == 0: continue
        if not is_optimal(s, action):
            fumbles+=1
        num_winning_pos+=1
    return fumbles, num_winning_pos

In [61]:
def evaluate_using_fumbles(policy, Q):
    fumbles, num_winning_pos = evaluate_on_winning_positions(policy, Q)
    print('number of fumbles:', fumbles)
    print('total number of winning positions:', num_winning_pos)
    print('percentage of winning positions fumbled: {:.2f} %'.format(fumbles/num_winning_pos*100))

In [62]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 108
total number of winning positions: 300
percentage of winning positions fumbled: 36.00 %


In [63]:
policy, Q = monte_carlo_es(policy=SimpleGreedyPolicy())

print(Q)

100%|██████████| 100000/100000 [00:02<00:00, 38420.10it/s]

{(2, 7, 0): {(1, 7): 0.0, (0, 2): 0.0, (1, 4): 0.0, (1, 2): 0.0, (0, 1): 0.0, (1, 1): 0.0, (1, 6): 0.0, (1, 3): 0.0, (1, 5): 1.0}, (0, 1, 1): {(1, 1): 0.34498188791817835, (2, 1): 0.0}, (0, 2, 2): {(1, 1): 0.0006872177498527431, (1, 2): 0.0, (2, 2): 0.0, (2, 1): 0.0}, (0, 5, 3): {(2, 1): 0.0, (2, 3): 0.0, (2, 2): 0.0, (1, 5): 0.0, (1, 2): 0.8666666666666666, (1, 4): 0.0, (1, 1): 0.0, (1, 3): 0.0}, (1, 1, 0): {(0, 1): 0.01083688298672011, (1, 1): 0.0}, (1, 7, 0): {(1, 4): 0.0, (0, 1): 0.0, (1, 3): 0.0, (1, 6): 1.0, (1, 5): 0.0, (1, 7): 0.0, (1, 1): 0.0, (1, 2): 0.0}, (2, 2, 0): {(0, 1): 0.01629814180421461, (1, 2): 0.0, (0, 2): 0.0, (1, 1): 0.0}, (3, 3, 0): {(1, 1): 0.025462155563306627, (1, 2): 0.0, (0, 1): 0.0, (0, 3): 0.0, (1, 3): 0.0, (0, 2): 0.0}, (4, 4, 0): {(0, 1): 0.0, (1, 3): 0.0, (0, 4): 0.0, (1, 2): 0.0, (0, 3): 0.0, (1, 4): 0.0, (1, 1): 0.0, (0, 2): 0.0}, (7, 4, 7): {(0, 3): 0.0, (2, 1): 1.0, (2, 2): 0.9285714285714285, (1, 2): 1.0, (2, 7): 0.0, (0, 7): 0.0, (1, 4): 0.0, (0,




In [64]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 152
total number of winning positions: 300
percentage of winning positions fumbled: 50.67 %


In [65]:
def generate_episode_no_es(policy, s0, Q, negative_on_loss=False, move_opponent=play_optimal):

    # init
    game = SimpleNim(s0)
    states = []
    actions = []
    rewards = []

    first_visit = {}
    index = 0

    # loop until end of episode
    while not game.is_game_over():
        # agent move
        curr_state = game.get_state()
        states.append(curr_state)
        curr_action = policy.policy(curr_state, Q)
        actions.append(curr_action)
        game.make_move(curr_action[0], curr_action[1])

        if (tuple(curr_state), curr_action) not in first_visit:
            first_visit[(tuple(curr_state), curr_action)] = index
        index+=1

        # opponent move
        move_opponent(game)

        if not game.is_game_over(): rewards.append(0)

    if game.winner() == 0:
        rewards.append(1)
    else:
        if negative_on_loss: rewards.append(-1)
        else: rewards.append(0)
    return states, actions, rewards, first_visit

In [66]:
# Monte Carlo Control without exploring starts
def monte_carlo_no_es(gamma=1, policy=EpsilonGreedyPolicy(), negative_on_loss=False, move_opponent=play_optimal):
    Q = {} # take default to be zero
    N = {}


    for i in tqdm(range(max_episodes)):
        policy.update(epsilon=1.0/float(i+1))
        s0 = np.random.randint(0, initial_pebbles+1, num_heaps)
        while all(s0 == 0):
            s0 = np.random.randint(0, initial_pebbles+1, num_heaps)

        S, A, R, first_visit = generate_episode_no_es(policy=policy, s0=s0, Q=Q, negative_on_loss=negative_on_loss, move_opponent=move_opponent)

        if len(S) == 0:
            continue

        G = 0
        for t in range(len(R)-1, -1, -1):
            G = gamma * G + R[t]
            if first_visit[(S[t], A[t])] == t:

                if (S[t], A[t]) not in N:
                    N[(S[t], A[t])] = 0
                N[(S[t], A[t])] += 1

                if S[t] not in Q:
                    Q[S[t]] = {}
                if A[t] not in Q[S[t]]:
                    Q[S[t]][A[t]] = 0
                Q[S[t]][A[t]] = Q[S[t]][A[t]] + float((G-Q[S[t]][A[t]]))/float(N[(S[t], A[t])])

    return policy, Q

In [67]:
policy, Q = monte_carlo_no_es()

print(Q)

100%|██████████| 100000/100000 [00:02<00:00, 37512.00it/s]

{(1, 1, 0): {(0, 1): 0.2535282651072129, (1, 1): 0.0}, (2, 2, 0): {(1, 1): 0.12497576736672013, (0, 1): 0.0}, (4, 3, 0): {(1, 1): 0.0}, (0, 1, 1): {(1, 1): 0.1398551500231893, (2, 1): 0.0}, (3, 2, 1): {(1, 1): 0.349964446551315}, (4, 4, 1): {(0, 1): 0.5976522971058496}, (4, 5, 2): {(2, 1): 0.7115662650602383, (1, 1): 0.5}, (4, 6, 2): {(1, 1): 1.0}, (1, 5, 0): {(1, 1): 0.0}, (6, 2, 1): {(1, 1): 0.0}, (3, 3, 0): {(0, 1): 0.16433001954286652, (1, 1): 0.0}, (4, 3, 4): {(0, 1): 0.4952153110047856}, (4, 4, 4): {(1, 1): 1.0}, (3, 4, 6): {(1, 1): 0.3152709359605912}, (4, 4, 0): {(1, 1): 0.07838906868033096}, (5, 4, 1): {(2, 1): 0.0}, (7, 4, 2): {(2, 1): 0.0}, (2, 1, 2): {(0, 1): 0.7058550903754055}, (2, 1, 3): {(2, 1): 0.7644163150492221, (1, 1): 0.0}, (5, 2, 3): {(1, 1): 1.0}, (2, 1, 5): {(2, 1): 1.0}, (5, 5, 0): {(1, 1): 0.0}, (7, 5, 0): {(0, 1): 0.0}, (0, 2, 2): {(2, 1): 0.06442371752165205}, (0, 3, 3): {(2, 1): 0.09084077031470163}, (0, 4, 7): {(1, 1): 0.0}, (1, 0, 1): {(0, 1): 0.265353814




In [68]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 250
total number of winning positions: 300
percentage of winning positions fumbled: 83.33 %


In [69]:
# Monte Carlo Control without exploring starts
def monte_carlo_no_es_every_visit(gamma=1, policy=EpsilonGreedyPolicy(), negative_on_loss=False, move_opponent=play_optimal):
    Q = {} # take default to be zero
    N = {}


    for i in tqdm(range(max_episodes)):
        policy.update(1/float(i+1))
        s0 = np.random.randint(0, initial_pebbles+1, num_heaps)
        while all(s0 == 0):
            s0 = np.random.randint(0, initial_pebbles+1, num_heaps)

        S, A, R, first_visit = generate_episode_no_es(policy=policy, s0=s0, Q=Q, negative_on_loss=negative_on_loss, move_opponent=move_opponent)

        if len(S) == 0:
            continue

        G = 0
        for t in range(len(R)-1, -1, -1):
            G = gamma * G + R[t]

            if (S[t], A[t]) not in N:
                N[(S[t], A[t])] = 0
            N[(S[t], A[t])] += 1

            if S[t] not in Q:
                Q[S[t]] = {}
            if A[t] not in Q[S[t]]:
                Q[S[t]][A[t]] = 0
            Q[S[t]][A[t]] = Q[S[t]][A[t]] + float((G-Q[S[t]][A[t]]))/float(N[(S[t], A[t])])

    return policy, Q

In [70]:
policy, Q = monte_carlo_no_es_every_visit()

print(Q)

100%|██████████| 100000/100000 [00:02<00:00, 34711.94it/s]

{(1, 0, 1): {(2, 1): 0.11802215617400713}, (2, 0, 2): {(0, 1): 0.03704382908490748, (2, 1): 0.0}, (2, 4, 2): {(1, 1): 0.0}, (3, 0, 3): {(2, 1): 0.05492115280043498}, (4, 0, 4): {(0, 1): 0.07425105679103106, (2, 1): 0.0}, (5, 0, 6): {(0, 1): 0.2997032640949558}, (6, 1, 6): {(1, 1): 0.418652849740933}, (7, 1, 7): {(2, 1): 0.0}, (2, 0, 0): {(0, 1): 0.0}, (1, 1, 0): {(1, 1): 0.023339890477962512, (0, 1): 0.0}, (2, 2, 0): {(0, 1): 0.038880524311398455, (1, 1): 0.0}, (3, 3, 0): {(1, 1): 0.05109404096834257}, (4, 4, 0): {(1, 1): 0.06939614290230806}, (5, 5, 0): {(0, 1): 0.0, (1, 1): 0.0}, (6, 6, 0): {(0, 1): 0.0}, (7, 7, 0): {(1, 1): 0.0}, (1, 3, 1): {(2, 1): 0.0}, (0, 7, 0): {(1, 1): 0.0}, (0, 1, 1): {(1, 1): 0.22813182400040152, (2, 1): 0.09090909090909093}, (0, 2, 2): {(2, 1): 0.03571428571428571, (1, 1): 0.13316937467241963}, (0, 3, 3): {(2, 1): 0.18086271075103968}, (0, 4, 4): {(2, 1): 0.09132591701743896, (1, 1): 0.0}, (1, 5, 4): {(0, 1): 0.10312180143295789, (2, 1): 0.0}, (2, 7, 4): {(




In [71]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 260
total number of winning positions: 300
percentage of winning positions fumbled: 86.67 %


In [72]:
# Monte Carlo Control with exploring starts with every visit updates
def monte_carlo_es_every_visit(gamma=1, policy=EpsilonGreedyPolicy(), negative_on_loss=False, move_opponent=play_optimal):
    Q = {} # take default to be zero
    N = {}

    for i in tqdm(range(max_episodes)):
        policy.update(1/float(i+1))
        s0 = np.random.randint(0, initial_pebbles+1, num_heaps)
        while all(s0 == 0):
            s0 = np.random.randint(0, initial_pebbles+1, num_heaps)
        heap_num = np.random.randint(0,num_heaps)
        while s0[heap_num] == 0:
            heap_num = np.random.randint(0,num_heaps)
        a0 = (heap_num, np.random.randint(1, s0[heap_num]+1))

        S, A, R, first_visit = generate_episode(policy=policy, s0=s0, a0=a0, Q=Q, negative_on_loss=negative_on_loss, move_opponent=move_opponent)

        if len(S) == 0:
            continue

        G = 0
        for t in range(len(R)-1, -1, -1):
            G = gamma * G + R[t]

            if (S[t], A[t]) not in N:
                N[(S[t], A[t])] = 0
            N[(S[t], A[t])] += 1

            if S[t] not in Q:
                Q[S[t]] = {}
            if A[t] not in Q[S[t]]:
                Q[S[t]][A[t]] = 0
            Q[S[t]][A[t]] = Q[S[t]][A[t]] + float((G-Q[S[t]][A[t]]))/float(N[(S[t], A[t])])

    return policy, Q

In [73]:
policy, Q = monte_carlo_es_every_visit()

print(Q)

100%|██████████| 100000/100000 [00:02<00:00, 37451.22it/s]

{(0, 1, 1): {(1, 1): 0.011317365269461038, (2, 1): 0.0}, (0, 2, 2): {(1, 1): 0.01799657208150835, (2, 2): 0.0, (2, 1): 0.0, (1, 2): 0.0}, (1, 4, 2): {(0, 1): 0.0, (1, 1): 0.0, (1, 3): 0.0, (2, 2): 0.0, (1, 4): 0.0, (2, 1): 0.0, (1, 2): 1.0}, (1, 1, 0): {(0, 1): 0.020903883941636274, (1, 1): 0.0}, (2, 2, 0): {(1, 1): 0.0, (0, 1): 0.0316116103174091, (1, 2): 0.0, (0, 2): 0.0}, (2, 3, 1): {(2, 1): 0.006622516556291395, (1, 2): 0.0, (0, 1): 0.0, (0, 2): 0.0, (1, 3): 0.0, (1, 1): 0.0}, (4, 3, 7): {(0, 2): 0.0, (2, 3): 0.8999999999999999, (1, 1): 0.9473684210526314, (2, 5): 0.8181818181818182, (1, 3): 0.0, (2, 2): 0.9611650485436894, (2, 1): 0.0, (0, 1): 0.0, (0, 3): 0.9473684210526314, (1, 2): 0.9333333333333332, (2, 4): 0.0, (0, 4): 0.0, (2, 6): 0.0, (2, 7): 0.0}, (1, 0, 1): {(2, 1): 0.4079517992518296, (0, 1): 0.009615384615384616}, (2, 2, 1): {(0, 1): 0.1555555555555556, (2, 1): 0.6666666666666659, (1, 1): 0.0, (0, 2): 0.0, (1, 2): 0.0}, (2, 3, 3): {(2, 2): 0.9499999999999998, (2, 3): 0.




In [74]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 94
total number of winning positions: 300
percentage of winning positions fumbled: 31.33 %


In [75]:
policy, Q = monte_carlo_es_every_visit(policy=SimpleGreedyPolicy())

print(Q)

100%|██████████| 100000/100000 [00:02<00:00, 38754.94it/s]

{(0, 1, 1): {(1, 1): 0.009473570890300365, (2, 1): 0.0}, (0, 2, 2): {(1, 1): 0.0008902077151335307, (2, 2): 0.0, (2, 1): 0.0, (1, 2): 0.0}, (0, 7, 3): {(1, 5): 0.0, (2, 2): 0.0, (1, 7): 0.0, (1, 4): 0.8333333333333334, (2, 1): 0.0, (1, 6): 0.0, (1, 1): 0.0, (2, 3): 0.0, (1, 3): 0.0, (1, 2): 0.0}, (1, 1, 0): {(1, 1): 0.3127020785219415, (0, 1): 0.0}, (2, 1, 2): {(0, 1): 0.8191680030185808, (2, 2): 0.0, (2, 1): 0.0, (0, 2): 0.0, (1, 1): 0.8173076923076922}, (2, 1, 3): {(2, 1): 0.8594634873323403, (1, 1): 0.0, (0, 1): 0.0, (0, 2): 0.0, (2, 2): 0.0, (2, 3): 0.0}, (2, 7, 4): {(1, 6): 1.0, (2, 1): 1.0, (0, 1): 1.0, (2, 4): 0.0, (0, 2): 0.0, (1, 2): 0.13333333333333336, (2, 2): 0.0, (1, 7): 0.0, (1, 4): 0.0, (2, 3): 0.0, (1, 1): 1.0, (1, 5): 0.0, (1, 3): 0.0}, (0, 3, 3): {(1, 1): 0.0, (2, 2): 0.0, (2, 3): 0.0, (1, 2): 0.0, (1, 3): 0.0, (2, 1): 0.0}, (0, 4, 4): {(1, 1): 0.0, (2, 1): 0.0, (1, 4): 0.0, (1, 3): 0.0, (2, 4): 0.0, (2, 2): 0.0, (2, 3): 0.0, (1, 2): 0.0}, (0, 5, 5): {(1, 1): 0.0, (1,




In [76]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 111
total number of winning positions: 300
percentage of winning positions fumbled: 37.00 %


### Negative reward on losing
- Something to notice here is that due to the low win rate of the agent against an optimal opponent, there are very few samples with a non-zero reward(which prompts improvement)
- this may slow down the learning process
- So, what if we use a negative reward for a loss to force improvement?

#### Changes to rewards model
$$R_T = \begin{cases} 
            1 & agent & win \\
            -1 & agent & loss
        \end{cases}$$
$$R_t = 0 \hspace{3mm}\forall t<T$$

In [77]:
# Epsilon Greedy and exploring starts first visit with negative rewards
policy, Q = monte_carlo_es(negative_on_loss=True)

print(Q)

100%|██████████| 100000/100000 [00:02<00:00, 37078.10it/s]

{(1, 0, 1): {(2, 1): -1.0, (0, 1): -0.030118832484530015}, (2, 2, 1): {(0, 1): 0.6771496437054627, (2, 1): 0.6666666666666664, (0, 2): -1.0, (1, 1): -1.0, (1, 2): -1.0}, (2, 3, 1): {(1, 1): 0.7598519888991682, (0, 2): -1.0, (2, 1): -1.0, (1, 3): -1.0, (1, 2): -1.0, (0, 1): -1.0}, (4, 4, 1): {(1, 1): 0.3052631578947368, (0, 3): -1.0, (2, 1): 0.43576826196473545, (0, 4): -1.0, (1, 2): 0.3870967741935484, (1, 4): -1.0, (1, 3): -1.0, (0, 1): 0.38095238095238093, (0, 2): 0.3689839572192516}, (4, 5, 1): {(1, 1): -1.0, (1, 2): 0.813953488372093, (2, 1): -1.0, (1, 4): -1.0, (0, 2): 0.8253382533825329, (0, 3): -1.0, (0, 4): -1.0, (1, 5): -1.0, (0, 1): 0.8125000000000001, (1, 3): 0.8181818181818181}, (5, 5, 6): {(0, 1): 0.894736842105263, (2, 5): 0.8571428571428571, (1, 3): -1.0, (2, 4): 0.27272727272727276, (1, 5): -1.0, (1, 4): 0.8181818181818182, (0, 2): 1.0, (1, 1): 0.5384615384615384, (2, 1): -1.0, (2, 6): 0.4666666666666666, (2, 2): 1.0, (1, 2): 1.0, (0, 4): 1.0, (2, 3): -0.846153846153846




In [78]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 130
total number of winning positions: 300
percentage of winning positions fumbled: 43.33 %


In [79]:
# Epsilon Greedy with exploring starts first visit with negative rewards
policy, Q = monte_carlo_es_every_visit(negative_on_loss=True)

print(Q)

100%|██████████| 100000/100000 [00:02<00:00, 37745.60it/s]

{(1, 0, 1): {(0, 1): -0.9854758567970441, (2, 1): -1.0}, (2, 0, 2): {(0, 1): -0.9734398871782819, (0, 2): -1.0, (2, 1): -1.0, (2, 2): -1.0}, (3, 0, 3): {(2, 1): -0.9493441881501599, (0, 1): -1.0, (0, 2): -1.0, (0, 3): -1.0, (2, 2): -1.0, (2, 3): -1.0}, (5, 2, 3): {(1, 2): -1.0, (2, 2): 1.0, (0, 1): 1.0, (1, 1): 0.9583333333333334, (2, 3): -1.0, (2, 1): -1.0, (0, 5): -1.0, (0, 2): 1.0, (0, 3): -1.0, (0, 4): -1.0}, (3, 7, 1): {(0, 2): -1.0, (1, 7): -1.0, (2, 1): -1.0, (0, 3): -1.0, (1, 5): -1.0, (1, 1): 1.0, (1, 3): 1.0, (0, 1): 1.0, (1, 2): 1.0, (1, 6): -1.0, (1, 4): 1.0}, (1, 0, 6): {(2, 3): -1.0, (2, 5): 1.0, (0, 1): -1.0, (2, 2): -1.0, (2, 1): -1.0, (2, 6): -1.0, (2, 4): -1.0}, (2, 0, 3): {(0, 1): -1.0, (2, 3): -1.0, (2, 1): 0.7610619469026546, (2, 2): -1.0, (0, 2): -1.0}, (0, 1, 4): {(2, 4): -1.0, (2, 1): -1.0, (2, 2): -1.0, (1, 1): -1.0, (2, 3): 1.0}, (0, 6, 3): {(2, 3): -1.0, (2, 1): -1.0, (2, 2): -1.0, (1, 4): -1.0, (1, 6): -1.0, (1, 1): -1.0, (1, 3): -1.0, (1, 5): -1.0, (1, 2): 




In [80]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 165
total number of winning positions: 300
percentage of winning positions fumbled: 55.00 %


In [81]:
# providing a way to check performance more reliably
def evaluate_using_fumbles_averaged(train=monte_carlo_es_every_visit, num_iters=5):
    num_winning_pos = 0
    total_fumbles = 0
    for i in range(num_iters):
        print(f'iteration {i+1}')
        policy, Q = train(negative_on_loss=True)

        fumbles, num_winning_pos = evaluate_on_winning_positions(policy, Q)
        total_fumbles += fumbles
    return total_fumbles, num_winning_pos

In [82]:
# num_iters = 5
# total_fumbles, num_winning_pos = evaluate_using_fumbles_averaged(num_iters=num_iters)
# print('average number of fumbles: ', total_fumbles/num_iters)
# print('fumble rate: {:.2f}%'.format(total_fumbles/float(num_winning_pos*num_iters)*100))

In [83]:
policy, Q = monte_carlo_es_every_visit(negative_on_loss=True, move_opponent=play_random)

print(Q)

100%|██████████| 100000/100000 [00:04<00:00, 22012.71it/s]

{(0, 4, 0): {(1, 4): 0.054621848739495854, (1, 1): -0.846153846153846, (1, 3): -1.0, (1, 2): 0.05497076023391796}, (1, 0, 0): {(0, 1): -0.07406424210246876}, (1, 0, 1): {(2, 1): -0.006555297102981604, (0, 1): -0.9629629629629628}, (2, 0, 1): {(0, 1): 0.0997624703087887, (2, 1): -0.006923076923076928, (0, 2): -1.0}, (2, 0, 2): {(2, 1): 0.016000000000000045, (2, 2): -0.09127789046653144, (0, 1): -0.945945945945946, (0, 2): 0.08614232209737833}, (2, 0, 3): {(2, 1): -0.015457788347205672, (2, 3): -0.7837837837837837, (0, 1): -0.0019267822736031004, (2, 2): -0.10204081632653061, (0, 2): -0.007874015748031503}, (3, 0, 3): {(0, 1): -0.025316455696202528, (0, 2): 0.11320754716981138, (2, 2): -0.4375, (2, 1): -0.3714285714285715, (2, 3): 0.1291390728476822, (0, 3): 0.11900532859680277}, (4, 0, 3): {(0, 1): 0.1443569553805774, (0, 4): 0.13513513513513514, (0, 2): -0.44827586206896547, (2, 2): -3.8163916471489756e-17, (2, 1): -0.019607843137254888, (2, 3): -0.5294117647058822, (0, 3): 0.173553719




In [84]:
evaluate_using_fumbles(policy, Q)

number of fumbles: 242
total number of winning positions: 300
percentage of winning positions fumbled: 80.67 %


In [85]:
# Aggregating Value Function V(s) for plotting
def aggregate_V(Q):
    V = np.zeros((initial_pebbles,)*num_heaps)
    for state in itertools.product(range(initial_pebbles), repeat=num_heaps):
        if state not in Q or len(Q[state]) == 0: V[state] = 0
        else: V[state] = max(Q[state], key=Q[state].get)[1]

    return V

In [86]:
V = aggregate_V(Q)

### The following visualization works only for 3 heaps

In [87]:
import plotly.graph_objects as go

In [None]:
def plot_value_function(V):
    s1_vals = np.arange(initial_pebbles)
    s2_vals = np.arange(initial_pebbles)
    s3_vals = np.arange(initial_pebbles)  # slider dimension

    # Prepare frames for slider
    frames = []

    for s3 in s3_vals:
        s1_grid, s2_grid = np.meshgrid(s1_vals, s2_vals, indexing='ij')
        V_slice = V[:, :, s3]

        mesh = go.Surface(
            x=s1_grid,
            y=s2_grid,
            z=V_slice,
            colorbar_title='V',
            showscale=True
        )

        frame = go.Frame(
            data=[mesh],
            name=str(round(s3, 2)),
        )
        frames.append(frame)

    # Initial mesh (first slice)
    initial_surface = go.Surface(
        x=s1_grid,
        y=s2_grid,
        z=V[:, :, 0],
        colorbar_title='V',
        showscale=True
    )

    # Create figure
    fig = go.Figure(
        data=[initial_surface],
        frames=frames
    )

    # Slider steps
    slider_steps = [
        {
            "args": [[f.name], {"frame": {"redraw": True}, "mode": "immediate"}],
            "label": f.name,
            "method": "animate",
        }
        for f in frames
    ]

    # Layout with slider
    fig.update_layout(
        title="Value Function V(s) — Varying s3",
        scene=dict(
            xaxis=dict(range=[s1_vals.min(), s1_vals.max()]),
            yaxis=dict(range=[s2_vals.min(), s2_vals.max()]),
            zaxis=dict(range=[V.min(), V.max()]),
        ),
        sliders=[{
            "steps": slider_steps,
            "currentvalue": {"prefix": "s3 = "}
        }],
    )

    fig.show()


In [None]:
plot_value_function(V)