# Agent 2 - MC Method

## Preliminaries

In [529]:
#####
## Agent 2: Monte Carlo Methods
#####


#####
##### Step 1: Load Environment

!pip install kaggle-environments

###### Import necessary libraries (run this twice per instructions)
from kaggle_environments import make, evaluate

# Create the game environment
# Set debug=True to see the errors if your agent refuses to run
env = make("connectx", debug=True)

# List of available default agents
print(list(env.agents))

from kaggle_environments import make, evaluate

# Create the game environment
# Set debug=True to see the errors if your agent refuses to run
env = make("connectx", debug=True)

# List of available default agents
print(list(env.agents))

##### Other libraries and functions
import numpy as np
from tqdm.notebook import tqdm # to keep track of the progress
from collections import defaultdict  # we will see why this is something useful for our case



['random', 'negamax']
['random', 'negamax']


In [530]:
#####
##### Define environment parameters, make environment
rows = 4
columns = 5
in_a_row = 3
debug_mode = True


env = make("connectx", {"rows": rows, "columns": columns, "inarow": in_a_row}, steps=[], debug=debug_mode)

## Initial Code from Project Assignment - Practice with Environment

In [531]:
#####
##### Create a Training Agent. Run Agent1 against random player

# Define Agent 1 - play in left most column that is legal
def agent1(obs):
  return [cols for cols in range(len(obs.board)) if obs.board[cols] == 0][0]   
  #this agent tries to place a chip in the furthest left column, provided that there's still space to do it

# Load a default agent called "random".
agent2 = "random"

# Training agent in first position (player 1) against the default random agent.
trainer = env.train([None, "random"])

obs = trainer.reset()
for _ in range(30):

    #action = 0 # Action for the agent being trained.
    action = agent1(obs)   # agent1 is the defined above, that places a piece in the left-most position, provided that there is space.
    obs, reward, done, info = trainer.step(action)

    if done:
        print(reward)
        env.render(mode='ipython')
        obs = trainer.reset()

1


1


1


1


-1


1


1


1


In [532]:
#####
##### Now run Agent1 against the negmax - it doesn't win

trainer = env.train([None, "negamax"])

obs = trainer.reset()
for _ in range(30):
    action = agent1(obs)   # agent1 is the defined above, that places a piece in the left-most position, provided that there is space.
    obs, reward, done, info = trainer.step(action)
    
    if done:
        print(reward)
        env.render(mode='ipython')
        obs = trainer.reset()

-1


-1


-1


-1


-1


-1


-1


In [7]:
#####
##### Now try to play manually against the random agent

env.play([None, "random"])

## Getting Ready to Build My MC Agent

I start with the agent format provided in the assignment python code:

In [533]:
class RandomAgent(object):

    def __init__(self):
        from collections import defaultdict
        self.epsilon = 0.1
        self.policy = defaultdict(lambda: 2) # According to lab 4, defaultdict gives default value for any key
                                            # So here, self.policy[x] gives 2 for any otherwise undefined key 'x'. 

    def act(self, obs):  
        if np.random.rand() < self.epsilon: # np.random.rand samples from standard uniform dist. 
            return columns//2 # so with probability epsilon, it returns (floor rounded) columns/2.
        else: # with probability 1 - epsilon:
            valid_moves = [col for col in range(columns) if obs.board[col] == 0] # 
            chosen_action = np.random.choice(valid_moves)
            return int(chosen_action)

    def learn(self, obs, action, reward):
        self.epsilon *= 0.9995
        if reward is not None:
          if reward>=0:
            self.policy[tuple(obs.board)] = action


We see that the agent is composed of the following parts:

* epsilon
* policy

And the built-in functions for that agent are:
* act(self, obs): returns the chosen action as an integer. Note that action must be in [0,1,2,3,4], which corresponds to the column to drop the chip into. 
* learn(self, obs, action, reward): given the board, an action, and a reward. Epsilon is shrunk slightly, and policy is updates at the specific obs.board value.


Some other things to note are: 

* 'obs' appears to be the observed state. obs.board is the important piece. According to game documentation, obs.board is a "Serialized grid (rows x columns). 0 = Empty, 1 = P1, 2 = P2".

Now that I understand the basics of the agent, I can apply the MC method:

Some notes about my MC method:

- In this game, each state can only be visited once per episode, so first-visit methods are fine. 
- I am going to try an off-policy method with weighted importance sampling (roughly Algorithm 4 from Lecture 4)
- Generally, I will be closely following the process we used to develop a blackjack strategy in Lab 5.

## Building My MC Agent

### Helper function 1: Convert 'Struct' Type Defining Current State to Manageable Dict Key

I keep running into an issue where using the trainer directly (the output of `trainer.reset()` or the output of `trainer.step()`) gives an error: 'Unhashable type 'Struct''. This appears to be because the Kaggle environment stores this state information in its own data type called a 'Struct', which is not hashable and cannot be used as a dictionary key. I need to be able to define dictionary keys according to the state, so I need a way of converting this 'Struct' to something hashable (such as a tuple) and while still preserving all the information that makes the state unique.

Below, I define the function `state_id`. This is named because it extracts the identifying information about the state from the 'Struct'.

In [534]:
def state_id(Struct):
    return (tuple(Struct['board']), Struct.mark) # Combine the two features we need to know: the state of the board and which mark we are
    # into a tuple
    
    # Note, Struct.board worked when the trainer was set as [None, other], but not when it was [other, None].
    # No idea why, but I get around this by using Struct['board']

### Helper function 2: Epsilon soft policy

The inputs are (based on Lab 5):
- `Q`: A dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.
- `epsilon`: A value from 0 to 1 specifying the probability of taking a non-greedy action.

Note that in my version below, I have to ensure that a legal move is played, so illegal moves are automatically taken with probabiliity 0 and the `epsilon` probability is divided equally among the *legal* non-greedy moves.

In [650]:
# Allowing illegal moves:
#def epsilon_soft(env, Q, epsilon=0.25): # set 0.25 as default here
#  nA = env.configuration.columns
#  policy = defaultdict(lambda: np.ones(nA)*epsilon/nA) # default policy for non-greedy action is random 
#  for keys, values in Q.items():
#    best = np.argmax(values) # find the best action
#    policy[keys][best] = 1-epsilon + epsilon/nA # and the greedy action will be take with this prob
#  return policy 

def epsilon_soft(env, Q, epsilon=0.25): # set 0.25 as default here
    nA = env.configuration.columns
    policy = defaultdict(lambda: np.ones(nA)*epsilon/nA) # default policy for non-greedy action is random 
    for keys, values in Q.items():
        # Now find best action *out of valid moves*
        valid_moves = [col for col in range(nA) if state[0][col] == 0] # Find valid moves. Remember state[0] is the board
        n_valid_moves = len(valid_moves) # number of valid moves
            
        for i in range(len(policy[keys])): # Need to define policy so that illegal moves can't be taken
            if i in valid_moves:
                policy[keys][i] = epsilon/n_valid_moves
            else:
                policy[keys][i] = 0
        
        
        valid_Q_state = Q[keys][:]   # I think this is a messy way to do this but it should work. 
        for i in range(len(valid_Q_state)):
            if i not in valid_moves:
                valid_Q_state[i] = -1000000 # Set Q for illegal moves extremely low so it won't be max
        best_action = np.argmax(valid_Q_state) # Find best action
        policy[keys][best_action] = 1- epsilon + epsilon/n_valid_moves # and update probability for the best move
    
        policy[keys][best] = 1-epsilon + epsilon/n_valid_moves # and the greedy action will be take with prob 1 - 0.25 + epsilon/nA
    return policy 

### Helper function 3: Generate episodes

My second step will be to create a function for generating episodes given a policy as an input.

The inputs are (roughly based on Lab 5):
- `trainer`: a trainer from `env.train()`, from which the `trainer.step()` function can be accessed to run a turn. 
- `Q`: A dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.
- `policy`: This is a dictionary where `policy[s]` returns the action that the agent chooses after observing state `s`.

And the output is a list of tuples `[(state1, action1, reward1), (state2, action2, reward2),..., (stateN, actionN, rewardN)]` holding the state, action and reward for each step of the episode.



In [536]:
# Note that this uses the trainer, not just the environment
def generate_episode(trainer, Q, policy):
    nA = env.configuration.columns # number of actions is the number of columns on the board
    episode = [] # initialize the episode as an empty list. We will add to it later
    state = state_id(trainer.reset()) # empty board
    while True:
        if state in Q:
            global mystate # testing with this
            mystate = state
            
            action = np.random.choice(np.arange(nA), p = policy[state]) 
        else:
            # if not already in dict, take random action. But make sure it is a legal action
            valid_moves = [col for col in range(nA) if state[0][col] == 0]
            n_valid_moves = len(valid_moves) # number of valid moves
            action = np.random.choice(valid_moves, p = np.ones(n_valid_moves)/n_valid_moves) 
        action = int(action) # Make it an integer, not a numpy integer
        next_state, reward, done, info = trainer.step(action)
        next_state = state_id(next_state) # convert state to tuple for using as dictionary
        episode.append((state, action, reward))
        state = next_state
        if done:
            break # finish episode when environment says game is done. No need to set a turn limit,
            # because max number of turns per game is 20.
    return episode

### Helper Function 4: Plot rewards
This function comes from lab 8 and can be used to plot the cumulative reward

In [668]:
# Plot rewards function from Lab 8
def plot_rewards(cum_rew, method=None):
    plt.plot(cum_rew)
    plt.xlabel("episode")
    plt.ylabel(r"$\sum R$")
    plt.title(method)
    plt.show()

### Helper Function 5: Generate an episode from a manually played game
This function is similar to the generate episode function above, but it allows the game to be played manually. With this function, I should be able to train my agent by playing against in myself.

Note, I still can't figure out how to save an episode from a game played through env.play()

### Real function: On-Policy MC Method

This function takes the following arguments:
- `env`: The environment
- `trainer_mark1`: A trainer from `env.train()` where this agent plays as mark1.
- `trainer_mark2`: A trainer from `env.train()` where this agent plays as mark2.
- `num_episodes`: The number of episodes to run
- `gamma`: The discount rate
- `epsilon`:  A value from 0 to 1 specifying the probability of taking a non-greedy action.

It also takes the following arguments, which I initialize outside of the function so that the MC policy control can be run even if there is a previous policy other than the random one:

- `N`: a (default) dictionary storing the number of visits to a given state
- `Q`: a (default) dictionary storing the Q values for each state. Default is 0
- `policy`: The current policy (a default dict)

In [542]:
def mc_on_policy_control(env, trainer_mark1, trainer_mark2,  
                         current_N, current_Q, current_policy, num_episodes=10000,
                         gamma=1.0, epsilon=0.25):
    nA = env.configuration.columns # number of actions is the number of columns on the board
    
    # Pull in initial N, Q, policy
    N = current_N
    Q = current_Q
    policy = current_policy
    
    # loop over episodes
    for i in tqdm(range(num_episodes)): # tgdm gives progress bar
        if i % 2 == 0:
            episode = generate_episode(trainer_mark1, Q, policy) # generate an episode from previous function
        else:
            episode = generate_episode(trainer_mark2, Q, policy) # rotate between being player 1, player 2
        T = len(episode) # number of turns that this agent took (1-10)
        G = 0.0 # Initialize the return as 0, it will be updated at each step of the episode
        # Now obtain the states, actions, and rewards
        for t in reversed(range(T)):
            state, action, rewards = episode[t]
            G = gamma * G + rewards # update return
            N[state][action] += 1 # Add one to counter for this s-a pair
            Q[state][action] += (G - Q[state][action])/N[state][action] # Update Q for this s-a pair.
            
            # Now update policy for this state, but remember to only allow legal moves
            valid_moves = [col for col in range(nA) if state[0][col] == 0] # Find valid moves. Remember state[0] is the board
            n_valid_moves = len(valid_moves) # number of valid moves
            
            for i in range(len(policy[state])): # Need to define policy so that illegal moves can't be taken
                if i in valid_moves:
                    policy[state][i] = epsilon/n_valid_moves
                else:
                    policy[state][i] = 0
                    
            # Now find best action *out of valid moves*
            valid_Q_state = Q[state][:]   # I think this is a messy way to do this but it should work. 
            for i in range(len(valid_Q_state)):
                if i not in valid_moves:
                    valid_Q_state[i] = -1000000 # Set Q for illegal moves extremely low so it won't be max
            best_action = np.argmax(valid_Q_state) # Find best action
            policy[state][best_action] = 1- epsilon + epsilon/n_valid_moves # and update probability for the best move
            
    on_policy = dict((key,np.argmax(value)) for key, value in policy.items()) # In the end, take best policy with prob 1
    V_on = dict((key,np.max(value)) for key, value in Q.items()) # Take best Q approximation as well
    return on_policy, V_on, Q

## Training my MC Agent: Method 1 (Without creating an agent class)

I will first train my MC Agent against the default `negmax` agent. 

In [543]:
trainer1 = env.train([None, "negamax"])
trainer2 = env.train(["negamax", None])

# Initialize Q, N, policy. No prior info or prior policy (other than random epsilon_soft) here:
N = defaultdict(lambda: np.zeros(nA)) # initialize empty count for number of visits to each state-action
Q = defaultdict(lambda: np.zeros(nA)) # initialize Q as 0 for each state-action
policy = epsilon_soft(env, Q) # initial policy


# training_run1 = mc_on_policy_control(env, trainer1, trainer2,
                                     current_N = N, current_Q = Q,
                                     current_policy = policy, num_episodes=1000, gamma=0.9, epsilon=0.25)

  0%|          | 0/1000 [00:00<?, ?it/s]

KeyboardInterrupt: 

Now I have in `training_run1` a policy, V, and Q values. Let's take the policy and play against the agent myself now. The function `trained_policy` below creates an agent that plays according to the policy found during the first training run. 

In [544]:
def trained_policy(obs):
    board_id = (tuple(obs.board), obs.mark,)
    if board_id in mypolicy.keys():
        return(int(mypolicy[board_id]))
    else:
        valid_moves = [col for col in range(5) if obs.board[col] == 0]
        n_valid_moves = len(valid_moves) # number of valid moves
        action = np.random.choice(valid_moves, p = np.ones(n_valid_moves)/n_valid_moves) 
        action = int(action) # Make it an integer, not a numpy integer
        #print('playing random')
        return action
    
    

In [545]:
env.play([trained_policy, None]) # Initiate my game against it.

The policy found by training run 1, in which I trained the agent's policy against the `negamax` agent using the MC method, does decently well. I can still beat it, however. 

I want to improve the MC method policy even more. I will do this by training it further, this time against another version of itself (one that isn't learning though). I will call this run2.

In [None]:
trainer2 = env.train([None, trained_policy])

# Initialize Q, N, policy. No prior info or prior policy (other than random epsilon_soft) here:
N = defaultdict(lambda: np.zeros(nA)) # initialize empty count for number of visits to each state-action
Q = defaultdict(lambda: np.zeros(nA)) # initialize Q as 0 for each state-action
policy = epsilon_soft(env, Q) # initial policy


# training_run2 = mc_on_policy_control(env, trainer2, 
                                     current_N = N, currrent_Q = Q,
                                     current_policy = policy, num_episodes=1000, gamma=0.9, epsilon=0.25)

## Training my MC Agent: Method 2 (Creating an agent class)

The above method works, but the assignment recommends streamlining the learning code by creating an agent that can play and learn. In this section, I perform training using the MC method on an agent:

In [648]:
class MCMethodAgent(object):

    def __init__(self):
        from collections import defaultdict
        self.epsilon = 0.25
        self.N = defaultdict(lambda: np.zeros(nA)) # initialize empty count for number of visits to each state-action
        self.Q = defaultdict(lambda: np.zeros(nA)) # initialize empty Q 
        self.env = make("connectx", {"rows": 4, "columns": 5, "inarow": 3}, steps=[], debug=debug_mode)
        self.policy = epsilon_soft(self.env, self.Q) # initial policy
        self.greedy_policy = self.policy # this is firm policy for
        self.cumulative_reward = [] # initialize empty list to hold cumulative rewards from learning

    def act(self, obs):  # play (greedily) according to the current policy. Note that it only plays one turn. 
        board_id = (tuple(obs.board), obs.mark,)
        if board_id in self.policy.keys():
            return int(np.argmax(self.policy[board_id]))   
        else:
            valid_moves = [col for col in range(5) if obs.board[col] == 0]
            n_valid_moves = len(valid_moves) # number of valid moves
            action = np.random.choice(valid_moves, p = np.ones(n_valid_moves)/n_valid_moves) 
            action = int(action) # Make it an integer, not a numpy integer
            #print('playing random')
            return action
        
    def act_nongreedily(self, obs):
        board_id = (tuple(obs.board), obs.mark,)
        if board_id in self.policy.keys():
            action = np.random.choice(range(5), p = self.policy[board_id]) 
            return action     
        else:
            valid_moves = [col for col in range(5) if obs.board[col] == 0]
            n_valid_moves = len(valid_moves) # number of valid moves
            action = np.random.choice(valid_moves, p = np.ones(n_valid_moves)/n_valid_moves) 
            action = int(action) # Make it an integer, not a numpy integer
            #print('playing random')
            return action
        
            

    def learn_MConpolicy(self, trainer_mark1, trainer_mark2, num_episodes=10000, 
                         gamma=1.0, epsilon = 0.25): # Essentially the MC_on_policy_control() function from above
                
        # Pull in initial env, N, Q, policy
        env = self.env
        N = self.N.copy()
        Q = self.Q.copy()
        policy = self.policy.copy()
        cumulative_reward = self.cumulative_reward.copy()
        
        nA = env.configuration.columns # number of actions is the number of columns on the board
    
        # loop over episodes
        for i in tqdm(range(num_episodes)): # tgdm gives progress bar
            if i % 2 == 0:
                episode = generate_episode(trainer_mark1, Q, policy) # generate an episode from previous function
            else:
                episode = generate_episode(trainer_mark2, Q, policy) # rotate between being player 1, player 2
            T = len(episode) # number of turns that this agent took (1-10)
            cumulative_reward.append(episode[T-1][2]) # final reward 
            G = 0.0 # Initialize the return as 0, it will be updated at each step of the episode
            # Now obtain the states, actions, and rewards
            for t in reversed(range(T)):
                state, action, rewards = episode[t]
                G = gamma * G + rewards # update return
                N[state][action] += 1 # Add one to counter for this s-a pair
                Q[state][action] += (G - Q[state][action])/N[state][action] # Update Q for this s-a pair.
            
                # Now update policy for this state, but remember to only allow legal moves
                valid_moves = [col for col in range(nA) if state[0][col] == 0] # Find valid moves. Remember state[0] is the board
                n_valid_moves = len(valid_moves) # number of valid moves
            
                for j in range(len(policy[state])): # Need to define policy so that illegal moves can't be taken
                    if j in valid_moves:
                        policy[state][j] = epsilon/n_valid_moves
                    else:
                        policy[state][j] = 0
                    
                # Now find best action *out of valid moves*
                valid_Q_state = Q[state][:]   # I think this is a messy way to do this but it should work. 
                for j in range(len(valid_Q_state)):
                    if j not in valid_moves:
                        valid_Q_state[j] = -1000000 # Set Q for illegal moves extremely low so it won't be max
                best_action = np.argmax(valid_Q_state) # Find best action
                policy[state][best_action] = 1- epsilon + epsilon/n_valid_moves # and update probability for the best move
        
        self.policy = policy.copy() # update the agent's policy
        self.greedy_policy = dict((key,np.argmax(value)) for key, value in policy.items()) # In the end, take best policy with prob 1
        self.Q = Q.copy()
        self.N = N.copy()
        self.cumulative_reward = cumulative_rewards.copy()
        #V_on = dict((key,np.max(value)) for key, value in Q.items()) # Take best Q approximation as well
        # return on_policy, V_on, Q. # Don't need to return anything for now. Just update the agent

Now initialize this agent and train it against `random`:

In [635]:
#Agent2 = MCMethodAgent()

trainer1 = env.train([None, "random"])
trainer2 = env.train(["random", None])

Agent2.learn_MConpolicy(trainer1, trainer2, num_episodes = 25000)

  0%|          | 0/5000 [00:00<?, ?it/s]

KeyboardInterrupt: 

Now we can access the learned policy from the agent directly:

In [548]:
# Agent2.policy
# Agent2.greedy_policy

{((0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 2, 0, 0, 1, 2, 1, 0), 1): 0,
 ((0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 2, 0, 0, 1, 2, 1, 0), 1): 0,
 ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 2, 1, 0), 1): 0,
 ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0), 1): 0,
 ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 1): 0,
 ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0), 2): 1,
 ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0), 2): 0,
 ((0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 2, 0, 2), 1): 0,
 ((0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 2, 0, 0), 1): 3,
 ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0), 1): 0,
 ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0), 2): 0,
 ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0), 2): 2,
 ((0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 1, 0), 1): 0,
 ((0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 2, 0, 0), 1): 0,
 ((0, 

let's train it even further, this time against a negamax agent, as it may not be (or be likely to be) getting exposed to all possible plays against negamax

In [588]:
trainer1 = env.train([None, "negamax"])
trainer2 = env.train(["negamax", None])

Agent2.learn_MConpolicy(trainer1, trainer2, num_episodes = 25000)

  0%|          | 0/10000 [00:00<?, ?it/s]

let's train it even further, this time against itself:

In [591]:
mypolicy = Agent2.policy

def trained_policy(obs): # now defined according to the initial agent policy
    board_id = (tuple(obs.board), obs.mark,)
    if board_id in mypolicy.keys():
        action = int(np.random.choice(range(5), p = mypolicy[board_id]))
        return action
    else:
        valid_moves = [col for col in range(5) if obs.board[col] == 0]
        n_valid_moves = len(valid_moves) # number of valid moves
        action = np.random.choice(valid_moves, p = np.ones(n_valid_moves)/n_valid_moves) 
        action = int(action) # Make it an integer, not a numpy integer
        #print('playing random')
        return action



trainer1 = env.train([None, trained_policy])
trainer2 = env.train([trained_policy, None])

Agent2.learn_MConpolicy(trainer1, trainer2, num_episodes = 10000)

  0%|          | 0/10000 [00:00<?, ?it/s]

### Plotting the cumulative reward

In [None]:
cum_reward = np.cumsum(Agent2.cumulative_reward)
plot_rewards(cum_reward, method = "Agent 2")

### Testing Agent 2: 
Now I create an agent/function that plays according to this learned policy and simulate games/play it

In [636]:
mypolicy = Agent2.greedy_policy

def trained_policy(obs): # now defined according to the initial agent policy
    board_id = (tuple(obs.board), obs.mark,)
    if board_id in mypolicy.keys():
        action = int(mypolicy[board_id])
        return action
    else:
        valid_moves = [col for col in range(5) if obs.board[col] == 0]
        n_valid_moves = len(valid_moves) # number of valid moves
        action = np.random.choice(valid_moves, p = np.ones(n_valid_moves)/n_valid_moves) 
        action = int(action) # Make it an integer, not a numpy integer
        #print('playing random')
        return action


## Test winning percentage against negamax

The agent who is trained by the MC method can only beat me some of the time. Let's see how it performs against negamax, compared to a random player against negamax

In [666]:
# Here, function sim_games() runs n games between player 1 and player 2 (rotating who goes first)
def sim_games(player1, player2, n):
    p1_wins = 0
    for i in tqdm(range(n)):
        if i % 2 == 0:
            game = env.run([player1, player2])
            final_state = game[len(game) - 1][0]
            if final_state.reward == 1: # if the game ended and p1 was the last to play, then he won
                p1_wins += 1
                
        else:
            game = env.run([player2, player1])
            final_state = game[len(game) - 1][1]
            if final_state.reward == 1: # if the game ended and p1 was the last to play, then he won
                p1_wins += 1
    return p1_wins, p1_wins/n

import random
random.seed(123)
random_vs_negamax = sim_games('random', 'negamax', 1000)
random_vs_negamax

  0%|          | 0/100 [00:00<?, ?it/s]

(6, 0.06)

the random agent only wins 6% of the time vs. negamax

In [667]:
random.seed(234)
Agent2_vs_negamax = sim_games(trained_policy, 'negamax', 1000)
Agent2_vs_negamax 

  0%|          | 0/100 [00:00<?, ?it/s]

(59, 0.59)

My agent wins 59% of the time, so it is much better than the random one. Let's test my agent against a random player

In [644]:
random.seed(234)
Agent2_vs_random = sim_games(trained_policy, 'random', 1000)
Agent2_vs_random

  0%|          | 0/100 [00:00<?, ?it/s]

(92, 0.92)

That is good to see at least, it wins 92% of the time. Now I test it against myself

In [None]:
env.play([None, trained_policy])

In [None]:
env.play([trained_policy, None])

### Questions:

- This really isn't that good. Maybe if I train it alot more.
- I think that I do have to do this rotating between player 1 and player 2, right?
- I'm struggling to get this self.act() function to work as a player/trainer. I keep getting this three positional arguments thing. So instead I am putting the policy from the agent into an outside function and playing games according to that. 
- I can't figure out how to save episodes from a manually played game. I want to so I can train the agent manually