**This notebook is a practice lab in the [Intro to Game AI and Reinforcement Learning](https://www.kaggle.com/learn/intro-to-game-ai-and-reinforcement-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/deep-reinforcement-learning).**

---


# Introduction

In this practice lab I want to train an agent against different opponent agents. My aims are
- to understand how the performance changes after each training against a specific agent
- to understand how the overall performance changes across all trainings against all agents
- to benchmark the agent against a very strong opponent. See the [collection](connectX_agents_collection) gathered from [vyacheslavbolotin](https://www.kaggle.com/vyacheslavbolotin)'s [overview](https://www.kaggle.com/code/vyacheslavbolotin/agents-connect-x)
- to potentially test various models
- to potentially test various reward schemes (I liked the idea of [Pascal Pons](http://blog.gamesolver.org/solving-connect-four/01-introduction/) to take the number of moves required to win the game into account)
- to apply the framework of my [practice lab cart pole](practice_lab_stable_baselines3_cartpole_v1.ipynb)

# Basic considerations

In [1]:
from learntools.core import binder
binder.bind(globals())
from learntools.game_ai.ex4 import *

### 1) Set the architecture

In the tutorial, you learned one way to design a neural network that can select moves in Connect Four.  The neural network had an output layer with seven nodes: one for each column in the game board.

Say now you wanted to create a neural network that can play chess.  How many nodes should you put in the output layer?

- Option A: 2 nodes (number of game players)
- Option B: 16 nodes (number of game pieces that each player starts with)
- Option C: 4672 nodes (number of possible moves)
- Option D: 64 nodes (number of squares on the game board)

Use your answer to set the value of the `best_option` variable below.  Your answer should be one of `'A'`, `'B'`, `'C'`, or `'D'`.

In [4]:
# Fill in the blank
best_option = 'C'

# Check your answer
q_1.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> If we use a similar network as in the tutorial, the network should output a probability for each possible move.

In [None]:
# Lines below will give you solution code
q_1.solution()

### 2) Decide reward

In the tutorial, you learned how to give your agent a reward that encourages it to win games of Connect Four.  Consider now training an agent to win at the game [Minesweeper](https://bit.ly/2T5xEY8).  The goal of the game is to clear the board without detonating any bombs.

To play this game in Google Search, click on the **[Play]** button at [this link](https://www.google.com/search?q=minesweeper).  

<center>
<img src="https://storage.googleapis.com/kaggle-media/learn/images/WzoEfKY.png" width=50%><br/>
</center>

With each move, one of the following is true:
- The agent selected an invalid move (in other words, it tried to uncover a square that was uncovered as part of a previous move).  Let's assume this ends the game, and the agent loses.
- The agent clears a square that did not contain a hidden mine.  The agent wins the game, because all squares without mines are revealed.
- The agent clears a square that did not contain a hidden mine, but has not yet won or lost the game.
- The agent detonates a mine and loses the game.

How might you specify the reward for each of these four cases, so that by maximizing the cumulative reward, the agent will try to win the game?

After you have decided on your answer, run the code cell below to get credit for completing this question.

In [5]:
# Check your answer (Run this code cell to receive credit!)
# invalid move > -10
# clears square + win > 1000
# clears square + continue > 1/288
# mine > -1000
q_2.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> Here's a possible solution - after each move, we give the agent a reward that tells it how well it did:
- If agent wins the game in that move, it gets a reward of `+1`.
- Else if the agent selects an invalid move, it gets a reward of `-10`.
- Else if it detonates a mine, it gets a reward of `-1`.
- Else if the agent clears a square with no hidden mine, it gets a reward of `+1/100`.

To check the validity of your answer, note that the reward for selecting an invalid move and for detonating a mine should both be negative.  The reward for winning the game should be positive.  And, the reward for clearing a square with no hidden mine should be either zero or slightly positive.

# Custom environment and neural network

In [1]:
import random
import numpy as np
import pandas as pd
import gym
import matplotlib.pyplot as plt
%matplotlib inline

from kaggle_environments import make, evaluate
from gym import spaces

class ConnectFourGym(gym.Env):
    def __init__(self, agent2="random"):
        ks_env = make("connectx", debug=True)
        self.env = ks_env.train([None, agent2])
        self.rows = ks_env.configuration.rows
        self.columns = ks_env.configuration.columns
        # Learn about spaces here: http://gym.openai.com/docs/#spaces
        self.action_space = spaces.Discrete(self.columns)
        self.observation_space = spaces.Box(low=0, high=2, 
                                            shape=(1,self.rows,self.columns), dtype=int)
        # Tuple corresponding to the min and max possible rewards
        self.reward_range = (-10, 1)
        # StableBaselines throws error if these are not defined
        self.spec = None
        self.metadata = None
    def reset(self):
        self.obs = self.env.reset()
        return np.array(self.obs['board']).reshape(1,self.rows,self.columns)
    def change_reward(self, old_reward, done):
        if old_reward == 1: # The agent won the game
            return 1
        elif done: # The opponent won the game
            return -1
        else: # Reward 1/42
            return 1/(self.rows*self.columns)
    def step(self, action):
        # Check if agent's move is valid
        is_valid = (self.obs['board'][int(action)] == 0)
        if is_valid: # Play the move
            self.obs, old_reward, done, _ = self.env.step(int(action))
            reward = self.change_reward(old_reward, done)
        else: # End the game and penalize agent
            reward, done, _ = -10, True, {}
        return np.array(self.obs['board']).reshape(1,self.rows,self.columns), reward, done, _
    
# Create ConnectFour environment 
env = ConnectFourGym(agent2="random")

import torch as th
import torch.nn as nn

!pip install "stable-baselines3"
from stable_baselines3 import PPO 
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

# Neural network for predicting action values
class CustomCNN(BaseFeaturesExtractor):
    
    def __init__(self, observation_space: gym.spaces.Box, features_dim: int=128):
        super(CustomCNN, self).__init__(observation_space, features_dim)
        # CxHxW images (channels first)
        n_input_channels = observation_space.shape[0]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 32, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]

        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.ReLU())

    def forward(self, observations: th.Tensor) -> th.Tensor:
        return self.linear(self.cnn(observations))

policy_kwargs = dict(
    features_extractor_class=CustomCNN,
)

  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(



Next, run the code cell below to train an agent with PPO.  This code is identical to the code from the tutorial.

# Opponents for training and evaluation

## Agent with n-step lookahead and alpha-beta pruning

In [2]:
# agent with n-step lookahead and alpha-beta pruning
def my_nstep_lookahead_ab_pruning_agent(obs, config):
    # Your code here: Amend the agent!

    import random
    import numpy as np
    
    # Gets board at next step if agent drops piece in selected column
    def drop_piece(grid, col, mark, config):
        next_grid = grid.copy()
        for row in range(config.rows-1, -1, -1):
            if next_grid[row][col] == 0:
                break
        next_grid[row][col] = mark
        return next_grid

    # Helper function for get_heuristic: checks if window satisfies heuristic conditions
    def check_window(window, num_discs, piece, config):
        return (window.count(piece) == num_discs and window.count(0) == config.inarow-num_discs)
        
    # Helper function for get_heuristic: counts number of windows satisfying specified heuristic conditions
    def count_windows(grid, num_discs, piece, config):
        num_windows = 0
        # horizontal
        for row in range(config.rows):
            for col in range(config.columns-(config.inarow-1)):
                window = list(grid[row, col:col+config.inarow])
                if check_window(window, num_discs, piece, config):
                    num_windows += 1
        # vertical
        for row in range(config.rows-(config.inarow-1)):
            for col in range(config.columns):
                window = list(grid[row:row+config.inarow, col])
                if check_window(window, num_discs, piece, config):
                    num_windows += 1
        # positive diagonal
        for row in range(config.rows-(config.inarow-1)):
            for col in range(config.columns-(config.inarow-1)):
                window = list(grid[range(row, row+config.inarow), range(col, col+config.inarow)])
                if check_window(window, num_discs, piece, config):
                    num_windows += 1
        # negative diagonal
        for row in range(config.inarow-1, config.rows):
            for col in range(config.columns-(config.inarow-1)):
                window = list(grid[range(row, row-config.inarow, -1), range(col, col+config.inarow)])
                if check_window(window, num_discs, piece, config):
                    num_windows += 1
        return num_windows

    # Helper function for get_heuristic: calculates value of heuristic for grid
    def get_heuristic(grid, mark, config):
        num_threes = count_windows(grid, 3, mark, config)
        num_fours = count_windows(grid, 4, mark, config)
        num_threes_opp = count_windows(grid, 3, mark%2+1, config)
        num_fours_opp = count_windows(grid, 4, mark%2+1, config)
        score = num_threes - 1e2*num_threes_opp - 1e4*num_fours_opp + 1e6*num_fours
        return score

    # Uses minimax with alpha-beta pruning to calculate value of dropping piece in selected column
    def score_move(grid, col, mark, config, nsteps, alpha=-float('inf'), beta=float('inf')):
        next_grid = drop_piece(grid, col, mark, config)
        score = minimax(next_grid, nsteps-1, False, mark, config, alpha, beta)
        return score
    
    # Helper function for minimax: checks if agent or opponent has four in a row in the window
    def is_terminal_window(window, config):
        return window.count(1) == config.inarow or window.count(2) == config.inarow
    
    # Helper function for minimax: checks if game has ended
    def is_terminal_node(grid, config):
        # Check for draw 
        if list(grid[0, :]).count(0) == 0:
            return True
        # Check for win: horizontal, vertical, or diagonal
        # horizontal 
        for row in range(config.rows):
            for col in range(config.columns-(config.inarow-1)):
                window = list(grid[row, col:col+config.inarow])
                if is_terminal_window(window, config):
                    return True
        # vertical
        for row in range(config.rows-(config.inarow-1)):
            for col in range(config.columns):
                window = list(grid[row:row+config.inarow, col])
                if is_terminal_window(window, config):
                    return True
        # positive diagonal
        for row in range(config.rows-(config.inarow-1)):
            for col in range(config.columns-(config.inarow-1)):
                window = list(grid[range(row, row+config.inarow), range(col, col+config.inarow)])
                if is_terminal_window(window, config):
                    return True
        # negative diagonal
        for row in range(config.inarow-1, config.rows):
            for col in range(config.columns-(config.inarow-1)):
                window = list(grid[range(row, row-config.inarow, -1), range(col, col+config.inarow)])
                if is_terminal_window(window, config):
                    return True
        return False
    
    # Minimax implementation with alpha-beta pruning
    def minimax(node, depth, maximizingPlayer, mark, config, alpha, beta):
        is_terminal = is_terminal_node(node, config)
        valid_moves = [c for c in range(config.columns) if node[0][c] == 0]
        if depth == 0 or is_terminal:
            return get_heuristic(node, mark, config)
        if maximizingPlayer:
            value = -np.inf
            for col in valid_moves:
                child = drop_piece(node, col, mark, config)
                value = max(value, minimax(child, depth-1, False, mark, config, alpha, beta))
                alpha = max(alpha, value)
                if alpha >= beta:
                    break
            return value
        else:
            value = np.inf
            for col in valid_moves:
                child = drop_piece(node, col, mark%2+1, config)
                value = min(value, minimax(child, depth-1, True, mark, config, alpha, beta))
                beta = min(beta, value)
                if alpha >= beta:
                    break
            return value

    # agent driver
    # How deep to make the game tree: higher values take longer to run!
    # ConncectX comes with a max time per player move and also for all moves!
    N_STEPS = 5
    # Get list of valid moves
    valid_moves = [c for c in range(config.columns) if obs.board[c] == 0]
    # Convert the board to a 2D grid
    grid = np.asarray(obs.board).reshape(config.rows, config.columns)
    # Use the heuristic to assign a score to each possible board in the next step
    scores = dict(zip(valid_moves, [score_move(grid, col, obs.mark, config, N_STEPS) for col in valid_moves]))
    # Get a list of columns (moves) that maximize the heuristic
    max_cols = [key for key in scores.keys() if scores[key] == max(scores.values())]
    # Select at random from the maximizing columns
    return random.choice(max_cols)

# Training

In [2]:
# Initialize agent
model = PPO("CnnPolicy", env, policy_kwargs=policy_kwargs, verbose=0)

# Train agent
model.learn(total_timesteps=50000)

  and should_run_async(code)


<stable_baselines3.ppo.ppo.PPO at 0x7ab7b10b7430>

# Trained agent

In [3]:
def agent1(obs, config):
    # Use the best model to select a column
    col, _ = model.predict(np.array(obs['board']).reshape(1, 6,7))
    # Check if selected column is valid
    is_valid = (obs['board'][int(col)] == 0)
    # If not valid, select random move. 
    if is_valid:
        return int(col)
    else:
        return random.choice([col for col in range(config.columns) if obs.board[int(col)] == 0])

# Evaluation

In [4]:
# Create the game environment
env = make("connectx")

# Two random agents play one game round
env.run([agent1, my_nstep_lookahead_ab_pruning_agent])

# Show the game
env.render(mode="ipython")

NameError: name 'make' is not defined

In [8]:
def get_win_percentages(agent1, agent2, n_rounds=100):
    # Use default Connect Four setup
    config = {'rows': 6, 'columns': 7, 'inarow': 4}
    # Agent 1 goes first (roughly) half the time          
    outcomes = evaluate("connectx", [agent1, agent2], config, [], n_rounds//2)
    # Agent 2 goes first (roughly) half the time      
    outcomes += [[b,a] for [a,b] in evaluate("connectx", [agent2, agent1], config, [], n_rounds-n_rounds//2)]
    print("Agent 1 Win Percentage:", np.round(outcomes.count([1,-1])/len(outcomes), 2))
    print("Agent 2 Win Percentage:", np.round(outcomes.count([-1,1])/len(outcomes), 2))
    print("Number of Invalid Plays by Agent 1:", outcomes.count([None, 0]))
    print("Number of Invalid Plays by Agent 2:", outcomes.count([0, None]))

In [11]:
get_win_percentages(agent1=agent1, agent2="negamax")

Agent 1 Win Percentage: 0.06
Agent 2 Win Percentage: 0.93
Number of Invalid Plays by Agent 1: 0
Number of Invalid Plays by Agent 2: 0


In [None]:
import numpy as np
from multiprocessing import Pool

def get_win_percentages_mc(agent1, agent2, n_rounds=100, num_cores=6):
    """
    Calculates win percentages for two agents in a Connect Four game.

    Args:
        agent1: The first agent.
        agent2: The second agent.
        n_rounds: The total number of games to play.
        num_cores: The number of CPU cores to use for parallel processing.

    Returns:
        None
    """

    config = {'rows': 6, 'columns': 7, 'inarow': 4}

    # Calculate the number of games per core
    games_per_core = n_rounds // (2 * num_cores) 
    remaining_games = n_rounds % (2 * num_cores)

    # Create a list of arguments for each process
    args_list = []
    for i in range(num_cores):
        # Alternate agent order for each core
        rounds = games_per_core
        if i < remaining_games:
            rounds += 1
        args_list.append(("connectx", [agent1, agent2], config, [], rounds))
        args_list.append(("connectx", [agent2, agent1], config, [], rounds))

    # Create a pool of worker processes
    with Pool(processes=num_cores) as pool:
        # Run the evaluation function in parallel
        results = pool.starmap(evaluate, args_list)

    # Combine the results from all cores
    outcomes = []
    for core_results in results:
        outcomes.extend(core_results)

    # Calculate and print win percentages and invalid play counts
    print("Agent 1 Win Percentage:", np.round(outcomes.count([1, -1]) / len(outcomes), 2))
    print("Agent 2 Win Percentage:", np.round(outcomes.count([-1, 1]) / len(outcomes), 2))
    print("Number of Invalid Plays by Agent 1:", outcomes.count([None, 0]))
    print("Number of Invalid Plays by Agent 2:", outcomes.count([0, None]))

In [None]:
import numpy as np
get_win_percentages_mc(agent1=agent1, agent2=my_4step_lookahead_ab_pruning_agent)

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-game-ai-and-reinforcement-learning/discussion) to chat with other learners.*