# ConnectX: Deep Q Learning + Theory
In this notebook I'll create a deep Q-Learning agent trained using an experience replay and and explain the math behind the algorithm. This notebook was created referring several other notebooks and resources. Do refer those to get an in depth idea about RL. I will link them below:

- [Intro to Game AI and Reinforcement Learning](https://www.kaggle.com/alexisbcook/deep-reinforcement-learning)
- [Reinforcement course by David Silver](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZBiG_XpjnPrSNw-1XQaM_gB)
- [DQN tutorial from PyTorch docs](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
- [Introduction to RL by Sutton and Barto](http://incompleteideas.net/book/RLbook2020.pdf)

QLearning is an off-policy learning method that is similar to TD but the difference being it uses a replay memory and estimates the target from a fixed target policy. This helps in stabilizing the neural network used.

In [None]:
from kaggle_environments import evaluate, make, utils
from collections import namedtuple
import random
import numpy as np
import torch
from kaggle_environments import make, evaluate
from gym import spaces
import torch.nn as nn
import torch.nn.functional as f
import torch.optim as optim
import tqdm

# Why replay?
In sequential training methods like SARSA or SARSA($\lambda$), we train the function approximator by bootstrapping with the next state. And the next state is highly correlated with the current one. This correlation causes the network to blow up and be unstable, hindering the performance of the network. Introduction of replay aims to decorrelate and optimize the network. Instead of training against a sequential step, a experience replay will randonmy select de-correlated steps and optimizes the network. This stabilizes the training of the network. 

The replay will store the state, action, reward and the next_state ($s_t, a_t, r_t, s_{t+1}$).

The memory implemented will have fixed capacity, as more experience comes in, older experiences will be removed from the memory

In [None]:
transition = namedtuple('transitions', ('state', 'action', 'reward', 'next_state'))

class ReplayMemory:
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        
    def push(self, state, action, reward, next_state):
        if len(self.memory) > self.capacity:
            self.memory.pop()
        self.memory.append(transition(state, action, reward, next_state))
        
    def sample_batch(self, batch_size):
        if len(self) > batch_size:
            return random.sample(self.memory, batch_size)
        return []
    
    def __len__(self):
        return len(self.memory)

# Environment

We'll make a custom environment for the game and modify the reward system. This code is taken from [here](https://www.kaggle.com/alexisbcook/deep-reinforcement-learning) and modified.

## States
States are converted into pytorch tensors (for convenience) of size of the connectx board (6x7). 0 is an empty space, 1 is player one and 2 is player two

## Rewards
The reward system yields:
- 1 if the game is won
- -1 is the game is lost
- 1/42 is the game is continuing
- -10 is the agent makes an invalid move (trying to fill a already filled column) and game is ended

In [None]:
class ConnectFourGym:
    def __init__(self, agent2="random"):
        ks_env = make("connectx", debug=True)
        self.env = ks_env.train([None, agent2])
        self.rows = ks_env.configuration.rows
        self.columns = ks_env.configuration.columns
        self.action_space = spaces.Discrete(self.columns)
        self.observation_space = spaces.Box(
            low=0, high=2, 
            shape=(self.rows, self.columns, 1), 
            dtype=np.int
        )
        # Tuple corresponding to the min and max possible rewards
        self.reward_range = (-10, 1)
        # StableBaselines throws error if these are not defined
        self.spec = None
        self.metadata = None
        
    def random_action(self):
        return torch.tensor([self.action_space.sample()])

    def reset(self):
        self.obs = self.env.reset()
        return torch.tensor(self.obs['board'], 
                            dtype=torch.float).reshape(1, self.rows, self.columns)

    def change_reward(self, old_reward, done):
        if old_reward == 1: # The agent won the game
            return torch.tensor([1.])
        elif done: # The opponent won the game
            return torch.tensor([-1.])
        else: # Reward 1/42
            return torch.tensor([1/(self.rows*self.columns)])

    def step(self, action):
        # Check if agent's move is valid
        action = action[0].item()
        is_valid = (self.obs['board'][int(action)] == 0)
        if is_valid: # Play the move
            self.obs, old_reward, done, _ = self.env.step(int(action))
            reward = self.change_reward(old_reward, done)
        else: # End the game and penalize agent
            reward, done, _ = torch.tensor([-10.]), True, {}
        return torch.tensor(self.obs['board'], dtype=torch.float).reshape(1, self.rows, self.columns), reward, done, _

# Function Approximator
The policy network will be a simple network with two convolutional layers followed by a fully connected layer and an output layer. ReLU is used as the activation function. The approximator outputs the state action value of each action (the column to be played).

In [None]:
class DQNCNNPolicy(nn.Module):
    def __init__(self, op):
        super(DQNCNNPolicy, self).__init__()
        self.inp = nn.Conv2d(1, 16, (2, 2))
        self.conv_1 = nn.Conv2d(16, 32, (2, 2))
        self.linear_1 = nn.Linear(32 * 4 * 5, 64)
        self.linear_2 = nn.Linear(64, op)
        
    def forward(self, x):
        x = f.relu(self.inp(x))
        x = f.relu(self.conv_1(x))
        x = f.relu(x.flatten(1))
        x = f.relu(self.linear_1(x))
        return f.relu(self.linear_2(x))

# Action selection
Action selection will be based on $\epsilon-greedy$ action selection. greedy action if probability > $\epsilon$ else a random action.

In [None]:
def get_greedy_action(policy, state):
    assert state.shape == (1, 6, 7)
    state = state.unsqueeze(0)
    return policy(state)

def get_epsilon_greedy_action(policy, state, env, eps):
    prob = np.random.uniform()
    
    if prob > eps:
        with torch.no_grad():
            return get_greedy_action(policy, state).argmax().unsqueeze(0)
    return env.random_action()

In [None]:
def get_win_percentages(agent1, agent2, n_rounds=100):
    # Use default Connect Four setup
    config = {'rows': 6, 'columns': 7, 'inarow': 4}
    # Agent 1 goes first (roughly) half the time          
    outcomes = evaluate("connectx", [agent1, agent2], config, [], n_rounds//2)
    # Agent 2 goes first (roughly) half the time      
    outcomes += [[b,a] for [a,b] in evaluate("connectx", [agent2, agent1], config, [], n_rounds-n_rounds//2)]
    print("Agent 1 Win Percentage:", np.round(outcomes.count([1,-1])/len(outcomes), 2))
    print("Agent 2 Win Percentage:", np.round(outcomes.count([-1,1])/len(outcomes), 2))
    print("Number of Invalid Plays by Agent 1:", outcomes.count([None, 0]))
    print("Number of Invalid Plays by Agent 2:", outcomes.count([0, None]))

# Learning
In tradtitional TD methods, the state action value of the current state is estimated from the next state action value. Given by the Bellman equation: $r + \gamma Q(s', a')$. Taking in consideration function approximation: $r + \gamma  Q(s', a', w)$. QLearning is a special case of TD(0) where the target state action value is taken as the maximum from a different previous target policy (called the off-policy). The target policy is kept fixed and updated at specific intervals. This is again done to stabilize the network. The network is updated using the temporal difference of the estimate and the current state action value given by $\delta = r + \gamma Q(s', a', w') - Q(s, a, w)$ using MSE loss.

In [None]:
BATCH_SIZE = 32
GAMMA = 0.9

def optimize_model(policy, target, criterion, optimizer, memory):
    sample = memory.sample_batch(BATCH_SIZE)
    
    non_final_states_mask = [True if item.next_state is not None else False for item in sample]

    states = torch.stack([item.state for item in sample])
    actions = torch.stack([item.action for item in sample])
    rewards = torch.stack([item.reward for item in sample])

    non_final_states = torch.stack([item.next_state for item in sample if item.next_state is not None])
    next_states = torch.zeros(states.shape)
    next_states[non_final_states_mask] = non_final_states
    
    state_action = policy(states).gather(1, actions)
    next_state_action = target(next_states).max(1)[0].unsqueeze(0).T.detach()
    
    expected = reward + GAMMA * next_state_action
    optimizer.zero_grad()
    loss = criterion(state_action, expected)
    loss.backward()
    optimizer.step()
    return loss.item()

In [None]:
ALPHA = 0.01
n_episodes = 10000
EPSILON = 0.2
POLICY_UPDATE = 15

policy = DQNCNNPolicy(7)
target_policy = DQNCNNPolicy(7)
target_policy.load_state_dict(policy.state_dict())
target_policy.eval()

env = ConnectFourGym('random')
replay_memory = ReplayMemory(15000)
loss_criterion = nn.MSELoss()
optimzier_func = optim.SGD(policy.parameters(), lr=ALPHA)


with tqdm.tqdm(range(n_episodes), unit='episode') as tq_ep:
    for ep in tq_ep:
        tq_ep.set_description(f"Episode: {ep + 1}")
        done = False
        state = env.reset()
        ep_loss, ep_reward = 0, 0
        while not done:
            action = get_epsilon_greedy_action(policy, state, env, EPSILON)
            next_state, reward, done, _ = env.step(action)
            ep_reward += reward[0].item()
            if done:
                next_state = None
            replay_memory.push(state, action, reward, next_state)

            if len(replay_memory) > BATCH_SIZE:
                ep_loss += optimize_model(policy, target_policy, loss_criterion, 
                                          optimzier_func, replay_memory)
            state = next_state
        tq_ep.set_postfix(ep_reward=ep_reward)
        
        if ep % POLICY_UPDATE == 0:
            target_policy.load_state_dict(policy.state_dict())

In [None]:
def agent1(obs, config):
    with torch.no_grad():
        state = torch.tensor(obs['board'], dtype=torch.float).reshape(1, 1, 6, 7)
        col = policy(state).argmax().item()
    is_valid = (obs['board'][int(col)] == 0)
    if is_valid:
        return int(col)
    else:
        return random.choice(
            [col for col in range(config.columns) if obs.board[int(col)] == 0]
        )

# Testing
Let's try running the trained agent against random and negamax agent. But first, let's try running random vs random to set a baseline.

In [None]:
get_win_percentages('random', 'random')
get_win_percentages(agent1, 'random')
get_win_percentages(agent1, 'negamax')

- Random vs random has an almost 1:1 win lose rate.
- Trained agent vs random has a better win rate for the trained agent. The trained agent fairs better than the random agent.
- Trained agent vs negamax is absolutely horrible with negamax agent beating the trained agent in 99% of the games. 

# Possible improvements
- Use better exploration strategies.
- Come up with a better reward system if there are any.
- Tinker with the hyperparameters for better performance