# **Deep Q-Learning Network (DQN) Reinforcement Learning Agent for Blackjack**
In this notebook, we evaluate the final trained model using the ImprovedDQNAgent class within our training notebook. In our training, we used 500,000 epochs to train the agent using off-policy learning with a replaymemory.

## Imports and Installs

In [None]:
%%capture
# capture line is to hide the output
# Install required packages
!pip install gymnasium torch numpy matplotlib

In [None]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from collections import deque
import random

## **DQN Agent with Replay Memory Implementation**
The Agent and the ReplayMemory class are instantiated here, and then the model weights are loaded. For additional improvement, it might be beneficial to save the agent and replaymemory as a class within a .py file for import.

In [None]:
import random
from collections import namedtuple, deque

# Define a named tuple to store experiences
Experience = namedtuple('Experience', ('state', 'action', 'reward', 'next_state', 'done'))

class ReplayMemory:
    def __init__(self, capacity=10000):
        self.memory = deque(maxlen=capacity)
        self.capacity = capacity

    def push(self, state, action, reward, next_state, done):
        """Save an experience to memory"""
        experience = Experience(state, action, reward, next_state, done)
        self.memory.append(experience)

    def sample(self, batch_size):
        """Randomly sample a batch of experiences from memory"""
        if batch_size > len(self.memory):
            batch_size = len(self.memory)
        experiences = random.sample(self.memory, batch_size)

        # Convert to separate arrays
        states = torch.FloatTensor([exp.state for exp in experiences])
        actions = torch.LongTensor([exp.action for exp in experiences])
        rewards = torch.FloatTensor([exp.reward for exp in experiences])
        next_states = torch.FloatTensor([exp.next_state for exp in experiences])
        dones = torch.FloatTensor([exp.done for exp in experiences])

        return states, actions, rewards, next_states, dones

    def __len__(self):
        return len(self.memory)

class ImprovedDQNAgent:
    def __init__(self, input_dim=3, learning_rate=5e-4, gamma=0.99, epsilon=1.0):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.fitness = [] # used to store a series of 'avg reward' during evaluation

        # Larger network
        self.policy_net = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        ).to(self.device)

        self.target_net = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        ).to(self.device)

        self.target_net.load_state_dict(self.policy_net.state_dict())

        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=learning_rate)
        self.memory = ReplayMemory(capacity=20000)

        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_min = 0.05
        self.epsilon_decay = 0.99997
        self.batch_size = 128
        self.target_update = 10
        self.episode_count = 0

    def select_action(self, state):
        """Select action using epsilon-greedy policy"""
        if random.random() < self.epsilon:
            return random.randint(0, 1)

        with torch.no_grad():
            state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            q_values = self.policy_net(state)
            return q_values.argmax().item()

    def update_epsilon(self):
        """Decay epsilon value"""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

    def store_transition(self, state, action, reward, next_state, done):
        """Store transition in replay memory"""
        self.memory.push(state, action, reward, next_state, done)

    def train_step(self):
        """Perform one training step"""
        if len(self.memory) < self.batch_size:
            return

        states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size)
        states = states.to(self.device)
        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device)

        # Double DQN implementation
        with torch.no_grad():
            next_actions = self.policy_net(next_states).argmax(1)
            next_q_values = self.target_net(next_states).gather(1, next_actions.unsqueeze(1)).squeeze(1)
            target_q_values = rewards + (1 - dones.float()) * self.gamma * next_q_values

        current_q_values = self.policy_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)

        # Huber loss for better stability
        loss = nn.SmoothL1Loss()(current_q_values, target_q_values)

        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), 1.0)
        self.optimizer.step()

        return loss.item()

    def update_target_network(self):
        """Update target network parameters"""
        if self.episode_count % self.target_update == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())
        self.episode_count += 1

The policy net state dictionary is the most important aspect of the training of the model. The target network is less useful as it learns without consideration of the states themselves, and is mostly used during training - whereas the policy net is  effective at inference.

In [None]:
# Initialize the improved agent then loading the state dict from the .pth file
DQN_Agent = ImprovedDQNAgent()
DQN_Agent.policy_net.load_state_dict(torch.load('/content/Blackjack_DQN_500000_episodes.pth')['policy_net_state_dict'])

  DQN_Agent.policy_net.load_state_dict(torch.load('/content/Blackjack_DQN_500000_episodes.pth')['policy_net_state_dict'])


<All keys matched successfully>

## Evaluation

### Evaluation Metrics of the Trained Model on a New Environment
We create a new test environment using the RGB array with Gymnasium to visualize the agent's environment.

In [None]:
test_env = gym.make("Blackjack-v1", render_mode='rgb_array')

Our evaluation function removes some components that were used in the training function and mostly captures the rewards - such as whether the agent was able to play optimally in terms of the number of wins, draws, and losses.

In [None]:
# Evaluate the trained agent
def evaluate_agent_detailed(test_env, trained_agent, n_episodes=200):
    wins = 0
    draws = 0
    losses = 0
    total_rewards = []
    player_sums = []
    dealer_sums = []

    for episode in range(n_episodes):
        state, _ = test_env.reset()
        done = False
        episode_reward = 0

        while not done:
            # Use greedy policy (no exploration)
            with torch.no_grad():
                state_tensor = torch.FloatTensor(state).unsqueeze(0)
                action = trained_agent.policy_net(state_tensor).argmax().item()

            state, reward, done, truncated, _ = test_env.step(action)
            episode_reward += reward

        total_rewards.append(episode_reward)
        if reward > 0:
            wins += 1
        elif reward == 0:
            draws += 1
        else:
            losses += 1

        player_sums.append(state[0])  # Final player sum

    print("\nDetailed Evaluation Results:")
    print(f"Number of Episodes: {n_episodes}")
    print(f"Win Rate: {wins/n_episodes*100:.1f}%  ({wins}/{n_episodes})")
    print(f"Draw Rate: {draws/n_episodes*100:.1f}%  ({draws}/{n_episodes})")
    print(f"Loss Rate: {losses/n_episodes*100:.1f}%  ({losses}/{n_episodes})")
    print(f"Average Reward: {np.mean(total_rewards):.3f}")
    print(f"Average Final Player Sum: {np.mean(player_sums):.1f}")

    return total_rewards, player_sums

# Evaluate the trained agent
print("\nEvaluating trained agent...")
eval_rewards, player_sums = evaluate_agent_detailed(test_env, DQN_Agent, n_episodes=200)


Evaluating trained agent...

Detailed Evaluation Results:
Number of Episodes: 200
Win Rate: 38.0%  (76/200)
Draw Rate: 8.5%  (17/200)
Loss Rate: 53.5%  (107/200)
Average Reward: -0.155
Average Final Player Sum: 19.7


### Visualizing the Agent's Learning with Gymnasium's RGB Environment
This visualization is less useful for evaluation and is mostly useful for interpretting the gameplay. The evaluation output is informative and tells us the statistics of how the agent fares within the evaluation environment. We found that our agent usually performs within the 40-50% win range. This is quite normal as blackjack is at most a game of luck, and in general even the optimal stategy has a .05% disadvantage compared to the dealer/the antagonist of the game.

In [None]:
# Uses the Gym Monitor wrapper to evalaute the agent and record video
# only one video will be saved
# video of the final episode with the episode trigger
test_env = gym.wrappers.RecordVideo(
    test_env, "./gym_monitor_output", episode_trigger=lambda x: x == 0)

evaluate_agent_detailed(test_env, DQN_Agent)

test_env.close()


Detailed Evaluation Results:
Number of Episodes: 200
Win Rate: 39.0%  (78/200)
Draw Rate: 11.0%  (22/200)
Loss Rate: 50.0%  (100/200)
Average Reward: -0.110
Average Final Player Sum: 19.6


In [None]:
# play a video using a path to the video
from IPython.display import Video
from base64 import b64encode

def show_video(video_path):
    video_file = Video(video_path, embed=True)
    display(video_file)


In [None]:
# visualizing the rl-video-episode-0.mp4 in the gym_monitor_output
show_video("./gym_monitor_output/rl-video-episode-0.mp4")

## Strategy Analysis and Interactive Human Feedback Loop with A.I Recommendation
Finally, we implemented a stategy analysis to understand the kinds of decisions that were learned by the agent within the Q-Learning policy matrix. We also implemented a human feedback loop with the agent's action recommendations. The training loop retrains the model after each game and helps to train the model using the human's actions.

### Strategy Analysis

In [None]:
def analyze_strategy(agent):
    # Common situations in blackjack
    test_states = [
        (16, 10, 0),  # Hard 16 vs dealer 10
        (12, 6, 0),   # Hard 12 vs dealer 6
        (18, 9, 0),   # Hard 18 vs dealer 9
        (11, 10, 0),  # Hard 11 vs dealer 10
        (15, 7, 0),   # Hard 15 vs dealer 7
    ]

    print("\nStrategy Analysis:")
    print("Player Sum | Dealer Card | Action")
    print("-" * 35)

    for player_sum, dealer_card, usable_ace in test_states:
        state = np.array([player_sum, dealer_card, usable_ace])
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            action = agent.policy_net(state_tensor).argmax().item()
            action_name = "Hit" if action == 1 else "Stand"
            print(f"{player_sum:^10} | {dealer_card:^11} | {action_name:^6}")

# Run strategy analysis
analyze_strategy(DQN_Agent)


Strategy Analysis:
Player Sum | Dealer Card | Action
-----------------------------------
    16     |     10      |  Hit  
    12     |      6      | Stand 
    18     |      9      | Stand 
    11     |     10      |  Hit  
    15     |      7      |  Hit  


### Interactive Human Feedback Loop

In [None]:
def card_name(card):
    """Convert card number to readable format"""
    if card == 1:
        return 'A'
    elif card == 11:
        return 'J'
    elif card == 12:
        return 'Q'
    elif card == 13:
        return 'K'
    else:
        return str(card)

def print_cards(cards, hidden=False):
    """Display cards in readable format"""
    if hidden:
        return f"[{card_name(cards[0])}, ?]"
    return f"[{', '.join(card_name(c) for c in cards)}]"

def get_card_value(card):
    if card == 1:  # Ace
        return 11
    return min(card, 10)

def calculate_hand_value(cards):
    value = sum(get_card_value(card) for card in cards)
    num_aces = cards.count(1)

    # Adjust for aces
    while value > 21 and num_aces:
        value -= 10
        num_aces -= 1

    return value

def play_interactive_blackjack_with_learning(agent):
    print("\nWelcome to Interactive Learning Blackjack!")
    print("The AI will learn from your games.")

    game_memory = []  # Store game experiences
    stats = {'games': 0, 'wins': 0, 'losses': 0, 'draws': 0}

    while True:
        player_cards = []
        dealer_cards = []
        deck = list(range(1, 14)) * 4
        random.shuffle(deck)

        # Initial deal
        player_cards.extend([deck.pop(), deck.pop()])
        dealer_cards.extend([deck.pop(), deck.pop()])

        game_states = []  # Store states for this game

        while True:
            print("\nDealer shows:", print_cards(dealer_cards, hidden=True))
            print("Your cards:", print_cards(player_cards))
            player_value = calculate_hand_value(player_cards)
            print(f"Your total: {player_value}")

            if player_value > 21:
                print("Bust! You lose.")
                stats['losses'] += 1
                break

            # Current state
            current_state = np.array([player_value, get_card_value(dealer_cards[0]), 1 in player_cards])

            # Get AI recommendation
            with torch.no_grad():
                state_tensor = torch.FloatTensor(current_state).unsqueeze(0)
                q_values = agent.policy_net(state_tensor)
                ai_action = q_values.argmax().item()
                confidence = torch.softmax(q_values, dim=1)[0]
                ai_recommendation = "Hit" if ai_action == 1 else "Stand"
                print(f"\nAI Recommends: {ai_recommendation} (Confidence: {confidence[ai_action]:.2f})")

            action = input("\nYour action (H/S): ").upper()
            while action not in ['H', 'S']:
                action = input("Invalid input. Please enter H or S: ").upper()

            # Store state and action
            game_states.append((
                current_state,
                1 if action == 'H' else 0,
                player_value
            ))

            if action == 'H':
                player_cards.append(deck.pop())
            else:
                break

        # Game ended, calculate final reward
        final_player_value = calculate_hand_value(player_cards)
        dealer_value = play_dealer_hand(dealer_cards, deck)

        if final_player_value <= 21:
            print("\nFinal hands:")
            print(f"Dealer: {print_cards(dealer_cards)} (Total: {dealer_value})")
            print(f"Player: {print_cards(player_cards)} (Total: {final_player_value})")

            if dealer_value > 21:
                print("Dealer busts! You win!")
                reward = 1.0
                stats['wins'] += 1
            elif dealer_value > final_player_value:
                print("Dealer wins!")
                reward = -1.0
                stats['losses'] += 1
            elif dealer_value < final_player_value:
                print("You win!")
                reward = 1.0
                stats['wins'] += 1
            else:
                print("Push (tie)!")
                reward = 0.0
                stats['draws'] += 1
        else:
            reward = -1.0

        # Store experiences for learning
        for state, action, value in game_states:
            agent.memory.push(state, action, reward, state, True)
            # Train on a batch
            if len(agent.memory) >= agent.batch_size:
                loss = agent.train_step()
                if loss:
                    print(f"Training loss: {loss:.4f}")

        stats['games'] += 1
        print("\nCurrent Stats:")
        print(f"Games Played: {stats['games']}")
        print(f"Win Rate: {stats['wins']/stats['games']*100:.1f}%")

        play_again = input("\nPlay again? (Y/N): ").upper()
        if play_again != 'Y':
            break

    print("\nFinal Stats:")
    print(f"Games Played: {stats['games']}")
    print(f"Wins: {stats['wins']}")
    print(f"Losses: {stats['losses']}")
    print(f"Draws: {stats['draws']}")
    print(f"Win Rate: {stats['wins']/stats['games']*100:.1f}%")

    # Save the improved model
    torch.save({
        'policy_net_state_dict': agent.policy_net.state_dict(),
        'optimizer_state_dict': agent.optimizer.state_dict(),
        'epsilon': agent.epsilon,
    }, 'blackjack_dqn_improved.pth')
    print("\nImproved model saved!")

# Helper function for dealer's turn
def play_dealer_hand(dealer_cards, deck):
    dealer_value = calculate_hand_value(dealer_cards)
    while dealer_value < 17:
        dealer_cards.append(deck.pop())
        dealer_value = calculate_hand_value(dealer_cards)
        print(f"Dealer hits: {print_cards(dealer_cards)} (Total: {dealer_value})")
    return dealer_value

# Start the learning interactive game
print("\nStarting interactive learning blackjack game...")
play_interactive_blackjack_with_learning(DQN_Agent)


Starting interactive learning blackjack game...

Welcome to Interactive Learning Blackjack!
The AI will learn from your games.

Dealer shows: [K, ?]
Your cards: [8, 6]
Your total: 14

AI Recommends: Hit (Confidence: 0.56)

Dealer shows: [K, ?]
Your cards: [8, 6, Q]
Your total: 24
Bust! You lose.

Current Stats:
Games Played: 1
Win Rate: 0.0%

Dealer shows: [4, ?]
Your cards: [7, J]
Your total: 17

AI Recommends: Stand (Confidence: 0.62)
Dealer hits: [4, A, 3] (Total: 18)

Final hands:
Dealer: [4, A, 3] (Total: 18)
Player: [7, J] (Total: 17)
Dealer wins!

Current Stats:
Games Played: 2
Win Rate: 0.0%

Dealer shows: [4, ?]
Your cards: [J, 10]
Your total: 20

AI Recommends: Stand (Confidence: 0.86)
Dealer hits: [4, 10, A] (Total: 15)
Dealer hits: [4, 10, A, 6] (Total: 21)

Final hands:
Dealer: [4, 10, A, 6] (Total: 21)
Player: [J, 10] (Total: 20)
Dealer wins!

Current Stats:
Games Played: 3
Win Rate: 0.0%

Dealer shows: [Q, ?]
Your cards: [K, 8]
Your total: 18

AI Recommends: Stand (Conf