# Learning to Play Blackjack with Reinforcement Learning

**Problem Statement:** Reinforcement Learning (RL) trains an agent to make decisions by interacting with an environment to maximize rewards. In this project, we apply RL to Blackjack using the OpenAI Gymnasium environment.

The following code is provided to help you get started. For neural network approaches, we recommend using GPUs train the models.

## 1. Install and Import Dependencies

In [None]:
# Install gymnasium if needed
# !pip install gymnasium matplotlib numpy seaborn

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
from tqdm import tqdm

print(f"Gymnasium version: {gym.__version__}")

## 2. Set Up and Explore the Blackjack Environment

The Blackjack-v1 environment simulates the card game with the following:
- **State**: A tuple of (player_sum, dealer_showing_card, usable_ace)
  - `player_sum` (int): Current sum of player's hand (4–21)
  - `dealer_showing_card` (int): Dealer's face-up card (1–10, where 1 = Ace)
  - `usable_ace` (bool): Whether the player has a usable ace
- **Actions**: 0 = Stand (stop), 1 = Hit (draw another card)
- **Rewards**: +1 (win), 0 (draw), -1 (lose)

In [None]:
# Create the environment
env = gym.make('Blackjack-v1', sab=True)  # sab=True uses Sutton & Barto rules

print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")
print(f"Actions: 0=Stand, 1=Hit")

In [None]:
# Play a few sample episodes manually to understand the environment
print("=== Sample Episodes ===")

for episode in range(5):
    state, info = env.reset()
    print(f"\nEpisode {episode + 1}:")
    print(f"  Initial state: player_sum={state[0]}, dealer_card={state[1]}, usable_ace={state[2]}")
    
    done = False
    step = 0
    while not done:
        # Random action for demonstration
        action = env.action_space.sample()
        action_name = 'Hit' if action == 1 else 'Stand'
        
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        step += 1
        
        print(f"  Step {step}: Action={action_name}, "
              f"New state={next_state}, Reward={reward}, Done={done}")
        
        state = next_state
    
    result = 'Win' if reward > 0 else ('Draw' if reward == 0 else 'Lose')
    print(f"  Result: {result}")

## 3. Random Policy Baseline

We've provided two simple baselines:

1. Random policy agent: This agent will choose to hit or stand randomly.

2. Stand on 17+: This agent will hit until it has reached 17+ in which it will stand.

**Note:** Your job is to implement better RL algorithms to beat these baselines.

In [None]:
def evaluate_policy(env, policy_fn, n_episodes=100000):
    """
    Evaluate a policy over many episodes.
    
    Args:
        env: Gymnasium environment
        policy_fn: Function that takes a state and returns an action
        n_episodes: Number of episodes to simulate
    
    Returns:
        win_rate, draw_rate, lose_rate
    """
    wins, draws, losses = 0, 0, 0
    
    for _ in range(n_episodes):
        state, _ = env.reset()
        done = False
        
        while not done:
            action = policy_fn(state)
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
        
        if reward > 0:
            wins += 1
        elif reward == 0:
            draws += 1
        else:
            losses += 1
    
    return wins / n_episodes, draws / n_episodes, losses / n_episodes


# Random policy
random_policy = lambda state: env.action_space.sample()
win_rate, draw_rate, lose_rate = evaluate_policy(env, random_policy)
print(f"Random Policy: Win={win_rate:.3%}, Draw={draw_rate:.3%}, Lose={lose_rate:.3%}")

# Simple threshold policy (stand on 17+)
threshold_policy = lambda state: 1 if state[0] < 17 else 0
win_rate, draw_rate, lose_rate = evaluate_policy(env, threshold_policy)
print(f"Threshold (17) Policy: Win={win_rate:.3%}, Draw={draw_rate:.3%}, Lose={lose_rate:.3%}")

In [None]:
# TODO: Your code goes here