
# Assignment 9 – The Blackjack Learner: Understanding Monte Carlo Reinforcement Learning
### Fundamentals of Reinforcement Learning and Monte Carlo Estimation

**Topics Covered:** Reinforcement Learning Foundations, Markov Decision Processes (MDPs), Monte Carlo Prediction and Policy Evaluation  
**Environment:** A simulated Blackjack game designed to model agent–environment interaction and sequential decision-making under uncertainty.  

## Theoretical Concept:

Reinforcement Learning (RL) is a computational framework in which an agent learns to make decisions through interaction with an environment.  
At each time step \( t \), the agent observes a state \( S_t \), chooses an action \( A_t \), receives a reward \( R_{t+1} \), and transitions to a new state \( S_{t+1} \).  
The goal is to learn a policy \( \pi(a|s) \) that maximizes the expected return

\[
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots
\]

where \( \gamma \) is the discount factor controlling the importance of future rewards.

### Markov Decision Process (MDP):
An MDP defines this interaction through the tuple \( (S, A, P, R, \gamma) \), representing states, actions, transition probabilities, rewards, and the discount factor.  
The Markov property assumes that the next state depends only on the current state and action, enabling tractable modeling of sequential decisions.

### Monte Carlo Prediction:
Monte Carlo methods estimate value functions by averaging returns from complete episodes of experience.  
For a state \( s \), the value estimate \( V(s) \) is updated as  

\[
V(s) \leftarrow V(s) + \alpha [G_t - V(s)]
\]

where \( G_t \) is the total reward (return) observed after visiting state \( s \).  
These methods do not require a model of the environment and converge with sufficient sampling.

### Assignment Context:
In this assignment, a simplified Blackjack environment is used to demonstrate how an agent learns optimal play through experience. By repeatedly playing simulated games, the agent estimates the expected reward of different game states using Monte Carlo averaging. You will visualize learned value patterns, interpret how decisions improve through experience, and discuss the properties of RL algorithms.


# Step 1 – Import libraries

In [11]:
import numpy as np
import random
from collections import defaultdict
import matplotlib.pyplot as plt

# Step 2 – Define simple helper functions

In [None]:
def draw_card():
    """Draws a random card (1–10)."""
    card = random.randint(1, ____)     ### FILL IN BLANK: max card number
    return min(card, ____)             ### FILL IN BLANK: face cards adjustment


def hand_value(hand):
    """Calculates total hand value considering usable ace."""
    value = ____ (hand)                ### FILL IN BLANK: compute sum
    if ____ in hand and value + 10 <= 21:   ### FILL IN BLANK: ace check
        return value + 10
    return ____                        ### FILL IN BLANK: return total


def is_bust(hand):
    """Returns True if hand value > 21."""
    return ____ (hand) > 21            ### FILL IN BLANK: helper call


# Quick test
sample_hand = [____(), ____()]         ### FILL IN BLANK: draw two cards
print("Sample Hand:", sample_hand)
print("Hand Value:", hand_value(sample_hand))
print("Is it a bust?", is_bust(sample_hand))


**Interpretation Question:**  
What is the purpose of the helper functions defined above?

# Step 3 – Simulate one Blackjack episode

In [None]:
def play_blackjack_debug():
    player = [____(), ____()]     ### FILL IN BLANK: draw two cards
    dealer = [____(), ____()]     ### FILL IN BLANK: draw two cards
    episode = []

    print("\n--- New Game ---")
    print(f"Initial Player Hand: {player} (Value = {____(player)})")   ### FILL IN BLANK
    print(f"Dealer Shows: {dealer[0]}")

    # Define initial state/action
    state = (____(player), dealer[0],
             1 in player and ____(player) + 10 <= 21)   ### FILL IN BLANKS: compute state
    action = "stick"

    # Player’s turn
    while ____(player) < 20:      ### FILL IN BLANK: stopping condition
        state = (____(player), dealer[0],
                 1 in player and ____(player) + 10 <= 21)

        # Simple policy
        if ____(player) < 17:     ### FILL IN BLANK: condition for hit
            action = "hit"
        else:
            action = "stick"

        print(f"\nPlayer decides to {action.upper()} at value {____(player)}")

        if action == "hit":
            player.append(____())  ### FILL IN BLANK: draw a card
            print(f"New Player Hand: {player} (Value = {____(player)})")
            if ____(player):       ### FILL IN BLANK: check for bust
                print("Player busts! Hand value:", ____(player))
                episode.append((state, action, -1))
                print("Reward: -1 (Loss)")
                return episode
        else:
            print("Player sticks at:", ____(player))
            break

    # Dealer’s turn
    print("\nDealer's turn begins.")
    while ____(dealer) < 17:      ### FILL IN BLANK
        dealer.append(____())     ### FILL IN BLANK
        print(f"Dealer draws. Dealer Hand: {dealer} (Value = {____(dealer)})")

    player_score = ____(player)
    dealer_score = ____(dealer)
    reward = np.sign(player_score - dealer_score)

    print("\nFinal Player Hand:", player, "Value:", player_score)
    print("Final Dealer Hand:", dealer, "Value:", dealer_score)

    if reward == 1:
        print("Result: Player Wins (+1)")
    elif reward == 0:
        print("Result: Draw (0)")
    else:
        print("Result: Dealer Wins (-1)")

    episode.append((state, action, reward))
    return episode


for i in range(3):
    result = play_blackjack_debug()
    print("Episode data:", result)


**Interpretation Question:**  
Looking at the game outputs above, what do these episode traces show about how the agent interacts with the environment, and how can this information be used for learning?

# Step 4 – Monte Carlo Value Estimation

In [None]:
def mc_value_estimation(num_games=____):     ### FILL IN BLANK: choose total episodes
    V = defaultdict(____)   ### FILL IN BLANK: store average returns
    N = defaultdict(____)   ### FILL IN BLANK: count visits

    print("Starting Monte Carlo estimation...")

    for game in range(num_games):
        episode = ____()    ### FILL IN BLANK: simulate a game
        G = 0
        seen = set()

        # Loop backward through episode
        for state, action, reward in ____ (episode):   ### FILL IN BLANK: iterate in reverse
            G += reward
            if state not in seen:
                seen.add(state)
                N[state] += 1
                V[state] += (G - V[state]) / N[state]

        # Minimal progress output
        if game == num_games // ____:   ### FILL IN BLANK: midpoint check
            print("Halfway through... learning in progress.")

    print("Monte Carlo estimation completed.")
    return V


# Run estimation
V = mc_value_estimation(____)   ### FILL IN BLANK: specify number of games
print("Number of unique states learned:", len(V))


**Interpretation Question:**  
Why does Monte Carlo estimation use averaging across many episodes?



In [None]:
sample_states = list(____.items())[:____]    ### FILL IN BLANK: access learned states, choose sample size
print("\n--- Sample of Learned State Values ---")

for s, val in ____:                          ### FILL IN BLANK: iterate through sampled states
    print(f"State: {s},  Value: {val:.3f}")


**Interpretation Question:**  
What does a positive or negative value indicate for a given state?

# Step 5 – Visualize Value Function (with and without usable ace)

In [None]:
# Step 6 – Visualize learned state values

usable_ace = np.zeros((____, ____))         ### FILL IN BLANK: set grid dimensions
no_usable_ace = np.zeros((____, ____))

for (player, dealer, ace) in ____:          ### FILL IN BLANK: iterate through learned states
    if 12 <= player <= 21 and 1 <= dealer <= 10:
        if ace:
            usable_ace[player-12][dealer-1] = ____.get((player, dealer, ace), 0)
        else:
            no_usable_ace[player-12][dealer-1] = ____.get((player, dealer, ace), 0)

fig, axes = plt.subplots(1, 2, figsize=(____, ____))   ### FILL IN BLANK: figure size

# With usable ace
im1 = axes[0].imshow(usable_ace, origin='lower', cmap='____', vmin=-1, vmax=1)   ### FILL IN BLANK: color map
axes[0].set_title('Value with Usable Ace')
axes[0].set_xlabel('Dealer Showing')
axes[0].set_ylabel('Player Sum')
plt.colorbar(im1, ax=axes[0])

# Without usable ace
im2 = axes[1].imshow(no_usable_ace, origin='lower', cmap='____', vmin=-1, vmax=1)
axes[1].set_title('Value without Usable Ace')
axes[1].set_xlabel('Dealer Showing')
axes[1].set_ylabel('Player Sum')
plt.colorbar(im2, ax=axes[1])

plt.show()


**Interpretation Question:**  
How does having a usable ace influence the expected return?


# **Reflection and Discussion**

**Question 1:**  
During training, you observed the progress messages as games were completed. What does the increasing number of games tell you about the stability or convergence of the estimated state values? How would you know if the value function has converged?

**Question 2:**  
Examine the value heatmaps for usable and non-usable aces. What insights can you draw about the role of flexibility in decision-making, and how does the presence of a usable ace change optimal behavior?

**Question 3:**  
Review the printed sample of learned state values. Some states have slightly negative values even after many episodes. What factors could cause persistent underestimation or overestimation in Monte Carlo value estimation?

**Question 4:**  
Consider the random action policy used during game simulation. If the agent instead followed a more informed policy (e.g., hit below 17, stick otherwise), how might that affect both the learned values and the learning efficiency?

**Question 5:**  
Reflect on the overall learning pattern seen in the heatmaps and sample outputs. What does this tell you about the relationship between experiential learning and decision quality in reinforcement learning systems?