
# Analysis of the Final Learned Policy for a Q‑Learning Agent
**AI Studio: Project Analysis — Penalty Shootout**

This notebook provides a detailed qualitative and quantitative analysis of the final learned policy of the **Q‑learning** agent for the **Penalty Shootout** game. Since episodes are short and rewards are discrete (goal vs save), this analysis focuses on inspecting the **Q‑table** to understand the agent's strategy, strengths, and limitations.



## 1. Setup and Data Loading
First, import the necessary libraries and load the final **Q‑table** and training logs from the saved file `training_data.npz`.


In [1]:

# Requirements
# pip install numpy matplotlib  # if running in a fresh environment

import numpy as np
import matplotlib.pyplot as plt
import os

TRAINING_FILE = "training_data.npz"

# Load the Q table
try:
    data = np.load(TRAINING_FILE, allow_pickle=True)
    Q = data["Q"]
    returns = data["returns"]
    moving_avg = data["moving_avg"]
    epsilons = data["epsilons"]
    ACTIONS = list(data["actions"])
    print(f"Loaded: {TRAINING_FILE}")
    print("Q-table shape:", Q.shape)
except FileNotFoundError:
    # Fallback mock so the notebook is still runnable for formatting preview
    print("Could not find 'training_data.npz'. Creating a small mock Q-table for preview...")
    ACTIONS = ["L", "LC", "C", "RC", "R"]
    # (kick_idx 0..5) × (score_diff -5..+5) = (6 * 11) states; 5 actions
    Q = np.random.rand(6*11, 5).astype(np.float32)
    returns = np.random.rand(500).astype(np.float32) * 5.0
    moving_avg = np.convolve(returns, np.ones(50)/50, mode="same")
    epsilons = np.linspace(1.0, 0.05, len(returns)).astype(np.float32)

print("Episodes:", len(returns))
print("Actions:", ACTIONS)


ModuleNotFoundError: No module named 'numpy'

In [None]:

# State index mapping used in training:
# idx = (kick_idx * cols) + (diff + max_diff), where cols = 2*max_diff + 1, and max_diff = kicks
KICKS = 5
MAX_DIFF = KICKS
COLS = 2 * MAX_DIFF + 1
ROWS = KICKS + 1  # kick_idx 0..5

def idx_from_state(kick_idx, score_diff):
    return (kick_idx * COLS) + (score_diff + MAX_DIFF)

def state_from_idx(idx):
    kick_idx = idx // COLS
    diff = (idx % COLS) - MAX_DIFF
    return kick_idx, diff

score_labels = [str(d) for d in range(-MAX_DIFF, MAX_DIFF+1)]
kick_labels = [str(k) for k in range(ROWS)]



## 2. Learning Curves
We start with **episode goals** and a **moving average** to verify that the agent improved, followed by the **ε‑greedy schedule** used during training.


In [None]:

plt.figure()
plt.plot(returns, label="Episode goals")
plt.plot(moving_avg, label="Moving average (50)")
plt.xlabel("Episode")
plt.ylabel("Goals per episode")
plt.title("Learning Curve — Goals")
plt.legend()
plt.tight_layout()
plt.show()

plt.figure()
plt.plot(epsilons)
plt.xlabel("Episode")
plt.ylabel("Epsilon (ε)")
plt.title("Exploration Schedule")
plt.tight_layout()
plt.show()



## 3. Qualitative Analysis of the Final Learned Policy
To understand the agent’s preferences, we visualize its **greedy action** (argmax over Q) across relevant state slices.  
Because the state is `(kick_idx, score_diff)`, we examine:
- **Greedy action vs. score difference** for a *fixed kick index*; and
- **Greedy action vs. kick index** for a *fixed score difference*.


In [None]:

# Build a matrix: rows = kick_idx (0..5), cols = score_diff (-5..+5), values = greedy action index
greedy = np.argmax(Q, axis=1)
grid = np.zeros((ROWS, COLS), dtype=int)
for k in range(ROWS):
    for d in range(-MAX_DIFF, MAX_DIFF+1):
        grid[k, d+MAX_DIFF] = greedy[idx_from_state(k, d)]

plt.figure()
plt.imshow(grid, aspect="auto")
plt.xticks(ticks=range(COLS), labels=score_labels, rotation=0)
plt.yticks(ticks=range(ROWS), labels=kick_labels)
plt.xlabel("Score Difference (goals - saves)")
plt.ylabel("Kick Index (0: first ... 5: last)")
plt.title("Greedy Action Map (indices) across State Space")
plt.colorbar()
plt.tight_layout()
plt.show()

print("Legend (action index → label):", {i:a for i,a in enumerate(ACTIONS)})



### 3.1 Preferred Actions by Score Context (Fixed Kick)
Here we pick a specific kick (e.g., **kick 2**) and show the **Q‑values** for all actions across score differences. This highlights how the policy changes when trailing vs. leading.


In [None]:

fixed_k = 2  # choose any 0..5
M = np.zeros((len(ACTIONS), COLS), dtype=float)
for a in range(len(ACTIONS)):
    for d in range(-MAX_DIFF, MAX_DIFF+1):
        M[a, d+MAX_DIFF] = Q[idx_from_state(fixed_k, d), a]

plt.figure()
plt.imshow(M, aspect="auto")
plt.yticks(ticks=range(len(ACTIONS)), labels=ACTIONS)
plt.xticks(ticks=range(COLS), labels=score_labels, rotation=0)
plt.xlabel("Score Difference")
plt.ylabel("Action")
plt.title(f"Q-values by Score Difference at Kick {fixed_k}")
plt.colorbar()
plt.tight_layout()
plt.show()



### 3.2 Preferred Actions over the Shootout Timeline (Fixed Score)
Now we fix a **score difference** (e.g., **0**) and plot **Q‑values** across **kick index**. This shows whether the agent becomes more/less aggressive as the final kick approaches.


In [None]:

fixed_diff = 0  # choose from -5..+5
M2 = np.zeros((len(ACTIONS), ROWS), dtype=float)
for a in range(len(ACTIONS)):
    for k in range(ROWS):
        M2[a, k] = Q[idx_from_state(k, fixed_diff), a]

plt.figure()
plt.imshow(M2, aspect="auto")
plt.yticks(ticks=range(len(ACTIONS)), labels=ACTIONS)
plt.xticks(ticks=range(ROWS), labels=kick_labels)
plt.xlabel("Kick Index")
plt.ylabel("Action")
plt.title(f"Q-values by Kick Index at Score Diff {fixed_diff}")
plt.colorbar()
plt.tight_layout()
plt.show()



### 3.3 Unvisited or Undertrained States
We flag any **state rows** in the Q‑table that appear to have remained near zero (common if the agent rarely reaches those states during training).


In [None]:

row_norms = np.linalg.norm(Q, axis=1)
low_rows = np.where(row_norms < 1e-6)[0]
print(f"Total states: {Q.shape[0]} | Near-zero rows: {len(low_rows)}")

if len(low_rows) > 0:
    example = low_rows[:10]
    decoded = [state_from_idx(i) for i in example]
    print("Example near-zero state indices (kick_idx, score_diff):", decoded)



## 4. High-Level Policy: Overall Action Preference
We compute the **overall greedy action distribution** across the entire state space. This provides a high-level view of preferred shot directions.


In [None]:

vals, counts = np.unique(greedy, return_counts=True)
labels = [ACTIONS[v] for v in vals]

plt.figure()
# Donut chart using a pie with a white circle
plt.pie(counts, labels=labels, autopct="%1.1f%%", startangle=90)
centre_circle = plt.Circle((0,0), 0.65, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title("Overall Greedy Action Distribution (All States)")
plt.tight_layout()
plt.show()



## 5. Optional Quantitative Evaluation
Run multiple evaluation episodes using the learned **greedy policy** (no exploration) to report mean ± std goals.


In [None]:

def run_optional_eval():
    try:
        from penalty_shootout_ql import PenaltyShootoutEnv, to_state_index
    except Exception as e:
        print("Env not available, skipping evaluation:", e)
        return None

    env = PenaltyShootoutEnv()
    max_diff = env.kicks

    def policy(state):
        idx = to_state_index(state, env.kicks, max_diff)
        return int(np.argmax(Q[idx]))

    def evaluate(n_episodes=100):
        scores = []
        for _ in range(n_episodes):
            s = env.reset()
            done = False
            total = 0.0
            while not done:
                a = policy(s)
                s, r, done, _ = env.step(a)
                total += r
            scores.append(total)
        return float(np.mean(scores)), float(np.std(scores))

    mean_score, std_score = evaluate(200)
    print(f"Evaluation over 200 episodes: {mean_score:.2f} ± {std_score:.2f} goals")


run_optional_eval()



## 6. Conclusions
- The **learning curves** confirm improvement as ε decays and the moving average of goals increases.

- The **greedy action maps** reveal how the agent adapts to **score pressure** and **late kicks**.

- Any **near‑zero Q rows** highlight states that were rarely or never encountered; these are candidates for curriculum tweaks or reward shaping.

- The **overall action distribution** summarizes the agent’s style (e.g., favoring safer corners or central shots), which you can discuss in the report.



**Next steps:** compare with **SARSA**, modify the goalkeeper distribution, extend the action set (e.g., shot power), or switch to **DQN** with function approximation.
