# **Customising Gymnasium Environments and Implementing Reinforcement Learning Agents with Stable-Baselines3**

*Trabalho por*: Inês Castro (202304060), Soraia Costa (202305078), Simão Gomes (202304752)


2.3 : descobrir como fazer citações em condições :O <br>
Tá aqui só para não ficar perdido lá no meio <br>
<br>
As métricas extraídas, comparadas e analisadas serão as seguintes:
- TEMPO MÉDIO GOLO MARCADO - avalia a capacidade ofensiva dos agentes
- TEMPO MÉDIO GOLO SOFRIDO - avalia a capacidade defensiva dos agentes
- DIFERENÇA DE GOLOS MÉDIA (<0, adversário ganha; >0, agentes ganham) - comparação com os agentes adversários
- TEMPO POSSE DO DISCO - se tiver pouco, não está a tocar no disco; se tiver muito não está a rematar, será avaliado em conjunto com o tempo adversário
- TEMPO ADVERSÁRIO POSSE DO DISCO

***
# **índice**

### 1. Introduction
   - 1.1. Objective
   - 1.2. Environment Selection: IceHockey-v5
   - 1.3. Motivation and Expected Outcomes

### 2. Environment Analysis
   - 2.1. Original Environment Overview
        - Observation Space
        - Action Space
        - Reward Structure
        - Game Mechanics
   - 2.2. Baseline Performance Analysis
        - Random Agent Behavior
   - 2.3. Identified Modification Opportunities

### 3. Environment Customization (Week 1)
   - 3.1. Reward Engineering Design
        - Proposed Modifications
        - Theoretical Justification
   - 3.2. Implementation
        - Custom Wrapper Code
        - Testing and Validation
   - 3.3. Modified Environment Behavior
        - Expected vs Actual Changes
        - Edge Cases Handling

### 4. Baseline Agent (Week 2)
   - 4.1. Algorithm Selection
        - Why DQN/PPO?
        - Alternative Considerations
   - 4.2. Baseline Configuration
        - Hyperparameters (Default)
        - Training Setup
   - 4.3. Baseline Training
        - Training Process
        - Convergence Analysis
   - 4.4. Baseline Results
        - Performance Metrics
        - Behavior Analysis

### 5. Custom Environment Agent (Week 2)
   - 5.1. Training with Custom Rewards
        - Same Algorithm, New Environment
        - Training Process
   - 5.2. Hyperparameter Tuning
        - Configurations Tested
        - Tuning Methodology
        - Rationale for Each Change
   - 5.3. Results per Configuration
        - Configuration 1: [name]
        - Configuration 2: [name]
        - Configuration 3: [name]
   - 5.4. Best Model Selection

### 6. Evaluation and Comparison (Week 3)
   - 6.1. Evaluation Methodology
        - Metrics Definition
        - Evaluation Protocol
   - 6.2. Quantitative Results
        - Performance Comparison Table
        - Statistical Significance Tests
   - 6.3. Qualitative Analysis
        - Behavioral Differences
        - Play Style Comparison
   - 6.4. Learning Curves Analysis
        - Training Progression
        - Convergence Comparison
   - 6.5. Visualizations
        - Reward Evolution
        - Action Distributions
        - Game Statistics

### 7. Discussion
   - 7.1. What Worked
        - Successful Modifications
        - Positive Impacts
   - 7.2. What Didn't Work
        - Failed Approaches
        - Unexpected Behaviors
   - 7.3. Insights and Learnings
        - Reward Shaping Implications
        - Trade-offs Discovered
   - 7.4. Limitations
        - Technical Constraints
        - Time/Resource Limitations

### 8. Conclusions
   - 8.1. Summary of Findings
   - 8.2. Impact of Customizations
   - 8.3. Future Work

### 9. References


***
## **1. Introduction**

### 1.1 Objective
[inserir objetivo]

### 1.2 Environment Selection: IceHockey-v5
[Explicar o porquê de escolhermos IceHockey]
- Two-player ice hockey game
- Discrete action space (18 actions)
- Sparse reward structure
- Strategic gameplay requirements

### 1.3 Motivation and Expected Outcomes
[O que esperamos alcançar]

***
## **2. Environment Analysis**
[Informação do ambiente](https://ale.farama.org/environments/ice_hockey/)<br>


Neste ambiente controlamos uma equipa de 2 jogadores, para simular como era jogado o IceHockey nós só controlamos um dos jogadores de cada vez.<br>
O jogador da equipa a ser controlado muda automaticamente conforme a proximidade do

### 2.1 Original Environment Overview

In [10]:
import gymnasium as gym
import ale_py
import numpy as np

# Create environment
env = gym.make("ALE/IceHockey-v5", render_mode="rgb_array")

print("=== Environment Information ===")
print(f"Observation Space: {env.observation_space}")
print(f"Action Space: {env.action_space}")
print(f"Action Meanings: {env.unwrapped.get_action_meanings()}")

=== Environment Information ===
Observation Space: Box(0, 255, (210, 160, 3), uint8)
Action Space: Discrete(18)
Action Meanings: ['NOOP', 'FIRE', 'UP', 'RIGHT', 'LEFT', 'DOWN', 'UPRIGHT', 'UPLEFT', 'DOWNRIGHT', 'DOWNLEFT', 'UPFIRE', 'RIGHTFIRE', 'LEFTFIRE', 'DOWNFIRE', 'UPRIGHTFIRE', 'UPLEFTFIRE', 'DOWNRIGHTFIRE', 'DOWNLEFTFIRE']


A recompensa inicial deste ambiente é +1 por ponto marcado, -1 por ponto sofrido.<br>
Um score negativo indica vitória do adversário e um positivo indica vitória da nossa equipa. 

### 2.2 Baseline Performance Analysis

In [11]:
def test_random_agent(env, n_episodes=50):
    """Test random policy - basic stats"""
    episode_rewards = []
    episode_lengths = []
    
    for ep in range(n_episodes):
        obs, info = env.reset()
        total_reward = 0
        steps = 0
        done = False
        
        while not done:
            action = env.action_space.sample()
            obs, reward, terminated, truncated, info = env.step(action)
            total_reward += reward
            steps += 1
            done = terminated or truncated
        
        episode_rewards.append(total_reward)
        episode_lengths.append(steps)
    
    print(f"Random Agent Performance:")
    print(f"  Mean Reward: {np.mean(episode_rewards):.2f} ± {np.std(episode_rewards):.2f}")
    print(f"  Mean Length: {np.mean(episode_lengths):.1f}")
    
    return episode_rewards, episode_lengths

env = gym.make('ALE/IceHockey-v5')
random_rewards, random_lengths = test_random_agent(env)
env.close()

Random Agent Performance:
  Mean Reward: -9.46 ± 3.75
  Mean Length: 3305.5


### 2.3 Identified Modification Opportunities

The current reward function used in the Ice Hockey environment is a purely sparse objective reward, providing:

- +1 for scoring a goal

- –1 for conceding a goal

While this formulation captures the final objective of the game, it presents several limitations that directly hinder the learning performance of reinforcement learning agents, particularly in multi-agent and continuous-time domains. Based on insights from recent studies in MARL applied to robot football and simulated sports environments, several modification opportunities can be identified.


1. **Sparse Rewards Are Ineffective in Long-Horizon, Multi-Agent Tasks**

Sparse scoring events provide too little information for the agent to learn meaningful behaviours.  
Multiple studies highlight the limitations of using goals as the only training signal.

- *“Goals are infrequent events that provide little learning signal for credit assignment.”*  
  — *Embedding Contextual Information Through Reward Shaping* :contentReference[oaicite:0]{index=0}

- *“The reward associated with scoring has low impact due to its rarity.”*  
  — *The Role of a Reward in Shaping Multiple Football Agents’ Behavior* :contentReference[oaicite:1]{index=1}

***Opportunity:*** 
Introduce dense shaping rewards to provide feedback at every timestep, improving sample efficiency and stabilizing learning.


2. **Lack of Intermediate Feedback Prevents Effective Credit Assignment**

With only ±1 at terminal states, agents cannot distinguish which specific actions contributed to success or failure.

- rSoccer results demonstrate that sparse rewards cause agents to *“fail to learn coordinated behaviours”* and that rewards become *“inadequate when observation dimensionality increases”* :contentReference[oaicite:2]{index=2}.

- The GRF benchmark shows that adding checkpoint rewards dramatically improves learning stability:  
  *“Dense reward shows better and more stable performance than sparse reward.”* :contentReference[oaicite:3]{index=3}

***Opportunity:*** 
Add intermediate tactical rewards (e.g., puck recovery, successful passes, shots on target).


3. **No Directional Guidance Leads to Unstructured or Random Policies**

The current reward provides no gradient for progress toward the opponent’s goal, leading to exploratory behaviours that do not contribute to scoring.

In GRF, directional shaping (CHECKPOINT reward) was introduced for exactly this reason:

- *“CHECKPOINTS reward substantially reduces exploratory actions by guiding the player toward the opponent’s goal.”*  
  — GRF MARL Benchmark :contentReference[oaicite:4]{index=4}

***Opportunity:***  
Include a **progress reward** based on puck displacement toward the opponent’s goal.


4. **Lack of Coordination Incentives Produces Degenerate Multi-Agent Behavior**

Sparse rewards do not encourage:

- spacing  
- defensive support  
- coordinated pressure  
- role differentiation  

This problem is documented in rSoccer, where:

- *“Agents block each other and fail to collaborate when no shaping is provided.”*  
  — rSoccer Framework :contentReference[oaicite:5]{index=5}

***Opportunity:***  
Add shaping terms encouraging cooperative behaviour such as spacing, puck recovery, and avoidance of collisions.


5. **Sparse Rewards Lead to Overfitting and Non-Generalizable Strategies**

Agents trained solely on goal rewards may learn brittle strategies that exploit specific opponent weaknesses rather than robust tactical principles.

The GRF benchmark warns that:

- *“Training solely against fixed opponents leads to overfitting; policies remain far from robust.”*  
  — GRF Full-Game Analysis :contentReference[oaicite:6]{index=6}

***Opportunity:***  
Reward behaviours that generalize across opponents (e.g., maintaining possession, structured defensive pressure).

***
## 3. Environment Customization

### 3.1 Reward Engineering Design

**Proposed Modifications:**<br>
reward inicial = 0

1. ***Golos:***
if scored_goal: reward += 1.0
if conceded_goal: reward -= 1.0

2. ***Direção (inspirado CHECKPOINT):***
reward += k1 * (old_dist_to_goal - new_dist_to_goal)

3. ***Posse do puck:***
if has_puck: reward += 0.01
else: reward -= 0.01

4. ***Pressão e defesa:***
reward += k2 * (old_dist_to_puck - new_dist_to_puck)

5. ***Ações úteis:***
if completed_pass: reward += 0.1
if shot_on_goal: reward += 0.05
if steal_puck: reward += 0.2

6. ***Penalizações:***
if collision: reward -= 0.02
if agent_idle_too_long: reward -= 0.02
if teammates_too_close: reward -= 0.02

[Explicar teoria por trás]

### 3.2 Implementation

In [12]:
# Inserir código :P

### 3.3 Testing Custom Environment

In [13]:
# Ver se funciona

***
## 4. Baseline Agent

### 4.1 Algorithm Selection
É o mesmo que toda a gente <br>
**Why DQN?**

We chose Deep Q-Network (DQN) for the following reasons:

1. **Image-based observations**: DQN with CNN policy is standard for Atari
2. **Discrete actions**: DQN is designed for discrete action spaces
3. **Proven track record**: DQN has strong performance on Atari games
4. **Stable-Baselines3 support**: Well-implemented and documented

**Alternative Considerations**
- PPO: More stable but potentially slower convergence
- A2C: Faster training but less sample efficient
- Rainbow DQN: Better performance but more complex

### 4.2 Baseline Configuration

### 4.3 Baseline Training

### 4.4 Baseline Results

***
## 5. Custom Environment Agent

### 5.1 Training with custom rewards

### 5.2 (???) Hyperparameter tuning
Vamos ter várias configurações focadas em coisas diferentes ex: learning rate, buffer, exploração

### 5.3 Results per configuration

### 5.4 Best model selection

***
## 6. Evaluation and Comparison

### 6.1 Methodology

### 6.2 Quant results

### 6.3 Qual results

### 6.4 Learning curve analysis

### 6.5 Visualizations

***
## 7. Discussion

***
## 8. Conclusion

***
## 9. References

[1] Song, Y. et al. (2023). The Role of a Reward in Shaping Multiple Football Agents’ Behavior. <br>
[2] Zhang, H. et al. (2023). Embedding Contextual Information through Reward Shaping in Multi-Agent Learning. <br>
[3] Martins, F. B. et al. (2021). rSoccer: A Framework for Studying Reinforcement Learning. <br>
[4] Song, Y. et al. (2024). Boosting Studies of Multi-Agent RL on Google Research Football.