**Toan Lam**  
**GitHub Repo:** https://github.com/tobeyesong/ppo-breakout/  

# Reinforcement Learning for Atari Breakout using PPO

## Project Overview

This project implements and evaluates a Proximal Policy Optimization (PPO) algorithm to play the Atari game Breakout. The goal was to train an agent that could effectively learn the game's mechanics and develop a strategy to maximize score. The implementation uses PyTorch, Stable-Baselines3, and Gymnasium to create a scalable training pipeline that leverages parallel environments for efficient learning.

## Environment Description

### Breakout Game Rules

Breakout is a classic Atari game where the player controls a paddle at the bottom of the screen to bounce a ball upward. The objective is to break all the bricks arranged in rows at the top of the screen by hitting them with the ball.

Key game elements:
- **Paddle**: Player-controlled horizontal bar at the bottom that can move left and right
- **Ball**: Bounces off walls, bricks, and the paddle
- **Bricks**: Arranged in rows at the top; disappear when hit by the ball
- **Scoring**: Player earns points for each brick destroyed
- **Lives**: Player loses a life when the ball falls below the paddle

### Technical Implementation

For this project, we used the `BreakoutNoFrameskip-v4` environment from Gymnasium's Atari suite. This environment:
- Provides raw pixel observations (210×160×3 RGB images)
- Discrete action space with 4 possible actions: NOOP, FIRE (start game), RIGHT, LEFT
- Terminates episodes after losing all lives
- Awards varying points based on brick position (higher rows = more points)

We applied standard preprocessing techniques including:
- Grayscale conversion and downsizing to 84×84 pixels
- Frame stacking (4 consecutive frames) to capture motion
- Frame skipping to improve training efficiency

## Model Architecture & Training Approach

### Why PPO?

I chose Proximal Policy Optimization (PPO) for this task because:

1. **Sample Efficiency**: PPO achieves good performance with fewer environment interactions compared to simpler methods like DQN
2. **Stability**: The "proximal" constraint prevents large policy updates that could destabilize training
3. **Continuous Learning**: PPO's actor-critic structure allows for both discrete actions and efficient value estimation
4. **Parallelization**: Easy to implement with multiple parallel environments for faster training

### Model Configuration

The PPO implementation uses Stable-Baselines3 with the following key hyperparameters:

```python
model = PPO(
    "CnnPolicy",  # CNN-based policy for image processing
    train_env,
    learning_rate=2.5e-4,  # Lower = stable but slower learning
    n_steps=128,  # Steps per env before update
    batch_size=256,  # Minibatch size for gradient updates
    n_epochs=4,  # Update passes through each batch
    gamma=0.99,  # Discount factor for future rewards
    gae_lambda=0.95,  # Balances bias/variance in advantage estimation
    clip_range=0.1,  # Limits policy update size for stability
    ent_coef=0.01,  # Encourages exploration
    device=device,
)
```

The CNN policy network processes frame-stacked images through convolutional layers followed by fully connected layers that output both action probabilities (policy) and state value estimates (critic).

## Experimental Setup

### Training Process

The agent was trained using 8 parallel environments to accelerate the data collection process. Training ran for a total of 10 million timesteps with the following setup:

- **Hardware**: CUDA-enabled GPU for neural network training
- **Parallel Environments**: 8 copies of Breakout running simultaneously
- **Training Duration**: 10M timesteps (approximately 9 hours of training)
- **Checkpointing**: Models saved periodically for evaluation and resumption
- **Monitoring**: TensorBoard logs for tracking metrics

### Challenges & Solutions

#### Challenge 1: Training Instability

Initially, the agent showed unstable learning patterns with high variance in performance between evaluation runs. 

**Solution**: 
- Reduced the learning rate from 5e-4 to 2.5e-4 to make updates more conservative
- Increased batch size from 128 to 256 to improve gradient estimates
- Added entropy coefficient (0.01) to encourage exploration

#### Challenge 2: Slow Learning Progress

The agent struggled to consistently hit bricks in the early stages of training.

**Solution**:
- Increased parallel environments from 4 to 8 for more diverse experience collection
- Adjusted gamma to 0.99 to better account for delayed rewards
- Implemented proper frame stacking to help the agent understand ball dynamics

#### Challenge 3: Environment Compatibility

Faced issues with the latest Gymnasium API changes affecting frame stacking.

**Solution**:
- Updated from `FrameStack` to `FrameStackObservation` to match the current API
- Made evaluation scripts compatible with both training and visualization needs

## Results & Analysis

### Training Progress

The charts below show the agent's learning progress over training:

![Episode Mean Rewards](rollout-ep-rew-mean-1.png)
*Figure 1: Episode mean rewards over training time showing steady improvement*

![Entropy Loss](train-entropy-loss.png)
*Figure 2: Entropy loss demonstrating exploration-exploitation balance*

![Policy Gradient Loss](train-policy-gradient-loss.png)
*Figure 3: Policy gradient loss showing training optimization*

### Performance Metrics

Based on the TensorBoard data from your training session, the agent showed impressive learning progress during training:

- **Final Training Reward**: 352.28 points
- **Maximum Training Reward**: 360.08 points
- **Initial Rewards (first few episodes)**: 1.29 to 3.65 points
- **Final Rewards (last few episodes)**: 343.45 to 352.28 points
- **Evaluation Reward**: 387.60 points at 80,000 steps

For a comprehensive evaluation, the agent was tested over 100 episodes using the headless evaluation script, achieving these results:

- **Average Score**: 341.99 ± 100.02
- **Median Score**: 387.00
- **Min/Max Score**: 25.00/437.00
- **Average Episode Length**: 2812.71 steps

The high average score of 341.99 confirms the agent's strong performance, while the relatively large standard deviation (±100.02) indicates some variability between episodes. The impressive maximum score of 437.00 demonstrates the agent's potential when conditions are favorable. The large gap between minimum (25.00) and maximum scores suggests that occasional poor episodes still occur, which is common in reinforcement learning due to the inherent randomness in game dynamics.

## Learning Progression

The training data shows a clear progression in the agent's abilities:

1. **Initial Phase (0-50k steps)**: The agent started with rewards around 1-3 points, mostly hitting random bricks through trial and error
2. **Developing Phase (50k-300k steps)**: Steady improvement as the agent learned to track and return the ball consistently
3. **Advanced Phase (300k+ steps)**: Rewards stabilizing above 340 points, indicating the agent developed strategic gameplay

This progression can be seen in the reward chart, while the policy gradient and entropy losses show how the learning algorithm gradually converged to an optimal policy.

### Visual Demonstration

A video showing the agent's performance can be viewed here:
[PPO Breakout Agent Gameplay](https://youtu.be/v8D32hMqH4U)

The visualization shows how the agent learned to position the paddle strategically to keep the ball in play and target remaining bricks.

## Discussion & Future Work

### What Worked Well

- The PPO algorithm proved effective at learning Breakout, developing strategies beyond simple ball tracking
- Parallelization significantly improved training efficiency
- The CNN architecture successfully processed visual inputs to understand game state

### Limitations & Challenges

- The agent occasionally struggles with edge cases (very fast ball movement)
- Performance plateaued after ~7M steps, suggesting possible diminishing returns
- The fixed hyperparameter set may not be optimal for all phases of learning

### Future Improvements

1. **Curriculum Learning**: Implement progressive difficulty increase during training
2. **Hyperparameter Scheduling**: Dynamically adjust learning rate and exploration parameters
3. **Alternative Architectures**: Compare performance with Rainbow DQN or A2C implementations
4. **Human Demonstrations**: Incorporate imitation learning from expert human demonstrations
5. **Transfer Learning**: Test if skills transfer to similar games like Pong or other Atari titles

This project demonstrates that PPO can effectively learn complex visual-based control tasks with delayed rewards. The techniques used here could be extended to more challenging environments or real-world robotics applications.

## References

1. Project Repository: [github.com/tobeyesong/ppo-breakout](https://github.com/tobeyesong/ppo-breakout)
2. Gameplay Demo: [YouTube - PPO Breakout Agent](https://youtu.be/v8D32hMqH4U)
3. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. [arXiv:1707.06347](https://arxiv.org/abs/1707.06347)
4. Stable-Baselines3 Documentation: [stable-baselines3.readthedocs.io](https://stable-baselines3.readthedocs.io/)
5. Gymnasium Atari Environments: [gymnasium.farama.org/environments/atari/](https://gymnasium.farama.org/environments/atari/)
6. Visualization Files (PNG):
   - `rollout-ep-rew-mean-1.png`
   - `train-entropy-loss.png`
   - `train-policy-gradient-loss.png`