# Training a PPO Agent on Breakout

Author: Christa Sparks

Github: sparkyyc

Class: CSPB 3202

## Introduction

### The Game Overview - Breakout

Breakout is a classic arcade game where the player controls a paddle to bounce a ball, aiming to destroy a wall of bricks at the top of the screen. The main objective is to clear all the bricks by striking them with the ball, which bounces off surfaces at varying angles based on its impact point. Players earn points for each brick destroyed.

The game ends when the player loses all lives, which occurs if the ball falls below the paddle and out of the play area. Success is defined by the player’s ability to clear all bricks and achieve a high score. This makes Breakout an ideal environment for testing reinforcement learning agents, as it requires precise control, strategic planning, and the ability to learn from interaction with the game environment.

### Project Goal

The goal of this project was to train a reinforcement learning agent to play the Atari game Breakout using reinforcement learning algorithms, specifically exploring the performance of the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms from the Stable Baselines3 library. Breakout is a challenging environment due to its dynamic gameplay and the need for precise control, making it an ideal testbed for comparing the effectiveness of different reinforcement learning approaches.

Initially, the project focused on using the DQN algorithm, which has been a benchmark in deep reinforcement learning due to its ability to handle discrete action spaces. However, after evaluating the results and considering the need for more stable and efficient training, the project shifted to using the PPO algorithm. PPO is well-suited for tasks like Breakout because of its balance between complexity and performance, making it a good choice for training agents across various environments. The transition from DQN to PPO was driven by the desire to optimize the agent's performance and leverage PPO's robustness in handling complex environments.

## Methodology

### Final Model Architecture

The final model was built using the PPO implementation from the stable-baselines3 library. Key components of the architecture included:

- Policy Network: A Convolutional Neural Network (CNN) was used to process the image-based input from the game environment. This allowed the agent to understand the spatial structure of the game, identifying the positions of the paddle, ball, and bricks.

- Learning Rate Scheduler: The learning rate was initially set to a moderate value, allowing for more significant policy updates during the early stages of training. As training progressed, the learning rate decreased, enabling finer adjustments to the policy as the agent approached optimal performance.

- Environment Setup: The Breakout environment was created using the Gymnasium's ALE/Breakout-v5 environment. The environment was wrapped with Monitor and VecTransposeImage to handle episodic statistics and to manage image-based observations correctly.



### Algorithm Selection

During the initial stages of this project, the Deep Q-Network (DQN) algorithm was employed to train an agent to play Breakout. DQN, known for its effectiveness in environments with discrete action spaces, has been a popular choice for Atari games and other similar tasks. However, several challenges and observations during the training process led to the decision to switch from DQN to Proximal Policy Optimization (PPO). This section discusses the reasoning behind this shift, the comparative advantages of PPO, and the outcomes observed after making this change.

#### Challenges with DQN

While DQN is a powerful algorithm, it presented several challenges when applied to the Breakout environment:

- Unstable Learning: During training with DQN, the agent exhibited significant instability in learning. The reward curve fluctuated greatly, with the agent often failing to maintain consistent performance.

- Memory and Computational Requirements: DQN requires a replay buffer to store past experiences, which can be memory-intensive, especially in environments with high-dimensional inputs like images. Additionally, the computational overhead of sampling from the replay buffer and updating the Q-network led to slower training times.

- Hyperparameter Sensitivity: The performance of DQN was highly sensitive to hyperparameter choices such as the learning rate, target network update frequency, and replay buffer size. Tuning these parameters to achieve stable learning was challenging and time-consuming.

#### Rationale for Switching to PPO

Given the challenges faced with DQN, PPO was considered as an alternative for training the Breakout agent. PPO offers several advantages that address the issues encountered with DQN:

- Stability and Robustness: PPO is designed to maintain stability during training. This helps prevent large updates that can destabilize the learning process, leading to more consistent performance over time. The stability offered by PPO was a key factor in its selection, especially given the unstable learning observed with DQN.

- Continuous Action Space Compatibility: While Breakout uses a discrete action space, PPO's ability to handle both discrete and continuous action spaces made it a versatile choice. The algorithm's policy gradient approach allowed for more direct optimization of the policy, which can be more efficient than the value-based approach of DQN in certain environments.

- Improved Exploration: PPO inherently balances exploration and exploitation through its policy gradient updates, which are less reliant on ε-greedy strategies. This allowed the agent to explore more effectively, particularly in the early stages of training, leading to the discovery of better strategies.

- Reduced Hyperparameter Sensitivity: PPO has fewer hyperparameters that are less sensitive compared to DQN, making it easier to tune and more likely to converge to a good solution with default or slightly adjusted settings. This reduced the time and effort required for hyperparameter tuning.

- Efficiency in Training: PPO's on-policy nature, where the policy is updated using trajectories generated by the current policy, simplifies the training process. This approach avoids the complexity of maintaining and sampling from a replay buffer, leading to faster iterations and potentially quicker convergence.

#### Outcomes and Observations

After switching to PPO, the training process for the Breakout agent showed marked improvements:

Stabilized Learning Curve: The reward curve became more stable, with the agent showing consistent improvement over time. The fluctuations observed with DQN were significantly reduced, leading to a more reliable learning process.

Better Exploration: The agent demonstrated better exploration of the game space, discovering more effective strategies and avoiding the pitfalls of local minima that were common with DQN.

Efficient Use of Computational Resources: Training times were more manageable, and the absence of a large replay buffer reduced memory usage, making the training process more efficient overall.

The switch from DQN to PPO was a strategic decision that resulted in a more effective training process for the Breakout environment. The advantages of PPO, particularly in terms of stability, exploration, and efficiency, made it a better fit for this task, ultimately leading to improved agent performance.


## Training Process

The training process was conducted over a series of episodes, where the agent interacted with the Breakout environment, receiving rewards based on its performance. The PPO algorithm updated the agent’s policy iteratively, with each episode contributing to the refinement of the agent's strategy.
Training Metrics

- Rewards: The primary metric used to evaluate the agent’s performance was the reward received per episode. As training progressed, the agent’s ability to break more bricks and keep the ball in play for longer periods was reflected in higher rewards.

- Policy Updates: The agent’s policy was updated at regular intervals, with the PPO algorithm ensuring that updates were constrained to maintain stability.

- Learning Rate: The learning rate was adjusted throughout training, decreasing as the agent’s performance began to plateau, allowing for more precise policy adjustments.


## Iteration and Model Improvement

The development of the Breakout agent was an iterative process involving multiple experiments, evaluations, and adjustments to the reinforcement learning models used. This section documents the key stages of this process, highlighting the challenges encountered, the strategies employed to overcome them, and how these iterations contributed to the overall performance improvement of the agent.

#### Initial Experiments with DQN

The project began with the implementation of the Deep Q-Network (DQN) algorithm, a well-known method in reinforcement learning, particularly effective for environments with discrete action spaces like Breakout. The initial DQN model was trained with standard hyperparameters, and while the agent managed to learn some basic strategies for playing Breakout, it quickly became apparent that the performance plateaued at a relatively low level. The model struggled with the complex dynamics of the game, particularly in terms of generalizing to different scenarios that required precise timing and decision-making.

#### Transition to PPO

Given the limitations observed with DQN, the decision was made to experiment with the Proximal Policy Optimization (PPO) algorithm. PPO is known for its robustness and ability to maintain stable performance across a variety of environments, making it an attractive alternative to DQN. The initial PPO implementation used default settings from the Stable Baselines3 library. Early results showed promise, with the agent demonstrating more consistent learning and better handling of the game's complexity compared to the DQN-based agent.

#### Hyperparameter Tuning

As the project progressed, hyperparameter tuning became a critical part of the iterative process. Key parameters such as the learning rate, the number of steps per update (n_steps), and the batch size were adjusted to optimize the PPO model's performance. For instance, the learning rate was reduced to prevent the model from converging too quickly to suboptimal policies, and the n_steps parameter was increased to allow the model to collect more experience per update, leading to better policy refinement.

Through these adjustments, the agent's performance steadily improved, as evidenced by the increasing mean rewards over time. However, tuning these parameters was not without challenges. Some configurations led to overfitting, where the agent performed well in training but poorly during evaluation. This required a careful balance, ensuring that the model remained generalizable while still capitalizing on the learning opportunities provided by the environment.

#### Addressing Overfitting and Early Stopping

During the training process, overfitting emerged as a significant issue. To address this, techniques such as early stopping were considered, where training would halt if the model's performance failed to improve over a set number of evaluations. However, initial implementations of early stopping were too aggressive, leading to premature termination of training before the model had fully optimized. This was addressed by refining the early stopping criteria and extending the warm-up period before monitoring performance for potential stopping. Additionally, to further mitigate overfitting, regular evaluations were conducted to ensure the agent's performance was consistent across different episodes and not just during training.

#### Final Model and Performance

The final PPO model incorporated all the insights gained from previous iterations. With refined hyperparameters and a more robust training strategy, the agent achieved significantly higher rewards compared to the initial DQN model. The final model demonstrated improved gameplay strategies, such as better ball control and efficient breaking of blocks, leading to a more competitive performance in Breakout.

This iterative process underscored the importance of flexibility and continuous testing in reinforcement learning. Each experiment provided valuable feedback, guiding the subsequent adjustments that ultimately resulted in a more capable and efficient Breakout agent.

## Evaluation and Results
### Training Rewards Over Time

#### Dqn model results:

![breakout dqn rewards](breakout_charts/breakout_training_rewards.png)

The DQN model exhibited significant fluctuations in rewards across episodes. The rewards ranged broadly from low to moderate, with some higher peaks around 12 to 14. The graph shows instability and inconsistency in the learning process, which is characteristic of DQN models when applied to environments with complex dynamics like Breakout.

#### Previous ppo model results:

![breakout previous rewards](breakout_charts/ppo_breakout_training_rewards.png)

This PPO model also showed fluctuations in rewards, but it generally reached rewards around 10, with some episodes dipping lower. The model showed some stability but was still oscillating between episodes, indicating that it might not have fully converged.

#### Final model:

![breakout rewards](breakout_charts/ppo_breakout_training_rewards_extended.png)

The final PPO model exhibited a more refined learning process with rewards more consistently reaching higher values (around 10 to 14) compared to the other models. Although there were still fluctuations, the overall trend showed a more stable and gradual improvement over time.



### Visual Evaluation

![breakout gif](breakout_videos/breakout_agent_extended.gif)

(This is a gif and video(below) of the old dqn agent, I could not get gifs and videos to work with the ppo agent. Dicussed in Challenges section.)

To complement the numerical evaluation, a GIF were recorded of the agent playing the game. These recordings visually demonstrated the agent’s strategy, showing how it managed to break bricks.


### Final Performance

While the agent did not reach the level of a human expert, it showed significant learning and improvement over the course of training. The use of PPO allowed for a more stable and robust training process compared to the earlier attempts with DQN, resulting in an agent that could competently play Breakout.

## Challenges with PPO Model

The training rewards chart generated during the PPO model's training shows clear evidence that the agent was making progress. Over time, the rewards obtained by the agent generally increased, with episodes reaching as high as 14 reward points. This trend suggests that the PPO model was indeed learning to play the game more effectively, navigating the complexities of Breakout's dynamic environment.

Despite the promising results observed during training, technical difficulties arose when attempting to load the PPO model for observation or recording purposes. While the DQN model could be loaded without issues, the PPO model encountered errors, particularly related to the environment setup and model predictions. These issues prevented the successful playback and visualization of the agent's gameplay.

The inability to visualize or record the agent's performance post-training poses a significant limitation in fully validating and demonstrating the model's capabilities. Without being able to observe the agent's actions, it's challenging to qualitatively assess its strategies and behaviors during gameplay. Additionally, the recording and visualization processes are essential for documenting the agent's performance in a more tangible and interpretable way.

### Reflection and Next Steps

This experience underscores the importance of not only training the agent but also ensuring that the entire pipeline—from training to evaluation and recording—functions seamlessly. The issues encountered suggest that further investigation is needed into the environment setup and model compatibility, especially when switching between different reinforcement learning algorithms like PPO and DQN.

Future work should focus on resolving these technical challenges to enable proper evaluation and documentation of the agent's performance. This could involve revisiting the environment configuration, ensuring that the model's training and evaluation environments are consistent, and possibly exploring alternative methods for saving and loading models to avoid compatibility issues.

Additionally, it would be beneficial to conduct more rigorous testing of the environment and model interaction before and after training to ensure that any changes or updates to the environment or model architecture do not inadvertently introduce errors.

In conclusion, while the PPO model showed significant promise during training, the inability to fully observe and record the agent's performance highlights the complexities involved in reinforcement learning projects. Addressing these challenges will be crucial for future iterations and successful deployment of trained models in real-world or simulated environments.

## Conclusion

Switching from DQN to PPO was a critical decision in this project. PPO provided the stability and efficiency needed to train the Breakout agent effectively. The final model, using PPO with a carefully managed learning rate and robust policy updates, demonstrated the agent’s ability to learn and improve over time. The visual outputs, including videos and GIFs, provided further evidence of the agent’s learning progress, showcasing its performance in the Breakout environment.

The results of this project highlight the importance of choosing the right RL algorithm for the task at hand. PPO’s advantages in stability and exploration made it the ideal choice for training a Breakout agent, resulting in a successful implementation that can be further refined and extended in future work.

### Learnings/References/Tutorials:

https://github.com/DLR-RM/stable-baselines3

https://gymnasium.farama.org/environments/atari/breakout/

https://www.nature.com/articles/nature14236

https://arxiv.org/abs/1509.06461

https://arxiv.org/abs/1707.06347

https://arxiv.org/abs/1709.06560

https://arxiv.org/abs/2007.06700

https://jerrickliu.com/2020-07-13-FourthPost/

https://huggingface.co/blog/deep-rl-ppo

https://www.sciencedirect.com/topics/computer-science/learning-rate
