## Proximal Policy Optimization (PPO)
PPO is a policy-based reinforcement learning algorithm used to learn an optimal control strategy for environments with continuous or complex dynamics, such as our custom Spinning Acrobot. In reinforcement learning (RL), an agent interacts with an environment and learns to make decisions to maximize its cumulative reward over time. The agent’s policy is the strategy it follows to determine which action to take given the current state of the environment. PPO improves upon standard policy gradient methods by using a clipped objective function to prevent large and destabilizing updates during training. This ensures that policy updates remain within a reasonable range and maintain stability.

PPO uses a value function to estimate the expected future rewards that can be obtained from a given state, which helps guide the agent’s decisions. It also employs an advantage function, which quantifies how much better (or worse) a particular action is compared to the expected value of the state. By combining the policy (which decides which actions to take) with the value function and advantage function, PPO is able to balance exploration (trying new actions) and exploitation (choosing actions that lead to higher rewards). The clipping mechanism ensures that the policy is not updated too drastically, improving learning efficiency and ensuring more stable convergence. As a result, PPO can effectively evolve its policy to maximize the cumulative reward while maintaining stability, making it suitable for complex, continuous-action environments like the Spinning Acrobot. (Author: Tim)

PPO has many advantages over other machine learning algorithms which has led to it being widely used today even by the likes of companies like OpenAI. The first, as mentioned above, is that it uses a policy-based method rather than a value-based method. This means that as it learns, it directly optimizes the policy by explicitly updating its parameters rather than deriving a policy based on the value function. The next advantage is its use of clipping which greatly increases the stability of the learning. Without clipping, it would be very easy for overcorrections to occur which would throw off the entire flow of the learning. If the policy is changed too much all at once, agents can become stuck with a policy that fails to achieve any reward and will therefore never be able to improve in any reasonable amount of time. In this algorithm, the clipping is implemented using the ratio between the likelihood of an agent taking an action given a state using the old policy vs the likelihood of an agent taking an action given the state using the updated policy. If the ratio is too high or low, meaning the agent is much more or less likely to take an action given a specific state with the new policy, then that new policy will not be implemented. When deciding how to update the policy, PPO is also faster than older RL algorithms because it uses bootstrapping in advantage estimation. This means that instead of having to run the entire episode when training, it only runs partial episodes and then uses the value function to estimate the remaining rewards. Therefore, it is faster because the agent can learn before the full episode is run, and it results in lower overall variance. Finally, PPO improves upon its predecessor (TRPO) by eliminating the need for calculating second-order derivatives. TRPO implements clipping by enforcing hard constraints based on limiting KL divergence (a mathematical statistical method of assessing distances between probability distributions) which is very computationally intensive, so PPO is much better adapted for large-scale problems. Overall, PPO is a very strong algorithm which makes use of multiple methods of approximation that allow it to work efficiently and well for a variety of RL problems. (Author: Hayden)

In [2]:
from IPython.display import Video

# Display the video (path should be relative or absolute)
Video("acrobotv1.mp4", width=600, height=400, embed=True)

## Changes made to code
Our goal was to make the Acrobot spin multiple times in the same direction. To achieve this, we created a custom Gymnasium environment that inherited from the Acrobot environment. We introduced several new variables in the __init__ method, including prev_tip_angle, total_spin_angle, spin_direction, target_spins, and spin_complete. These instance variables allowed us to set a new goal for the Acrobot and track whether the goal was met. We also updated the reset method to ensure these variables were properly re-initialized at the start of each episode.

The most significant changes were made in the step method. We calculated the x and y position of the Acrobot’s tip using the joint angles and trigonometric functions (sine and cosine). The np.arctan2 function was then used to determine the tip angle in the range $[-\pi, \pi]$. To track the spins, we computed the change in tip angle since the last step. We checked the direction of spin and, if it matched the previous direction, added the change in angle to the total_spin_angle. We stored the current tip angle for comparison in the next step. Once the agent completed the target number of spins, it received a large reward and the episode terminated.

We trained the agent using the PPO algorithm from the stable_baselines3 Python module, running over 400,000 timesteps.

(Author: Tim)

In [5]:
from IPython.display import Video

# Display the video (path should be relative or absolute)
Video("spinningacrobot.mp4", width=600, height=400, embed=True)


In [1]:
from IPython.display import Video

# Display the video (path should be relative or absolute)
Video("Acrobot_Straight.mp4", width=600, height=400, embed=True)