## Proximal Policy Optimization (PPO)
PPO (Proximal Policy Optimization) is a policy-based reinforcement learning algorithm used to learn an optimal control strategy for environments with continuous or complex dynamics, such as our custom Spinning Acrobot. In reinforcement learning, a policy is the agent's strategy for deciding which actions to take based on its current state. The goal is to improve the policy over time in order to maximize the total cumulative reward. Unlike traditional policy gradient methods, which can lead to large and unstable changes in the policy, PPO uses a modified objective function to ensure more stable updates. This is done through a clipping mechanism, which prevents the policy from changing too drastically between updates. The clipping ensures that updates remain in a stable and reasonable range, helping the algorithm improve the policy without causing sudden, destabilizing shifts. This allows PPO to strike a balance between exploring new actions and exploiting what it has already learned. (Author: Tim)

In [6]:
from IPython.display import Video

# Display the video (path should be relative or absolute)
Video("acrobotv1.mp4", width=600, height=400, embed=True)

## Changes made to code
Our goal was to make the Acrobot spin multiple times in the same direction. To achieve this, we created a custom Gymnasium environment that inherited from the Acrobot environment. We introduced several new variables in the __init__ method, including prev_tip_angle, total_spin_angle, spin_direction, target_spins, and spin_complete. These instance variables allowed us to set a new goal for the Acrobot and track whether the goal was met. We also updated the reset method to ensure these variables were properly re-initialized at the start of each episode.

The most significant changes were made in the step method. We calculated the x and y position of the Acrobot’s tip using the joint angles and trigonometric functions (sine and cosine). The np.arctan2 function was then used to determine the tip angle in the range $[-\pi, \pi]$. To track the spins, we computed the change in tip angle since the last step. We checked the direction of spin and, if it matched the previous direction, added the change in angle to the total_spin_angle. We stored the current tip angle for comparison in the next step. Once the agent completed the target number of spins, it received a large reward and the episode terminated.

We trained the agent using the PPO algorithm from the stable_baselines3 Python module, running over 400,000 timesteps.

(Author: Tim)

In [5]:
from IPython.display import Video

# Display the video (path should be relative or absolute)
Video("spinningacrobot.mp4", width=600, height=400, embed=True)
