# Reinforcement Learning for Cartpole Balancing

Name: *Yunlong Pan*

Email:*yunlong.pan@stonybrook.edu*

- Implemented a reinforcement learning agent to control the Cartpole environment using the OpenAI Gym toolkit. Successfully trained the agent to balance the pole on the cart.

- Employed the Proximal Policy Optimization (PPO) algorithm to train the agent and fine-tuned hyperparameters for optimal performance. Utilized deep neural networks as function approximators for policy and value functions.

- Visualized training progress and analyzed learning curves to monitor agent's training process.


In [1]:
import gymnasium as gym
import numpy as np
import ray
from ray import tune
from ray.rllib.algorithms.ppo.ppo import PPO
from gym.wrappers import RecordVideo

ray.init()
# tune.run("PPO",
#          config={"env": "CartPole-v1",
#                  "evaluation_interval": 2,    # num of training iter between evaluations
#                  "evaluation_num_episodes": 20
#                  },
#          local_dir='/Users/ylpan/Desktop/yl_AMS691_RL_Project/yl_gymnasium',
#          checkpoint_freq= 5,
#          )

agent = PPO(config={"env": "CartPole-v1",
                    "evaluation_interval": 2,
                    "evaluation_num_episodes": 20
                    }
            )
agent.restore("/Users/ylpan/Desktop/yl_AMS691_RL_Project/yl_gymnasium/PPO_2023-10-09_13-15-49/PPO_CartPole-v1_7e4a4_00000_0_2023-10-09_13-15-49/checkpoint_000180")

env = RecordVideo(gym.make('CartPole-v1',render_mode="rgb_array"), 'yl_cartpole_result')
# obs = env.reset()
# print(obs[0])
# print(agent.compute_single_action(obs[0]))
obs = env.reset()
action = agent.compute_single_action(obs[0])
# for _ in range(30):
#     # print(f"Pole angle at step start: {np.degrees(obs[0])}", end=" ")
#
#     # print(agent.compute_single_action(obs[0]))
#     obs, rewards, done, _c, _d = env.step(action)
#     action = agent.compute_single_action(obs)
#     # print(f"Reward in this step: {rewards}")
#     env.render()

for ep in range(1):
    print(f"Episode number is {ep+1}")
    obs = env.reset()
    for _ in range(100):
        print(f"Pole angle at step start: {np.degrees(obs[0])}", end=" ")
        obs, reward, done, _, _ = env.step(action)
        action = agent.compute_single_action(obs)
        print(f"Pole angle at step end: {np.degrees(obs[0])}", end=" ")
        print(f"Reward in step: {reward}, done: {done}")
        if done:
            break
ray.shutdown()
# env.close()


2023-10-10 12:19:29,213	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-10-10 12:19:34,357	INFO trainable.py:984 -- Restored on 127.0.0.1 from checkpoint: Checkpoint(filesystem=local, path=/Users/ylpan/Desktop/yl_AMS691_RL_Project/yl_gy

Episode number is 1
Pole angle at step start: [-2.3076787  -1.1653538   1.6054516   0.46877775] Pole angle at step end: -2.3309860229492188 Reward in step: 1.0, done: False
Pole angle at step start: -2.3309860229492188 Pole angle at step end: -2.1311728954315186 Reward in step: 1.0, done: False
Pole angle at step start: -2.1311728954315186 Pole angle at step end: -2.155400276184082 Reward in step: 1.0, done: False
Pole angle at step start: -2.155400276184082 Pole angle at step end: -1.9564155340194702 Reward in step: 1.0, done: False
Pole angle at step start: -1.9564155340194702 Pole angle at step end: -1.981394648551941 Reward in step: 1.0, done: False
Pole angle at step start: -1.981394648551941 Pole angle at step end: -1.7830814123153687 Reward in step: 1.0, done: False
Pole angle at step start: -1.7830814123153687 Pole angle at step end: -1.808663249015808 Reward in step: 1.0, done: False
Pole angle at step start: -1.808663249015808 Pole angle at step end: -1.610883116722107 Reward

