<a href="https://colab.research.google.com/github/supsi-dacd-isaac/TeachDecisionMakingUncertainty/blob/main/Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cart-pole control

In the second task you’ll try to craft controllers for the version of the cart-pole problem described by Barto, Sutton, and Anderson in [“Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”.](https://ieeexplore.ieee.org/document/6313077)

<div>
<img src="https://gymnasium.farama.org/_static/videos/mujoco/inverted_pendulum.gif" width="200"/>
</div>



# Gym env
This model has been encoded in the **Inverted Pendulum** environment

Find [here](https://gymnasium.farama.org/environments/mujoco/inverted_pendulum/) a detailed description of the environment
* action space
* observation space
* reward
* model




In [12]:
#@title Installing required libraries, setup

%%capture
!pip install gym pyvirtualdisplay
!apt-get install -y xvfb python-opengl ffmpeg
!pip install "gymnasium[mujoco]"
import os
os.environ["MUJOCO_GL"] = "egl"
os.environ["PYOPENGL_PLATFORM"] = "egl"

# Env and benchmark policies
The following cell defines the environment, `InvertedPendulum-v5` and two baseline policie:
* `zero_policy`: always return 0, that is, no force is applied to the cart
* `random_policy`: a random force is sampled from the admissible intervals of the environment's action space

The effect of these two policies is shown in the following cells via animations of control scenarios.

In [13]:
import gymnasium as gym
import numpy as np
import random
import math
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Create the InvertedPendulum environment (Mujoco must be installed)
env = gym.make("InvertedPendulum-v5", render_mode="rgb_array")

# Random policy - always chose [0]
def zero_policy(obs):
    return [0]

# Random policy
def random_policy(obs):
    return env.action_space.sample()

# Benchmark solutions
The following cell runs 1000 episodes for the `zero_policy` and the `random_policy` policies and retrieve the expected rewards in terms of number of steps before termination. Your solutions should beat both the benchmarks (~24 steps in expectation).

In [14]:
#@title Defining animating function
from gymnasium.wrappers import RecordVideo
import glob
from IPython.display import Video


def animate_policy(policy, env=gym.make("InvertedPendulum-v5", render_mode="rgb_array")):
  recorded = RecordVideo(
      env,                          # this is your AtariPreprocessing+FrameStack env
      video_folder="",
      episode_trigger=lambda ep: True,
      name_prefix="breakout_eval"
  )

  # Run exactly one episode
  obs, info = recorded.reset()
  done = False
  i=0
  obs, _ = env.reset()
  while not done:
      i+=1
      action = policy(obs)
      obs, reward, terminated, truncated, info = recorded.step(action)
      done = terminated or truncated
  recorded.close()

  # Find and embed the just-written MP4
  mp4 = sorted(glob.glob("breakout_eval-episode-*.mp4"))[-1]
  return Video(mp4, embed=True)

In [15]:
animate_policy(random_policy,env=gym.make("InvertedPendulum-v5", render_mode="rgb_array"))

In [16]:
animate_policy(zero_policy,env=gym.make("InvertedPendulum-v5", render_mode="rgb_array"))


In [17]:
def compute_average_reward(policy, env, num_episodes=1000):
    total_reward = 0
    for _ in range(num_episodes):
        obs, _ = env.reset()
        done = False
        episode_reward = 0
        while not done:
            action = policy(obs)
            obs, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            done = terminated or truncated
        total_reward += episode_reward
    return total_reward / num_episodes

# Create the environment
env = gym.make("InvertedPendulum-v5")

# Compute average reward for zero policy
avg_reward_zero = compute_average_reward(zero_policy, env)
print(f"Average reward for zero policy over 1000 episodes: {avg_reward_zero}")

# Compute average reward for random policy
avg_reward_random = compute_average_reward(random_policy, env)
print(f"Average reward for random policy over 1000 episodes: {avg_reward_random}")

env.close()

Average reward for zero policy over 1000 episodes: 24.042
Average reward for random policy over 1000 episodes: 4.785
