# Create a Simple Reflex-Based Lunar Lander Agent

In this example, we will use Gymnasium, an environment to train agents via reinforcement learning (RL). We will not use RL here but just use the environment with a custom simple reflex-based agent. 

## Install Gymnasium

The documentation for Gymnasium is available at https://gymnasium.farama.org/ 

Steps:
1. Create a new folder and open it with VS Code and install all needed Python Extensions in VS Code.
2. Create a new virtual environment (CTRL-Shift P Python Create Environment...)
3. I needed to install swig and the Python C++ headers on WSL2 via the terminal
    * `sudo apt install swig`
    * `sudo apt-get install python3-dev` 
4. Install gymnasium with the needed extras

In [None]:
%pip install -q swig
%pip install -q gymnasium[box2d,classic_control]

## The Lunar Lander Environment 

The documentation of the environment is available at: https://gymnasium.farama.org/environments/box2d/lunar_lander/

* Performance Measure: A reward of -100 or +100 points for crashing or landing safely respectively. We do not use 
  intermediate rewards here.

* Environment: This environment is a classic rocket trajectory optimization problem. A ship needs to land safely. The space is **continuous** with
  x and y coordinates in the range [-2.5, 2.5]. The landing pad is at coordinate (0,0).

* Actuators:  According to Pontryagin’s
  maximum principle, it is optimal to fire the engine at full throttle or turn it off. This is the reason why this environment has discrete actions: engine on or off. There are four discrete actions available:

    - 0: do nothing
    - 1: fire left orientation engine
    - 2: fire main engine
    - 3: fire right orientation engine

* Sensors: Each observation is an 8-dimensional vector: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.

Gymnasim environments are implemented as classes with a `make` method to create the environment, a `reset` method, and a `step` method to execute an action.
To use it with an agent function that expects percetps and returns an action, we need write glue code that connects the environment with the agent function.

In [10]:
import gymnasium as gym

def run_episode(agent_function, max_steps=1000):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialize the environment
    env = gym.make("LunarLander-v3", render_mode=None)

    # Reset the environment to generate the first observation (use seed=42 in reset to get reproducible results)
    observation, info = env.reset()

    # run one episode
    for _ in range(max_steps):
        # call the agent function to select an action
        action = agent_function(observation)

        print (f"Obs: {observation} -> Action: {action}")

        # step: execute an action in the environment
        observation, reward, terminated, truncated, info = env.step(action)
    
        env.render()

        if terminated:
            print(f"Final Reward: {reward}")
            break
    
    env.close()
    return reward

Note: `env.render()` shows the environment when the notebook is locally run (e.g., in VScode). On Colab, you cannot see the environment because the code is run on a headless server (i.e., a server without a display). There are some workarounds you can google.

## Example: A Random Agent

We ranomly return one of the actions. The environment accepts the integers 0-3.


In [9]:
import numpy as np

def random_agent_function(observation): 
    """A random agent that selects actions uniformly at random. It ignores the observation."""
    return np.random.choice([0, 1, 2, 3], p=[0.25, 0.25, 0.25, 0.25])

run_episode(random_agent_function)

Obs: [-0.00768595  1.3999902  -0.77851564 -0.48579895  0.00891286  0.17634556
  0.          0.        ] -> Action: 1
Obs: [-0.01546936  1.3884788  -0.7896111  -0.5117229   0.02006378  0.22304049
  0.          0.        ] -> Action: 0
Obs: [-0.02325306  1.3763682  -0.78964573 -0.53843474  0.03120808  0.22290668
  0.          0.        ] -> Action: 2
Obs: [-0.03112469  1.3643225  -0.79806393 -0.5356246   0.04199808  0.21582015
  0.          0.        ] -> Action: 0
Obs: [-0.0389966   1.3516784  -0.79809546 -0.562306    0.05278661  0.21579078
  0.          0.        ] -> Action: 3
Obs: [-0.04678421  1.338435   -0.78750765 -0.58892643  0.06144578  0.17319937
  0.          0.        ] -> Action: 3
Obs: [-0.05447731  1.3245988  -0.77564585 -0.6152183   0.0677167   0.12543003
  0.          0.        ] -> Action: 2
Obs: [-0.06228695  1.3110851  -0.78693426 -0.6008979   0.07363201  0.11831703
  0.          0.        ] -> Action: 0
Obs: [-0.07009687  1.2969716  -0.7869502  -0.62757677  0.0795461

-100

## A Simple Reflex-Based Agent

To make the code easier to read, we use enumerations for actions (integers) and observations (index in the observation vector).

In [7]:
from enum import Enum

class Act(Enum):
    LEFT = 1
    RIGHT = 3
    MAIN = 2
    NO_OP = 0

class Obs(Enum):
    X = 0
    Y = 1
    VX = 2
    VY = 3
    ANGLE = 4
    ANGULAR_VELOCITY = 5
    LEFT_LEG_CONTACT = 6
    RIGHT_LEG_CONTACT = 7



## Implement A Better Reflex-Based Agent

Build a better that uses its right and left thrusters to land the craft (more) safely. Test your agent function using 100 problems.

In [8]:
def rocket_agent_function(observation):
    """Rule-based agent for lunar lander."""

    rules = [
        # (điều kiện, hành động)
        #Hãm tốc độ rơi
        (lambda obs: obs[Obs.VY.value] < -0.2, Act.MAIN.value),
        #Canh góc
        (lambda obs: obs[Obs.ANGLE.value] > 0.15, Act.RIGHT.value), #mui tau nghieng phai nen dung dong co phai tra day ve
        (lambda obs: obs[Obs.ANGLE.value] < -0.15, Act.LEFT.value), #mui tau nghieng trai nen dung dong co trai tra day ve
        #Đưa tàu về bãi đáp chỉ định
        (lambda obs: obs[Obs.X.value] > 0.1, Act.LEFT.value),
        (lambda obs: obs[Obs.X.value] < -0.1, Act.RIGHT.value)
    ]

    # chạy qua từng rule
    for condition, action in rules:
        if condition(observation):
            return action

    return Act.NO_OP.value
run_episode(rocket_agent_function)


Obs: [ 0.00534973  1.4050221   0.5418712  -0.26213983 -0.00619238 -0.12274197
  0.          0.        ] -> Action: 2
Obs: [ 0.01085701  1.399317    0.55611044 -0.2535941  -0.01151624 -0.10648684
  0.          0.        ] -> Action: 2
Obs: [ 0.01637573  1.3937044   0.5572207  -0.24949823 -0.01680839 -0.10585276
  0.          0.        ] -> Action: 2
Obs: [ 0.0218297   1.3882722   0.55110806 -0.24150516 -0.02245212 -0.11288514
  0.          0.        ] -> Action: 2
Obs: [ 0.02733202  1.3835702   0.55580264 -0.20907079 -0.0279589  -0.11014573
  0.          0.        ] -> Action: 2
Obs: [ 0.03301163  1.3795296   0.5727747  -0.17968592 -0.03273122 -0.09545504
  0.          0.        ] -> Action: 0
Obs: [ 0.03869123  1.3748889   0.5727879  -0.20636104 -0.0375033  -0.09545038
  0.          0.        ] -> Action: 2
Obs: [ 0.0443901   1.3702224   0.57468474 -0.20753294 -0.04225169 -0.09497631
  0.          0.        ] -> Action: 2
Obs: [ 0.05022135  1.3661904   0.5874549  -0.17933106 -0.0465355

-100

## Evaluating the Agent

Run the agent on 100 problems and report the average reward.

In [11]:
#TEST 1: VY < -0.3
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode=None)

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 100, 100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 100, -100, 100, -100, -100, 100, -100, -100, -100, 100, -100, 100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 100, 100, -100, -100, -100, -100, -100, -100, -100, -100]
Average reward: -82.0
Success rate: 9/100


In [26]:
#TEST 2: VY < -0.2
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode=None)

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

[100, 100, -100, -100, -100, -100, 100, -100, -100, np.float64(-0.0806241681046842), 100, 100, -100, 100, -100, -100, 100, -100, 100, -100, 100, 100, -100, -100, 100, -100, -100, 100, -100, -100, -100, -100, 100, -100, 100, 100, -100, -100, -100, -100, 100, -100, -100, 100, -100, -100, 100, -100, -100, -100, -100, -100, -100, -100, 100, -100, 100, -100, 100, -100, -100, -100, -100, -100, 100, 100, -100, -100, -100, -100, -100, -100, -100, 100, -100, 100, -100, 100, -100, -100, 100, -100, 100, 100, -100, -100, 100, -100, 100, -100, -100, -100, -100, -100, 100, np.float64(1.234205915060316), -100, -100, 100, 100]
Average reward: -29.988464182530443
Success rate: 34/100


In [25]:
#TEST 3: VY < -0.15
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode=None)

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

[-100, 100, -100, -100, -100, -100, 100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 100, 100, 100, -100, 100, 100, 100, -100, -100, -100, -100, -100, 100, -100, -100, -100, 100, -100, -100, 100, -100, -100, -100, -100, -100, np.float64(-0.22023092004587738), -100, 100, -100, np.float64(-0.021794024895214648), 100, 100, -100, -100, 100, -100, -100, -100, -100, -100, -100, -100, 100, -100, -100, -100, -100, -100, -100, -100, 100, 100, -100, -100, 100, 100, -100, -100, 100, -100, -100, -100, -100, -100, 100, -100, 100, -100, -100, -100, -100, -100, -100, -100, -100, 100]
Average reward: -50.00242024944941
Success rate: 24/100


In [None]:
#TEST 3: VY < -0.2 | -0.1<X & X>0.1 left right
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode="human")

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

Tàu đã hạ cánh tốt hơn và ttimf ve dich tot hon. Toi nen them rule khi chua o X bang 0 thi ko tat dong co day.

In [None]:
#TEST 3: VY < -0.2 | -0.1<X & X>0.1 left right| abs(X) > 0.1 main
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode="human")

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

Khong dap dat duoc hahaha, gan dat thi no bat len

In [None]:
#TEST 3: đổi thứ tự sang đưa về pad 0 -> canh angle -> use main
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode="human")

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

Tau rớt liên tù tì =))))

In [21]:
#TEST 3: đổi thứ tự sang đưa về -> use main -> pad 0 -> canh angle
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode=None)

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

[-100, -100, -100, -100, -100, -100, -100, -100, -100, 100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
Average reward: -94.0
Success rate: 3/100


Như hạch

In [53]:
#TEST 3: tắt v angle (roi nhanh hon do tan suat bat main bi giam) ket qua cao nhat len toi 41/100 du tau mat can bang nhieu hơn
#toi se thu dieu chinh angle ve 0.15 cho ra ket qua tot hon rat nhieu
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode=None)

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

[100, -100, -100, 100, 100, -100, 100, 100, 100, 100, 100, 100, 100, 100, -100, 100, 100, -100, -100, -100, 100, 100, 100, 100, 100, -100, 100, -100, 100, -100, 100, 100, 100, 100, -100, 100, -100, -100, 100, 100, -100, -100, 100, 100, -100, 100, -100, 100, 100, -100, 100, -100, -100, 100, 100, 100, 100, 100, -100, -100, 100, 100, 100, 100, -100, 100, 100, 100, -100, 100, 100, 100, -100, 100, -100, 100, 100, -100, -100, -100, -100, -100, -100, -100, -100, 100, -100, 100, 100, -100, 100, 100, 100, -100, 100, -100, -100, -100, 100, 100]
Average reward: 20.0
Success rate: 60/100


In [None]:
#TEST 3: tắt v angle (roi nhanh hon do tan suat bat main bi giam) ket qua cao nhat len toi 41/100 du tau mat can bang nhieu hơn
#toi se thu dieu chinh angle ve 0.15 cho ra ket qua tot hon rat nhieu
#toi them lai dieu kien v angle nhung de sau dieu kien angle
#toi them dieu kien Y < -0.1 thi chay main ->> khong on lam
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode="human")

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

In [122]:
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode=None)

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

[100, -100, -100, -100, 100, -100, -100, -100, 100, -100, 100, 100, 100, 100, -100, -100, 100, -100, 100, -100, -100, 100, -100, 100, -100, 100, 100, -100, -100, 100, 100, 100, -100, 100, -100, 100, np.float64(0.21965917964532025), 100, -100, -100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, -100, -100, 100, -100, 100, 100, 100, -100, 100, -100, 100, 100, 100, 100, -100, 100, 100, 100, 100, -100, 100, -100, 100, -100, 100, -100, 100, 100, 100, np.float64(0.10289941503260039), 100, -100, 100, 100, 100, -100, 100, 100, 100, -100, 100, 100, 100, 100, -100, -100, 100]
Average reward: 28.00322558594678
Success rate: 63/100


In [None]:
import numpy as np

def run_episode_test(agent_function):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialise the environment
    env = gym.make("LunarLander-v3", render_mode="human")

    # Reset the environment to generate the first observation
    observation, info = env.reset()

    # run one episode (max. 1000 steps)
    for _ in range(1000):
        # call the agent to select an action
        action = agent_function(observation)

        # step (transition) through the environment with the action
        observation, reward, terminated, truncated, info = env.step(action)

        if terminated:
            break

    env.close()
    return reward

def run_episodes(agent_function, n=100):
    """Run multiple episodes with the given agent and return the rewards for each episode."""
    return [run_episode_test(agent_function) for _ in range(n)]

rewards = run_episodes(rocket_agent_function)
print(rewards)

print(f"Average reward: {np.average(rewards)}")
print(f"Success rate: {np.sum(np.array(rewards) == 100)}/{len(rewards)}")

Cách lưu code bằng video

In [14]:
import gymnasium as gym

def run_episode(agent_function, max_steps=1000, video_folder="result"):
    """Run one episode in the LunarLander-v3 environment using the provided agent."""

    # Initialize environment with rgb_array for video recording
    env = gym.make("LunarLander-v3", render_mode="rgb_array")
    env = gym.wrappers.RecordVideo(
        env,
        video_folder,
        episode_trigger=lambda ep: True,  # record every episode
        name_prefix="lunarlander"
    )

    # Reset environment
    observation, info = env.reset()

    total_reward = 0
    for _ in range(max_steps):
        # chọn action theo agent
        action = agent_function(observation)

        # step: thực hiện hành động
        observation, reward, terminated, truncated, info = env.step(action)
        total_reward += reward

        if terminated or truncated:
            print(f"Final Reward: {total_reward}")
            break

    env.close()
    return total_reward

reward = run_episode(rocket_agent_function)
print("Reward:", reward)


Final Reward: -14.39198234813665
Reward: -14.39198234813665


Chon lan dap cao diem nhat

In [15]:
import gymnasium as gym
import os
import shutil

def run_episode(agent_function, max_steps=1000, video_folder="tmp_videos"):
    """Run one episode and return total reward + path to video."""

    env = gym.make("LunarLander-v3", render_mode="rgb_array")
    env = gym.wrappers.RecordVideo(
        env,
        video_folder,
        episode_trigger=lambda ep: True,
        name_prefix="lunarlander"
    )

    obs, info = env.reset()
    total_reward = 0
    for _ in range(max_steps):
        action = agent_function(obs)
        obs, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        if terminated or truncated:
            break

    env.close()
    return total_reward

def run_best(agent_function, n_episodes=10):
    best_reward = float("-inf")
    best_video = None

    # clean up old folder
    if os.path.exists("tmp_videos"):
        shutil.rmtree("tmp_videos")

    os.makedirs("tmp_videos", exist_ok=True)

    for ep in range(n_episodes):
        reward = run_episode(agent_function, video_folder="tmp_videos")
        print(f"Episode {ep}: reward={reward}")

        # find latest video file created
        video_files = [f for f in os.listdir("tmp_videos") if f.endswith(".mp4")]
        video_files.sort(key=lambda f: os.path.getmtime(os.path.join("tmp_videos", f)))
        latest_video = os.path.join("tmp_videos", video_files[-1])

        if reward > best_reward:
            best_reward = reward
            best_video = "best_episode.mp4"
            shutil.copy(latest_video, best_video)
            print(f"🔥 New best! Saved {best_video}")

    print(f"Best reward = {best_reward}, video saved at {best_video}")
    return best_reward, best_video

best_reward, best_video = run_best(rocket_agent_function, n_episodes=20)


  logger.warn(


Episode 0: reward=256.4432880229661
🔥 New best! Saved best_episode.mp4
Episode 1: reward=5.141186139868651
Episode 2: reward=-27.01626478986438
Episode 3: reward=-12.023107924915209
Episode 4: reward=203.6237965495681
Episode 5: reward=176.32706090857894
Episode 6: reward=-74.81329816388359
Episode 7: reward=184.38826445630522
Episode 8: reward=15.407102413766083
Episode 9: reward=159.06687583427413
Episode 10: reward=-10.729076552091684
Episode 11: reward=255.33831146298547
Episode 12: reward=188.82489018593623
Episode 13: reward=229.99244680978526
Episode 14: reward=96.71992716426885
Episode 15: reward=6.908580215106568
Episode 16: reward=-23.464949567347404
Episode 17: reward=-488.626030470921
Episode 18: reward=-168.71692311824646
Episode 19: reward=4.404359005830301
Best reward = 256.4432880229661, video saved at best_episode.mp4


Lay 10 episode diem cao va 10 episode diem thap, ghep thanh 1 video cho moi truong hop

In [19]:
import gymnasium as gym
import os
import shutil

def run_episode(agent_function, env, max_steps=1000):
    obs, info = env.reset()
    total_reward = 0
    for _ in range(max_steps):
        action = agent_function(obs)
        obs, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        if terminated or truncated:
            break
    return total_reward

def record_multiple(agent_function, n_total=100, max_steps=1000):
    results = []
    for i in range(n_total):
        video_dir = f"videos/episode_{i}"
        env = gym.make("LunarLander-v3", render_mode="rgb_array")
        env = gym.wrappers.RecordVideo(env, video_dir, episode_trigger=lambda x: True)

        reward = run_episode(agent_function, env, max_steps)
        env.close()
        results.append((reward, video_dir))
        print(f"Episode {i} finished with reward {reward}")
    return results

def split_best_worst(results, top_k=10):
    sorted_results = sorted(results, key=lambda x: x[0])

    worst = sorted_results[:top_k]
    best = sorted_results[-top_k:]

    # Tạo folder lưu riêng
    os.makedirs("videos_best", exist_ok=True)
    os.makedirs("videos_worst", exist_ok=True)

    for reward, folder in best:
        for f in os.listdir(folder):
            if f.endswith(".mp4"):
                shutil.copy(os.path.join(folder, f), f"videos_best/best_{reward:.2f}.mp4")

    for reward, folder in worst:
        for f in os.listdir(folder):
            if f.endswith(".mp4"):
                shutil.copy(os.path.join(folder, f), f"videos_worst/worst_{reward:.2f}.mp4")

    print("Saved top 10 best episodes in videos_best/, worst 10 in videos_worst/")

results = record_multiple(random_agent_function, n_total=100)  # chạy 100 tập
split_best_worst(results, top_k=10)


Episode 0 finished with reward -201.8955687383574
Episode 1 finished with reward -380.5658767263654
Episode 2 finished with reward -318.09908811841126
Episode 3 finished with reward -130.78199365014902
Episode 4 finished with reward -240.1892674541233
Episode 5 finished with reward -139.3406049690797
Episode 6 finished with reward -79.69014950795855
Episode 7 finished with reward -114.35685263095512
Episode 8 finished with reward -367.50153686176634
Episode 9 finished with reward -76.79052059388039
Episode 10 finished with reward -143.075744181998
Episode 11 finished with reward -134.70659441460032
Episode 12 finished with reward -161.81856541199943
Episode 13 finished with reward -84.30771795632651
Episode 14 finished with reward -273.3719064499886
Episode 15 finished with reward -452.658347368328
Episode 16 finished with reward -122.8493251738608
Episode 17 finished with reward -103.10257401718539
Episode 18 finished with reward -301.33504099324176
Episode 19 finished with reward -22