# **This is the first gymnasium stable-baselines3 simple**

The provided code implements a simple custom OpenAI Gymnasium environment for a "Golf" game. Here's a summary of the functionality:
Steps:
- Define custom env
- Define model
- Train

### **Key Features**
1. **Environment Basics**:
   - The environment models a "golf ball" that can move between positions on a one-dimensional grid with values ranging from `0` to `10`.
   - The starting position of the ball is `9`.

2. **Spaces**:
   - **Action Space**: Discrete with two possible actions:
     - `0`: Move the ball one step closer to `0` (decrement position).
     - `1`: Move the ball one step away from `0` (increment position).
   - **Observation Space**: A single integer (wrapped in a NumPy array) representing the ball's position, constrained between `0` and `10`.

3. **Core Methods**:
   - **`reset()`**: 
     - Resets the environment, placing the ball at position `9`.
     - Returns the initial observation and an empty dictionary.
   - **`step(action)`**:
     - Updates the position based on the action taken.
     - Clips the position to ensure it remains between `0` and `10`.
     - Checks if the game is complete (`done` is `True`) when the position reaches `0`.
     - Rewards the agent with `1` if the game ends, otherwise `0`.
     - Returns the updated observation, reward, done flag, truncated flag (`False` in this case), and an empty info dictionary.
   - **`render()`**: Prints the current position of the ball to the console.
   - **`close()`**: A placeholder method for cleanup.

### **Purpose**
This code defines a simple reinforcement learning environment where the goal is to reach position `0` from an initial position of `9`. It could be used to test or train RL agents in a straightforward setting.

In [5]:
from typing import Any, SupportsFloat

import numpy as np
import gymnasium as gym
from gymnasium.core import ActType
from gymnasium.core import ObsType


class GolfEnv(gym.Env):
    metadata = {'render.modes': ['console']}

    def __init__(self):
        super().__init__()
        self.pos = 9
        self.action_space = gym.spaces.Discrete(2)
        self.observation_space = gym.spaces.Box(
            low=0,
            high=10,
            shape=(1,),
            dtype=np.int32,
        )

    def reset(
            self,
            *,
            seed: int | None = None,
            options: dict[str, Any] | None = None,
    ) -> tuple[ObsType, dict[str, Any]]:  # type: ignore

        self.pos = 9
        return np.array([self.pos]), {}

    def step(
            self, action: ActType
    ) -> tuple[ObsType, SupportsFloat, bool, bool, dict[str, Any]]:

        if action == 0:
            self.pos -= 1

        if action == 1:
            self.pos += 1

        self.pos = np.clip(self.pos, 0, 10)

        done = bool(self.pos == 0)
        reward = 1 if done else 0

        return np.array([self.pos]), reward, done, False, {}

    def render(self):
        print(f'pos: {self.pos}')

    def close(self):
        pass


### **Key Features**
1. **Purpose**:
   - The code defines a custom neural network for an Actor-Critic policy used in reinforcement learning. It extends the `ActorCriticPolicy` class from the `Stable-Baselines3` library and introduces a user-defined `MyACModel` for the Actor and Critic networks.

2. **Classes**:
   - **`MyACModel`**:
     - Implements separate feedforward networks for the Actor (policy) and Critic (value function).
     - Provides methods to compute forward passes for the Actor, Critic, or both combined.
   - **`MyACPolicy`**:
     - Extends the `ActorCriticPolicy` class from `Stable-Baselines3`.
     - Overrides the `_build_mlp_extractor` method to use the custom `MyACModel` as the policy's neural network.

3. **Customizations**:
   - The Actor and Critic networks have independent architectures, each with a configurable output dimension (`last_layer_dim_pi` and `last_layer_dim_vf`).
   - Orthogonal initialization is disabled in this policy implementation.

4. **Use Case**:
   - This implementation can be used to define a custom Actor-Critic policy for reinforcement learning algorithms, such as PPO, A2C, or other variants supported by `Stable-Baselines3`.

In [6]:
import torch
from stable_baselines3.common.policies import ActorCriticPolicy


class MyACModel(torch.nn.Module):
    def __init__(
        self,
        features_dim,
        last_layer_dim_pi=64,
        last_layer_dim_vf=64
    ):
        super(MyACModel, self).__init__()

        # Store the output dimensions for the Actor and Critic
        self.latent_dim_pi = last_layer_dim_pi  # Output dimension of the policy network (Actor)
        self.latent_dim_vf = last_layer_dim_vf  # Output dimension of the value function network (Critic)

        # Define the Actor network
        self.actor = torch.nn.Sequential(
            torch.nn.Linear(features_dim, last_layer_dim_pi),
            torch.nn.ReLU(),
        )

        # Define the Critic network
        self.critic = torch.nn.Sequential(
            torch.nn.Linear(features_dim, last_layer_dim_vf),
            torch.nn.ReLU(),
        )

    def forward(self, x):
        """
        Forward pass for both Actor and Critic networks, used during testing.
        """
        actor_output = self.actor(x)
        critic_output = self.critic(x)
        return actor_output, critic_output

    def forward_actor(self, x):
        """
        Forward pass for the Actor network, used for policy prediction.
        """
        return self.actor(x)

    def forward_critic(self, x):
        """
        Forward pass for the Critic network, used for value function computation.
        """
        return self.critic(x)


class MyACPolicy(ActorCriticPolicy):
    def __init__(
        self,
        obs_space,
        action_space,
        lr_schedule,
        *args,
        **kwargs
    ):
        super().__init__(obs_space, action_space, lr_schedule, *args, **kwargs)
        self.ortho_init = False  # Disable orthogonal initialization (can be enabled based on requirements)

    def _build_mlp_extractor(self) -> None:
        # Initialize the custom MLP extractor
        self.mlp_extractor = MyACModel(self.features_dim)
        # Set the output dimensions for the Actor and Critic
        self.latent_dim_pi = self.mlp_extractor.latent_dim_pi
        self.latent_dim_vf = self.mlp_extractor.latent_dim_vf

    def _train(self):
        pass


### **Key Components**

#### **1. Training Function (`train`)**
- **Inputs**: 
  - `NewEnv`: The environment class to be trained on.
- **Process**:
  - Creates a vectorized environment using `make_vec_env`, allowing multiple instances (10 in this case) of the environment to be run in parallel for faster training.
  - Initializes a PPO model with:
    - `MyACPolicy`: The custom policy defined in `MyACPolicy`.
    - `train_env`: The vectorized training environment.
  - Trains the model for `20,000` timesteps.
- **Output**: The trained PPO model.

#### **2. Testing Function (`test`)**
- **Inputs**:
  - `model`: The trained RL model.
  - `env`: A single instance of the environment for testing.
- **Process**:
  - Resets the environment to obtain the initial observation.
  - Runs a loop for up to `100` steps:
    - Uses the trained model to predict actions based on the current observation.
    - Executes the action in the environment and receives:
      - The next observation.
      - Reward for the action.
      - Done flag (indicating the end of the episode).
    - Prints the current state (`obs`), action taken, reward received, and whether the episode is done.
    - Breaks out of the loop if the episode is complete (`done` is `True`).
- **Output**: None. Results are printed to the console.

---

### **Execution Flow (`__main__`)**
1. **Training**:
   - Calls `train(GolfEnv)` to train the PPO model using the custom `GolfEnv` environment.
2. **Testing**:
   - Calls `test(model, GolfEnv())` to evaluate the trained model on a single instance of `GolfEnv`.

---

### **Use Case**
- This script demonstrates a full RL pipeline:
  - **Environment Setup**: Defines and uses a custom Gymnasium environment (`GolfEnv`).
  - **Policy Customization**: Implements a custom Actor-Critic policy (`MyACPolicy`).
  - **Training**: Trains the policy using PPO.
  - **Testing**: Evaluates the trained policy in the environment, displaying key results like actions, rewards, and terminal conditions.

This setup is ideal for experimenting with custom RL environments and models.

In [7]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.env_util import make_vec_env


def train(NewEnv):
    train_env = make_vec_env(lambda: NewEnv(), n_envs=10)
    model = PPO(MyACPolicy, env=train_env, verbose=0)
    model.learn(total_timesteps=2_0000)
    return model


def test(model, env):
    obs, info = env.reset()

    for i in range(100):
        action, _states = model.predict(obs)
        obs, reward, done, _, _ = env.step(action)
        print(f'obs: {obs}, action: {action}, reward: {reward}, done: {done}')
        if done:
            break


model = train(GolfEnv)
test(model, GolfEnv())


Using cpu device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 111      |
|    ep_rew_mean     | 1        |
| time/              |          |
|    fps             | 12051    |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 20480    |
---------------------------------
obs: [8], action: 0, reward: 0, done: False
obs: [7], action: 0, reward: 0, done: False
obs: [8], action: 1, reward: 0, done: False
obs: [7], action: 0, reward: 0, done: False
obs: [6], action: 0, reward: 0, done: False
obs: [7], action: 1, reward: 0, done: False
obs: [8], action: 1, reward: 0, done: False
obs: [7], action: 0, reward: 0, done: False
obs: [6], action: 0, reward: 0, done: False
obs: [7], action: 1, reward: 0, done: False
obs: [8], action: 1, reward: 0, done: False
obs: [9], action: 1, reward: 0, done: False
obs: [8], action: 0, reward: 0, done: False
obs: [7], action: 0, reward: 0, done: False
obs: [8], action: 1, reward