
# Discovering a Reinforcement Learning API from a Black‑Box Environment (MuJoCo)

**Goal:** Without reading the environment’s source code, you will *infer* a standard RL API—
similar to Gym/Gymnasium—by probing a MuJoCo environment treated as a **black box**.

By the end, you should be able to:
- Identify the **state (observation) space** and **action space**.
- Use `reset()` to obtain an **initial state** (and `info`) and discuss reproducibility.
- Use `step(action)` to obtain **next_state, reward, terminated, truncated, info**.
- Understand and check **`terminated` vs `truncated`** (and why both exist).
- Log and visualize a **trajectory** and differentiate **state vs next_state**.
- Build the canonical training loop with `done = terminated or truncated`.

> You’re encouraged to treat the environment as a sealed box. Your tools are prints, shapes,
> bounds, and the API surface. Infer what’s going on from *observations*, not internals.


In [None]:

# --- Setup (run locally as needed) ---
# If you don't have these installed, uncomment and run:
# !pip install -U gymnasium gymnasium[mujoco] mujoco matplotlib pandas

import sys, importlib, math, random, numpy as np, pandas as pd
import matplotlib.pyplot as plt

# Prefer Gymnasium (it exposes terminated/truncated cleanly).
try:
    gym = importlib.import_module("gymnasium")
except ImportError:
    # Fallback to classic gym (>=0.26 also has terminated/truncated)
    gym = importlib.import_module("gym")

print("Python:", sys.version.split()[0])
print("Numpy:", np.__version__)
print("Pandas:", pd.__version__)
print("Matplotlib:", plt.matplotlib.__version__)
print("Gym-like module:", gym.__name__, getattr(gym, "__version__", ""))



## 1) Make a **black‑box** MuJoCo environment

We’ll try a few common MuJoCo IDs until one works. Don’t worry about which one—
your task is to *infer* its API purely from interactions.


In [None]:

# Try a list of common MuJoCo env IDs. We'll pick the first that makes successfully.
candidate_ids = [
    "Ant-v5", "Ant-v4",
    "HalfCheetah-v5", "HalfCheetah-v4",
    "Hopper-v5", "Hopper-v4",
    "Walker2d-v5", "Walker2d-v4"
]

env_id = None
env = None
for cid in candidate_ids:
    try:
        env = gym.make(cid)  # uses default TimeLimit if provided by the environment
        env_id = cid
        break
    except Exception as e:
        env = None

if env is None:
    raise RuntimeError("Could not create a MuJoCo env. Install gymnasium[mujoco] (or gym[mujoco]) and MuJoCo.")

print("Using environment:", env_id)



## 2) Element A — **State (Observation) Space**

**Question:** What is the shape, dtype, and typical range of the **state** returned by the env?  
**Method:** Inspect `env.observation_space`, then actually call `reset()` to see a sample.

> Treat the state as a vector (or array). Don’t assume semantics—just record what you observe.


In [None]:

# Peek at the observation (state) space
obs_space = env.observation_space
print("Observation space type:", type(obs_space).__name__)
print("Observation space:", obs_space)

# Reset to get an initial observation without fixing the seed (yet)
state, info = env.reset()
print("\nInitial state shape:", np.shape(state), "dtype:", getattr(state, "dtype", type(state)))
print("Initial info keys:", list(info.keys()))
# A quick numeric summary (if it's a numeric Box)
if hasattr(obs_space, "low") and hasattr(obs_space, "high"):
    print("Obs low (first 5):", np.array(obs_space.low).ravel()[:5])
    print("Obs high(first 5):", np.array(obs_space.high).ravel()[:5])
print("Sample state preview (first 8):", np.array(state).ravel()[:8])



## 3) Element B — **Action Space**

**Question:** What actions does the env accept? Continuous or discrete? What are the bounds?  
**Method:** Inspect `env.action_space`. Then sample a random action and attempt one step.


In [None]:

act_space = env.action_space
print("Action space type:", type(act_space).__name__)
print("Action space:", act_space)
if hasattr(act_space, "low") and hasattr(act_space, "high"):
    print("Action low (first 5):", np.array(act_space.low).ravel()[:5])
    print("Action high(first 5):", np.array(act_space.high).ravel()[:5])

# Try a single random step to *observe* the API
action = act_space.sample()
next_state, reward, terminated, truncated, info = env.step(action)

print("\n--- Single step probe ---")
print("Action shape:", np.shape(action))
print("Next state shape:", np.shape(next_state))
print("Reward (scalar):", reward, "| type:", type(reward).__name__)
print("terminated:", terminated, "| truncated:", truncated)
print("info keys:", list(info.keys()))



## 4) Element C — **Reset semantics, seeding, and (maybe) initial state control**

Most black‑box envs offer:
- `state, info = env.reset()`
- Optional `seed=` for reproducibility.
- Sometimes `options={...}` to control initial conditions (not guaranteed).

**Task:** Show that seeding produces reproducible initial states. Then *attempt* to set a custom
initial state via `options` to see if the env supports it. If not, note the limitation.


In [None]:

# Reproducibility via seeds
s1, _ = env.reset(seed=123)
s2, _ = env.reset(seed=123)
s3, _ = env.reset(seed=456)

print("Reset with seed=123 identical?", np.allclose(np.asarray(s1), np.asarray(s2)))
print("Reset seed=123 vs 456 identical?", np.allclose(np.asarray(s1), np.asarray(s3)))

# Attempt: pass custom options (many MuJoCo envs don't implement this; we *probe* to find out)
supports_options = True
try:
    # This is a *probe*: totally arbitrary content to see if options are validated or ignored.
    s_custom, info_custom = env.reset(options={"state": np.zeros_like(np.asarray(s1))})
    print("Custom options accepted. New state (first 8):", np.array(s_custom).ravel()[:8])
except TypeError as e:
    supports_options = False
    print("This env does not accept an 'options' argument in reset():", e)
except Exception as e:
    supports_options = False
    print("Tried options in reset(), but encountered:", repr(e))

print("Supports custom options in reset()?", supports_options)



## 5) `terminated` vs `truncated` and the canonical control loop

- **`terminated`:** The episode ended for *task-defined* reasons (e.g., the agent fell).
- **`truncated`:** The episode ended due to an *external limit* (e.g., time limit reached).
- **`done = terminated or truncated`** is the standard loop condition.

**Task:** Run a short episode with random actions. Detect and report whether it ended by
termination or truncation.


In [None]:

def run_one_episode_random(env, max_env_steps=None, verbose=True):
    state, info = env.reset()
    total_reward = 0.0
    steps = 0
    while True:
        action = env.action_space.sample()
        next_state, reward, terminated, truncated, info = env.step(action)
        total_reward += float(reward)
        steps += 1
        if verbose:
            print(f"t={steps:<3} reward={reward: .3f} terminated={terminated} truncated={truncated}")
        if terminated or truncated:
            reason = "terminated" if terminated else "truncated"
            if verbose:
                print(f"Episode ended by **{reason}** after {steps} steps; total_reward={total_reward:.2f}")
            break
        if max_env_steps is not None and steps >= max_env_steps:
            if verbose:
                print("Stopping early (max_env_steps reached).")
            break
        state = next_state
    return steps, total_reward

_ = run_one_episode_random(env, verbose=False)
print("Ran a silent random episode to check everything is wired correctly.")



## 6) Demonstrate **truncation** explicitly (TimeLimit wrapper)

To *force* a `truncated=True` ending, wrap the env with a small `max_episode_steps`.  
We’ll run two short episodes to observe both endings in practice.


In [None]:

# Build a fresh environment instance for a clean demo:
env_trunc = gym.make(env_id)
# Wrap to ensure a short time limit so truncation is observable
env_trunc = gym.wrappers.TimeLimit(env_trunc, max_episode_steps=25)

print("Running with an explicit TimeLimit (25 steps) to trigger truncation...")
steps, total_reward = run_one_episode_random(env_trunc, verbose=True)



## 7) Logging and visualizing a **trajectory**

We’ll collect the sequence
\[(s₀, a₀, s₁, a₁, s₂, …)\]
along with rewards and flags, store it, and plot a few signals.

> For high‑dimensional states/actions, we’ll show just the first few components for readability.


In [None]:

def rollout_random(env, max_steps=200):
    s, info = env.reset(seed=42)
    traj_alt = []  # alternating list: s0, a0, s1, a1, ...
    rows = []      # tabular view

    terminated = truncated = False
    t = 0
    while not (terminated or truncated) and t < max_steps:
        a = env.action_space.sample()
        s_next, r, terminated, truncated, info = env.step(a)

        traj_alt.append(np.asarray(s))
        traj_alt.append(np.asarray(a))

        rows.append({
            "t": t,
            "reward": float(r),
            "terminated": bool(terminated),
            "truncated": bool(truncated),
            # show only first few dims for compactness
            "s0": float(np.asarray(s).ravel()[0]) if np.asarray(s).size > 0 else np.nan,
            "s1": float(np.asarray(s).ravel()[1]) if np.asarray(s).size > 1 else np.nan,
            "a0": float(np.asarray(a).ravel()[0]) if np.asarray(a).size > 0 else np.nan,
            "a1": float(np.asarray(a).ravel()[1]) if np.asarray(a).size > 1 else np.nan,
        })
        s = s_next
        t += 1

    # Append the final state at the end to complete (..., s_T)
    traj_alt.append(np.asarray(s))
    return traj_alt, pd.DataFrame(rows)

traj_alt, traj_df = rollout_random(env_trunc, max_steps=100)
print(f"Alternating list length = {len(traj_alt)} (should be 2*T + 1)")
print("First 3 entries types:", type(traj_alt[0]).__name__, type(traj_alt[1]).__name__, type(traj_alt[2]).__name__)
print(traj_df.head())


In [None]:

# Display the DataFrame as a table (interactive if supported in your environment)
try:
    from caas_jupyter_tools import display_dataframe_to_user
    display_dataframe_to_user("Trajectory (first components only)", traj_df)
except Exception as e:
    # Fallback: just display
    traj_df



### Plot a couple of simple signals

- Reward per step  
- One state component over time (if available)

> Note: We use plain Matplotlib (no seaborn), single plot per figure, and default colors.


In [None]:

# Reward per step
plt.figure()
plt.plot(traj_df["t"], traj_df["reward"])
plt.xlabel("t")
plt.ylabel("reward")
plt.title("Reward per step")
plt.show()


In [None]:

# First state component (if present)
if "s0" in traj_df:
    plt.figure()
    plt.plot(traj_df["t"], traj_df["s0"])
    plt.xlabel("t")
    plt.ylabel("state[0]")
    plt.title("First state component over time")
    plt.show()



## 8) Reconstructing the **Gym‑style API** (what you should have *inferred*)

- **Reset:**  
  `state, info = env.reset(seed=..., options=...)`  
  Returns initial **state** and an **info** dict. `seed` controls reproducibility.  
  `options` may allow setting initial conditions (if implemented).

- **Step:**  
  `next_state, reward, terminated, truncated, info = env.step(action)`  
  - `next_state`: same space as `state` (shape/dtype inferred from `observation_space`).  
  - `reward`: scalar float.  
  - `terminated`: task success/failure or natural terminal condition.  
  - `truncated`: episode cut short by external limit (e.g., `TimeLimit`).  
  - **Check `done = terminated or truncated`.**

- **Spaces:**  
  - `env.observation_space` (e.g., `Box(low, high, shape, dtype)`)  
  - `env.action_space` (e.g., `Box` for continuous torques).  
  You can `sample()` from spaces to probe shapes and valid ranges.

- **Trajectory:**  
  Log tuples `(s_t, a_t, r_t, s_{t+1}, terminated, truncated, info)` or the alternating list  
  `(s0, a0, s1, a1, ..., s_T)`. Visualize to build intuition.



## 9) Exercises (for you to explore)

1. **Bounds sanity check:** Sample 1,000 random actions and verify they always lie within `action_space` bounds.
2. **Seed reproducibility:** Show that with a fixed `seed`, both the initial state and the first *k* steps (under a fixed action sequence) are reproducible.
3. **Terminated vs truncated:** Increase the `TimeLimit` to a large number and try to cause a `terminated=True` naturally (e.g., make the agent fall). What qualitative differences do you observe?
4. **Custom reset options:** If `options` are not supported, wrap the env in your own `ResettableWrapper` that saves the last state and restores it on reset (advanced; requires env-specific state setters). Discuss why true black‑box envs may *not* permit arbitrary state setting.
5. **Policy stub:** Replace the random policy with a zero action (or small PD controller) and compare trajectories.
6. **Replay:** Store the trajectory and replay it (e.g., with a video wrapper) to visually connect `terminated` vs `truncated` with behavior. (Requires ffmpeg; optional.)


In [None]:

# Optional: close envs when done
try:
    env.close()
    env_trunc.close()
except Exception:
    pass
