# EECS 598 Lab 7: Simplified Rapid Motor Adaptation (Phase 1)

![rma](https://www.therobotreport.com/wp-content/uploads/2021/07/legged-robot-adapts.jpg)

This notebook is worth **80 points**. Your score will be calculated as `score = min(score, 80)`.
Questions and implementations are marked with relevent `#TODO(student)` markers.

Before starting the assignment, please put your name and UMID in the following format:

Firstname LASTNAME, #00000000 (ex. Drew SCHEFFER #31415926)

**YOUR ANSWER**

SHIVAM UDESHI, #87841376

## Setup

In [1]:
import sys, types, importlib

# Create a tiny fake 'imp' module exposing only 'reload'
_imp = types.ModuleType("imp")
_imp.reload = importlib.reload
sys.modules["imp"] = _imp

# load autoreload
%load_ext autoreload
%autoreload 2

In [2]:
print('Setting environment variable to use GPU rendering:')
%env MUJOCO_GL=egl
%env XLA_PYTHON_CLIENT_PREALLOCATE=false

Setting environment variable to use GPU rendering:
env: MUJOCO_GL=egl
env: XLA_PYTHON_CLIENT_PREALLOCATE=false


In [3]:
#@title Import packages for plotting and creating graphics
import time
import itertools
import numpy as np
from typing import Callable, NamedTuple, Optional, Union, List

# Graphics and plotting.
print('Installing mediapy:')
!command -v ffmpeg >/dev/null || (apt update && apt install -y ffmpeg)
!pip install -q mediapy
import mediapy as media
import matplotlib.pyplot as plt

# More legible printing from numpy.
np.set_printoptions(precision=3, suppress=True, linewidth=100)

Installing mediapy:


### Google Colab Setup

Next, we'll run a few commands to set up the environment on Google Colab. If you are running this notebook locally you can skip this section

Run the following to mount this notebook to your Google Drive. Follow the link and sign into the Google account following the prompts. Use the same Google account that you used to store this notebook.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Now update the path below to point to the folder in your Google Drive where you uploaded this notebook. If everything worked correctly you should see the following filenames at least: [`07_lab_student.ipynb`, `EECS598RSLRLBraxWrapper.py`, `rma_go1_locomote.py`, `rma_rsl_rl/`]

In [5]:
import os

# TODO: Fill in the Google Drive path where you uploaded project 2
# Example: If you create a 2025FA folder and put all the files under Lab6, then '2025FA/Lab6'
# GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = '2025FA/Lab7'

GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = '/content/drive/MyDrive/CSE_598/lab7-rma-phase1'
GOOGLE_DRIVE_PATH_LAB7 = os.path.join('drive', 'My Drive', GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)

print(os.listdir(GOOGLE_DRIVE_PATH_LAB7))

# Add to path and change directory for good measure
sys.path.append(GOOGLE_DRIVE_PATH_LAB7)
os.chdir(GOOGLE_DRIVE_PATH_LAB7)

['EECS598RSLRLBraxWrapper.py', '.DS_Store', 'rma_go1_locomote.py', 'media', 'rma_rsl_rl', 'logs', '__pycache__', '07_lab_student.ipynb']


In [6]:
from google.colab import files

import distutils.util
import os
import subprocess
if subprocess.run('nvidia-smi').returncode:
  raise RuntimeError(
      'Cannot communicate with GPU. '
      'Make sure you are using a GPU Colab runtime. '
      'Go to the Runtime menu and select Choose runtime type.')

# Add an ICD config so that glvnd can pick up the Nvidia EGL driver.
# This is usually installed as part of an Nvidia driver package, but the Colab
# kernel doesn't install its driver via APT, and as a result the ICD is missing.
# (https://github.com/NVIDIA/libglvnd/blob/master/src/EGL/icd_enumeration.md)
NVIDIA_ICD_CONFIG_PATH = '/usr/share/glvnd/egl_vendor.d/10_nvidia.json'
if not os.path.exists(NVIDIA_ICD_CONFIG_PATH):
  with open(NVIDIA_ICD_CONFIG_PATH, 'w') as f:
    f.write("""{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}
""")

### Install customized rsl_rl library

In [7]:
%cd rma_rsl_rl

/content/drive/MyDrive/CSE_598/lab7-rma-phase1/rma_rsl_rl


In [8]:
!pip install -e .

Obtaining file:///content/drive/MyDrive/CSE_598/lab7-rma-phase1/rma_rsl_rl
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: rsl_rl
  Attempting uninstall: rsl_rl
    Found existing installation: rsl_rl 1.0.2
    Uninstalling rsl_rl-1.0.2:
      Successfully uninstalled rsl_rl-1.0.2
  Running setup.py develop for rsl_rl
Successfully installed rsl_rl-1.0.2


In [9]:
%cd ..

/content/drive/MyDrive/CSE_598/lab7-rma-phase1


If the below, throws an error, you likely have to restart your runtime.

In [10]:
import rsl_rl
from rsl_rl.runners import OnPolicyRunnerRMA

## Mujoco, JAX, MJX, BRAX, and Playground Setup & Imports

In [11]:
!pip install mujoco
!pip install mujoco_mjx
!pip install brax
!pip install noise

# TODO(student): If you're running this locally, make sure to install cuda enabled jax via something like:
# !pip install "jax[cuda12]"

Collecting mujoco
  Downloading mujoco-3.3.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting glfw (from mujoco)
  Downloading glfw-2.10.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.p39.p310.p311.p312.p313-none-manylinux_2_28_x86_64.whl.metadata (5.4 kB)
Downloading mujoco-3.3.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading glfw-2.10.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.p39.p310.p311.p312.p313-none-manylinux_2_28_x86_64.whl (243 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.5/243.5 kB[0m [31m8.6 MB/s

In [12]:
!pip install playground
!pip install wandb
!pip install tensorboard

Collecting playground
  Downloading playground-0.0.5-py3-none-any.whl.metadata (8.7 kB)
Downloading playground-0.0.5-py3-none-any.whl (7.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m75.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: playground
Successfully installed playground-0.0.5


In [13]:
import os

try:
  print('Checking that the installation succeeded:')
  import mujoco
  mujoco.MjModel.from_xml_string('<mujoco/>')
except Exception as e:
  raise e from RuntimeError(
      'Something went wrong during installation. Check the shell output above '
      'for more information.\n'
      'If using a hosted Colab runtime, make sure you enable GPU acceleration '
      'by going to the Runtime menu and selecting "Choose runtime type".')

print('Installation successful.')

# Tell XLA to use Triton GEMM, this improves steps/sec by ~30% on some GPUs
xla_flags = os.environ.get('XLA_FLAGS', '')
xla_flags += ' --xla_gpu_triton_gemm_any=True'
os.environ['XLA_FLAGS'] = xla_flags

Checking that the installation succeeded:
Installation successful.


Ensure that the output of the following cell is `[CudaDevice(id=0)]`

In [14]:
import jax
print(jax.devices())

[CudaDevice(id=0)]


In [15]:
#@title Import MuJoCo, MJX, and Brax
from datetime import datetime
from etils import epath
import functools
from IPython.display import HTML
from typing import Any, Dict, Sequence, Tuple, Union
import os
from ml_collections import config_dict


import jax
from jax import numpy as jp
import numpy as np
from flax.training import orbax_utils
from flax import struct
from matplotlib import pyplot as plt
import mediapy as media
from orbax import checkpoint as ocp

import mujoco
from mujoco import mjx

from brax import base
from brax import envs
from brax import math
from brax.base import Base, Motion, Transform
from brax.base import State as PipelineState
from brax.envs.base import Env, PipelineEnv, State
from brax.mjx.base import State as MjxState
from brax.training.agents.ppo import train as ppo
from brax.training.agents.ppo import networks as ppo_networks
from brax.io import html, mjcf, model

Failed to import warp: No module named 'warp'
Failed to import mujoco_warp: No module named 'warp'


In [16]:

import os

xla_flags = os.environ.get("XLA_FLAGS", "")
xla_flags += " --xla_gpu_triton_gemm_any=True"
os.environ["XLA_FLAGS"] = xla_flags
os.environ["MUJOCO_GL"] = "egl"

from datetime import datetime
import json

from absl import app
from absl import flags
from absl import logging
import jax
import mediapy as media
from ml_collections import config_dict
import mujoco
import torch

import mujoco_playground
from mujoco_playground import registry
from mujoco_playground import wrapper_torch
from mujoco_playground import wrapper
from mujoco_playground.config import locomotion_params
from mujoco_playground.config import manipulation_params
from mujoco_playground.config import dm_control_suite_params

# Suppress logs if you want
logging.set_verbosity(logging.WARNING)

mujoco_menagerie not found. Downloading...


Cloning mujoco_menagerie: ██████████| 100/100 [00:38<00:00]


Checking out commit 14ceccf557cc47240202f2354d684eca58ff8de4
Successfully downloaded mujoco_menagerie


# Understanding Rapid Motor Adaptation (RMA)

![here](https://ar5iv.labs.arxiv.org/html/2107.04034/assets/x1.png)

Today, we'll begin to investigate one popular Sim2Real reinforcement learning paradigm: learning (and eventually distilling) from privileged information available in simulation. One popular implementation of this idea is [Rapid Motor Adaption for Legged Robots (RMA)](https://ashish-kmr.github.io/rma-legged-robots/). In this formulation, training robotic locomotion policies are split up into two main phases, as shown in the figure above.

In **phase 1**, a motor policy, $\pi$, is trained that takes as input the state, $x_t$, the previous action $a_t$ and a learned environment latent vector $z_t$ that encodes features of the environment that may be relevant for adapting the robot's policy, but may not be directly observable when the robot is deployed in the real world (such as friction, center of mass position, etc). In this sense, you can think of $z_t$ as conditioning the policy. At this stage, however, the policy would not be able to be applied directly on a real robot, because the policy requires access to privilegded state to generate $z_t$.

In **phase 2**, the motor policy is frozen (no longer updated), and an adaptation module is trained to *approximate* the environment feature $z_t$ using *only* the robot's extended state history $x_{1...T}, a_{1...T}$. Crucially, since we're training both phases in simulation, we can rollout the phase 1 privileged information encoder to get "ground truth" $z_t$ labels for each timestep. So, we can do simple supervised learning to estimate $\hat{z}_t$ from state history!

In today's lab, we'll just focus on getting used to the structure of privileged information and implement some key parts of **phase 1** training. I suggest that you briefly look through [the paper](https://ashish-kmr.github.io/rma-legged-robots/rma-locomotion-final.pdf).

`TODO(student):` Based on the paper, how does this learning strategy compare to simple domain randomization? In other words, what's one negative effect that naive domain randomization can have on a trained policy? **(10 points)**.

**[Answer Here]**

In the RMA paper, the authors explain that simple domain randomization, where you randomly vary environment parameters like friction, mass, or slopes during training can make the policy too broad and conservative. Because the robot must handle every possible variation it sees, it often learns a “safe” but less optimal behavior that doesnt adapt well to specific real-world conditions. In contrast, RMA trains a student policy that can quickly adapt to new environments by estimating hidden physical properties (like terrain friction) from its recent motion history. This makes it more efficient and robust, allowing the robot to adjust on the fly instead of relying on overly generalized behavior learned from randomization alone.



# Setup the Environment for RMA
The first step of training Phase 1 of RMA is setting up the structure of the task. This includes modifying the MujocoPlayground environment to define the relevant reward functions, the observation space, and notably the *privileged obserbation space*. Most of this setup has already been done, but fill in the TODOs in `rma_go1_locomote.py` to finish defining the `obs` and `privileged_obs`.

`TODO(student):` Implement the TODOs in `rma_go1_locomote.py` **(30 points)**



In [17]:
from mujoco_playground._src.locomotion.go1 import base as go1_base
import inspect

# List all methods in Go1Env
methods = inspect.getmembers(go1_base.Go1Env, predicate=inspect.isfunction)
for name, func in methods:
    print(name)


__init__
get_accelerometer
get_feet_pos
get_global_angvel
get_global_linvel
get_gravity
get_gyro
get_local_linvel
get_upvector
render
reset
step


In [18]:
# Show source code
print(inspect.getsource(go1_base.Go1Env.get_local_linvel))


  def get_local_linvel(self, data: mjx.Data) -> jax.Array:
    return mjx_env.get_sensor_data(
        self.mj_model, data, consts.LOCAL_LINVEL_SENSOR
    )



In [None]:
# Test out to see what the priveleged information is...
from rma_go1_locomote import LocomotionRMAEnv, go1_rma_default_config

env_cfg = go1_rma_default_config()
env = LocomotionRMAEnv(task="flat_terrain", config=env_cfg)

jit_reset = jax.jit(env.reset)
jit_step = jax.jit(env.step)

state = jit_reset(jax.random.PRNGKey(0))

policy_obs = state.obs["state"]
print(f"State (dim={policy_obs.shape}):     ", policy_obs)

env_obs = state.obs["privileged_state"]
print(f"Privileged State (dim={env_obs.shape}): ", env_obs)



State (dim=(45,)):      [ 0.142 -0.47   0.402  0.467 -0.306 -0.29   0.     0.    -1.     0.     0.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
  0.     0.     0.   ]
Privileged State (dim=(6,)):  [ 0.6   12.743  0.     0.     0.     0.   ]


`TODO(student):` Take a look at the reward functions used to train the quadruped in the paper. In contrast to the locomotion RL setups that we have seen in our previous labs, the original RMA implementation doesn't include gait-based reward terms encouraging things like foot height, or foot contact time. How do the authors propose to have natural gaits (walking patterns) emerge without this reward engineering?  **(10 points)**

**[Answer Here]**

In the original RMA paper, the authors avoid manually engineering gait-based rewards like foot height or contact time. Instead, they rely on rapid motor adaptation with a privileged teacher policy during training. The teacher policy has access to detailed state information (like ground friction, contact forces, and torso mass) and is trained to track high-level velocity commands. By learning to satisfy these task-level objectives (like forward velocity tracking and stability penalties), the policy naturally discovers efficient walking patterns and periodic gaits without needing explicit foot placement rewards. Essentially, the gaits emerge implicitly as a consequence of optimizing for overall task performance, rather than being directly enforced.


Note: we actually won't be doing this in this lab because it significantly slows down learning. But feel free to implement and run it on your own if you're curious!

# Modifying RSL_RL To Train RMA Phase 1
To train Phase 1 of our simplified RMA policy, we’ll need additional components beyond the standard reinforcement learning setup. In particular, this phase requires mechanisms for handling privileged information and training the environment encoder which are features not present in vanilla RL pipelines.

To do this, we’ll build off of the familiar [RSL_RL](https://github.com/leggedrobotics/rsl_rl) framework discussed in previous labs, extending its training loop to incorporate these new elements.

Our adapted RSL_RL library resides in the `rma_rsl_rl` folder. There are a few important files to look at:
- `./rsl_rl/runners/on_policy_runner_rma.py`. This file contains the main training and logging loop and is responsible for rolling out the policy in simulation.
- `./rsl_rl/algorithms/ppo_priv.py`. This is a slightly altered version of Proximal Policy Optimization (PPO) that is adjusted to properly handle the privileged environmental data.
- `./rsl_rl/modules/actor_critic_latent.py`. This is the main change for Phase 1 where we define and use the "environment encoder" as a submodule of the actor critic policy.


In essense there are two main things that we had to do to adapt rsl_rl to work in our case

1. (Already Done) We had to change how the Runner and PPO scripts accept and handle privileged information. This is mostly boilerplate code. This is necessary because the original rsl_rl implementation uses a different meaning of "privilieged information" focused on [Asyncronous Actor Critic (A2C)](https://arxiv.org/pdf/1602.01783).

2. (Your Job) Instead of using the vanilla ActorCritic policy that would map (obs, priv_obs) -> actoins, follow the structure of RMA to map (obs, priv_obs) -> (obs, z_t) -> actions. This can be done my creating a new policy, defined in `./rsl_rl/modules/actor_critic_latent.py`.

`TODO(student):` Fill in the TODOs in `./rsl_rl/modules/actor_critic_latent.py` to follow the structure of RMA. Specifically, implement the `__init__`, `forward`, and `only_latent` functions. **(30 points)**

# Training Using Privileged Information (optional)
In this section, you can combine the above steps to train a Phase-1 RMA policy that learns to encode and use priveleged environment features $z_t$.

Training your policy can be a bit finiky at this point, especially using Colab resources. Don't expect the final policy to be of the same quality as the original paper as a few shortcuts were made:
1. Not using bumpy terrain
2. We train for significantly fewer epochs
3. The reward curriculum as defined in the paper has yet to me implemented.
4. The reward hyperparameters have not been optimized for our particular simulation platform.

To see some results of the limited training feel free to look in the `media/` folder.

**The below is optional,** althrough I'd recommend you use the training code to at least test that your above implementations don't crash or error out.

Load all the necessary imports...

In [19]:
import os

print("Setting environment variable to use GPU rendering:")
os.environ["MUJOCO_GL"] = "egl"
os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"

from datetime import datetime
import os
import json

from mujoco_playground import registry
from mujoco_playground.config import locomotion_params

from brax import envs


import jax
from jax import numpy as jp
import numpy as np
from flax.training import orbax_utils
from flax import struct
from matplotlib import pyplot as plt
import mediapy as media
from orbax import checkpoint as ocp

import mujoco
from mujoco import mjx

import torch

import mujoco_playground
from mujoco_playground import registry
from mujoco_playground import wrapper_torch
from mujoco_playground.config import locomotion_params
from mujoco_playground.config import manipulation_params
from mujoco_playground.config import dm_control_suite_params

from EECS598RSLRLBraxWrapper import EECS598RSLRLBraxWrapper

from rsl_rl.runners import OnPolicyRunnerRMA

from rma_go1_locomote import go1_rma_default_config
from EECS598RSLRLBraxWrapper import EECS598RSLRLBraxWrapper

import functools
import numpy as np
from tqdm import tqdm

from datetime import datetime


Setting environment variable to use GPU rendering:


Setup the environment, wrap using the wrapper, and set up the training configuration file

In [20]:
from rma_go1_locomote import go1_rma_default_config, LocomotionRMAEnv

_LOAD_RUN_NAME = None
_CHECKPOINT_NUM = -1
_SUFFIX = None
_SEED = 1
_NUM_ENVS = 8192
_DEVICE = "cuda:0"
_EXP_NAME = "RMA"

device=_DEVICE

def setup_env_rsl(num_envs):
    '''
    Create the experiment name, logs/chkpt dirs, load the environment,
    and wrap it in a EECS598RSLRLBraxWrapper to make it compatible with rsl_rl
    '''

    # Create the environment
    env_cfg = go1_rma_default_config()
    raw_env = LocomotionRMAEnv(task="flat_terrain", config=env_cfg)

    # Experiment name
    now = datetime.now()
    timestamp = now.strftime("%Y%m%d-%H%M%S")
    exp_name = f"{_EXP_NAME}-{timestamp}"
    exp_name += f"-{_SUFFIX}"
    print(f"Experiment name: {exp_name}")

    # Logging directory
    logdir = os.path.abspath(os.path.join("logs", exp_name))
    os.makedirs(logdir, exist_ok=True)
    print(f"Logs are being stored in: {logdir}")

    # Checkpoint directory
    ckpt_path = os.path.join(logdir, "checkpoints")
    os.makedirs(ckpt_path, exist_ok=True)
    print(f"Checkpoint path: {ckpt_path}")

    # Save environment config to JSON
    with open(
        os.path.join(ckpt_path, "config.json"), "w", encoding="utf-8"
    ) as fp:
        json.dump(env_cfg.to_dict(), fp, indent=4)

    # Domain randomization
    randomizer = registry.get_domain_randomizer("Go1JoystickFlatTerrain")

    brax_env = EECS598RSLRLBraxWrapper(
        raw_env,
        num_envs,
        _SEED,
        env_cfg.episode_length,
        1,
        randomization_fn=randomizer,
    )

    # Build RSL-RL config
    train_cfg = locomotion_params.rsl_rl_config("Go1JoystickFlatTerrain")

    # Overwrite default config for RMA
    train_cfg.seed = _SEED
    train_cfg.run_name = exp_name
    train_cfg.resume = _LOAD_RUN_NAME is not None
    train_cfg.load_run = _LOAD_RUN_NAME if _LOAD_RUN_NAME else "-1"
    train_cfg.checkpoint = _CHECKPOINT_NUM
    train_cfg.runner_class_name = "OnPolicyRunnerRMA"

    train_cfg.algorithm.num_mini_batches = 32
    train_cfg.algorithm.gamma = 0.97

    train_cfg_dict = train_cfg.to_dict()
    train_cfg_dict["policy_class_name"] = "ActorCriticLatent"
    train_cfg_dict["algorithm_class_name"] = "PPO_priv"

    return exp_name, logdir, brax_env, train_cfg_dict

In [21]:
exp_name, logdir, brax_env, train_cfg_dict = setup_env_rsl(num_envs=_NUM_ENVS)

print("Running RMA With Privileged Information ...")
print(train_cfg_dict)

# train the locomotion policy
runner = OnPolicyRunnerRMA(brax_env, train_cfg_dict, logdir, device=device)

Experiment name: RMA-20251016-183430-None
Logs are being stored in: /content/drive/MyDrive/CSE_598/lab7-rma-phase1/logs/RMA-20251016-183430-None
Checkpoint path: /content/drive/MyDrive/CSE_598/lab7-rma-phase1/logs/RMA-20251016-183430-None/checkpoints
obs_shape: {'privileged_state': (6,), 'state': (45,)}
Asymmetric observation space
JITing reset and step
Done JITing reset and step
Running RMA With Privileged Information ...
{'algorithm': {'class_name': 'PPO', 'clip_param': 0.2, 'desired_kl': 0.01, 'entropy_coef': 0.001, 'gamma': 0.97, 'lam': 0.95, 'learning_rate': 0.0003, 'max_grad_norm': 1.0, 'num_learning_epochs': 5, 'num_mini_batches': 32, 'schedule': 'fixed', 'use_clipped_value_loss': True, 'value_loss_coef': 1.0}, 'checkpoint': -1, 'empirical_normalization': True, 'experiment_name': 'test', 'load_run': '-1', 'max_iterations': 1000, 'num_steps_per_env': 24, 'policy': {'activation': 'elu', 'actor_hidden_dims': [512, 256, 128], 'class_name': 'ActorCritic', 'critic_hidden_dims': [512, 

Train the policy:

In [22]:
runner.learn(
    num_learning_iterations=1000,
    init_at_random_ep_len=False,
)
print("Done training.")
print(logdir)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                        Total time: 2232.45s
                               ETA: 1117.9s

################################################################################
                     [1m Learning iteration 667/1000 [0m                      

                       Computation: 59944 steps/s (collection: 1.835s, learning 1.444s)
               Value function loss: 0.0021
                    Surrogate loss: 0.0180
             Mean action noise std: 0.05
                       Mean reward: 97.75
               Mean episode length: 838.47
--------------------------------------------------------------------------------
                   Total timesteps: 131334144
                    Iteration time: 3.28s
                        Total time: 2235.73s
                               ETA: 1114.5s

################################################################################
                     [1m Learning iterati

# Analyze the Learned Environment Latent $z_t$ (optional)
One thing you might be curious about is what the envirionment latent $z_t$ actually captures. One technique for doing this would be to grab the ActorCriticLatent policy that you trained, and pass in hand-made environment observations vectors to see how it affects $z_t$. Alternatively, you could also collect $z_t$ values while rolling out the policy in randomized environments. Since $z_t$ is likely > 2 dimensions, you may find it useful to visualize it using visualization / latent space projection techniques such as tSNE. (Check out the `./media` folder)

I encourage you to explore this if you're curious and have the time!

In [23]:
# Your Code Here

## What to Turn In

`#TODO(student):` Please zip the following files and turn them into the assignment on gradescope:
1. this `07_lab_student.ipynb` file. Please make sure to fill our your name and umich ID in the first cell
2. the `ActorCriticLatent.py` file

Please ensure all cell outputs (videos, plots, etc) are in tact when you download the .ipynb file.