# EECS 598 Lab 8: Rapid Motor Adaptation (Phase 2)

![rma](https://www.therobotreport.com/wp-content/uploads/2021/07/legged-robot-adapts.jpg)

This notebook is worth **80 points**. Your score will be calculated as `score = min(score, 80)`.
Questions and implementations are marked with relevent `#TODO(student)` markers.

Before starting the assignment, please put your name and UMID in the following format:

Firstname LASTNAME, #00000000 (ex. Drew SCHEFFER #31415926)

**YOUR ANSWER**

SHIVAM UDESHI, #87841376

## Setup

In [1]:
import sys, types, importlib

# Create a tiny fake 'imp' module exposing only 'reload'
_imp = types.ModuleType("imp")
_imp.reload = importlib.reload
sys.modules["imp"] = _imp

# load autoreload
%load_ext autoreload
%autoreload 2

In [2]:
print('Setting environment variable to use GPU rendering:')
%env MUJOCO_GL=egl
%env XLA_PYTHON_CLIENT_PREALLOCATE=false

Setting environment variable to use GPU rendering:
env: MUJOCO_GL=egl
env: XLA_PYTHON_CLIENT_PREALLOCATE=false


In [3]:
#@title Import packages for plotting and creating graphics
import time
import itertools
import numpy as np
from typing import Callable, NamedTuple, Optional, Union, List

# Graphics and plotting.
print('Installing mediapy:')
!command -v ffmpeg >/dev/null || (apt update && apt install -y ffmpeg)
!pip install -q mediapy
import mediapy as media
import matplotlib.pyplot as plt

# More legible printing from numpy.
np.set_printoptions(precision=3, suppress=True, linewidth=100)

Installing mediapy:


### Google Colab Setup

Next, we'll run a few commands to set up the environment on Google Colab. If you are running this notebook locally you can skip this section

Run the following to mount this notebook to your Google Drive. Follow the link and sign into the Google account following the prompts. Use the same Google account that you used to store this notebook.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Now update the path below to point to the folder in your Google Drive where you uploaded this notebook. If everything worked correctly you should see the following filenames at least: [`08_lab_student.ipynb`, `EECS598RSLRLBraxWrapper.py`, `rma_go1_locomote.py`, `rma_rsl_rl/`]

In [5]:
import os

# TODO: Fill in the Google Drive path where you uploaded project 2
# Example: If you create a 2025FA folder and put all the files under Lab6, then '2025FA/Lab6'
# GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = '2025FA/Lab8'

GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = '/content/drive/MyDrive/CSE598/lab8-rma-phase2/lab8-rma-phase2'
GOOGLE_DRIVE_PATH_LAB8 = os.path.join('drive', 'My Drive', GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)

print(os.listdir(GOOGLE_DRIVE_PATH_LAB8))

# Add to path and change directory for good measure
sys.path.append(GOOGLE_DRIVE_PATH_LAB8)
os.chdir(GOOGLE_DRIVE_PATH_LAB8)

['.DS_Store', 'scene_mjx_feetonly_rough_terrain.xml', 'rma_rsl_rl', 'teacher_checkpoints', '__pycache__', 'rma_go1_locomote.py', 'EECS598RSLRLBraxWrapper.py', 'teacher_rollout.mp4', '08_lab_student.ipynb']


In [6]:
from google.colab import files

import distutils.util
import os
import subprocess
if subprocess.run('nvidia-smi').returncode:
  raise RuntimeError(
      'Cannot communicate with GPU. '
      'Make sure you are using a GPU Colab runtime. '
      'Go to the Runtime menu and select Choose runtime type.')

# Add an ICD config so that glvnd can pick up the Nvidia EGL driver.
# This is usually installed as part of an Nvidia driver package, but the Colab
# kernel doesn't install its driver via APT, and as a result the ICD is missing.
# (https://github.com/NVIDIA/libglvnd/blob/master/src/EGL/icd_enumeration.md)
NVIDIA_ICD_CONFIG_PATH = '/usr/share/glvnd/egl_vendor.d/10_nvidia.json'
if not os.path.exists(NVIDIA_ICD_CONFIG_PATH):
  with open(NVIDIA_ICD_CONFIG_PATH, 'w') as f:
    f.write("""{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}
""")

### Install customized rsl_rl library

In [7]:
%cd rma_rsl_rl

/content/drive/MyDrive/CSE598/lab8-rma-phase2/lab8-rma-phase2/rma_rsl_rl


In [8]:
!pip install -e .

Obtaining file:///content/drive/MyDrive/CSE598/lab8-rma-phase2/lab8-rma-phase2/rma_rsl_rl
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: rsl_rl
  Attempting uninstall: rsl_rl
    Found existing installation: rsl_rl 1.0.2
    Uninstalling rsl_rl-1.0.2:
      Successfully uninstalled rsl_rl-1.0.2
  Running setup.py develop for rsl_rl
Successfully installed rsl_rl-1.0.2


In [9]:
%cd ..

/content/drive/MyDrive/CSE598/lab8-rma-phase2/lab8-rma-phase2


If the below, throws an error, you likely have to restart your runtime.

In [10]:
import rsl_rl
from rsl_rl.runners import OnPolicyRunnerRMA
from rsl_rl.runners import OnPolicyRunnerDagger

## Mujoco, JAX, MJX, BRAX, and Playground Setup & Imports

In [11]:
!pip install mujoco
!pip install mujoco_mjx
!pip install brax
!pip install noise

# TODO(student): If you're running this locally, make sure to install cuda enabled jax via something like:
# !pip install "jax[cuda12]"



In [12]:
!pip install playground
!pip install wandb
!pip install tensorboard



In [13]:
import os

try:
  print('Checking that the installation succeeded:')
  import mujoco
  mujoco.MjModel.from_xml_string('<mujoco/>')
except Exception as e:
  raise e from RuntimeError(
      'Something went wrong during installation. Check the shell output above '
      'for more information.\n'
      'If using a hosted Colab runtime, make sure you enable GPU acceleration '
      'by going to the Runtime menu and selecting "Choose runtime type".')

print('Installation successful.')

# Tell XLA to use Triton GEMM, this improves steps/sec by ~30% on some GPUs
xla_flags = os.environ.get('XLA_FLAGS', '')
xla_flags += ' --xla_gpu_triton_gemm_any=True'
os.environ['XLA_FLAGS'] = xla_flags

Checking that the installation succeeded:
Installation successful.


Ensure that the output of the following cell is `[CudaDevice(id=0)]`

In [14]:
import jax
print(jax.devices())

[CudaDevice(id=0)]


In [15]:
#@title Import MuJoCo, MJX, and Brax
from datetime import datetime
from etils import epath
import functools
from IPython.display import HTML
from typing import Any, Dict, Sequence, Tuple, Union
import os
from ml_collections import config_dict


import jax
from jax import numpy as jp
import numpy as np
from flax.training import orbax_utils
from flax import struct
from matplotlib import pyplot as plt
import mediapy as media
from orbax import checkpoint as ocp

import mujoco
from mujoco import mjx

from brax import base
from brax import envs
from brax import math
from brax.base import Base, Motion, Transform
from brax.base import State as PipelineState
from brax.envs.base import Env, PipelineEnv, State
from brax.mjx.base import State as MjxState
from brax.training.agents.ppo import train as ppo
from brax.training.agents.ppo import networks as ppo_networks
from brax.io import html, mjcf, model

Failed to import warp: No module named 'warp'
Failed to import mujoco_warp: No module named 'warp'


In [16]:

import os

xla_flags = os.environ.get("XLA_FLAGS", "")
xla_flags += " --xla_gpu_triton_gemm_any=True"
os.environ["XLA_FLAGS"] = xla_flags
os.environ["MUJOCO_GL"] = "egl"

from datetime import datetime
import json

from absl import app
from absl import flags
from absl import logging
import jax
import mediapy as media
from ml_collections import config_dict
import mujoco
import torch

import mujoco_playground
from mujoco_playground import registry
from mujoco_playground import wrapper_torch
from mujoco_playground import wrapper
from mujoco_playground.config import locomotion_params
from mujoco_playground.config import manipulation_params
from mujoco_playground.config import dm_control_suite_params

# Suppress logs if you want
logging.set_verbosity(logging.WARNING)

# Recap of Rapid Motor Adaptation (RMA)

![here](https://ar5iv.labs.arxiv.org/html/2107.04034/assets/x1.png)

Today, we'll continue to investigate one popular Sim2Real reinforcement learning paradigm: learning (and distilling) from privileged information available in simulation. One popular implementation of this idea is [Rapid Motor Adaption for Legged Robots (RMA)](https://ashish-kmr.github.io/rma-legged-robots/). In this formulation, training robotic locomotion policies are split up into two main phases, as shown in the figure above.

In **phase 1**, a motor policy, $\pi$, is trained that takes as input the state, $x_t$, the previous action $a_t$ and a learned environment latent vector $z_t$ that encodes features of the environment that may be relevant for adapting the robot's policy, but may not be directly observable when the robot is deployed in the real world (such as friction, center of mass position, etc). In this sense, you can think of $z_t$ as conditioning the policy. At this stage, however, the policy would not be able to be applied directly on a real robot, because the policy requires access to privilegded state to generate $z_t$.

In **phase 2**, the motor policy is frozen (no longer updated), and an adaptation module is trained to *approximate* the environment feature $z_t$ using *only* the robot's extended state history $x_{1...T}, a_{1...T}$. Crucially, since we're training both phases in simulation, we can rollout the phase 1 privileged information encoder to get "ground truth" $z_t$ labels for each timestep. So, we can do simple supervised learning to estimate $\hat{z}_t$ from state history!

In today's lab, we'll focus on implementing the knowledge distillation in **phase 2** of RMA. If you need a refresher, I suggest that you look through [the paper](https://ashish-kmr.github.io/rma-legged-robots/rma-locomotion-final.pdf).

`TODO(student):` Explain in your own words why it's important that both Phase 1 and Phase 2 are trained in simulation. **(10 points)**.

**[Answer Here]**

Both phases of RMA are trained in simulation because the simulator provides access to privileged information—like friction, mass, or terrain properties—that the real robot cannot directly measure. In Phase 1, this allows the motor policy to learn how to adapt its behavior using these hidden environment details. Then, in Phase 2, we can use simulation to generate ground truth labels for the latent environment vector, letting the adaptation module learn to predict it from past observations. Training in simulation is also safer and lets the robot experience many diverse environments before being deployed in the real world.

# Loading the Privileged "Teacher" Policy (Trained using Phase 1)
First, we'll load and visualize our "teacher policy" that was trained using Phase 1 of RMA. Compared to the last lab, this policy was trained using rough terrain, penalty reward curriculum (as described in the paper), more privileged information, and was trained for more iterations. Feel free to checkout the `rma_go1_locomote.py` environment to see the updated implementation.

Note, to get a more optimal teacher policy you'd likely have to do much more tedious reward tuning. But this should work for this lab!

### Load policy from checkpoint

In [17]:
from glob import glob
from rma_go1_locomote import go1_rma_default_config, rma_domain_randomize, LocomotionRMAEnv

_SEED=10
device="cuda:0"

# Get the model checkpoint
ckpt_num = 6000
model_path = f"./teacher_checkpoints/model_{ckpt_num}.pt"

teacher_render_trajectory = []
def teacher_render_callback(_, state):
    teacher_render_trajectory.append(state)

# Create the parent env
randomizer = rma_domain_randomize
env_cfg = go1_rma_default_config()
raw_eval_env = LocomotionRMAEnv(config=env_cfg, task="rough_terrain")
from EECS598RSLRLBraxWrapper import EECS598RSLRLBraxWrapper
brax_eval_env = EECS598RSLRLBraxWrapper(
    raw_eval_env,
    num_actors=1,
    seed=_SEED,
    episode_length=env_cfg.episode_length,
    action_repeat=1,
    randomization_fn=randomizer,
    render_callback=teacher_render_callback,
)

#Load the trained, priveleged RMA policy
train_cfg = locomotion_params.rsl_rl_config("")
train_cfg.runner_class_name = "OnPolicyRunnerRMA"
train_cfg_dict = train_cfg.to_dict()
train_cfg_dict["policy_class_name"] = "ActorCriticLatent"# "ActorCriticLatent"
train_cfg_dict["algorithm_class_name"] = "PPO_priv"

# Get the Parent runner and policy
runner = OnPolicyRunnerRMA(brax_eval_env, train_cfg_dict, device=device)
runner.load(path=model_path)
policy = runner.get_inference_policy(device=device)

obs_shape: {'privileged_state': (17,), 'state': (51,)}
Asymmetric observation space
JITing reset and step
Done JITing reset and step
ActorCriticLatent.__init__ got unexpected arguments, which will be ignored: ['class_name']
Actor MLP: MLPEncode(
  (activation_fn): ELU(alpha=1.0)
  (output_activation_fn): Tanh()
  (prop_encoder): Sequential(
    (0): Linear(in_features=17, out_features=256, bias=True)
    (1): ELU(alpha=1.0)
    (2): Linear(in_features=256, out_features=128, bias=True)
    (3): ELU(alpha=1.0)
    (4): Linear(in_features=128, out_features=8, bias=True)
    (5): ELU(alpha=1.0)
  )
  (action_mlp): Sequential(
    (0): Linear(in_features=59, out_features=512, bias=True)
    (1): ELU(alpha=1.0)
    (2): Linear(in_features=512, out_features=256, bias=True)
    (3): ELU(alpha=1.0)
    (4): Linear(in_features=256, out_features=128, bias=True)
    (5): ELU(alpha=1.0)
    (6): Linear(in_features=128, out_features=12, bias=True)
    (7): Tanh()
  )
)
Critic MLP: MLPEncode(
  (acti

### Perform a rollout
`TODO(student):` Fill in the TODO below to generate the teacher policy rollout. **(20 points)**

In [18]:
obs = brax_eval_env.get_observations()
privileged_obs = brax_eval_env.get_privileged_observations()

actor_critic = runner.alg.actor_critic

from tqdm import tqdm
with torch.inference_mode():
    for _ in tqdm(range(env_cfg.episode_length)):
        obs_tensor = torch.tensor(obs, dtype=torch.float32, device=device)
        priv_tensor = torch.tensor(privileged_obs, dtype=torch.float32, device=device)

        # 5. Concatenate inputs (this is what the model expects!)
        full_obs = torch.cat([obs_tensor, priv_tensor], dim=-1)

        # 6. Forward pass through actor network
        with torch.inference_mode():
            actions = actor_critic.actor.architecture(full_obs)

            z_t = actor_critic.actor.architecture.only_latent(full_obs)
        obs, privileged_obs, rewards, dones, infos = brax_eval_env.step(actions)

        if dones.any():
            break

        brax_eval_env.render()

print("The final z_t is ", z_t)

  obs_tensor = torch.tensor(obs, dtype=torch.float32, device=device)
  priv_tensor = torch.tensor(privileged_obs, dtype=torch.float32, device=device)
100%|█████████▉| 999/1000 [01:03<00:00, 15.72it/s] 


The final z_t is  tensor([[-0.9978, -0.3001, -1.0000, -1.0000, -0.1475, -1.0000, -0.9947, -1.0000]],
       device='cuda:0')


### Render the video

In [19]:
# Render the video
print("Rendering a video")
scene_option = mujoco.MjvOption()
scene_option.geomgroup[2] = True
scene_option.geomgroup[3] = False
scene_option.flags[mujoco.mjtVisFlag.mjVIS_CONTACTPOINT] = True
scene_option.flags[mujoco.mjtVisFlag.mjVIS_PERTFORCE] = True
scene_option.flags[mujoco.mjtVisFlag.mjVIS_CONTACTFORCE] = False


render_every = 2


# If your environment is wrapped multiple times, adjust as needed:
base_env = brax_eval_env.env.env.env  # or brax_env.env.env.env
fps = 1.0 / base_env.dt / render_every
traj = teacher_render_trajectory[::render_every]
frames = raw_eval_env.render(
    traj,
    camera="track",
    height=480,
    width=640,
    scene_option=scene_option,
)

file_name = f"teacher_rollout.mp4"
media.write_video(file_name, frames, fps=fps)
print(f"Rollout video saved as '{file_name}'.")
media.show_video(frames, fps=fps)

  '''Given proprioceptive history, return \hat{z_t}'''


Rendering a video


100%|██████████| 500/500 [00:02<00:00, 203.27it/s]


Rollout video saved as 'teacher_rollout.mp4'.


0
This browser does not support the video tag.


# RMA Phase 2: Implement Teacher-Student Distillation
Awesome! Now that the privilieged teacher policy is trained, we can use it's priviliged environment encoder to obtain a "ground-truth" $z_t$ label for each timestep in the simulation. In the Phase 2 stage of RMA, all we'll be doing is building an encoder that tries to regress to the $z_t$ label using information that the robot will have access to in the real world (i.e. its proprioceptive history).

To do this, we'll again again be building off of the familiar [RSL_RL](https://github.com/leggedrobotics/rsl_rl) framework discussed in previous labs, extending its training loop to incorporate these new elements. Crucially, though, we're not doing any RL in phase 2, just supervised regression.

Our adapted RSL_RL library resides in the `rma_rsl_rl` folder. There are a few important files to look at:
- `./rsl_rl/runners/on_policy_runner_dagger.py` This file contains the main training and logging loop and is responsible for rolling out and training the policy in simulation. I recommend you take a moment to read through this file.
- `./rsl_rl/algorithms/dagger.py` This file connects the *expert* policy and the *student* policy, stores the computed ($z_t$, prop_history) and performs regression steps.
- `./rsl_rl/modules/adaptation.py` This file defines the "adaptation module" that produces estimates of $\hat{z}_t$ given history $(x_{t-50:t-1}, a_{{t-50:t-1}})$

Please briefly take a look at all of these files to get a sense of the structure.

In this lab, the only file that you have to do implementation in is `./rsl_rl/algorithms/dagger.py`. You'll train the adaptation encoder by acquiring $z_t$ and $\hat{z}_t$ and perform regression steps.

`TODO(student):` Fill in the TODOs in `./rsl_rl/algorithms/dagger.py` to train the adaptation encoder.

1. Implement the TODOs in `DaggerAgent.get_student_latent()`, `DaggerAgent.get_expert_latent()`. **(20 points)**

2. Implement the TODOs in `DaggerTrainer.observe()`, `DaggerTrainer.step()`, `DaggerTrainer.update()`. **(20 points)**

# Run Phase 2 Training
Train the Phase 2 RMA policy! It shouldn't take too long to train.

### Get the Configs

In [30]:
_EXP_NAME = "RMA_DAGGER"
_SUFFIX = None
_USE_WANDB = False
_PLAY_ONLY = False
_NUM_ENVS = 4096
_LOAD_RUN_NAME = None
_CHECKPOINT_NUM = -1

def _init_logging_and_exp():
    # Experiment name
    now = datetime.now()
    timestamp = now.strftime("%Y%m%d-%H%M%S")
    exp_name = f"{_EXP_NAME}-{timestamp}"
    exp_name += f"-{_SUFFIX}"
    print(f"Experiment name: {exp_name}")

    # Logging directory
    logdir = os.path.abspath(os.path.join("logs", exp_name))
    os.makedirs(logdir, exist_ok=True)
    print(f"Logs are being stored in: {logdir}")

    # Checkpoint directory
    ckpt_path = os.path.join(logdir, "checkpoints")
    os.makedirs(ckpt_path, exist_ok=True)
    print(f"Checkpoint path: {ckpt_path}")

    # Save environment config to JSON
    with open(
        os.path.join(ckpt_path, "config.json"), "w", encoding="utf-8"
    ) as fp:
        json.dump(env_cfg.to_dict(), fp, indent=4)

    return exp_name, logdir, ckpt_path

def _get_training_config_dict(exp_name):
    train_cfg = locomotion_params.rsl_rl_config("") # get the default
    train_cfg.seed = _SEED
    train_cfg.run_name = exp_name
    train_cfg.resume = _LOAD_RUN_NAME is not None
    train_cfg.load_run = _LOAD_RUN_NAME if _LOAD_RUN_NAME else "-1"
    train_cfg.checkpoint = _CHECKPOINT_NUM
    train_cfg.runner_class_name = "OnPolicyRunnerDagger"

    train_cfg_dict = train_cfg.to_dict()


    train_cfg_dict["expert_policy_name"] = "DaggerExpert"
    train_cfg_dict["student_policy_class"] = "DaggerAgent"
    train_cfg_dict["history_len"] = 50

    return train_cfg_dict

  '''Given proprioceptive history, return \hat{z_t}'''


### Setup the Student Runner

In [31]:
from EECS598RSLRLBraxWrapper import EECS598RSLRLBraxWrapper
from rsl_rl.runners import OnPolicyRunnerDagger

# Initialize the base env
randomizer = rma_domain_randomize
env_cfg = go1_rma_default_config()
raw_env = LocomotionRMAEnv(task="flat_terrain", config=env_cfg) #task in {flat_terrain, rough_terrain}
brax_env = EECS598RSLRLBraxWrapper(
    raw_env,
    _NUM_ENVS,
    _SEED,
    env_cfg.episode_length,
    1,
    randomization_fn=randomizer,
)


# Init the logging dirs
exp_name, logdir, ckpt_path = _init_logging_and_exp()
print(f"Experiment {exp_name}:  storing checkpoints in {ckpt_path}")

# Setup the Dagger Runner
print("\n\nSetting up the Dagger Runner")
train_cfg_dict = _get_training_config_dict(exp_name)
expert_policy = runner.alg.actor_critic.actor.architecture
dagger_runner = OnPolicyRunnerDagger(expert_policy, env=brax_env, train_cfg=train_cfg_dict, log_dir=logdir, device=device)

obs_shape: {'privileged_state': (17,), 'state': (51,)}
Asymmetric observation space
JITing reset and step
Done JITing reset and step
Experiment name: RMA_DAGGER-20251023-223916-None
Logs are being stored in: /content/drive/MyDrive/CSE598/lab8-rma-phase2/lab8-rma-phase2/logs/RMA_DAGGER-20251023-223916-None
Checkpoint path: /content/drive/MyDrive/CSE598/lab8-rma-phase2/lab8-rma-phase2/logs/RMA_DAGGER-20251023-223916-None/checkpoints
Experiment RMA_DAGGER-20251023-223916-None:  storing checkpoints in /content/drive/MyDrive/CSE598/lab8-rma-phase2/lab8-rma-phase2/logs/RMA_DAGGER-20251023-223916-None/checkpoints


Setting up the Dagger Runner
Dagger Runner Loaded


### Train the Student Policy

In [32]:
dagger_runner.learn(
    num_learning_iterations=100,
    init_at_random_ep_len=False,
)
print("training finished")

################################################################################
                       [1m Learning iteration 0/100 [0m                       

                       Computation: 1017 steps/s (collection: 93.958s, learning 2.685s)
                  Prop latent loss: 0.6121
                       Mean reward: -2.11
               Mean episode length: 14.77
--------------------------------------------------------------------------------
                   Total timesteps: 98304
                    Iteration time: 96.64s
                        Total time: 96.64s
                               ETA: 9664.3s

################################################################################
                       [1m Learning iteration 1/100 [0m                       

                       Computation: 34769 steps/s (collection: 0.987s, learning 1.840s)
                  Prop latent loss: 0.1622
                       Mean reward: -2.27
               Mean episode leng

# Evaluate/Visualize the Student Policy

In [33]:
student_policy = dagger_runner.get_inference_policy(device=device)
raw_eval_env = LocomotionRMAEnv(task="rough_terrain", config=env_cfg) #task in {flat_terrain, rough_terrain}


render_trajectory = []
def render_callback(_, state):
    render_trajectory.append(state)

num_eval_envs = 1
from EECS598RSLRLBraxWrapper import EECS598RSLRLBraxWrapper
brax_eval_env = EECS598RSLRLBraxWrapper(
    raw_eval_env,
    1,
    _SEED+2,
    env_cfg.episode_length,
    1,
    render_callback=render_callback,
    randomization_fn=randomizer,
)

obs_shape: {'privileged_state': (17,), 'state': (51,)}
Asymmetric observation space
JITing reset and step
Done JITing reset and step


### Perform a rollout
`TODO(student):` Perform a rollout using the student_policy. Make sure you correctly initialize and update the proprioceptive history **(20 points)**

In [34]:
render_trajectory = []

# TODO(student): Initialize the history buffer
history = torch.zeros((brax_eval_env.num_envs, dagger_runner.history_shape), dtype=torch.float, device=device)

obs = brax_eval_env.get_observations()


from tqdm import tqdm
with torch.inference_mode():
    for _ in tqdm(range(env_cfg.episode_length)):
        actions = dagger_runner.actor.get_student_action(obs, history) #TODO(student): use the student policy

        obs, privileged_obs, rewards, dones, infos = brax_eval_env.step(actions)

        brax_eval_env.render()

        # TODO(student): Update the rolling history
        # Make a copy of the current observation
        obs_clone = obs.clone()
        steps = brax_eval_env.num_obs
        new_history = torch.zeros_like(history)
        new_history[:, :-steps] = history[:, steps:]
        new_history[:, -steps:] = obs_clone
        history = new_history



        if dones.any():
            break

100%|█████████▉| 999/1000 [01:03<00:00, 15.69it/s]


### Render a Video

In [35]:
from IPython.display import Video
_CAMERA = None

# Render
scene_option = mujoco.MjvOption()
scene_option.geomgroup[2] = True
scene_option.geomgroup[3] = False
scene_option.flags[mujoco.mjtVisFlag.mjVIS_CONTACTPOINT] = True
scene_option.flags[mujoco.mjtVisFlag.mjVIS_TRANSPARENT] = False
scene_option.flags[mujoco.mjtVisFlag.mjVIS_CONTACTFORCE] = False

render_every = 2

# If your environment is wrapped multiple times, adjust as needed:
base_env = brax_eval_env.env.env.env  # or brax_env.env.env.env
fps = 1.0 / base_env.dt / render_every
traj = render_trajectory[::render_every]
frames = raw_eval_env.render(
    traj,
    camera="track",
    height=480,
    width=640,
    scene_option=scene_option,
)

video_name = f"student_rollout.mp4"
media.write_video(f"{video_name}", frames, fps=fps)
print(f"Rollout video saved as '{video_name}'")
media.show_video(frames, fps=fps)

100%|██████████| 500/500 [00:02<00:00, 185.65it/s]


Rollout video saved as 'student_rollout.mp4'


0
This browser does not support the video tag.


# Analyze the Learned Environment Latent $\hat{z}_t$ (optional)
One thing you might be curious about is what the envirionment latent $z_t$ actually captures. One technique for doing this would be to grab the ActorCriticLatent policy that you trained, and pass in hand-made environment observations vectors to see how it affects $z_t$. Alternatively, you could also collect $z_t$ values while rolling out the policy in randomized environments. Since $z_t$ is likely > 2 dimensions, you may find it useful to visualize it using visualization / latent space projection techniques such as tSNE.

In [None]:
# Your Code Here

## What to Turn In

`#TODO(student):` Please zip the following files and turn them into the assignment on gradescope:
1. this `08_lab_student.ipynb` file. Please make sure to fill our your name and umich ID in the first cell
2. the `dagger.py` file

Please ensure all cell outputs (videos, plots, etc) are in tact when you download the .ipynb file.