# **Assignment 4: Model Based Reinforcement Learning**

### **Due Date**: 03/28/2025 at 11:59 PM

### **Late Due Date**: 03/31/2025 at 11:59 PM

#### **Writeup**: https://docs.google.com/document/d/1Ut7LWu_KagSjOGMFJwxMSRoVPtdlcUlSMQBOO7WkjPM/edit?usp=sharing

# **Introduction**

Welcome to Assignment 4 of CS 4756/5756. In this assignment, you will train Policy Gradient methods under various World Models. This assignment is built up by the following components:

- **[PROVIDED] Setup**: Dependency installing and initializations.
- **[PROVIDED] Helper Functions**: Provided functions for visualization and evaluation.
- **Part 1**: Train an expert PPO agent with StableBaselines3.
- **Part 2**: Train a world model using environment transitions.
- **Part 3**: Train a learner PPO agent on the world model.
- **[GRAD] Part 4**: Aggregate new data from the learner policy.

You will use the **FetchReach-v4** environment for this assignment. Refer to the Gymnasium-Robotics website for more details about this [environment](https://robotics.farama.org/envs/fetch/reach/)

Please read through the following paragraphs carefully.

**Getting Started**: You should complete this assignment on [Google Colab](https://colab.research.google.com).

**Evaluation**: Your code will be tested for correctness and, for certain assignments, speed. For this particular assignment, performance results will not be harshly graded (although we provide approximate expected reward numbers, you are not expected to replicate them exactly). Please remember that all assignments should be completed individually.

**Academic Integrity**: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else’s code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don’t try. We trust you all to submit your own work only; please don’t let us down. If you do, we will pursue the strongest consequences available to us.

**Getting Help**: The [Resources](https://www.cs.cornell.edu/courses/cs4756/2025sp/#resources) section on the course website is your friend! If you ever feel stuck in these projects, please feel free to avail yourself to office hours and Edstem! If you are unable to make any of the office hours listed, please let TAs know and we will be happy to assist. If you need a refresher for PyTorch, please see this [60 minute blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)! For Numpy, please see the quickstart [here](https://numpy.org/doc/stable/user/quickstart.html) and full API [here](https://numpy.org/doc/stable/reference/).

# **[PROVIDED] Setup**

Please run the cells below to install the necessary packages.

In [None]:
import sys
USING_COLAB = 'google.colab' in sys.modules

if USING_COLAB:
    !apt-get -qq update
    !apt-get -qq install -y libosmesa6-dev libgl1-mesa-glx libglfw3 libgl1-mesa-dev libglew-dev patchelf
    !apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
else:
    !pip install torch torchvision torchaudio
    !pip install numpy
    !pip install tqdm
    !pip install opencv-python

!pip install matplotlib
!pip install -U mediapy
!pip install -U renderlab
!pip install -U "imageio<3.0"
!pip install stable_baselines3

!git clone https://github.com/Farama-Foundation/Gymnasium-Robotics.git
!pip install -e Gymnasium-Robotics
sys.path.append('/content/Gymnasium-Robotics')

In [None]:
import os
# Mujoco GLEW Setup
try:
    if _mujoco_run_once:  pass
except NameError:
    _mujoco_run_once = False

if not _mujoco_run_once:
    try:
        os.environ['LD_PRELOAD']=os.environ['LD_PRELOAD'] + ':/usr/lib/x86_64-linux-gnu/libGLEW.so'
    except KeyError:
        os.environ['LD_PRELOAD']='/usr/lib/x86_64-linux-gnu/libGLEW.so'

    # Presetup so we don't see output on first env initialization
    _mujoco_run_once = True
    if USING_COLAB:
        NVIDIA_ICD_CONFIG_PATH = '/usr/share/glvnd/egl_vendor.d/10_nvidia.json'
        if not os.path.exists(NVIDIA_ICD_CONFIG_PATH):
            with open(NVIDIA_ICD_CONFIG_PATH, 'w') as f:
                f.write("""{
                    "file_format_version" : "1.0.0",
                    "ICD" : {
                        "library_path" : "libEGL_nvidia.so.0"
                    }
                }""")

    # Set environment variable to support EGL (off-screen) rendering
    %env MUJOCO_GL=egl

Please run the cells below to import necessary packages and set the initial seeding.

In [None]:
from torch.utils.data import DataLoader
import gymnasium.wrappers as wrappers
import matplotlib.pyplot as plt
import torch.distributions as D
from tqdm import tqdm, trange
import torch.optim as optim
import gymnasium_robotics
import gymnasium as gym
import torch.nn as nn
import numpy as np
import random
import torch

In [None]:
seed = 695

# Setting the seed to ensure reproducability
def reseed(seed, env=None):
    torch.manual_seed(seed)
    random.seed(seed)
    np.random.seed(seed)

    if env is not None:
        env.unwrapped._np_random = gym.utils.seeding.np_random(seed)[0]

reseed(seed)

In [None]:
# In this block we define wrappers necessary to simplify the environment MDP
def wrap_reach_fixed_goal(env):
    g = np.array([1.486, 0.73, 0.681], dtype=np.float32)
    env.unwrapped._sample_goal = lambda: g
    return env

class FetchRewardWrapper(gym.Wrapper):
    def reset(self, *args, **kwargs):
        obs, info = self.env.reset(*args, **kwargs)
        self.prev_dist = np.linalg.norm(obs['achieved_goal'] - obs['desired_goal'])
        return obs, info

    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action) # Terminated is never set to true
        current_dist = np.linalg.norm(obs['achieved_goal'] - obs['desired_goal'])
        reward = (self.prev_dist - current_dist) * 10
        self.prev_dist = current_dist
        return obs, reward, info['is_success'], truncated, info

# **[PROVIDED] Helper Functions**

### **Visualize Helper Function**

Below, we provide the helper function `visualize` for your use. This function will create a visualization of the environment passed in the parameter `env`. If you are using Colab, calling this function will render the visualization within the notebook. If you are using your local machine, this function will instead save a video of the visualization to your current directory (rendering videos in Jupyter Notebooks is not widely supported outside of Colab).

**Note:** In this code, a choice is provided on whether to vectorize the environment. The difference across vectorized and not vectorized gymnasium environments will be explained in the StableBaselines Introduction section.

In [None]:
def visualize(env: gym.Env, algorithm=None, video_name="test"):
    """
        Visualize a policy network for a given algorithm on a single episode

        Args:
            - env_name: Name of the gym environment to roll out `algorithm` in,
                it will be instantiated using gym.make or make_vec_env.
            - algorithm (PPOActor): Algorithm whose policy network will be rolled
                out for the episode. If no algorithm is passed in, a random policy
                will be visualized.
            - video_name (str): Name for the mp4 file of the episode that will be
                saved (omit .mp4). Only used when running on local machine.
    """

    def get_action(obs):
        if not algorithm:
            return env.action_space.sample()
        else:
            return algorithm.select_action(obs)

    if USING_COLAB:
        import renderlab as rl

        directory = './video'
        env = rl.RenderFrame(env, "output/")
        obs, info = env.reset()

        for i in range(500):
            action = get_action(obs)
            obs, reward, terminated, truncated, info = env.step(action)
            if terminated or truncated: break
        env.play()

    else:
        import cv2

        video = cv2.VideoWriter(f"{video_name}.mp4", cv2.VideoWriter_fourcc(*'mp4v'), 24, (600,400))
        obs = env.reset()

        for i in range(500):
            action = get_action(obs)
            obs, reward, terminated, truncated, info = env.step(action)
            if terminated or truncated: break

            im = env.render(mode='rgb_array')
            im = im[:,:,::-1]
            video.write(im)

        video.release()
        env.close()
        print(f"Video saved as {video_name}.mp4")

### **Policy Evaluation Functions**

The `evaluate_policy` function takes an agent actor, an environment whose output observations can be applied to the actor, and evaluates the policy by doing the following:

- Rollout actor for a default of 100 trajectories, and record the total reward.
- Return the average trajectory rewards over these episodes.

**Note:** Since the actor we will be defining in this assignment exclusively uses a StableBaselines3 PPO agent, then the environment provided must be an instance of `VecEnv`, more information introduced in Part 1.

The `success_rate` function is similar to the `evaluate_policy` function except that it takes a regular gymnasium environment instead of a vectorized environment. It also records the success rate as a percentage instead of the total reward.

In [None]:
def evaluate_policy(actor, environment, num_episodes=100, progress=True):
    """
        Returns the mean trajectory reward of rolling out `actor` on `environment.

        Parameters
        - actor: PPOActor instance, defined in Part 1.
        - environment: classstable_baselines3.common.vec_env.VecEnv instance.
        - num_episodes: Total number of trajectories to collect and average over.
    """

    total_rew = 0
    iterate = (trange(num_episodes) if progress else range(num_episodes))

    for _ in iterate:
        obs = environment.reset()
        done = False

        while not done:
            action = actor.select_action(obs)
            next_obs, reward, done, info = environment.step(action)
            total_rew += reward
            obs = next_obs

    return (total_rew / num_episodes).item()


def success_rate(actor, environment, num_episodes=100, progress=True):
    """
        Returns the percentage of successful trajectories of `actor` on `environment`.

        Parameters
        - actor: PPOActor instance, defined in Part 1.
        - environment: Gymnasium environment.
        - num_episodes: Total number of trajectories to collect and average over.
    """

    total_success = 0
    iterate = (trange(num_episodes) if progress else range(num_episodes))

    for _ in iterate:
        obs, info = environment.reset()
        done = False

        while not done:
            action = actor.select_action(obs)
            next_obs, reward, done, truncated, info = environment.step(action)
            obs = next_obs

            if done: total_success += 1
            if truncated: break

    return (total_success / num_episodes)

### **Notes About Fetch Reach Environment**

The environment uses a Fetch Robot, which is a 7-DoF Mobile Manipulator.

The task is a _goal-reaching task_: The observation space contains `observation` which includes the state of the robot in the environment, and `desired_goal` which specifies the xyz coordinate that the robot's gripper aims to reach.

See https://robotics.farama.org/envs/fetch/reach/ for more details.

If the goal is reached, `info['is_success']` will be set to 1, and this is an indication that we should terminate the rollout.

The reward is normally -1 per timestep spent in the environment without completing the task, with 50 steps being the limit (so -50 is the worst episode return).

> Note: For this assignment, we've modified the environment so that it only has a fixed goal to reach, and has better reward shaping. This is to make training easier and quicker later on.

**Run the cells below to create and visualize the environment:**

In [None]:
# Let's initialize the environment first
reseed(seed)

def make_fetch_env():
    env = gym.make("FetchReach-v4", render_mode="rgb_array")
    env = wrap_reach_fixed_goal(env)
    env = FetchRewardWrapper(env)
    env = wrappers.FilterObservation(env, ["desired_goal", "observation"])
    env = wrappers.FlattenObservation(env)
    return env

real_env = make_fetch_env()

In [None]:
visualize(real_env)

# **Part 1: Train Expert Using StableBaselines3**



### **1.1: [PROVIDED] Introduction To Stable Baselines 3**

StableBaselines3 is popular off-the-shelf set of reliable implementations of reinforcement learning algorithms in PyTorch. In this assignment, we will be using its PPO (Proximal Policy Gradient) implementation as our agent.

Each algorithm implementation is a subclass of the `stable_baselines3.common.base_class.BaseAlgorithm` class, which provides us with the following functions:

- `learn(total_timesteps, callback=None, log_interval=100, tb_log_name='run', reset_num_timesteps=True, progress_bar=False)`
  - This is the training loop of any of the RL algorithm implementations. Training is done by calling this function with an appropriate amount of `total_timesteps`.
- `predict(observation)`
  - Returns a tuple `(predicted_action, next_hidden_state)` based on input `observation`. If we are not using an RNN, the next hidden state can be neglected.
- `save(path)`
  - Saves the current policy parameters into a `.zip` file with given `path`. Note that the `path` does not have the `.zip` postfix.
- `load(path, env=None)`
  - Loads a saved a `.zip` checkpoint into this RL implementation model.

### **1.2: [PROVIDED] Hyperparameters**

The implementation has a set of hyperparameters that can be tuned towards better performance. For the sake of simplicity, we will provide the hyperparameters for the StableBaselines3 PPO implementation. The main ones we specify include the following:

- `n_steps`: the number of steps to run with the environment for each update to the policy network.
- `net_arch`: The network architecture of the policy network and the critic network:
  - `pi`: a list that specifies the hidden dimensions of the policy network. The input and output dimension are determined by the environment associated with this policy.
  - `vf`: a list that specifies the hidden dimensions of the critic network.
  - `activation_fn`: Nonlinearity to be applied between each of the MLP layers.

For a more comprehensive list and description of each of these hyperparameters, visit the official [documentation page](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters) for more information.


In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.vec_env.base_vec_env import VecEnv

hyperparameters = {
    "n_steps": 512,
    "policy_kwargs": {
        "net_arch": {
            "pi": [128],
            "vf": [128],
            "activation_fn": "tanh",
        }
    },
}

### **1.3: Vectorized Environmnent**

For any StableBaselines3 algorithm implementation, the gymnasium environment used need to be converted into a [vectorized environment](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#) of `VecEnv` type.

A vectorized environment stacks multiple independent environments into one, stepping multiple `n` environments each time. If we set the the `n_envs` parameter to 3, then 3 environments will be stepped each time the VecEnv is stepped.

**For the rest of this assignment, all vectorized environments with `n_env=n` will be described as n-vectorized.**

With a vectorized environment that steps multiple environments at the same time, the model learning process can be made more efficient through parallelization trajectory collection across these independent environments. This `n_envs` parameter can be tailored to the specific machines.  

**Important Differences:**
- The vectorized environments now require input action to be a shape of `n_envs * act_dim`. The output observation from `step` and `reset` will also have the shape of `n_envs * obs_dim`.
- The VecEnv `reset()` function returns only the observation, while the gymnasium.Env `reset()` function returns a tuple `(observation, info_dict)`.
- The `vec_env.step(action)` function returns a 4-tuple of `(obs, reward, terminated, info)`, while the `gym_env.step(action)` returns a 5-tuple of `(obs, reward, terminated, truncated, info)`. The `terminated` value from VecEnv would equivalent to the gymnasium environment's `terminated or truncated`.


A VecEnv instance can be created using the `make_vec_env` function, which takes the id of the wanted gymnasium environment, as well as the number of environments needed. This function has the following key parameters
- `env_id`: the id of the gymnasium environment, or instantiated gym environment, or a callable that returns an env.
- `n_envs`: The number of environments to have in parallel.
- `seed`: The initial seed for the random number generator.
- `env_kwargs`: An optional parameter to pass into the environment constructor.

More detailed function documentation can be found in this [page](https://stable-baselines3.readthedocs.io/en/master/common/env_util.html#stable_baselines3.common.env_util.make_vec_env).

**Instructions** For this part, please create two vectorized version of `FetchReach-v4` with 3 and 1 environments stacked. Note that because of our wrappers, you need to pass a callable, we have one called `make_fetch_env` defined above.

In [None]:
from stable_baselines3.common.env_util import make_vec_env

# TODO: Instantiate
real_vec_env_1 = None
real_vec_env_3 = None
# END TODO

### **1.4: Actor Definition**

**Instruction**: You will need to implement the following PPOActor class, which serves as a wrapper to provide PPO model predictions.
- `__init__`: Takes a path to the checkpoint and the corresponding environment, and load an instance of this PPO checkpoint. However if a PPO model is given, then the internally representing model uses that directly instead. This is for use in the Callback function, and since we provide that implementation for you, you will only need to implement the model loading portion of the constructor.
- `select_action`: Takes an observation and produce the corresponding action prediction from the checkpoint PPO model. While implementing, take note of the output of the `predict` function.

In [None]:
class PPOActor():
    def __init__(self, ckpt: str=None, environment: VecEnv=None, model=None):
        '''
          Requires environment to be a 1-vectorized environment

          The `ckpt` is a .zip file path that leads to the checkpoint you want
          to use for this particular actor.

          If the `model` variable is provided, then this constructor will store
          that as the internal representing model instead of loading one from the
          checkpoint path
        '''
        assert ckpt is not None or model is not None

        if model is not None:
            self.model = model
            return

        # TODO: Load checkpoint
        self.model = None
        # END TODO

    def select_action(self, obs):
        '''Gives the action prediction of this particular actor'''

        # TODO: Select action
        return None
        # END TODO

### **1.5: [PROVIDED] Callbacks**

To visualize the training process, since it could take a significant amount of time, StableBaselines3 provides a mean for us to visualize the training progress through a `BaseCallback` class instance, which can be optionally passed in as a parameter of the `learn` function. This Callback function is customizable by defining a subclass of `BaseCallback`.

For this part, we provide you with a customized callback that evaluates the model under training every 1024 steps on an evaluating environment, which will be the 1-vectorized environment you have instantiated in the previous portion. Based on this evaluation result, this callback will save a checkpoint of the model if it is, so far, the best performing model. At the end of training, a plot of all evaluation results with respect to number of steps will be generated.

You are free to modify this callback class to help you visualize training in any way most convenient for you, but is **NOT REQUIRED**.

In [None]:
class PPOCallback(BaseCallback):
    def __init__(self, verbose=0, save_path='default', eval_env=None):
        super(PPOCallback, self).__init__(verbose)
        self.rewards = []

        self.save_freq = 1024
        self.min_reward = -np.inf
        self.actor = None
        self.eval_env = eval_env

        self.save_path = save_path
        self.eval_steps = []
        self.eval_rewards = []

    def _init_callback(self) -> None:
        pass

    def _on_training_start(self) -> None:
        """
        This method is called before the first rollout starts.
        """
        self.actor = PPOActor(model=self.model)

    def _on_rollout_start(self) -> None:
        """
        A rollout is the collection of environment interaction
        using the current policy.
        This event is triggered before collecting new samples.
        """
        pass

    def _on_rollout_end(self) -> None:
        """
        This event is triggered before updating the policy.
        """
        episode_info = self.model.ep_info_buffer
        rewards = [ep_info['r'] for ep_info in episode_info]
        mean_rewards = np.mean(rewards)
        self.rewards.append(mean_rewards)

    def _on_step(self) -> bool:
        """
        This method will be called by the model after each call to `env.step()`.

        For child callback (of an `EventCallback`), this will be called
        when the event is triggered.

        :return: If the callback returns False, training is aborted early.
        """
        if self.eval_env is None:
            return True

        if self.num_timesteps % self.save_freq == 0 and self.num_timesteps != 0:
            mean_reward = evaluate_policy(self.actor, environment=self.eval_env, num_episodes=20)
            print(f'evaluating {self.num_timesteps=}, {mean_reward=}=======')

            self.eval_steps.append(self.num_timesteps)
            self.eval_rewards.append(mean_reward)
            if mean_reward > self.min_reward:
                self.min_reward = mean_reward
                self.model.save(self.save_path)
                print(f'model saved on eval reward: {self.min_reward}')

        return True

    def _on_training_end(self) -> None:
        """
        This event is triggered before exiting the `learn()` method.
        """
        print(f'model saved on eval reward: {self.min_reward}')

        plt.plot(self.eval_steps, self.eval_rewards, c='red')
        plt.xlabel('Episodes')
        plt.ylabel('Rewards')
        plt.title('Rewards over Episodes')

        plt.show()
        plt.close()

### **1.6 PPOActor Initialization And Training**

The `stable_baselines3.ppo.PPO` class inherits from the `BaseAlgorithm` class described at the beginning of this section, and is specifically implemented for the PPO algorithm. To initialize a class, the following parameters are especially important:
- `policy: str`: The policy type we use to train the agent, common ones include MlpPolicy and CnnPolicy. In our case, we will be using the MlpPolicy.
- `env: VecEnv`: The environment that the agent rollouts on for training, must be vectorized or it will be vectorized by the PPO implementation
- `n_steps`: number of steps to optimize the policy for
- `device`: The device to put the model on (For this assignment, if you're not able to reach the performance bounds, try setting this parameter to cpu)
- Other hyperparameters specified in the `hyperparameters` dictionary we provided, can be directly applied using the `**` operator.

**Instructions**
- Initialize a PPO MLP policy as expert, using the 3-env VecEnv initialized in the previous part and pass in the given hyperparameters.
- Train the expert with an instance of the `PPOCallback` defined before. No need to save the resulting model into checkpoint since that is done for you in the Callback class
  - (HINT): Look at the beginning of Part 1 for useful functions for training.


**Estimated Training Time**:
- 2 - 4 minutes on Google Colab CPU

In [None]:
reseed(seed)
ckpt_path = 'expert'
total_steps = 40960
expert_callback = PPOCallback(save_path=ckpt_path, eval_env=real_vec_env_1)

# TODO: Instantiate and train
expert = None
# END TODO

### **1.7: Evaluate Expert**

**Instructions** Initialize an expert PPOActor instance from the checkpoint and evaluate the expert agent using the `evaluate_policy` and `success_rate` function on the real environment.

**Expected Reward**: Around 1.6 - 1.7 on `real_vec_env_1`

**Expected Success**: Around 0.95 on `real_env`

In [None]:
expert = PPOActor(ckpt_path, real_vec_env_1)

# TODO: Evaluate

# END TODO

### **1.8: [PROVIDED] Visualize Expert**

In [None]:
visualize(real_env, algorithm=expert, video_name='expert')

# **Part 2: Collect Data And Train World Model**

### **2.1: [PROVIDED] Overview**
Unlike in simulation, we can rarely obtain the full transition function of real world scenarios, and we emulate that property in this assignment here.

Assuming we do not have the underlying logic to the `FetchReach-v4`, given that we have an expert agent in solving this particular problem, we take the following model based reinforcement learning approach to learn an RL agent that can be applied to the real scenario.

In real life, we might not have such a trained expert, and human operating the robot remotely could be one source of expert data.

1. Rollout a series of expert trajectories in the true environment (analogous to collecting a set of human demonstrations on the robot)
2. Define and train a world model with the trajectory transitions as input data
3. Define a new environment that applies the trained world model
4. Learn an RL agent under the learned environment
5. Evaluate this agent using the real environment

You will need to implement the following functions and classes
- `data_collect`: a helper function that rolls out a policy on an environment, and returning a tuple of lists representing the transitions
- `WorldModel` : a `torch.nn` module defining the architecture of the world.
- `train_world_model` and `eval_world_model`: Training and evaluation loop of the world model

Follow the instructions below to implement each of these components

### **2.2: Collect Data**

**Instructions**

The `data_collect` function should rollout a policy actor on the environment for a total of `num_steps`, with a maximum trajectory length of `traj_max_length`, then returning 3 lists: `observations`, `actions`, `next_observations` such that for any transition $i \leq$ num_steps:

data_env with initial state `observations[i]`, when stepped with `actions[i]`, yields a new state `next_observations[i]`.

In [None]:
def data_collect(num_steps: int, traj_max_length: int, data_env: gym.Env, actor: PPOActor):
    '''
    Collects observation, action, next_observation triplet data for `num_trajectories`
    each with a maximimum step count of `traj_max_length`

    - num_steps: Number of total steps to collect data over, should also be the sum of trajectory lengths
    - traj_max_length: Maximum length of each trajectory
    - data_env: The environment to collect data under, NOT A VecEnv

    - actor: A function that takes a `data_env` observation as input and outputs an action admissible to `data_env`

    Returns: (observations, actions, next_observations), each being a list
    '''

    observations, actions, next_obs = [], [], []

    # TODO: Step and collect data

    # END TODO

    return observations, actions, next_obs

**Instructions**
Run data collection function on the real environment with the expert policy trained in part 1.

**Note**: The `data_collect` function requires the environment provided to be a regular gymnasium environment instead of a vectorized environment. Please make sure to not confuse it with `real_vec_env_1` defined in part 1.1.

**Note**: Here is a list of currently created environments:
- `real_env`
- `real_vec_env_1`
- `real_vec_env_3`

Refer to function documentation for selecting which one to use when doing function calls.

**Estimated Collection Time**:
- 2 - 4 minutes on Google Colab CPU

In [None]:
total_steps = 50000
traj_max_length = 500
reseed(seed, env=real_env)

# TODO: Collect data
observations, actions, next_obs = None, None, None
# END TODO

### **2.3: [PROVIDED] Visualize And Create Dataset**

**Note** The below visualization is showing multiple coordinates of the observation at the same time, so it looks a bit weird. You should see 3 distinct curves, where each curve is made of overlapping red and blue components.

In [None]:
def visualize_collected_data(observations, next_obs):
    '''
        Takes the first 300 data points and generates a plot of the observations and next_obs.
    '''
    print(f'Dataset Size: {len(observations)}')
    print(f'Observation Size: {observations[0].shape}')
    plt.close()
    plt.plot(np.arange(300), [obs[3:6] for obs in observations[:300]], c='blue')
    plt.plot(np.arange(300), [obs[3:6] for obs in next_obs[:300]], c='red')
    plt.show()

visualize_collected_data(observations, next_obs)

In [None]:
from torch.utils.data import Dataset

class WorldDataset(Dataset):
    def __init__(self, obs, actions, next_obs):
        self.obs = obs
        self.actions = actions
        self.next_obs = next_obs

    def __len__(self):
        return len(self.obs)

    def __getitem__(self, idx):
        return {
            'orig_obs': self.obs[idx],
            'action': self.actions[idx],
            'next_obs': self.next_obs[idx]
        }

split = len(observations) // 5
val_data = WorldDataset(observations[:split], actions[:split], next_obs[:split])
train_data = WorldDataset(observations[split:], actions[split:], next_obs[split:])

train_dataloader = DataLoader(train_data, batch_size=128)
val_dataloader = DataLoader(val_data, batch_size=128)

### **2.4: Define World Model**

The `WorldModel` class should define a neural network that takes a state-action pair and outputs a state in the state space. The network should have the following architecture (the type of these layers should be `torch.float64`).

- Layer 1: a fully-connected layer with `inp_dim` input nodes and `hidden_dim_1` output nodes, followed by a ReLU activation function.
- Layer 2: a fully-connected layer with `hidden_dim_1` input nodes and `hidden_dim_2` output nodes, followed by another ReLU.
- Output layer: a fully-connected layer with `hidden_dim_2` input nodes and `output_dim` output nodes.

The `forward` function should take two inputs: `state` and `action`, concatenate them along the last dimension, and then pass it through the model architecture. For instance, if the state has shape `n * s`, and the action has shape `n * a`, then the input to the model should be `n * (s + a)`


In [None]:
class WorldModel(nn.Module):
    def __init__(self, input_dim, hidden_dim_1, hidden_dim_2, output_dim):
        super(WorldModel, self).__init__()
        self.input_dim = input_dim

        # TODO: Define architecture
        self.fc1 = None
        self.fc2 = None
        self.fc3 = None
        # END TODO

    def forward(self, state, action):
        '''
            Expected `state` to have shape n * s_dim
            Expected `action` to have shape n * a_dim
        '''
        n, s_dim = state.shape
        n_a, a_dim = action.shape
        assert n == n_a
        assert s_dim + a_dim == self.input_dim

        # TODO: Calculate next state
        return None
        # END TODO

### **2.5: Training And Validation Function For World Model**

The `train_world_model` function should train the provided model for one epoch, using the optimizer and criterion provided on the given train_dataloader. This function should iterate through each batch of the `train_dataloader` once, update the world model based on the loss calculated by criterion, then step the optimizer.

In [None]:
def train_world_model(model, optimizer, criterion, train_dataloader):
    '''
        This function should train the torch model `model` using the
        optim `optimizer` and `criterion` as loss function, on one pass
        of the `train_dataloader`

        This is should train the model for on epoch, as in one pass through
        the training data.

        Returns: the mean criterion loss across each batch of the dataset.
    '''
    total_loss, cnt = 0, 0
    model.train()

    # TODO: Update the model for one epoch

    # END TODO

    return total_loss / cnt


The `eval_world_model` function is similar to the `train_world_model` function with iteration through the batches of the `eval_dataloader` and computes the loss using the given criterion. Note that no update to model should be made and gradients should not be calculated during the forward pass.

In [None]:
def eval_world_model(model, criterion, eval_dataloader):
    '''
        This function should evaluate the torch model `model` using
        `criterion` as loss function, on one pass of the `eval_dataloader`

        This is should evaluate the model on the validation dataset.

        Take note that during evaluation, the model should not be updated
        in any way and gradients should not be calculated.

        Returns: the mean criterion loss across each batch of the dataset.

    '''
    total_loss, cnt = 0, 0
    model.eval()

    # TODO: Evaluate the model across the whole dataset

    # END TODO

    return total_loss / cnt

### **2.6: Train The World Model**

Train an instance of `WorldModel` for `50` epochs with the dataloader built in previous section, using Adam optimizer and MSE loss, with an `lr` of `0.0001`. Provide a plot of training and evaluation losses with respect to training epochs, and also print out the final evaluation loss.

**Estimated Training Time**:
- 2 - 4 minutes on Google Colab CPU

In [None]:
num_epochs = 50
reseed(seed)
lr = 0.0001

world_model = WorldModel(input_dim=17, hidden_dim_1=32, hidden_dim_2=64, output_dim=13)
optimizer = torch.optim.Adam(world_model.parameters(), lr=lr)
criterion = nn.MSELoss()

# TODO: Train and evaluate world model
train_losses, eval_losses = [], []
# END TODO

### **2.7: [PROVIDED] Build Gym Environment With World Model**

The following `WorldModelEnv` class is largely defined for you to train your next PPO agent as the reinforcement learning component of MBRL. In this environment, the reward is calculated the same as `real_env`

To initialize a `WorldModelEnv` environment, a `world_model` (an instance of WorldModel in this case) should be passed in as argument, which will be used as the transition function in the `step()` function.

This environment is registered with an id of **WorldModelFetch**, which can be initialized using `gym.make` or directly initializing it.


**Run the following cell to define and register this environment**

In [None]:
class WorldModelEnv(gym.Env):
    def __init__(self, world_model: WorldModel, render_mode: str='rgb_array'):
        super(WorldModelEnv, self).__init__()
        self.metadata = { 'render_modes': ['human', 'rgb_array'], 'render_fps': 30 }
        self.render_mode = 'rgb_array'
        self.world_model = world_model
        self.corr_env = real_env

        self.observation_space = self.corr_env.observation_space
        self.action_space = self.corr_env.action_space
        self.obs_min = self.corr_env.observation_space.low
        self.obs_max = self.corr_env.observation_space.high

        self.state = self.corr_env.reset()[0]
        self.goal_position = self.state[:3]
        self.prev_dist = np.linalg.norm(self.state[3:6] - self.goal_position)
        self.step_count = 0

    def seed(self, seed=None):
        pass

    def reset(self, seed=None, options=None):
        self.state = self.corr_env.reset()[0]
        self.goal_position = self.state[:3]
        self.prev_dist = np.linalg.norm(self.state[3:6] - self.goal_position)
        self.step_count = 0
        return self.state, {}

    def step(self, action):
        with torch.no_grad():
            state = torch.from_numpy(self.state).unsqueeze(0)
            action = torch.from_numpy(action).unsqueeze(0)
            next_state = self.world_model(state, action).squeeze(0).numpy()
            self.state = np.clip(next_state, a_min=self.obs_min, a_max=self.obs_max)

        current_dist = np.linalg.norm(self.state[3:6] - self.goal_position)
        reward = (self.prev_dist - current_dist) * 10
        self.prev_dist = current_dist
        self.step_count += 1

        terminated = self.corr_env.unwrapped._is_success(self.state[3:6], self.goal_position)
        truncated = self.step_count >= 500
        return self.state, reward, bool(terminated), truncated, {}

gym.register(id='WorldModelFetch', entry_point=WorldModelEnv)

# **Part 3: Train Agent On World Model**

Now that we have learned a model that simulates the transition function of the real environment, it's time to train an agent on this model. In this part, you will learn, visualize, and evaluate a PPO policy on the learned world model, similar to what happened in Part 1, using functions defined in the Helper Function section, Part 1, and Part 2.

**Follow instructions to complete each component**

### **3.1: Train New PPO On World Model**

**Instruction 1** Initialize a 3-vectorized and a 1-vectorized `WorldModelFetch` environment.

**Instruction 2** For this part, we will train a separate PPO policy using the learned model environment. This model should be trained with the world model learned in the previous part for 40960 steps, under a 3-vectorized environment, using the same hyperparameters provided in Part 1.

**Note**: Here is a list of created environments after running this following cell:
- `real_env`
- `real_vec_env_1`
- `real_vec_env_3`
- `world_vec_env_1`
- `world_vec_env_3`

Refer to function documentation for selecting which one to use when doing function calls.

**Estimated Training Time**:
- 2 - 4 minutes on Google Colab CPU

In [None]:
learner_ckpt_path = 'learner'
total_steps = 40960
reseed(seed)

# TODO 1: Create vectorized world environments (HINT: use env_kwargs)
env_kwargs = None
world_vec_env_1 = None
world_vec_env_3 = None
# END TODO

learner_callback = PPOCallback(save_path=learner_ckpt_path, eval_env=world_vec_env_1)

# TODO 2: Initiate training
learner = None
# END TODO

### **3.2: [PROVIDED] Visualize Learned Policy On Real Environment**

In [None]:
learner_actor = PPOActor(ckpt=f'{learner_ckpt_path}.zip', environment=world_vec_env_1)
visualize(real_env, algorithm=learner_actor, video_name="learner_eval")

### **3.3: Evaluate Learned Policy**

Evaluate the learner agent on both the learned world model and the real environment. Also print out the success rate on the real environment.

**Expected Rewards**:
- 1.6 - 1.7 on `world_vec_env_1`
- 0.7 - 0.9 on `real_vec_env_1`

**Expected Success**: Around 0.5 on `real_env`

In [None]:
learner = PPOActor(ckpt=f'{learner_ckpt_path}.zip', environment=world_vec_env_1)

# TODO: Evaluate on both environments

# END TODO

# **[GRAD] Part 4: World Model With Aggregated Data**

### **4.1: Collect Data With Learner Policy**

In this section, we collect another 50000 steps of data with the learned agent from the previous part on the real environment, simulating rollouts of the learned agent in real world.

**Estimated Collection Time**:
- 2 - 4 minutes on Google Colab CPU

In [None]:
total_steps = 50000
traj_max_length = 500

# TODO: Collect data using learner
policy_obs, policy_acts, policy_next_obs = None, None, None
# END TODO

visualize_collected_data(policy_obs, policy_next_obs)

### **4.2: Aggregate Data**

**Instruction**: Aggregate the expert data collect in Part 2 and the agent data collected from previous section, shuffle their order, and form a new train/val split instances of the `WorldDataset` and dataloaders. The train / val split should be **80% / 20%**.

In [None]:
# TODO: Aggregate data and split into training and validation
agg_train_dataloader = None
agg_val_dataloader = None
# END TODO

### **4.3: Train Second World Model Using Aggregate Data**

- Train for 50 epochs with `lr=0.0001`, Adam optimizer, and MSELoss criterion.
- Save the train and validation losses across epochs and plot them

**Estimated Training Time**:
- 3 - 5 minutes on Google Colab CPU

In [None]:
num_epochs = 50
reseed(seed)
lr = 0.0001

agg_world_model = WorldModel(input_dim=17, hidden_dim_1=32, hidden_dim_2=64, output_dim=13)
agg_optimizer = torch.optim.Adam(agg_world_model.parameters(), lr=lr)
agg_criterion = nn.MSELoss()

# TODO: Train and evaluate 2nd world model
agg_train_losses, agg_eval_losses = [], []
# END TODO

### **4.4: Train PPO Using Second World Model**

**Instruction 1** Initialize a 3-vectorized and a 1-vectorized WorldModelFetch environment with the world model trained on aggregated dataset

**Instruction 2** Train a separate PPO policy using the world model learned on aggregated data. Similar as before, this model should be trained for 40960 steps, under a 3-vectorized environment, using the same hyperparameters provided in Part 1.

**Note**: Here is a list of created environments after running this following cell:
- `real_env`
- `real_vec_env_1`
- `real_vec_env_3`
- `world_vec_env_1`
- `world_vec_env_3`
- `agg_world_vec_env_1`
- `agg_world_vec_env_3`

Refer to function documentation for selecting which one to use when doing function calls.

**Estimated Training Time**:
- 3 - 5 minutes on Google Colab CPU

In [None]:
agg_ckpt_path = 'agg_learner'
total_steps = 40960
reseed(seed)

# TODO 1: Create vectorized world environments (HINT: use env_kwargs)
env_kwargs = None
agg_world_vec_env_1 = None
agg_world_vec_env_3 = None
# END TODO

agg_callback = PPOCallback(save_path=agg_ckpt_path, eval_env=agg_world_vec_env_1)

# TODO 2: Initiate training
agg_learner = None
# END TODO

### **4.5: [PROVIDED] Visualize Learned Policy On Real Environment**

In [None]:
agg_actor = PPOActor(ckpt=f'{agg_ckpt_path}.zip', environment=agg_world_vec_env_1)
visualize(real_env, algorithm=agg_actor, video_name="agg_eval")

### **4.6: Evaluate PPO On Real And Second World Environment**

**Expected Rewards:**
- About 1.6 to 1.7 on `agg_world_vec_env_1`
- About 1.5 to 1.7 on `real_vec_env_1`

**Expected Success:** Around 0.85 on `real_env`

In [None]:
agg_learner = PPOActor(ckpt=f'{agg_ckpt_path}.zip', environment=agg_world_vec_env_1)

# TODO: Evaluate on both environments.

# END TODO