# 16-831 HW2 — Policy Gradient Master Notebook

Run every experiment from **hw2_new.pdf** by executing this notebook top to bottom in Google Colab. Each section handles the training commands, produces the requested plots, and prints the numeric analyses needed for the written deliverables.


## Before you start
1. Open **File → Save a copy in Drive** so you can edit and re-run the notebook later.
2. Execute cells sequentially with **Shift+Enter**. Avoid running expensive experiments twice unless you intend to re-train from scratch.
3. Every section caches results in Google Drive under `hw_16831/hw2/data`. If you rerun a cell, the code will reuse the latest logs instead of retraining.


In [None]:
#@title 1️⃣ Mount Google Drive
#@markdown Your work will be saved inside `hw_16831` so Colab restarts do not wipe it.
from google.colab import drive

drive.mount('/content/gdrive', force_remount=True)


In [None]:
#@title 2️⃣ Configure homework workspace paths
#@markdown This creates `/content/hw_16831` as a shortcut to your Drive directory and records the homework root.
import os
from pathlib import Path

DRIVE_WORKSPACE = Path('/content/gdrive/My Drive/hw_16831')
DRIVE_WORKSPACE.mkdir(parents=True, exist_ok=True)

COLAB_WORKSPACE = Path('/content/hw_16831')
if COLAB_WORKSPACE.exists() and not COLAB_WORKSPACE.is_symlink():
    raise RuntimeError('`/content/hw_16831` already exists and is not a symlink. Please rename or remove it before continuing.')
if not COLAB_WORKSPACE.exists():
    COLAB_WORKSPACE.symlink_to(DRIVE_WORKSPACE)

HW2_ROOT = COLAB_WORKSPACE / 'hw2'
DATA_ROOT = HW2_ROOT / 'data'

os.environ['HW2_REPO_ROOT'] = str(HW2_ROOT)
os.environ['HW2_DATA_ROOT'] = str(DATA_ROOT)
print(f"Homework repo directory: {HW2_ROOT}")


In [None]:
#@title 3️⃣ Install system dependencies
#@markdown Installs the packages MuJoCo and Gym rely on.
!apt-get update -qq
!apt-get install -y -qq libosmesa6-dev libgl1-mesa-glx libglfw3 patchelf swig ffmpeg


In [None]:
#@title 4️⃣ Clone or update the homework starter code
#@markdown Pulls the official 16-831 homework repository into your Drive workspace.
import os
import subprocess
from pathlib import Path

repo_url = "https://github.com/LeCAR-Lab/16831-S25-HW.git"
repo_root = Path(os.environ['HW2_REPO_ROOT'])
if repo_root.exists() and (repo_root / '.git').exists():
    print('Repository already present — pulling latest changes...')
    subprocess.run(['git', 'pull'], check=True, cwd=repo_root)
else:
    if repo_root.exists():
        print('Removing stale directory at', repo_root)
        subprocess.run(['rm', '-rf', str(repo_root)], check=True)
    subprocess.run(['git', 'clone', repo_url, str(repo_root)], check=True)
print('Repository ready at', repo_root)


In [None]:
#@title 5️⃣ Install Python requirements
#@markdown Installs the Python dependencies declared for HW2. Re-run after each factory reset.
import subprocess
from pathlib import Path

repo_root = Path(os.environ['HW2_REPO_ROOT'])
requirements_file = repo_root / 'requirements.txt'
subprocess.run(['pip', 'install', '-r', str(requirements_file), '--progress-bar', 'off'], check=True)


In [None]:
#@title 6️⃣ Download MuJoCo 2.1.0 and configure the simulator
#@markdown Skip this cell if you already see `~/.mujoco/mujoco210` in your Drive workspace.
import os
from pathlib import Path

home = Path('~').expanduser()
mujoco_dir = home / '.mujoco'
mujoco_dir.mkdir(parents=True, exist_ok=True)

if not (mujoco_dir / 'mujoco210').exists():
    !wget -q https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz -O /content/mujoco210-linux-x86_64.tar.gz
    !tar -xzf /content/mujoco210-linux-x86_64.tar.gz -C {mujoco_dir}
else:
    print('MuJoCo 2.1.0 already present — skipping download.')


In [None]:
#@title 7️⃣ Export MuJoCo environment variables
import os
from pathlib import Path

home = Path('~').expanduser()
mujoco_path = home / '.mujoco/mujoco210'
os.environ['LD_LIBRARY_PATH'] = os.environ.get('LD_LIBRARY_PATH', '') + f":{mujoco_path / 'bin'}"
os.environ['MUJOCO_PY_MUJOCO_PATH'] = str(mujoco_path)
os.environ['MUJOCO_GL'] = 'egl'
print('LD_LIBRARY_PATH ->', os.environ['LD_LIBRARY_PATH'])


In [None]:
#@title 8️⃣ Install reinforcement-learning helper packages
#@markdown These packages support logging, Box2D environments, and MuJoCo rendering.
!pip install -q tensorboardX box2d box2d-py pygame==2.1.3


In [None]:
#@title 9️⃣ Start a virtual display (required for MuJoCo)
from pyvirtualdisplay import Display

display = Display(visible=0, size=(1400, 900))
display.start()
print('Virtual display started.')


In [None]:
#@title 🔟 Sanity check rendering setup
#@markdown You should see a short video of a MuJoCo Ant if everything is configured correctly.
import imageio
import numpy as np
import gym
from IPython.display import Image

env = gym.make('Ant-v4')
obs, _ = env.reset(seed=0, return_info=True)
frames = []
for _ in range(30):
    obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
    frame = env.render()
    frames.append(frame)
    if terminated or truncated:
        break
env.close()
imageio.mimsave('/content/ant_preview.gif', frames, fps=10)
Image(filename='/content/ant_preview.gif')


## Utilities for running experiments and aggregating results
The helpers below take care of launching `run_hw2.py`, collecting TensorBoard scalars, and formatting the plots required in the PDF. They also remember which commands you already ran so repeated executions reuse cached results.


In [None]:
#@title Helper functions for experiment orchestration
import json
import math
import os
import shlex
import subprocess
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import Markdown, display
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator

plt.style.use('seaborn-v0_8-darkgrid')
pd.set_option('display.max_rows', 200)
pd.set_option('display.precision', 2)

PYTHON_BIN = sys.executable
REPO_ROOT = Path(os.environ['HW2_REPO_ROOT'])
DATA_ROOT = Path(os.environ.get('HW2_DATA_ROOT', REPO_ROOT / 'data'))
DATA_ROOT.mkdir(parents=True, exist_ok=True)

RUN_REGISTRY = {}


def _normalize_command(command: str):
    tokens = shlex.split(command)
    if not tokens:
        raise ValueError('Empty command string provided.')
    if tokens[0] in {'python', 'python3'} or tokens[0].endswith('python3') or tokens[0].endswith('python'):
        tokens[0] = PYTHON_BIN
    else:
        tokens = [PYTHON_BIN] + tokens
    return tokens


def _detect_exp_name(tokens, explicit=None):
    if explicit is not None:
        return explicit
    for idx, token in enumerate(tokens):
        if token == '--exp_name' and idx + 1 < len(tokens):
            return tokens[idx + 1]
    raise ValueError('Every command must include --exp_name to disambiguate log directories.')


def list_logdirs(exp_name: str):
    pattern = f"{exp_name}_"
    if not DATA_ROOT.exists():
        return []
    return sorted([
        path for path in DATA_ROOT.iterdir()
        if path.is_dir() and path.name.startswith(pattern)
    ], key=lambda p: p.stat().st_mtime)


def latest_logdir(exp_name: str):
    matches = list_logdirs(exp_name)
    return matches[-1] if matches else None


def run_pg_command(command: str, exp_name: str = None, force: bool = False):
    tokens = _normalize_command(command)
    exp_name = _detect_exp_name(tokens, explicit=exp_name)
    existing = latest_logdir(exp_name)
    if existing is not None and not force:
        print(f"Skipping {exp_name}: reusing cached logs at {existing.name}")
        RUN_REGISTRY.setdefault(exp_name, {})['logdir'] = str(existing)
        RUN_REGISTRY[exp_name]['command'] = command
        return existing
    print(f"Launching {exp_name}...")
    subprocess.run(tokens, cwd=REPO_ROOT, check=True)
    logdir = latest_logdir(exp_name)
    if logdir is None:
        raise FileNotFoundError(f"Failed to locate log directory for {exp_name}.")
    RUN_REGISTRY[exp_name] = {'command': command, 'logdir': str(logdir)}
    print(f"Finished {exp_name}, logs saved to {logdir.name}")
    return logdir


def run_experiment_batch(run_configs, force: bool = False):
    completed = []
    for cfg in run_configs:
        logdir = run_pg_command(cfg['command'], exp_name=cfg['exp_name'], force=force)
        cfg_record = dict(cfg)
        cfg_record['logdir'] = str(logdir)
        completed.append(cfg_record)
    return completed


def load_scalar_curve(exp_name: str, tag: str):
    logdir = latest_logdir(exp_name)
    if logdir is None:
        raise FileNotFoundError(f"No log directory found for {exp_name}")
    accumulator = EventAccumulator(str(logdir), size_guidance={'scalars': 0})
    accumulator.Reload()
    scalar_tags = accumulator.Tags().get('scalars', [])
    if tag not in scalar_tags:
        json_path = logdir / 'scalar_data.json'
        if json_path.exists():
            raw = json.loads(json_path.read_text())
            series = raw.get(tag, [])
            steps = np.array([entry['step'] for entry in series], dtype=np.int64)
            values = np.array([entry['value'] for entry in series], dtype=np.float64)
            return steps, values, logdir
        raise KeyError(f"Tag {tag} not logged for experiment {exp_name}")
    events = accumulator.Scalars(tag)
    steps = np.array([event.step for event in events], dtype=np.int64)
    values = np.array([event.value for event in events], dtype=np.float64)
    return steps, values, logdir


def compile_curves(label_exp_pairs, metric: str = 'Eval_AverageReturn'):
    frames = []
    for label, exp_name in label_exp_pairs:
        steps, values, logdir = load_scalar_curve(exp_name, metric)
        try:
            env_steps, env_vals, _ = load_scalar_curve(exp_name, 'Train_EnvstepsSoFar')
            env_map = dict(zip(env_steps.tolist(), env_vals.tolist()))
        except Exception:
            env_map = {}
        frame = pd.DataFrame({
            'iteration': steps,
            'value': values,
            'envsteps': [env_map.get(int(step), np.nan) for step in steps],
            'label': label,
            'exp_name': exp_name,
            'logdir': str(logdir),
        })
        frames.append(frame)
    if not frames:
        return pd.DataFrame(columns=['iteration', 'value', 'envsteps', 'label', 'exp_name', 'logdir'])
    return pd.concat(frames, ignore_index=True)


def plot_learning_curves(df, title: str, ylabel: str = 'Eval Average Return', xlabel: str = 'Training iteration', target: float = None):
    if df.empty:
        raise ValueError('No scalar data available to plot.')
    plt.figure(figsize=(8, 5))
    for label, group in df.groupby('label'):
        ordered = group.sort_values('iteration')
        plt.plot(ordered['iteration'], ordered['value'], label=label, linewidth=2)
    if target is not None:
        plt.axhline(target, linestyle='--', linewidth=1.5, color='tab:gray', label=f'Target = {target}')
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.legend()
    plt.grid(True, linestyle='--', linewidth=0.6, alpha=0.4)
    plt.tight_layout()
    plt.show()
    return df


def summarize_experiments(label_exp_pairs, metric: str = 'Eval_AverageReturn', threshold: float = None):
    rows = []
    for label, exp_name in label_exp_pairs:
        steps, values, logdir = load_scalar_curve(exp_name, metric)
        record = {
            'label': label,
            'exp_name': exp_name,
            'logdir': str(logdir),
            'final_iteration': int(steps[-1]),
            'final_return': float(values[-1]),
            'best_return': float(values.max()),
            'command': RUN_REGISTRY.get(exp_name, {}).get('command', '')
        }
        if threshold is not None:
            hits = np.where(values >= threshold)[0]
            record['iteration_reaching_threshold'] = int(steps[hits[0]]) if hits.size else None
        rows.append(record)
    return pd.DataFrame(rows)


def describe_threshold(summary_df, column='iteration_reaching_threshold'):
    valid = summary_df.dropna(subset=[column])
    if valid.empty:
        return 'None of the runs reached the requested threshold.'
    best_row = valid.sort_values(column).iloc[0]
    return f"Fastest run: {best_row['label']} reached the threshold at iteration {int(best_row[column])}."


## Experiment 1 – CartPole variance study (Section 5.1)
Run the six CartPole configurations exactly as listed in the PDF. The two plots compare (a) the three small-batch runs and (b) the three large-batch runs. The analysis cell automatically answers the written questions about value estimators, advantage standardization, and batch size.


In [None]:
cartpole_runs = [
    {
        'label': 'b=1500, trajectory returns (no RTG, no std)',
        'group': 'small',
        'exp_name': 'q1_sb_no_rtg_dsa',
        'command': 'python rob831/scripts/run_hw2.py --env_name CartPole-v0 -n 150 -b 1500 -dsa --exp_name q1_sb_no_rtg_dsa'
    },
    {
        'label': 'b=1500, reward-to-go (no std)',
        'group': 'small',
        'exp_name': 'q1_sb_rtg_dsa',
        'command': 'python rob831/scripts/run_hw2.py --env_name CartPole-v0 -n 150 -b 1500 -rtg -dsa --exp_name q1_sb_rtg_dsa'
    },
    {
        'label': 'b=1500, reward-to-go + std advantages',
        'group': 'small',
        'exp_name': 'q1_sb_rtg_na',
        'command': 'python rob831/scripts/run_hw2.py --env_name CartPole-v0 -n 150 -b 1500 -rtg --exp_name q1_sb_rtg_na'
    },
    {
        'label': 'b=6000, trajectory returns (no RTG, no std)',
        'group': 'large',
        'exp_name': 'q1_lb_no_rtg_dsa',
        'command': 'python rob831/scripts/run_hw2.py --env_name CartPole-v0 -n 150 -b 6000 -dsa --exp_name q1_lb_no_rtg_dsa'
    },
    {
        'label': 'b=6000, reward-to-go (no std)',
        'group': 'large',
        'exp_name': 'q1_lb_rtg_dsa',
        'command': 'python rob831/scripts/run_hw2.py --env_name CartPole-v0 -n 150 -b 6000 -rtg -dsa --exp_name q1_lb_rtg_dsa'
    },
    {
        'label': 'b=6000, reward-to-go + std advantages',
        'group': 'large',
        'exp_name': 'q1_lb_rtg_na',
        'command': 'python rob831/scripts/run_hw2.py --env_name CartPole-v0 -n 150 -b 6000 -rtg --exp_name q1_lb_rtg_na'
    },
]

cartpole_runs = run_experiment_batch(cartpole_runs)


In [None]:
small_entries = [(cfg['label'], cfg['exp_name']) for cfg in cartpole_runs if cfg['group'] == 'small']
large_entries = [(cfg['label'], cfg['exp_name']) for cfg in cartpole_runs if cfg['group'] == 'large']

cartpole_small_df = compile_curves(small_entries)
plot_learning_curves(cartpole_small_df, 'CartPole-v0 (batch size 1500)', target=200)

cartpole_large_df = compile_curves(large_entries)
plot_learning_curves(cartpole_large_df, 'CartPole-v0 (batch size 6000)', target=200)


In [None]:
cartpole_small_summary = summarize_experiments(small_entries)
cartpole_large_summary = summarize_experiments(large_entries)

display(Markdown('**Small batch summary (b = 1500)**'))
display(cartpole_small_summary)

display(Markdown('**Large batch summary (b = 6000)**'))
display(cartpole_large_summary)

traj_small = cartpole_small_summary.loc[cartpole_small_summary['exp_name'] == 'q1_sb_no_rtg_dsa', 'final_return'].item()
rtg_small = cartpole_small_summary.loc[cartpole_small_summary['exp_name'] == 'q1_sb_rtg_dsa', 'final_return'].item()
std_small = cartpole_small_summary.loc[cartpole_small_summary['exp_name'] == 'q1_sb_rtg_na', 'final_return'].item()

traj_large = cartpole_large_summary.loc[cartpole_large_summary['exp_name'] == 'q1_lb_no_rtg_dsa', 'final_return'].item()
rtg_large = cartpole_large_summary.loc[cartpole_large_summary['exp_name'] == 'q1_lb_rtg_dsa', 'final_return'].item()
std_large = cartpole_large_summary.loc[cartpole_large_summary['exp_name'] == 'q1_lb_rtg_na', 'final_return'].item()

small_best = cartpole_small_summary['final_return'].max()
large_best = cartpole_large_summary['final_return'].max()

analysis_lines = [
    f"- **Value estimator:** Without standardization (b=1500), reward-to-go finishes at {rtg_small:.1f} vs {traj_small:.1f} for trajectory returns, so reward-to-go performs better.",
    f"- **Advantage standardization:** At b=1500 the final return rises from {rtg_small:.1f} to {std_small:.1f} when advantages are normalized; at b=6000 the boost is {rtg_large:.1f} → {std_large:.1f}.",
    f"- **Batch size:** The best small-batch run ends at {small_best:.1f} average return, whereas the best large-batch run reaches {large_best:.1f}, showing that larger batches deliver higher asymptotic performance on CartPole."
]

display(Markdown('### Written answers for Section 5.1
' + '
'.join(analysis_lines)))


## Experiment 2 – InvertedPendulum hyper-parameter search (Section 5.2)
We search for the smallest batch size `b*` and the largest learning rate `r*` that reach the optimal return of 1000 in fewer than 100 iterations. The cell below sweeps over the provided candidates until a qualifying configuration is found, then the subsequent cells plot the successful run and record the exact command.


In [None]:
invpend_batch_sizes = [1000, 2000, 4000, 8000, 16000]
invpend_learning_rates = [0.03, 0.02, 0.015, 0.01, 0.005]

invpend_attempts = []
best_invpend = None

for batch_size in invpend_batch_sizes:
    for lr in invpend_learning_rates:
        exp_name = f"q2_b{batch_size}_lr{lr}"
        label = f"b={batch_size}, lr={lr}"
        command = (
            f"python rob831/scripts/run_hw2.py --env_name InvertedPendulum-v4 "
            f"--ep_len 1000 --discount 0.92 -n 100 -l 2 -s 64 -b {batch_size} -lr {lr} "
            f"-rtg --exp_name {exp_name}"
        )
        run_pg_command(command, exp_name=exp_name)
        summary = summarize_experiments([(label, exp_name)], threshold=1000.0)
        record = summary.iloc[0].to_dict()
        record['batch_size'] = batch_size
        record['learning_rate'] = lr
        invpend_attempts.append(record)
        if record['best_return'] >= 1000 and best_invpend is None:
            best_invpend = record
            break
    if best_invpend is not None:
        break

invpend_attempts_df = pd.DataFrame(invpend_attempts)
display(Markdown('**All attempted configurations (ordered search)**'))
display(invpend_attempts_df)

if best_invpend is None:
    raise RuntimeError('Search did not find a configuration that reaches the optimal return. Expand the candidate sets and rerun this cell.')
else:
    display(Markdown(
        f"✅ Found b* = {best_invpend['batch_size']} and r* = {best_invpend['learning_rate']} reaching 1000 in "
        f"iteration {int(best_invpend.get('iteration_reaching_threshold', 0))}."
    ))


In [None]:
best_invpend_label = f"b={best_invpend['batch_size']}, lr={best_invpend['learning_rate']}"
best_invpend_exp = f"q2_b{best_invpend['batch_size']}_lr{best_invpend['learning_rate']}"

invpend_curve = compile_curves([(best_invpend_label, best_invpend_exp)])
plot_learning_curves(invpend_curve, 'InvertedPendulum-v4 best configuration', target=1000)

best_summary = summarize_experiments([(best_invpend_label, best_invpend_exp)], threshold=1000.0)
best_command = best_summary.loc[0, 'command']

summary_lines = [
    f"- Command: `{best_command}`",
    f"- Iteration reaching 1000: {int(best_summary.loc[0, 'iteration_reaching_threshold'])}",
    f"- Final average return: {best_summary.loc[0, 'final_return']:.1f}",
]

display(Markdown('### Deliverables for Section 5.2
' + '
'.join(summary_lines)))


## Experiment 3 – LunarLander continuous control (Section 7.1)
This section validates the policy-gradient implementation with neural baseline on a moderate-difficulty task. Make sure you have applied the PDF’s code edits to `lunar_lander.py` before running the command below.


In [None]:
lunar_run = [
    {
        'label': 'LunarLanderContinuous-v4 baseline',
        'exp_name': 'q3_b10000_r0.005',
        'command': 'python rob831/scripts/run_hw2.py --env_name LunarLanderContinuous-v4 --ep_len 1000 --discount 0.99 -n 100 -l 2 -s 64 -b 10000 -lr 0.005 --reward_to_go --nn_baseline --exp_name q3_b10000_r0.005'
    }
]

lunar_run = run_experiment_batch(lunar_run)


In [None]:
lunar_entries = [(cfg['label'], cfg['exp_name']) for cfg in lunar_run]
lunar_curve = compile_curves(lunar_entries)
plot_learning_curves(lunar_curve, 'LunarLanderContinuous-v4 policy gradient', target=120)

lunar_summary = summarize_experiments(lunar_entries)
display(Markdown('**Section 7.1 deliverable:**'))
display(lunar_summary)


## Experiment 4 – HalfCheetah hyper-parameter study (Section 7.2)
We first sweep over the prescribed batch sizes and learning rates using reward-to-go with a neural baseline to identify \(b^*, r^*\). Afterwards we run the four ablations (no baseline/no RTG, RTG only, baseline only, both) with those hyper-parameters, produce the requested plots, and summarize how \(b\) and \(r\) affect performance.


In [None]:
cheetah_batch_sizes = [15000, 35000, 55000]
cheetah_learning_rates = [0.005, 0.01, 0.02]

cheetah_search_configs = []
for batch_size in cheetah_batch_sizes:
    for lr in cheetah_learning_rates:
        exp_name = f"q4_search_b{batch_size}_lr{lr}"
        label = f"b={batch_size}, lr={lr}"
        command = (
            f"python rob831/scripts/run_hw2.py --env_name HalfCheetah-v4 --ep_len 150 "
            f"--discount 0.95 -n 100 -l 2 -s 32 -b {batch_size} -lr {lr} -rtg --nn_baseline "
            f"--exp_name {exp_name}"
        )
        cheetah_search_configs.append({'label': label, 'exp_name': exp_name, 'command': command, 'batch_size': batch_size, 'learning_rate': lr})

cheetah_search_runs = run_experiment_batch(cheetah_search_configs)

search_entries = [(cfg['label'], cfg['exp_name']) for cfg in cheetah_search_runs]
cheetah_search_summary = summarize_experiments(search_entries)

cheetah_search_summary['batch_size'] = [cfg['batch_size'] for cfg in cheetah_search_runs]
cheetah_search_summary['learning_rate'] = [cfg['learning_rate'] for cfg in cheetah_search_runs]

display(Markdown('**Hyper-parameter sweep results**'))
display(cheetah_search_summary.sort_values('final_return', ascending=False))


In [None]:
cheetah_search_df = compile_curves(search_entries)
plot_learning_curves(cheetah_search_df, 'HalfCheetah-v4 search runs (RTG + baseline)', target=200)


In [None]:
cheetah_best_row = cheetah_search_summary.sort_values('final_return', ascending=False).iloc[0]
cheetah_b_star = int(cheetah_best_row['batch_size'])
cheetah_r_star = cheetah_best_row['learning_rate']
print(f"Chosen b* = {cheetah_b_star}, r* = {cheetah_r_star}")

cheetah_final_runs = [
    {
        'label': 'No RTG, no baseline',
        'exp_name': f'q4_b{cheetah_b_star}_r{cheetah_r_star}',
        'command': f'python rob831/scripts/run_hw2.py --env_name HalfCheetah-v4 --ep_len 150 --discount 0.95 -n 100 -l 2 -s 32 -b {cheetah_b_star} -lr {cheetah_r_star} --exp_name q4_b{cheetah_b_star}_r{cheetah_r_star}'
    },
    {
        'label': 'Reward-to-go only',
        'exp_name': f'q4_b{cheetah_b_star}_r{cheetah_r_star}_rtg',
        'command': f'python rob831/scripts/run_hw2.py --env_name HalfCheetah-v4 --ep_len 150 --discount 0.95 -n 100 -l 2 -s 32 -b {cheetah_b_star} -lr {cheetah_r_star} -rtg --exp_name q4_b{cheetah_b_star}_r{cheetah_r_star}_rtg'
    },
    {
        'label': 'NN baseline only',
        'exp_name': f'q4_b{cheetah_b_star}_r{cheetah_r_star}_nnbaseline',
        'command': f'python rob831/scripts/run_hw2.py --env_name HalfCheetah-v4 --ep_len 150 --discount 0.95 -n 100 -l 2 -s 32 -b {cheetah_b_star} -lr {cheetah_r_star} --nn_baseline --exp_name q4_b{cheetah_b_star}_r{cheetah_r_star}_nnbaseline'
    },
    {
        'label': 'Reward-to-go + baseline',
        'exp_name': f'q4_b{cheetah_b_star}_r{cheetah_r_star}_rtg_nnbaseline',
        'command': f'python rob831/scripts/run_hw2.py --env_name HalfCheetah-v4 --ep_len 150 --discount 0.95 -n 100 -l 2 -s 32 -b {cheetah_b_star} -lr {cheetah_r_star} -rtg --nn_baseline --exp_name q4_b{cheetah_b_star}_r{cheetah_r_star}_rtg_nnbaseline'
    }
]

cheetah_final_runs = run_experiment_batch(cheetah_final_runs)


In [None]:
cheetah_final_entries = [(cfg['label'], cfg['exp_name']) for cfg in cheetah_final_runs]
cheetah_final_df = compile_curves(cheetah_final_entries)
plot_learning_curves(cheetah_final_df, f'HalfCheetah-v4 ablations (b={cheetah_b_star}, lr={cheetah_r_star})', target=200)

cheetah_final_summary = summarize_experiments(cheetah_final_entries)
display(Markdown('**Section 7.2 ablation summary**'))
display(cheetah_final_summary)

batch_effects = cheetah_search_summary.groupby('batch_size')['final_return'].mean().sort_index()
lr_effects = cheetah_search_summary.groupby('learning_rate')['final_return'].mean().sort_index()

analysis_text = [
    '- **Batch size effect:** ' + ', '.join([f"b={int(bs)} → avg return {val:.1f}" for bs, val in batch_effects.items()]),
    '- **Learning rate effect:** ' + ', '.join([f"lr={lr} → avg return {val:.1f}" for lr, val in lr_effects.items()]),
    f"- **Best configuration:** b* = {cheetah_b_star}, r* = {cheetah_r_star} with final return {cheetah_best_row['final_return']:.1f}."
]

display(Markdown('### Written discussion for Section 7.2
' + '
'.join(analysis_text)))


## Experiment 5 – Hopper with generalized advantage estimation (Section 8)
Run the noisy Hopper environment with reward-to-go, a neural baseline, and \(\lambda \in \{0, 0.95, 0.99, 1.0\}\). The plots compare the four curves and the analysis cell highlights how \(\lambda\) influences learning.


In [None]:
hopper_lambdas = [0.0, 0.95, 0.99, 1.0]

hopper_runs = []
for lam in hopper_lambdas:
    lambda_tag = str(lam).replace('.', 'p')
    exp_name = f"q5_b2000_r0.001_lambda{lambda_tag}"
    label = f"lambda={lam}"
    command = (
        f"python rob831/scripts/run_hw2.py --env_name Hopper-v4 --ep_len 1000 --discount 0.99 -n 300 -l 2 -s 32 -b 2000 -lr 0.001 "
        f"--reward_to_go --nn_baseline --action_noise_std 0.5 --gae_lambda {lam} --exp_name {exp_name}"
    )
    hopper_runs.append({'label': label, 'exp_name': exp_name, 'command': command, 'lambda': lam})

hopper_runs = run_experiment_batch(hopper_runs)


In [None]:
hopper_entries = [(cfg['label'], cfg['exp_name']) for cfg in hopper_runs]
hopper_df = compile_curves(hopper_entries)
plot_learning_curves(hopper_df, 'Hopper-v4 with GAE sweep', target=400)

hopper_summary = summarize_experiments(hopper_entries)
hopper_summary['lambda'] = hopper_lambdas

display(Markdown('**Section 8 summary**'))
display(hopper_summary)

best_idx = hopper_summary['final_return'].idxmax()
best_lambda = hopper_summary.loc[best_idx, 'lambda']
best_return = hopper_summary.loc[best_idx, 'final_return']

discussion = [
    '- Detailed returns: ' + ', '.join([f"lambda={row['lambda']} → final {row['final_return']:.1f}" for _, row in hopper_summary.iterrows()]),
    f"- The best performing setting is lambda={best_lambda} with final return {best_return:.1f}, showing how increasing lambda changes bias/variance." ,
    f"- Command for best run: `{hopper_summary.loc[best_idx, 'command']}`"
]

display(Markdown('### Written discussion for Section 8
' + '
'.join(discussion)))


## Appendix – Full experiment execution order
1. **CartPole variance study** (Cell `cartpole_run` → plots → analysis).
2. **InvertedPendulum search** (Cells `invpend_search` and `invpend_deliverable`).
3. **LunarLander baseline** (Cells `lunar_command` and `lunar_plot`).
4. **HalfCheetah sweep and ablations** (Cells `cheetah_search` → `cheetah_final_summary`).
5. **Hopper with GAE** (Cells `hopper_runs` and `hopper_analysis`).

Once these cells have been executed in sequence, the notebook contains every plot, table, and written answer requested in *hw2_new.pdf* and is ready for export.
