# MicroDreamerOptimized on Colab

1. **System deps**: apt-get, Nsight Systems  
2. **Clone & pip→setup**: install Python & CUDA extensions  
3. **Build & test**: compile kernels, verify import  
4. **Run profiling**: sample runs + `nsys profile` calls



In [1]:
# Install MicroDreamer & custom CUDA extensions
# this code block takes approximately 7 minutes on a T4

%%bash
set -e  # fail fast

# 1) Install system build-tools + wget/gnupg (for Nsight Systems)
apt-get update -qq
apt-get install -y --no-install-recommends \
    build-essential cmake python3-dev wget ca-certificates gnupg2

# 2) Clone MicroDreamerOptimized and install Python deps
git clone https://github.com/russelb22/MicroDreamerOptimized.git
cd MicroDreamerOptimized
pip install -q -r requirements.txt

# 3) Install diff-gaussian-rasterization
git clone --recursive https://github.com/ashawkey/diff-gaussian-rasterization
pip install --no-build-isolation -q ./diff-gaussian-rasterization

# 4) Install simple-knn (editable so you can tweak it in-place)
pip install --no-build-isolation -q -e ./simple-knn

# 5) Other Git-based Python deps
pip install -q git+https://github.com/NVlabs/nvdiffrast/
pip install -q git+https://github.com/ashawkey/kiuikit/
pip install -q git+https://github.com/bytedance/ImageDream/#subdirectory=extern/ImageDream

# 6) Build your own CUDA kernels in-place
python setup_cuda_kernels.py build_ext --inplace

# 7) Install Nsight Systems
wget -qO - https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/nvidia.pub \
  | apt-key add -
echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/ /" \
  > /etc/apt/sources.list.d/nsight-systems.list
apt-get update -qq
apt-get install -y --no-install-recommends nsight-systems

echo "✅ Setup complete!"

Reading package lists...
Building dependency tree...
Reading state information...
build-essential is already the newest version (12.9ubuntu3).
ca-certificates is already the newest version (20240203~22.04.1).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
python3-dev is already the newest version (3.10.6-1~22.04.1).
python3-dev set to manually installed.
wget is already the newest version (1.21.2-2ubuntu1.1).
gnupg2 is already the newest version (2.2.27-3ubuntu2.3).
0 upgraded, 0 newly installed, 0 to remove and 36 not upgraded.
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.3/43.3 kB 2.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.3/44.3 kB 3.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 422.8/422.8 kB 16.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.3/128.3 kB 11.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.6/54.6 kB 4.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.1

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Cloning into 'MicroDreamerOptimized'...
Cloning into 'diff-gaussian-rasterization'...
Submodule 'third_party/glm' (https://github.com/g-truc/glm.git) registered for path 'third_party/glm'
Cloning into '/content/MicroDreamerOptimized/diff-gaussian-rasterization/third_party/glm'...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Emitting ninja build file /content/MicroDreamerOptimized/build/temp.linux-x86_64-cpython-311/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
W: https://developer.downlo

In [10]:
# 1. verify code is running on a GPU
# 2. run main_profile.py once since the first time takes a long time, populating GPU caches
# 3. this code block takes approximately 6 minutes on a T4

import torch
assert torch.cuda.is_available(), "No GPU detected—make sure Runtime → Change runtime type → GPU"
print("CUDA version:", torch.version.cuda, "GPU:", torch.cuda.get_device_name(0))

%cd /content/MicroDreamerOptimized
!python main_profile.py --config=configs/image_sai.yaml --input=test_data/05_objaverse_backpack_rgba.png --save_path=05_objaverse_backpack_rgba --profiling.enabled=false

CUDA version: 12.4 GPU: Tesla T4


In [6]:
# this code block is here just to make sure the main_profile.py is running through all of its code properly through the profiler

import os
from posix import mkdir

mkdir("/content/MicroDreamerOptimized/logdir/nsys")

os.environ["USE_CUDA_GAUSS"]    = "1"
os.environ["USE_CUDA_EXTRACT"]  = "0"

!nsys profile --trace=cuda,nvtx --sample=none --output=logdir/nsys/profile_test_code --force-overwrite=true python main_profile.py --config=configs/image_sai.yaml --input=test_data/05_objaverse_backpack_rgba.png --save_path=05_objaverse_backpack_rgba --profiling.enabled=true --profiling.mode=nvtx --profiling.scope=function

Collecting data...
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
  @torch.cuda.amp.custom_bwd
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  @custom_fwd
  @custom_bwd
  @torch.cuda.amp.autocast(enabled=False)
[DEBUG] USE_CUDA_GAUSS: 1
[DEBUG] USE_CUDA_EXTRACT: 0
[DEBUG] RUN_LABEL: None
Number of points at initialisation :  5000
100%|██████████| 21/21 [00:02<00:00,  7.22it/s]0 guassians have been cleaned
0 guassians have been cleaned
0 guassians have been cleaned
0 guassians have been cleaned
0 guassians have been cleaned
0 guassians have been cleaned
[DEBUG] calling save_model which should call gaussian_3D_coeff, which has been put onto CUDA
calling extract_mesh from within mode == geo+tex block in save_model
calling extract_fields from within extract_mesh
ENTER extract_fields
ENTER USE_CUDA_KERNEL ELSE block in extract_fields
RETURN FROM USE_CUDA_KERNEL ELSE block in extract_fields
[INFO] mesh cleaning: (72096, 3) --> (18063, 3), (144188, 3) --> (36

In [7]:
# this code block defines two functions that format output from nsys output files

import subprocess, re

def get_nvtx_avgs(nsys_rep_path):
    cmd = [
        "nsys", "stats",
        "--timeunit", "msec",
        "-r", "nvtx_sum",
        nsys_rep_path
    ]
    # run nsys stats and capture its output
    proc = subprocess.run(cmd, capture_output=True, text=True)
    if proc.returncode != 0:
        raise RuntimeError(f"nsys stats failed:\n{proc.stderr}")

    lines = proc.stdout.splitlines()
    # find the table separator (the dashed line)
    sep_idx = next(
        (i for i, L in enumerate(lines) if re.match(r"^[- ]{2,}-+", L)),
        None
    )
    if sep_idx is None:
        raise ValueError("Could not locate NVTX table in nsys output")

    results = []
    # each data row starts with a number (the Time (%) column)
    for row in lines[sep_idx+1:]:
        if not row.strip() or not re.match(r"^\s*\d", row):
            continue
        cols = row.split()
        # cols layout: [Time%, TotalTime, Instances, Avg, Med, Min, Max, StdDev, Style, RangeName]
        avg_ms     = cols[3]      # 4th token
        range_name = cols[-1]     # last token
        results.append((range_name, avg_ms))

    return results

def get_nvtx_stats(nsys_rep_path):
    cmd = [
        "nsys", "stats",
        "--timeunit", "msec",
        "-r", "nvtx_sum",
        nsys_rep_path
    ]
    proc = subprocess.run(cmd, capture_output=True, text=True)
    if proc.returncode != 0:
        raise RuntimeError(f"nsys stats failed:\n{proc.stderr}")

    lines = proc.stdout.splitlines()
    # find the table separator (the dashed line)
    sep_idx = next(
        (i for i, L in enumerate(lines) if re.match(r"^[- ]{2,}-+", L)),
        None
    )
    if sep_idx is None:
        raise ValueError("Could not locate NVTX table in nsys output")

    stats = []
    for row in lines[sep_idx+1:]:
        if not row.strip() or not re.match(r"^\s*\d", row):
            continue
        cols = row.split()
        # cols: [Time%, TotalTime, Instances, Avg, Med, Min, Max, StdDev, Style, RangeName]
        # strip commas before conversion:
        total_ms = float(cols[1].replace(",", ""))
        calls    = int(cols[2].replace(",", ""))
        avg_ms   = float(cols[3].replace(",", ""))
        range_name = cols[-1]
        stats.append({
            "range":     range_name,
            "total_ms":  total_ms,
            "calls":     calls,
            "avg_ms":    avg_ms
        })
    return stats

# example usage:
#nsys_rep = "/content/MicroDreamerOptimized/logdir/nsys/profile_20250623_021437.nsys-rep"
#for name, avg in get_nvtx_avgs(nsys_rep):
#    print(f"{name}: {avg} ms")

In [8]:
# this code block will iterate through combinations of USE_CUDA_*
# for each it will generate and run an nsys profile command
# it will also print nsys stats commands for each .nsys-rep file produced
# it will print an nsys stats command for both nvtx range report types, not kernel stats tho
# it will also write to the filename the type of GPU it is being run on and which cuda extensions
# are being used

import os
import time
import torch

# Timestamp for all runs
ts = time.strftime("%Y%m%d_%H%M%S")
logdir = "logdir/nsys"
os.makedirs(logdir, exist_ok=True)

# Query and sanitize GPU name for use in filenames
raw_gpu = torch.cuda.get_device_name(0)
gpu_tag = raw_gpu.replace(" ", "_").replace("/", "_")

# This list will collect all .nsys-rep file paths we generate
report_files = []

# All relevant combinations of (USE_CUDA_GAUSS, USE_CUDA_EXTRACT)
for g, e in [(0, 0), (1, 0), (1, 1)]:
#for g, e in [(0, 1)]:
    out = f"{logdir}/profile_{gpu_tag}_g{g}_e{e}_{ts}"
    report = f"{out}.nsys-rep"
    report_files.append(report)

    # Export into this process’s env
    os.environ["USE_CUDA_GAUSS"]   = str(g)
    os.environ["USE_CUDA_EXTRACT"] = str(e)

    profile_cmd = (
        f"nsys profile "
        f"--trace=cuda,nvtx --sample=none "
        f"--output={out} --force-overwrite=true "
        f"python main_profile.py "
        f"--config=configs/image_sai.yaml "
        f"--input=test_data/05_objaverse_backpack_rgba.png "
        f"--save_path=05_objaverse_backpack_rgba "
        f"--profiling.enabled=true "
        f"--profiling.mode=nvtx "
        f"--profiling.scope=function"
    )
    stats_nvtx_cmd = f"nsys stats --timeunit msec -r nvtx_sum {report}"

    print(f">>> Running profile for GAUSS={g}, EXTRACT={e} on GPU '{raw_gpu}'")
    print("PROFILE CMD:", profile_cmd)
    ret = os.system(profile_cmd)
    if ret != 0:
        print(f"!! nsys profile failed for GAUSS={g}, EXTRACT={e}")
    else:
        print(f"✔ Generated NVTX Summary report: {report}")
        print("   ", stats_nvtx_cmd)
    print()

# Show all the reports generated
print("All generated .nsys-rep files:")
for rf in report_files:
    print(" -", rf)

>>> Running profile for GAUSS=0, EXTRACT=0 on GPU 'Tesla T4'
PROFILE CMD: nsys profile --trace=cuda,nvtx --sample=none --output=logdir/nsys/profile_Tesla_T4_g0_e0_20250624_160529 --force-overwrite=true python main_profile.py --config=configs/image_sai.yaml --input=test_data/05_objaverse_backpack_rgba.png --save_path=05_objaverse_backpack_rgba --profiling.enabled=true --profiling.mode=nvtx --profiling.scope=function
✔ Generated NVTX Summary report: logdir/nsys/profile_Tesla_T4_g0_e0_20250624_160529.nsys-rep
    nsys stats --timeunit msec -r nvtx_sum logdir/nsys/profile_Tesla_T4_g0_e0_20250624_160529.nsys-rep

>>> Running profile for GAUSS=1, EXTRACT=0 on GPU 'Tesla T4'
PROFILE CMD: nsys profile --trace=cuda,nvtx --sample=none --output=logdir/nsys/profile_Tesla_T4_g1_e0_20250624_160529 --force-overwrite=true python main_profile.py --config=configs/image_sai.yaml --input=test_data/05_objaverse_backpack_rgba.png --save_path=05_objaverse_backpack_rgba --profiling.enabled=true --profiling.mo

In [9]:
#print just averages in ms per range
#for rf in report_files:
#  for name, avg in get_nvtx_avgs(rf):
#    print(f" -", rf, f"{name}: {avg} ms")

# print averages, num calls, and total ms
for rf in report_files:
    stats = get_nvtx_stats(rf)
    print(f"Report: {rf}")
    for s in stats:
        print(f"  - {s['range']}: avg {s['avg_ms']:.4f} ms over {s['calls']} calls (total {s['total_ms']:.4f} ms)")
    print()

Report: logdir/nsys/profile_Tesla_T4_g0_e0_20250624_160529.nsys-rep
  - :OUTER_RANGE: avg 33993.1038 ms over 1 calls (total 33993.1038 ms)
  - :EXTRACT_FIELDS_CPU: avg 8401.9477 ms over 1 calls (total 8401.9477 ms)
  - :GAUSSIAN_3D_COEFF_CPU: avg 1.2858 ms over 3369 calls (total 4331.7521 ms)

Report: logdir/nsys/profile_Tesla_T4_g1_e0_20250624_160529.nsys-rep
  - :OUTER_RANGE: avg 30640.0699 ms over 1 calls (total 30640.0699 ms)
  - :EXTRACT_FIELDS_CPU: avg 5068.9264 ms over 1 calls (total 5068.9264 ms)
  - :GAUSSIAN_3D_COEFF_GPU: avg 0.1384 ms over 3368 calls (total 466.1144 ms)

Report: logdir/nsys/profile_Tesla_T4_g1_e1_20250624_160529.nsys-rep
  - :OUTER_RANGE: avg 26399.4672 ms over 1 calls (total 26399.4672 ms)
  - :EXTRACT_FIELDS_GPU: avg 46.8878 ms over 1 calls (total 46.8878 ms)



In [None]:
!nsys stats --help


usage: nsys stats [<args>] <input-file>

<input-file> : Read data from a .nsys-rep or exported .sqlite file.

	-f, --format <name[:args...][,name[:args...]...]>

           Specify the output format. The special name "." indicates the
           default format for the given output.

           The default format for console is:    column
           The default format for files is:      csv
           The default format for processes is:  csv

           Available formats (and file extensions):

             column     Human readable columns (.txt)
             table      Human readable table (.txt)
             csv        Comma Separated Values (.csv)
             tsv        Tab Separated Values (.tsv)
             json       JavaScript Object Notation (.json)
             hdoc       HTML5 document with <table> (.html)
             htable     Raw HTML <table> (.html)

           This option may be used multiple times. Multiple formats may also
           be specified using a comma-sep