gpt-oss model enablement #1754

wwwjn · 2025-09-24T20:37:33Z

Keep developing on top of #1559. Thanks @KhoomeiK for initial contribution!

Initialized by the same seed checkpoint, set seed=0 and deterministic = True.

Run 1: dp_shard = 2

Run 2: dp_shard = 2, TP degree = 2 (NGPU=4)

Run 3: dp_shard = 2, TP degree =2, EP degree = 2 (NGPU=4)

Run 4: dp_shard = 2, TP degree = 2, EP degree = 2, ETP degree = 2 (NGPU=4)

Run 5: dp_shard=2, EP degree = 2 (NGPU=2)

wwwjn · 2025-09-24T21:48:48Z

torchtitan/models/attention.py

        block_mask = FlexAttention.block_masks[self.mask_key]
        return FlexAttention.flex_attn(q, k, v, block_mask=block_mask, scale=scale)

+    def _forward_with_sink(


Wants some early comments / suggestions @fegin @tianyu-l

LGTM.

I'm curious how expensive it is to always return lse. If it is actually no cost, we can merge the FlexAttention call to the original forward.

cc., @drisspg

wwwjn · 2025-09-30T22:55:49Z

Need to rebase onto #1776

torchtitan/models/attention.py

tianyu-l

Looks great in general. Left some comments. May need some rebase on recent & near-future development.

torchtitan/experiments/__init__.py

torchtitan/experiments/gpt_oss/README.md

torchtitan/experiments/gpt_oss/__init__.py

tianyu-l · 2025-10-05T21:48:50Z

torchtitan/experiments/gpt_oss/infra/expert_parallel.py

+
+
+# TODO(jianiw): This need to be merged with expert_parallel
+def expert_parallel(func: Callable) -> Callable:


sorry I'll merge my refactor, and then please rebase

Are you referring to #1569 ?

tianyu-l · 2025-10-05T22:14:08Z

torchtitan/experiments/gpt_oss/model/model.py

+        if use_sliding_attention:
+            self.attn = build_attention(
+                use_flex_attn=True,
+                attn_mask_type="sliding_window",


I think sliding_window should be orthogonal to causal vs. block-causal.
Namely, with document masking, the sliding window should only attend within single documents.

This should be much easier after #1776

In my current setup, I follow the original design in #1559 , which is simply sliding_window (equal to casual + sliding_window). I think the problem is when document masking is applied. Let me refactor after #1776

torchtitan/experiments/gpt_oss/model/moe.py

torchtitan/experiments/gpt_oss/model/model.py

torchtitan/experiments/gpt_oss/model/args.py

torchtitan/experiments/gpt_oss/__init__.py

…ks but reduces mfu for 20b

wwwjn · 2025-10-06T22:14:10Z

Summary of current status:

There are some prerequisite PRs:

FlexAttn refactor [RFC] Refactor attention and make attention mask an argument to the model #1776
EP refactor [EP] add initial support for NVSHMEM-based all-to-all #1569
refactor freqs_cis as a input of model.forward() [RFC] Lift freqs_cis as an input of models #1797

Once these PRs are landed, I will refactor:

FlexAttention, adding sliding_window attention mask, and make it orthogonal to block_causal mask.
ExpertParallel() and ExpertTensorParallel() class to reuse as much as possible, as keep aligns with main EP implementation

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 24, 2025

wwwjn commented Sep 24, 2025

View reviewed changes

wwwjn force-pushed the gpt-oss branch 2 times, most recently from 48b2a11 to 07c0ff4 Compare September 30, 2025 04:34

wwwjn marked this pull request as ready for review September 30, 2025 23:01

wwwjn requested review from tianyu-l and wconstab as code owners September 30, 2025 23:01

wwwjn changed the title ~~[WIP] gpt-oss model enablement~~ gpt-oss model enablement Sep 30, 2025

wwwjn commented Sep 30, 2025

View reviewed changes

torchtitan/models/attention.py Show resolved Hide resolved

tianyu-l reviewed Oct 5, 2025

View reviewed changes

Rohan Pandey and others added 17 commits October 6, 2025 14:14

gptoss experimental support

0313a6f

clean up tentative licensing

aa2db3f

training fixes: expert load balancing, TP for sinks + experts, EP wor…

04ad2c4

…ks but reduces mfu for 20b

only assert sdpa backends if using sdpa; improve conversion script

43563ec

fixed conversion script with param by param

21c1679

new lse-based flexattn implementation for sinks

f3ad331

test

64b5c32

rebase

c3036c8

fix flexattn

c02acaf

check and replace rope

1bfdfbb

FSDP work, TP doesn't work

29cc72f

test

b841303

fix sink

9697286

test EP

daf5a6e

working on ETP

8aa281d

clean up

d9d0b05

clean up

10869da

wwwjn force-pushed the gpt-oss branch from cd89d26 to 10869da Compare October 6, 2025 22:32

rebase + address comments

8bfbf7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpt-oss model enablement #1754

gpt-oss model enablement #1754

wwwjn commented Sep 24, 2025 •

edited

Loading

Uh oh!

wwwjn Sep 24, 2025

Uh oh!

fegin Sep 25, 2025

Uh oh!

wwwjn commented Sep 30, 2025

Uh oh!

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Oct 5, 2025

Uh oh!

wwwjn Oct 6, 2025

Uh oh!

tianyu-l Oct 5, 2025

Uh oh!

wwwjn Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Oct 6, 2025

Uh oh!

Uh oh!



		# TODO(jianiw): This need to be merged with expert_parallel
		def expert_parallel(func: Callable) -> Callable:

gpt-oss model enablement #1754

Are you sure you want to change the base?

gpt-oss model enablement #1754

Conversation

wwwjn commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wwwjn Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn commented Sep 30, 2025

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Oct 6, 2025

Uh oh!

Uh oh!

wwwjn commented Sep 24, 2025 •

edited

Loading