Skip to content

Conversation

wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Sep 24, 2025

Keep developing on top of #1559. Thanks @KhoomeiK for initial contribution!

Initialized by the same seed checkpoint, set seed=0 and deterministic = True.
Screenshot 2025-09-30 at 3 39 25 PM

Run 1: dp_shard = 2
Screenshot 2025-09-30 at 3 39 40 PM

Run 2: dp_shard = 2, TP degree = 2 (NGPU=4)
Screenshot 2025-09-30 at 3 36 49 PM

Run 3: dp_shard = 2, TP degree =2, EP degree = 2 (NGPU=4)
Screenshot 2025-09-30 at 3 35 33 PM

Run 4: dp_shard = 2, TP degree = 2, EP degree = 2, ETP degree = 2 (NGPU=4)
Screenshot 2025-09-30 at 3 32 34 PM

Run 5: dp_shard=2, EP degree = 2 (NGPU=2)
Screenshot 2025-09-30 at 3 26 43 PM

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 24, 2025
block_mask = FlexAttention.block_masks[self.mask_key]
return FlexAttention.flex_attn(q, k, v, block_mask=block_mask, scale=scale)

def _forward_with_sink(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wants some early comments / suggestions @fegin @tianyu-l

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm curious how expensive it is to always return lse. If it is actually no cost, we can merge the FlexAttention call to the original forward.

cc., @drisspg

@wwwjn wwwjn force-pushed the gpt-oss branch 2 times, most recently from 48b2a11 to 07c0ff4 Compare September 30, 2025 04:34
@wwwjn
Copy link
Contributor Author

wwwjn commented Sep 30, 2025

Need to rebase onto #1776

@wwwjn wwwjn marked this pull request as ready for review September 30, 2025 23:01
@wwwjn wwwjn changed the title [WIP] gpt-oss model enablement gpt-oss model enablement Sep 30, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great in general. Left some comments. May need some rebase on recent & near-future development.



# TODO(jianiw): This need to be merged with expert_parallel
def expert_parallel(func: Callable) -> Callable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'll merge my refactor, and then please rebase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to #1569 ?

if use_sliding_attention:
self.attn = build_attention(
use_flex_attn=True,
attn_mask_type="sliding_window",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sliding_window should be orthogonal to causal vs. block-causal.
Namely, with document masking, the sliding window should only attend within single documents.

This should be much easier after #1776

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my current setup, I follow the original design in #1559 , which is simply sliding_window (equal to casual + sliding_window). I think the problem is when document masking is applied. Let me refactor after #1776

@wwwjn
Copy link
Contributor Author

wwwjn commented Oct 6, 2025

Summary of current status:

There are some prerequisite PRs:

  1. FlexAttn refactor [RFC] Refactor attention and make attention mask an argument to the model #1776
  2. EP refactor [EP] add initial support for NVSHMEM-based all-to-all #1569
  3. refactor freqs_cis as a input of model.forward() [RFC] Lift freqs_cis as an input of models #1797

Once these PRs are landed, I will refactor:

  1. FlexAttention, adding sliding_window attention mask, and make it orthogonal to block_causal mask.
  2. ExpertParallel() and ExpertTensorParallel() class to reuse as much as possible, as keep aligns with main EP implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants