On MoE implementation in HuggingFace #36730

Neo9061 · 2025-03-14T20:31:26Z

On the Mixtral MoE implementation, I saw it mentioned that it is equivalent to standard MoE with full capacity (no dropped tokens). I just wonder where the token dropless logic is implemented?

Code reference: https://github.com/huggingface/transformers/blob/2c2495cc7b0e3e2942a9310f61548f40a2bc8425/src/transformers/models/mixtral/modeling_mixtral.py#L89C28-L90C20

CC @ArthurZucker if you have any insights. Thank you!

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2025-03-17T09:20:19Z

Hey! The doc is probably a bit outdated / was I think copy pasted from original code!
But basically as you can see on the code, there is no capacity and there is no dropout, so all tokens get assign to an expert.

ArthurZucker · 2025-03-17T09:20:54Z

I'll close this as it's not a bug but we can keep the conversation 😉

ArthurZucker closed this as completed Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On MoE implementation in HuggingFace #36730

On MoE implementation in HuggingFace #36730

Neo9061 commented Mar 14, 2025 •

edited

Loading

ArthurZucker commented Mar 17, 2025

ArthurZucker commented Mar 17, 2025

On MoE implementation in HuggingFace #36730

On MoE implementation in HuggingFace #36730

Comments

Neo9061 commented Mar 14, 2025 • edited Loading

ArthurZucker commented Mar 17, 2025

ArthurZucker commented Mar 17, 2025

Neo9061 commented Mar 14, 2025 •

edited

Loading