You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the Mixtral MoE implementation, I saw it mentioned that it is equivalent to standard MoE with full capacity (no dropped tokens). I just wonder where the token dropless logic is implemented?
Hey! The doc is probably a bit outdated / was I think copy pasted from original code!
But basically as you can see on the code, there is no capacity and there is no dropout, so all tokens get assign to an expert.
On the Mixtral MoE implementation, I saw it mentioned that it is equivalent to
standard MoE with full capacity (no dropped tokens)
. I just wonder where the token dropless logic is implemented?Code reference: https://github.com/huggingface/transformers/blob/2c2495cc7b0e3e2942a9310f61548f40a2bc8425/src/transformers/models/mixtral/modeling_mixtral.py#L89C28-L90C20
CC @ArthurZucker if you have any insights. Thank you!
The text was updated successfully, but these errors were encountered: