Mixture of Experts (MoE) implementation for minGPT by Val Krigan
My contribution here is only MoE implementations plus some necessary changes to hook it up. It supports training in soft MoE and sparce modes. First one is used for routers pretraining.
The rest of the code was borrowed from here: