8x model ensemble with chain knowledge distillation#31
8x model ensemble with chain knowledge distillation#31akshayvegesna merged 2 commits intoqlabs-eng:mainfrom
Conversation
Updated the number of models from 8 to 6 in the training script and added support for knowledge distillation with teacher models.
Updated the number of models from 6 to 8 for training. Changed the distillation method to use only the immediately preceding model as the teacher.
|
Again, great work -- this takes us from 7x to 8x data efficiency. Let me run some tests but looks good. I wonder how much further headroom there is just extending upon this. What are the promising directions you see from here after this lands? Perhaps one direction is to use the model from the tiny track which just takes 15 minutes, then we can train 32-64 models in the same time and maybe gain upon this. |
|
Curious, any hypotheses on why this is better than the full teacher with all N-1 models, even through 6 models? |
|
i think an interesting direction would be to train diff models with diff architectures/hyperparameters/training algorithm. that gives more diversity to the pool of models rather than training every model with just random seeds. regarding why this is better than full n-1 models as teacher: not even sure if it actually is better. currently the diff feels hard to distinguish whether its just noise or some meaningful signal. but i think with more experimentation on this, we can come up with some sort of theory |
|
All makes sense, going to merge this using your numbers for the leaderboard. Thanks. |
|
For future reference, this idea is similar to Born-Again Neural Networks: https://arxiv.org/pdf/1805.04770 |
As discussed in the previous PR, using chain knowledge distillation allows to fit more models in the ensemble and is comparatively quick to train then the previous technique.
Final Val Loss: 3.126098
Time: 16h 1m
Model training progress:
Thanks!