Skip to content

8x model ensemble with chain knowledge distillation#31

Merged
akshayvegesna merged 2 commits intoqlabs-eng:mainfrom
not-nonymous:chain-distillation
Mar 8, 2026
Merged

8x model ensemble with chain knowledge distillation#31
akshayvegesna merged 2 commits intoqlabs-eng:mainfrom
not-nonymous:chain-distillation

Conversation

@not-nonymous
Copy link
Contributor

As discussed in the previous PR, using chain knowledge distillation allows to fit more models in the ensemble and is comparatively quick to train then the previous technique.

Final Val Loss: 3.126098
Time: 16h 1m

Model training progress:

  • Model 1 Val Loss: 3.308297
  • Model 2 Val Loss: 3.223070 (Ensemble loss: 3.205810)
  • Model 3 Val Loss: 3.210214 (Ensemble loss: 3.171368)
  • Model 4 Val Loss: 3.202388 (Ensemble loss: 3.153127)
  • Model 5 Val Loss: 3.202067 (Ensemble loss: 3.142256)
  • Model 6 Val Loss: 3.200601 (Ensemble loss: 3.135083)
  • Model 7 Val Loss: 3.199623 (Ensemble loss: 3.129861)
  • Model 8 Val Loss: 3.200401 (Ensemble loss: 3.126098)

Thanks!

Updated the number of models from 8 to 6 in the training script and added support for knowledge distillation with teacher models.
Updated the number of models from 6 to 8 for training. Changed the distillation method to use only the immediately preceding model as the teacher.
@akshayvegesna
Copy link
Contributor

Again, great work -- this takes us from 7x to 8x data efficiency. Let me run some tests but looks good. I wonder how much further headroom there is just extending upon this. What are the promising directions you see from here after this lands?

Perhaps one direction is to use the model from the tiny track which just takes 15 minutes, then we can train 32-64 models in the same time and maybe gain upon this.

@akshayvegesna
Copy link
Contributor

Curious, any hypotheses on why this is better than the full teacher with all N-1 models, even through 6 models?

@not-nonymous
Copy link
Contributor Author

i think an interesting direction would be to train diff models with diff architectures/hyperparameters/training algorithm. that gives more diversity to the pool of models rather than training every model with just random seeds.

regarding why this is better than full n-1 models as teacher: not even sure if it actually is better. currently the diff feels hard to distinguish whether its just noise or some meaningful signal. but i think with more experimentation on this, we can come up with some sort of theory

@akshayvegesna
Copy link
Contributor

All makes sense, going to merge this using your numbers for the leaderboard. Thanks.

@akshayvegesna akshayvegesna merged commit 4eb2cce into qlabs-eng:main Mar 8, 2026
@akshayvegesna
Copy link
Contributor

For future reference, this idea is similar to Born-Again Neural Networks: https://arxiv.org/pdf/1805.04770

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants