8x model ensemble with chain knowledge distillation by not-nonymous · Pull Request #31 · qlabs-eng/slowrun

not-nonymous · 2026-03-08T03:16:11Z

As discussed in the previous PR, using chain knowledge distillation allows to fit more models in the ensemble and is comparatively quick to train then the previous technique.

Final Val Loss: 3.126098
Time: 16h 1m

Model training progress:

Model 1 Val Loss: 3.308297
Model 2 Val Loss: 3.223070 (Ensemble loss: 3.205810)
Model 3 Val Loss: 3.210214 (Ensemble loss: 3.171368)
Model 4 Val Loss: 3.202388 (Ensemble loss: 3.153127)
Model 5 Val Loss: 3.202067 (Ensemble loss: 3.142256)
Model 6 Val Loss: 3.200601 (Ensemble loss: 3.135083)
Model 7 Val Loss: 3.199623 (Ensemble loss: 3.129861)
Model 8 Val Loss: 3.200401 (Ensemble loss: 3.126098)

Thanks!

Updated the number of models from 8 to 6 in the training script and added support for knowledge distillation with teacher models.

Updated the number of models from 6 to 8 for training. Changed the distillation method to use only the immediately preceding model as the teacher.

akshayvegesna · 2026-03-08T03:46:13Z

Again, great work -- this takes us from 7x to 8x data efficiency. Let me run some tests but looks good. I wonder how much further headroom there is just extending upon this. What are the promising directions you see from here after this lands?

Perhaps one direction is to use the model from the tiny track which just takes 15 minutes, then we can train 32-64 models in the same time and maybe gain upon this.

akshayvegesna · 2026-03-08T04:00:13Z

Curious, any hypotheses on why this is better than the full teacher with all N-1 models, even through 6 models?

not-nonymous · 2026-03-08T05:10:21Z

i think an interesting direction would be to train diff models with diff architectures/hyperparameters/training algorithm. that gives more diversity to the pool of models rather than training every model with just random seeds.

regarding why this is better than full n-1 models as teacher: not even sure if it actually is better. currently the diff feels hard to distinguish whether its just noise or some meaningful signal. but i think with more experimentation on this, we can come up with some sort of theory

akshayvegesna · 2026-03-08T06:43:05Z

All makes sense, going to merge this using your numbers for the leaderboard. Thanks.

akshayvegesna · 2026-03-10T04:44:22Z

For future reference, this idea is similar to Born-Again Neural Networks: https://arxiv.org/pdf/1805.04770

not-nonymous added 2 commits March 7, 2026 04:56

Sequential Knowledge Distillation (6x model ensemble)

789b9a8

Updated the number of models from 8 to 6 in the training script and added support for knowledge distillation with teacher models.

8x model ensemble with chain knowledge distillation

cefb547

Updated the number of models from 6 to 8 for training. Changed the distillation method to use only the immediately preceding model as the teacher.

akshayvegesna merged commit 4eb2cce into qlabs-eng:main Mar 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8x model ensemble with chain knowledge distillation#31

8x model ensemble with chain knowledge distillation#31
akshayvegesna merged 2 commits intoqlabs-eng:mainfrom
not-nonymous:chain-distillation

not-nonymous commented Mar 8, 2026

Uh oh!

akshayvegesna commented Mar 8, 2026

Uh oh!

akshayvegesna commented Mar 8, 2026

Uh oh!

not-nonymous commented Mar 8, 2026

Uh oh!

akshayvegesna commented Mar 8, 2026

Uh oh!

akshayvegesna commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

not-nonymous commented Mar 8, 2026

Uh oh!

akshayvegesna commented Mar 8, 2026

Uh oh!

akshayvegesna commented Mar 8, 2026

Uh oh!

not-nonymous commented Mar 8, 2026

Uh oh!

akshayvegesna commented Mar 8, 2026

Uh oh!

akshayvegesna commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants