I'm trying to reproduce the training process of deepcoder 1.5b, but I found the base model is "checkpoints/deepscaler/deepscaler-code-32k-easy/global_step_320/actor/checkpoint" at https://wandb.ai/mluo/deepcoder/runs/s3lpnxwa/overview, with already relatively high val score (~ 23 on test_lcb)
how to improve from r1-distilled-qwen-1.5b (~ 16-17) to the performance of this ckpt? Thanks for your advices~