Description
Is there an existing issue for this bug?
- I have searched the existing issues
The bug has not been fixed in the latest main branch
- I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
按照readme教程,lora微调deepseek r1 时遇到这样的bug。好像是colossalai库的问题
Traceback (most recent call last):
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module
[rank5]: replace_layer = target_module.from_native_module(
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py", line 333, in from_native_module
[rank5]: rmsnorm = FusedRMSNormWithHook(
[rank5]: TypeError: 'NoneType' object is not callable
[rank5]: During handling of the above exception, another exception occurred:
[rank5]: Traceback (most recent call last):
[rank5]: File "/home/ds-r1/ColossalAI/applications/ColossalChat/examples/training_scripts/lora_finetune.py", line 464, in
[rank5]: train(args)
[rank5]: File "/home/ds-r1/ColossalAI/applications/ColossalChat/examples/training_scripts/lora_finetune.py", line 261, in train
[rank5]: model, optimizer, _, dataloader, lr_scheduler = booster.boost(
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/booster/booster.py", line 154, in boost
[rank5]: model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 457, in configure
[rank5]: model = HybridParallelModule(
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 87, in init
[rank5]: module, self.shared_params = shardformer.optimize(module, policy=custom_policy)
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/shardformer.py", line 55, in optimize
[rank5]: shared_params = sharder.shard()
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 43, in shard
[rank5]: self._replace_module(include=held_layers)
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 67, in _replace_module
[rank5]: self._recursive_replace_layer(
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank5]: self._recursive_replace_layer(
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank5]: self._recursive_replace_layer(
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank5]: self._recursive_replace_layer(
[rank5]: [Previous line repeated 2 more times]
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 112, in _recursive_replace_layer
[rank5]: self._replace_sub_module(module, sub_module_replacement, include)
[rank5]: File "/home/conda/envs/colo/lib/python3.10/site-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module
[rank5]: raise RuntimeError(
[rank5]: RuntimeError: Failed to replace input_layernorm of type DeepseekV3RMSNorm with FusedRMSNorm with the exception: 'NoneType' object is not callable. Please ch
eck your model configuration or sharding policy, you can set up an issue for us to help you as well.
Environment
No response