Qwen Lora微调实现相关代码
Based on the codebase examination, I'll now provide a comprehensive overview of the LoRA implementation in this workspace:

# Qwen LoRA Implementation Details

## 1. LoRA Configuration

The LoRA implementation in this codebase uses the PEFT (Parameter-Efficient Fine-Tuning) library. The core LoRA parameters are defined in `finetune.py`:

```python
@dataclass
class LoraArguments:
    lora_r: int = 64                # Rank of the LoRA update matrices
    lora_alpha: int = 16            # Scaling factor for the LoRA update
    lora_dropout: float = 0.05      # Dropout probability for LoRA layers
    lora_target_modules: List[str] = field(
        default_factory=lambda: ["c_attn", "c_proj", "w1", "w2"]  # Target modules to apply LoRA
    )
    lora_weight_path: str = ""      # Path to pre-trained LoRA weights
    lora_bias: str = "none"         # Bias configuration
    q_lora: bool = False            # Whether to use Quantized LoRA (Q-LoRA)
```

## 2. LoRA Setup in Training

The key LoRA setup happens in the `train()` function in `finetune.py`:

```python
if training_args.use_lora:
    # Decide which modules to save completely (not just the LoRA adapters)
    if lora_args.q_lora or is_chat_model:
        modules_to_save = None
    else:
        modules_to_save = ["wte", "lm_head"]  # For base models, save embedding and output layers
    
    # Configure LoRA
    lora_config = LoraConfig(
        r=lora_args.lora_r,
        lora_alpha=lora_args.lora_alpha,
        target_modules=lora_args.lora_target_modules,
        lora_dropout=lora_args.lora_dropout,
        bias=lora_args.lora_bias,
        task_type="CAUSAL_LM",
        modules_to_save=modules_to_save  # Special handling for token embeddings and output layer
    )
    
    # For Q-LoRA, prepare the model for k-bit training
    if lora_args.q_lora:
        model = prepare_model_for_kbit_training(
            model, use_gradient_checkpointing=training_args.gradient_checkpointing
        )

    # Convert the model to a PEFT model with LoRA adapters
    model = get_peft_model(model, lora_config)

    # Display trainable parameter information
    model.print_trainable_parameters()

    # Enable gradient checkpointing if requested
    if training_args.gradient_checkpointing:
        model.enable_input_require_grads()
```

## 3. Model Loading for LoRA

Before applying LoRA, the model is loaded with appropriate configurations:

```python
# Load model and tokenizer
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path,
    config=config,
    cache_dir=training_args.cache_dir,
    device_map=device_map,
    trust_remote_code=True,
    quantization_config=GPTQConfig(
        bits=4, disable_exllama=True
    )
    if training_args.use_lora and lora_args.q_lora
    else None,  # Quantization config for Q-LoRA
    **model_load_kwargs,
)
```

## 4. Saving LoRA Weights

The code includes a special function to handle saving LoRA weights, particularly with DeepSpeed:

```python
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str, bias="none"):
    """Collects the state dict and dump to disk."""
    # Check if zero3 mode enabled
    if deepspeed.is_deepspeed_zero3_enabled():
        state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
    else:
        if trainer.args.use_lora:
            # Special handling for LoRA weights
            state_dict = get_peft_state_maybe_zero_3(
                trainer.model.named_parameters(), bias
            )
        else:
            state_dict = trainer.model.state_dict()
    if trainer.args.should_save and trainer.args.local_rank == 0:
        trainer._save(output_dir, state_dict=state_dict)
```

The function `get_peft_state_maybe_zero_3` handles extracting LoRA parameters, especially with DeepSpeed Zero-3 optimization:

```python
def get_peft_state_maybe_zero_3(named_params, bias):
    if bias == "none":
        to_return = {k: t for k, t in named_params if "lora_" in k}
    elif bias == "all":
        to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
    elif bias == "lora_only":
        # Complex logic to handle LoRA bias parameters
        to_return = {}
        maybe_lora_bias = {}
        lora_bias_names = set()
        for k, t in named_params:
            if "lora_" in k:
                to_return[k] = t
                bias_name = k.split("lora_")[0] + "bias"
                lora_bias_names.add(bias_name)
            elif "bias" in k:
                maybe_lora_bias[k] = t
        for k, t in maybe_lora_bias:
            if bias_name in lora_bias_names:
                to_return[bias_name] = t
    else:
        raise NotImplementedError
    
    # Handle DeepSpeed Zero-3 parameters
    to_return = {k: maybe_zero_3(v) for k, v in to_return.items()}
    return to_return
```

## 5. Running LoRA Fine-tuning

### Single GPU LoRA

The script `finetune/finetune_lora_single_gpu.sh` contains the configuration for running LoRA fine-tuning on a single GPU:

```bash
python finetune.py \
  --model_name_or_path $MODEL \
  --data_path $DATA \
  --bf16 True \
  --output_dir output_qwen \
  --num_train_epochs 5 \
  --per_device_train_batch_size 2 \
  --per_device_eval_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 1000 \
  --save_total_limit 10 \
  --learning_rate 3e-4 \
  --weight_decay 0.1 \
  --adam_beta2 0.95 \
  --warmup_ratio 0.01 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --report_to "none" \
  --model_max_length 512 \
  --lazy_preprocess True \
  --gradient_checkpointing \
  --use_lora
```

### Distributed LoRA Training

For multi-GPU training, the script `finetune/finetune_lora_ds.sh` uses DeepSpeed with ZeRO-2 optimization:

```bash
torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --output_dir output_qwen \
    # ... (same parameters as single GPU) ...
    --use_lora \
    --gradient_checkpointing \
    --deepspeed ${DS_CONFIG_PATH}
```

## 6. Quantized LoRA (Q-LoRA)

The codebase also supports Q-LoRA, which uses a 4-bit quantized model with LoRA for more memory-efficient fine-tuning:

```python
# In finetune.py
quantization_config=GPTQConfig(
    bits=4, disable_exllama=True
)
if training_args.use_lora and lora_args.q_lora
else None,
```

The Q-LoRA script `finetune/finetune_qlora_single_gpu.sh` uses:

```bash
python finetune.py \
  --model_name_or_path $MODEL \
  --data_path $DATA \
  --fp16 True \  # Note: fp16 instead of bf16 for quantized models
  # ... (same parameters as regular LoRA) ...
  --use_lora \
  --q_lora \  # Enable Q-LoRA
  --deepspeed finetune/ds_config_zero2.json
```

## 7. Special Handling for Base vs. Chat Models

The code has special handling for base language models vs. chat models:

```python
is_chat_model = 'chat' in model_args.model_name_or_path.lower()
if training_args.use_lora:
    if lora_args.q_lora or is_chat_model:
        modules_to_save = None
    else:
        modules_to_save = ["wte", "lm_head"]  # Save token embedding and output layers for base models
```

This is because base models need to learn the special tokens used in the chat format, so the embedding and output layers need to be trained as well.

## 8. DeepSpeed Configuration

LoRA fine-tuning uses ZeRO-2 optimization as defined in `finetune/ds_config_zero2.json`:

```json
"zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
        "device": "none",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
}
```

This optimizes memory usage while allowing efficient distributed training of the LoRA adapters.

实测下来，Kimi-Audio-7B-Instruct load_detokenizer=False
显存占用位 23420 MB
ALM: 9766.33 M
Whisper: 636.97 M