# PEFT 库 LoRA 实战 - OpenAI Whisper-large-v2

本教程使用 LoRA 在`OpenAI Whisper-large-v2`模型上实现`语音识别(ASR)`任务的微调训练。

我们还结合了`int8` 量化进一步降低训练过程资源开销，同时保证了精度几乎不受影响。

## 全局参数设置

In [1]:
model_name_or_path = "openai/whisper-large-v2"
language = "Chinese (China)"
language_abbr = "zh-CN"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

batch_size=64

## 下载数据集 Common Voice

Common Voice 11.0 数据集包含许多不同语言的录音，总时长达数小时。

本教程以中文数据为例，展示如何使用 LoRA 在 Whisper-large-v2 上进行微调训练。

首先，初始化一个DatasetDict结构，并将训练集（将训练+验证拆分为训练集）和测试集拆分好，按照中文数据集构建配置加载到内存中：

In [2]:
from datasets import load_dataset
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train+validation")
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test")
common_voice["train"][0]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.15G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/429M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/501M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.28G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.28G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.15G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/998M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.06G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.06G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/549M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/610M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.25M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.43M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.98M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 13626it [00:00, 136253.63it/s][A
Reading metadata...: 29056it [00:00, 132145.03it/s][A


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 10581it [00:00, 179360.68it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 10581it [00:00, 195665.77it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 14049it [00:00, 140438.24it/s][A
Reading metadata...: 30310it [00:00, 153472.96it/s][A
Reading metadata...: 49092it [00:00, 169150.41it/s][A
Reading metadata...: 68299it [00:00, 178183.41it/s][A
Reading metadata...: 87192it [00:00, 182047.26it/s][A
Reading metadata...: 105582it [00:00, 182674.52it/s][A
Reading metadata...: 124606it [00:00, 185144.53it/s][A
Reading metadata...: 143518it [00:00, 186407.42it/s][A
Reading metadata...: 162159it [00:00, 180953.48it/s][A
Reading metadata...: 180288it [00:01, 179225.02it/s][A
Reading metadata...: 199630it [00:01, 183473.30it/s][A
Reading metadata...: 218501it [00:01, 185039.50it/s][A
Reading metadata...: 237241it [00:01, 185744.24it/s][A
Reading metadata...: 256133it [00:01, 186694.63it/s][A
Reading metadata...: 275037it [00:01, 187391.32it/s][A
Reading metadata...: 293886it [00:01, 187718.87it/s][A
Reading metadata...: 312765it [00:01, 188037.37it/s][A
Reading m

Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 21302it [00:00, 190220.34it/s][A
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


{'client_id': '95368aab163e0387e4fd4991b4f2d8ccfbd4364bf656c860230501fd27dcedf087773e4695a6cf5de9c4f1d406d582283190d065cdfa36b0e2b060cffaca977e',
 'path': '/home/yuxiang/.cache/huggingface/datasets/downloads/extracted/68855bbd878ec603e04e5baebb8217d4b60e177bbe23157096f27ede1a5cff9c/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
 'audio': {'path': '/home/yuxiang/.cache/huggingface/datasets/downloads/extracted/68855bbd878ec603e04e5baebb8217d4b60e177bbe23157096f27ede1a5cff9c/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([-6.82121026e-13, -2.27373675e-12, -2.27373675e-12, ...,
          1.21667399e-05,  3.23003678e-06, -2.43066324e-07]),
  'sampling_rate': 48000},
 'sentence': '性喜温暖润湿气候且耐寒。',
 'up_votes': 2,
 'down_votes': 0,
 'age': '',
 'gender': '',
 'accent': '',
 'locale': 'zh-CN',
 'segment': ''}

## 预处理训练数据集


In [3]:
from transformers import AutoFeatureExtractor, AutoTokenizer, AutoProcessor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_name_or_path)

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path, language=language, task=task)

processor = AutoProcessor.from_pretrained(
    model_name_or_path, language=language, task=task)

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

#### 移除数据集中不必要的字段

In [4]:
common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
)

In [5]:
common_voice["train"][0]

{'audio': {'path': '/home/yuxiang/.cache/huggingface/datasets/downloads/extracted/68855bbd878ec603e04e5baebb8217d4b60e177bbe23157096f27ede1a5cff9c/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([-6.82121026e-13, -2.27373675e-12, -2.27373675e-12, ...,
          1.21667399e-05,  3.23003678e-06, -2.43066324e-07]),
  'sampling_rate': 48000},
 'sentence': '性喜温暖润湿气候且耐寒。'}

#### 降采样音频数据

查看`common_voice` 数据集介绍，你会发现其音频是以48kHz的采样率进行采样的.

而`Whisper`模型是在16kHZ的音频输入上预训练的，因此我们需要将音频输入降采样以匹配模型预训练时使用的采样率。

通过在音频列上使用`cast_column`方法，并将`sampling_rate`设置为16kHz来对音频进行降采样。

下次调用时，音频输入将实时重新取样：

In [6]:
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [7]:
common_voice["train"][0]

{'audio': {'path': '/home/yuxiang/.cache/huggingface/datasets/downloads/extracted/68855bbd878ec603e04e5baebb8217d4b60e177bbe23157096f27ede1a5cff9c/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([ 5.09317033e-11, -7.27595761e-12, -6.54836185e-11, ...,
         -5.96661994e-06,  2.71382887e-05,  1.29687978e-05]),
  'sampling_rate': 16000},
 'sentence': '性喜温暖润湿气候且耐寒。'}

### 整合以上数据处理为一个函数

该数据预处理函数应该包括：
- 通过加载音频列将音频输入重新采样为16kHZ。
- 使用特征提取器从音频数组计算输入特征。
- 将句子列标记化为输入标签。

In [8]:
def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [9]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=28)

Map:   0%|          | 0/39637 [00:00<?, ? examples/s]

Map:   0%|          | 0/10581 [00:00<?, ? examples/s]

创建一个`DataCollator`类来将每个批次中的`attention_mask`填充到最大长度，并用`-100`替换填充值，以便在损失函数中被忽略。

然后初始化数据收集器的实例：

In [10]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## 训练模型

In [11]:
from transformers import AutoModelForSpeechSeq2Seq

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

In [12]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

为了准备模型进行int8量化，使用 `prepare_model_for_int8_training` 函数来处理模型：
- 将所有非int8模块转换为完全精度（fp32）以保持稳定性
- 在输入嵌入层上添加前向钩子，计算输入隐藏状态的梯度
- 启用渐变检查点以进行更高效的内存训练

In [13]:
from peft import prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)



In [14]:
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none")

In [15]:
model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 3,932,160 || all params: 1,547,237,120 || trainable%: 0.25414074863974306


### 演示需要，只训练了100 steps。建议同学改为默认的 3个 epochs 完整训练一个中文语音识别模型。

In [16]:
from transformers import Seq2SeqTrainingArguments

# 设置序列到序列模型训练的参数
training_args = Seq2SeqTrainingArguments(
    output_dir="models/whisper-large-v2-asr-int8",  # 指定模型输出和保存的目录
    per_device_train_batch_size=batch_size,  # 每个设备上的训练批量大小
    gradient_accumulation_steps=1,  # 梯度累积步数，在每次优化器步骤之前累积的更新步数
    learning_rate=1e-3,  # 学习率
    warmup_steps=50,  # 在训练初期增加学习率的步数，有助于稳定训练
    max_steps=100, # 训练总步数
    # num_train_epochs=3,  # 训练的总轮数
    # evaluation_strategy="epoch",  # 设置评估策略，这里是在每个epoch结束时进行评估
    fp16=True,  # 启用混合精度训练，可以提高训练速度，同时减少内存使用
    per_device_eval_batch_size=batch_size,  # 每个设备上的评估批量大小
    predict_with_generate=True,
    generation_max_length=225,  # 生成任务的最大长度
    logging_steps=25,  # 指定日志记录的步骤，用于跟踪训练进度
    remove_unused_columns=False,  # 是否删除不使用的列，以减少数据处理开销
    label_names=["labels"],  # 指定标签列的名称，用于训练过程中
)

#### 训练过程保存状态的回调，长时期训练建议使用

In [17]:
import os
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import Seq2SeqTrainer, TrainerCallback, Seq2SeqTrainingArguments, TrainerState, TrainerControl

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: Seq2SeqTrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

In [18]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [19]:
trainer.train()



Step,Training Loss
25,2.2888
50,0.546
75,0.4073
100,0.3916


TrainOutput(global_step=100, training_loss=0.9084253120422363, metrics={'train_runtime': 2456.9063, 'train_samples_per_second': 2.605, 'train_steps_per_second': 0.041, 'total_flos': 1.362453331968e+19, 'train_loss': 0.9084253120422363, 'epoch': 0.16})

### 保存 LoRA 模型

In [20]:
model.save_pretrained("models/whisper-large-v2-asr-int8")

### 使用 Pipiline 加载 LoRA 模型，实现自动语音识别任务

In [21]:
test_audio = "data/audio/test_zh.flac"

In [22]:
from transformers import AutomaticSpeechRecognitionPipeline

pipeline = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="chinese", task=task)

The model 'PeftModel' is not supported for . Supported models are ['Pop2PianoForConditionalGeneration', 'SeamlessM4TForSpeechToText', 'SeamlessM4Tv2ForSpeechToText', 'SpeechEncoderDecoderModel', 'Speech2TextForConditionalGeneration', 'SpeechT5ForSpeechToText', 'WhisperForConditionalGeneration', 'Data2VecAudioForCTC', 'HubertForCTC', 'MCTCTForCTC', 'SEWForCTC', 'SEWDForCTC', 'UniSpeechForCTC', 'UniSpeechSatForCTC', 'Wav2Vec2ForCTC', 'Wav2Vec2ConformerForCTC', 'WavLMForCTC'].


In [23]:
with torch.cuda.amp.autocast():
    text = pipeline(test_audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


In [24]:
text

'这是一段测试用于WhisperLarge V2模型的自动语音识别测试。'

In [1]:
# 加载训练后的模型

from transformers import AutoModelForSpeechSeq2Seq

model_path = "models/whisper-large-v2-asr-int8"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path, load_in_8bit=True, device_map="auto")

In [2]:
model_name_or_path = "openai/whisper-large-v2"
language = "Chinese (China)"
language_abbr = "zh-CN"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

# 加载测试集
from datasets import load_dataset
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()
# common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test")
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test").select(range(1000))

# 准备特征提取器和分词器
from transformers import AutoFeatureExtractor, AutoTokenizer, AutoProcessor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_name_or_path)

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path, language=language, task=task)

processor = AutoProcessor.from_pretrained(
    model_name_or_path, language=language, task=task)

# 移除数据集中不必要的字段
common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
)

from datasets import Audio

# 降采样音频数据
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

# 预处理函数
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]
    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

# 执行预处理操作
common_voice = common_voice.map(prepare_dataset, num_proc=28)


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Map (num_proc=28):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [3]:
# 初始化数据收集器的实例
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## 评估模型

In [4]:
import evaluate

# 词错误率（WER）是评估ASR模型常用的指标。从 Evaluate加载 WER 指标
metric = evaluate.load("wer")

In [5]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

eval_dataloader = DataLoader(common_voice["test"], batch_size=12, collate_fn=data_collator)

model.eval()

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 1280)
      (layers): ModuleList(
        (0-31): 32 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear8bitLt(in_features=1280, out_features=1280, bias=False)
            (v_proj): lora.Linear8bitLt(
              (base_layer): Linear8bitLt(in_features=1280, out_features=1280, bias=True)
              (lora_dropout): ModuleDict(
                (default): Dropout(p=0.05, inplace=False)
              )
              (lora_A): ModuleDict(
                (default): Linear(in_features=1280, out_features=8, bias=False)
              )
              (lora_B): ModuleDict(
                (default): Linear(in_features=8, out_features=1280, bias=False)
              )

In [6]:
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    del generated_tokens, labels, batch
    gc.collect()

100%|██████████| 84/84 [26:12<00:00, 18.71s/it] 


In [7]:
wer = 100 * metric.compute()
print(f"{wer=}")

wer=66.3


#### Homework 1: 为中文语料的训练过程增加过程评估，观察 Train Loss 和 Validation Loss 变化；
#### Homework 2: LoRA 模型训练完成后，使用测试集进行完整的模型评估

In [1]:
# 全局参数设置

model_id = "openai/whisper-large-v2"
language = "Chinese (China)"
language_abbr = "zh-CN"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

batch_size=80

In [2]:
# 加载数据集（训练集+验证集，测试集）

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train+validation")
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test").select(range(1000))
common_voice["train"][0]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


{'client_id': '95368aab163e0387e4fd4991b4f2d8ccfbd4364bf656c860230501fd27dcedf087773e4695a6cf5de9c4f1d406d582283190d065cdfa36b0e2b060cffaca977e',
 'path': '/home/yuxiang/.cache/huggingface/datasets/downloads/extracted/68855bbd878ec603e04e5baebb8217d4b60e177bbe23157096f27ede1a5cff9c/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
 'audio': {'path': '/home/yuxiang/.cache/huggingface/datasets/downloads/extracted/68855bbd878ec603e04e5baebb8217d4b60e177bbe23157096f27ede1a5cff9c/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([-6.82121026e-13, -2.27373675e-12, -2.27373675e-12, ...,
          1.21667399e-05,  3.23003678e-06, -2.43066324e-07]),
  'sampling_rate': 48000},
 'sentence': '性喜温暖润湿气候且耐寒。',
 'up_votes': 2,
 'down_votes': 0,
 'age': '',
 'gender': '',
 'accent': '',
 'locale': 'zh-CN',
 'segment': ''}

In [3]:
# 准备特征提前器和分词器

from transformers import AutoFeatureExtractor, AutoTokenizer, AutoProcessor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id, language=language, task=task)

processor = AutoProcessor.from_pretrained(model_id, language=language, task=task)

In [4]:
# 移除数据集中不必要的字段
common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
)

from datasets import Audio

# 降采样音频数据
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [5]:
common_voice["train"][0]

{'audio': {'path': '/home/yuxiang/.cache/huggingface/datasets/downloads/extracted/68855bbd878ec603e04e5baebb8217d4b60e177bbe23157096f27ede1a5cff9c/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([ 5.09317033e-11, -7.27595761e-12, -6.54836185e-11, ...,
         -5.96661994e-06,  2.71382887e-05,  1.29687978e-05]),
  'sampling_rate': 16000},
 'sentence': '性喜温暖润湿气候且耐寒。'}

In [6]:
# 预处理函数
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]
    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

# 执行数据集预处理
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=28)

In [7]:
# 初始化数据收集器的实例
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In [8]:
# 加载模型

from transformers import AutoModelForSpeechSeq2Seq

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, load_in_8bit=True, device_map="auto")

In [9]:
# 原始 Whisper 模型在自回归生成开始之前强制添加了若干前缀词元 ID (forced_decoder_ids)。这些词元 ID 主要用于在零样本 ASR 任务中标识语种和任务。
# 因为我们现在是对已知语种 (中文) 和任务 (转录) 进行微调，所以我们要将 forced_decoder_ids 设置为 None。
# 另外，模型还抑制了一些词元 (suppress_tokens)，这些词元的对数概率被强置为 -inf，以保证它们永远不会被采样到。我们会用一个空列表覆盖 suppress_tokens，即我们不抑制任何词元。

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

In [10]:
print(f"Memory footprint: {model.get_memory_footprint() / (1024 ** 2):.2f} MB")

Memory footprint: 1543.62 MB


In [11]:
# 为了准备模型进行int8量化，使用 `prepare_model_for_int8_training` 函数来处理模型
# prepare_model_for_int8_training() function:
# 1) casts all the non int8 modules to full precision (fp32) for stability
# 2) adds a forward hook to the input embedding layer to calculate the gradients of the input hidden states
# 3) enables gradient checkpointing for more memory-efficient training

from peft import prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)



In [12]:
print(f"Memory footprint: {model.get_memory_footprint() / (1024 ** 2):.2f} MB")

Memory footprint: 1687.24 MB


In [13]:
# 配置 LoRA 参数

from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none")

In [14]:
# 获取 PEFT 模型

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 3,932,160 || all params: 1,547,237,120 || trainable%: 0.25414074863974306


In [15]:
# 配置训练超参数（TrainingArguments）

from transformers import Seq2SeqTrainingArguments

# 设置序列到序列模型训练的参数
training_args = Seq2SeqTrainingArguments(
    output_dir="models/whisper-large-v2-asr-int8",  # 指定模型输出和保存的目录
    per_device_train_batch_size=batch_size,  # 每个设备上的训练批量大小
    gradient_accumulation_steps=1,  # 梯度累积步数，在每次优化器步骤之前累积的更新步数
    learning_rate=1e-3,  # 学习率
    warmup_steps=50,  # 在训练初期增加学习率的步数，有助于稳定训练
    # max_steps=5000, # 训练总步数
    num_train_epochs=3,  # 训练的总轮数
    evaluation_strategy="epoch",  # 设置评估策略，可选值：no, steps, epoch，这里是在每个epoch结束时进行评估
    # eval_steps=500,
    fp16=True,  # 启用混合精度训练，可以提高训练速度，同时减少内存使用
    per_device_eval_batch_size=12,  # 每个设备上的评估批量大小
    predict_with_generate=True,
    generation_max_length=225,  # 生成任务的最大长度
    logging_steps=100,  # 指定日志记录的步骤，用于跟踪训练进度
    remove_unused_columns=False,  # 是否删除不使用的列，以减少数据处理开销
    label_names=["labels"],  # 指定标签列的名称，用于训练过程中
    dataloader_num_workers=24, # Number of subprocesses to use for data loading
)

In [16]:
# 训练过程保存状态的回调

import os
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import Seq2SeqTrainer, TrainerCallback, Seq2SeqTrainingArguments, TrainerState, TrainerControl

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: Seq2SeqTrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

In [17]:
# 评估模型

import evaluate
import numpy as np

# 词错误率（WER）是评估ASR模型常用的指标。从 Evaluate加载 WER 指标
metric = evaluate.load("wer")

def compute_metrics(eval_pred):
    preds = eval_pred.predictions
    labels = eval_pred.label_ids
    
    # replace -100 with the pad_token_id
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # we do not want to group tokens when computing the metrics
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    wer = 100 * metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"wer": wer}

In [18]:
# 实例化训练器（Trainer）

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
    compute_metrics=compute_metrics,
)
model.config.use_cache = False

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [19]:
# 训练模型

trainer.train()



Epoch,Training Loss,Validation Loss,Wer
1,0.3524,0.42602,67.2
2,0.2872,0.407857,65.8
3,0.2274,0.410588,65.5




TrainOutput(global_step=1488, training_loss=0.3278871031217678, metrics={'train_runtime': 27621.6572, 'train_samples_per_second': 4.305, 'train_steps_per_second': 0.054, 'total_flos': 2.531417002463232e+20, 'train_loss': 0.3278871031217678, 'epoch': 3.0})

In [20]:
# 保存 LoRA 模型

model.save_pretrained("models/whisper-large-v2-asr-int8")

In [21]:
# 评估模型最终准确率（方法一）

trainer.evaluate(common_voice["test"])

{'eval_loss': 0.4105878472328186,
 'eval_wer': 65.5,
 'eval_runtime': 1177.3944,
 'eval_samples_per_second': 0.849,
 'eval_steps_per_second': 0.071,
 'epoch': 3.0}

In [22]:
# 评估模型最终准确率（方法二）

import evaluate

# 词错误率（WER）是评估ASR模型常用的指标。从 Evaluate加载 WER 指标
metric = evaluate.load("wer")

In [23]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

eval_dataloader = DataLoader(common_voice["test"], batch_size=12, collate_fn=data_collator)

# model.train()
# 在使用 pytorch 构建神经网络的时候，训练过程中会在程序上方添加一句model.train()，作用是 启用 batch normalization 和 dropout。
# 如果模型中有BN层（Batch Normalization）和 Dropout ，需要在训练时添加 model.train()。
# model.train() 是保证 BN 层能够用到每一批数据的均值和方差。对于 Dropout，model.train() 是随机取一部分网络连接来训练更新参数。

# model.eval()
# model.eval()的作用是不启用 Batch Normalization 和 Dropout。
# 如果模型中有 BN 层（Batch Normalization）和 Dropout，在测试时添加model.eval()。
# model.eval() 是保证 BN 层能够用全部训练数据的均值和方差，即测试过程中要保证 BN 层的均值和方差不变。对于 Dropout，model.eval() 是利用到了所有网络连接，即不进行随机舍弃神经元。

# 为什么测试时要用 model.eval() ？
# 训练完 train 样本后，生成的模型 model 要用来测试样本了。在 model(test) 之前，需要加上model.eval()，否则的话，有输入数据，即使不训练，它也会改变权值。这是 model 中含有 BN 层和 Dropout 所带来的的性质。
# model.eval() 时，pytorch 会自动把 BN 和 DropOut 固定住，不会取平均，而是用训练好的值。

model.eval()

PeftModel(
  (base_model): LoraModel(
    (model): WhisperForConditionalGeneration(
      (model): WhisperModel(
        (encoder): WhisperEncoder(
          (conv1): Conv1d(80, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
          (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
          (embed_positions): Embedding(1500, 1280)
          (layers): ModuleList(
            (0-31): 32 x WhisperEncoderLayer(
              (self_attn): WhisperSdpaAttention(
                (k_proj): Linear8bitLt(in_features=1280, out_features=1280, bias=False)
                (v_proj): lora.Linear8bitLt(
                  (base_layer): Linear8bitLt(in_features=1280, out_features=1280, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1280, out_features=8, bias=False)
                  )
            

In [24]:
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    del generated_tokens, labels, batch
    gc.collect()

100%|██████████| 84/84 [17:52<00:00, 12.77s/it]


In [25]:
wer = 100 * metric.compute()
print(f"{wer=}")

wer=59.4


In [26]:
# 测试模型
test_audio = "data/audio/test_zh.flac"

from transformers import AutomaticSpeechRecognitionPipeline

pipeline = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="chinese", task=task)

with torch.cuda.amp.autocast():
    text = pipeline(test_audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]

text

The model 'PeftModel' is not supported for . Supported models are ['Pop2PianoForConditionalGeneration', 'SeamlessM4TForSpeechToText', 'SeamlessM4Tv2ForSpeechToText', 'SpeechEncoderDecoderModel', 'Speech2TextForConditionalGeneration', 'SpeechT5ForSpeechToText', 'WhisperForConditionalGeneration', 'Data2VecAudioForCTC', 'HubertForCTC', 'MCTCTForCTC', 'SEWForCTC', 'SEWDForCTC', 'UniSpeechForCTC', 'UniSpeechSatForCTC', 'Wav2Vec2ForCTC', 'Wav2Vec2ConformerForCTC', 'WavLMForCTC'].


'这是一段测试用于Whisperler Large V2模型的自动语音识别测试。'

#### 重新加载保存后的模型，继续训练

In [1]:
# 全局参数设置

# 模型训练后的保存路径
model_path = "models/whisper-large-v2-asr-int8"
# 原有模型的 model_id
model_id = "openai/whisper-large-v2"
language = "Chinese (China)"
language_abbr = "zh-CN"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

batch_size=80

In [2]:
# 加载数据集（训练集+验证集，测试集）

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train+validation")
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test").select(range(1000))

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [3]:
# 准备特征提前器和分词器

from transformers import AutoFeatureExtractor, AutoTokenizer, AutoProcessor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id, language=language, task=task)

processor = AutoProcessor.from_pretrained(model_id, language=language, task=task)

In [4]:
# 移除数据集中不必要的字段
common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
)

from datasets import Audio

# 降采样音频数据
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [5]:
# 预处理函数
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]
    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

# 执行数据集预处理
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=28)

In [6]:
# 初始化数据收集器的实例
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In [7]:
# 加载训练后的模型

from transformers import AutoModelForSpeechSeq2Seq

# 加载基础模型
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path, load_in_8bit=True, device_map="auto")

In [8]:
# 原始 Whisper 模型在自回归生成开始之前强制添加了若干前缀词元 ID (forced_decoder_ids)。这些词元 ID 主要用于在零样本 ASR 任务中标识语种和任务。
# 因为我们现在是对已知语种 (中文) 和任务 (转录) 进行微调，所以我们要将 forced_decoder_ids 设置为 None。
# 另外，模型还抑制了一些词元 (suppress_tokens)，这些词元的对数概率被强置为 -inf，以保证它们永远不会被采样到。我们会用一个空列表覆盖 suppress_tokens，即我们不抑制任何词元。

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

In [9]:
print(f"Memory footprint: {model.get_memory_footprint() / (1024 ** 2):.2f} MB")

Memory footprint: 1558.62 MB


In [10]:
# 为了准备模型进行int8量化，使用 `prepare_model_for_int8_training` 函数来处理模型
# prepare_model_for_int8_training() function:
# 1) casts all the non int8 modules to full precision (fp32) for stability
# 2) adds a forward hook to the input embedding layer to calculate the gradients of the input hidden states
# 3) enables gradient checkpointing for more memory-efficient training

from peft import prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)



In [11]:
print(f"Memory footprint: {model.get_memory_footprint() / (1024 ** 2):.2f} MB")

Memory footprint: 1702.24 MB


In [12]:
from peft import PeftModel

# 加载经过 PEFT 微调的模型
model = PeftModel.from_pretrained(model, model_path, is_trainable=True, device_map="auto")

model.train()

PeftModel(
  (base_model): LoraModel(
    (model): WhisperForConditionalGeneration(
      (model): WhisperModel(
        (encoder): WhisperEncoder(
          (conv1): Conv1d(80, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
          (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
          (embed_positions): Embedding(1500, 1280)
          (layers): ModuleList(
            (0-31): 32 x WhisperEncoderLayer(
              (self_attn): WhisperSdpaAttention(
                (k_proj): Linear8bitLt(in_features=1280, out_features=1280, bias=False)
                (v_proj): lora.Linear8bitLt(
                  (base_layer): Linear8bitLt(in_features=1280, out_features=1280, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1280, out_features=8, bias=False)
                  )
            

In [13]:
model.print_trainable_parameters()

trainable params: 3,932,160 || all params: 1,547,237,120 || trainable%: 0.25414074863974306


In [14]:
# 配置训练超参数（TrainingArguments）

from transformers import Seq2SeqTrainingArguments

# 设置序列到序列模型训练的参数
training_args = Seq2SeqTrainingArguments(
    output_dir="models/whisper-large-v2-asr-int8",  # 指定模型输出和保存的目录
    per_device_train_batch_size=batch_size,  # 每个设备上的训练批量大小
    gradient_accumulation_steps=1,  # 梯度累积步数，在每次优化器步骤之前累积的更新步数
    learning_rate=1e-3,  # 学习率
    warmup_steps=50,  # 在训练初期增加学习率的步数，有助于稳定训练
    # max_steps=5000, # 训练总步数
    num_train_epochs=3,  # 训练的总轮数
    evaluation_strategy="epoch",  # 设置评估策略，可选值：no, steps, epoch，这里是在每个epoch结束时进行评估
    # eval_steps=500,
    fp16=True,  # 启用混合精度训练，可以提高训练速度，同时减少内存使用
    per_device_eval_batch_size=12,  # 每个设备上的评估批量大小
    predict_with_generate=True,
    generation_max_length=225,  # 生成任务的最大长度
    logging_steps=100,  # 指定日志记录的步骤，用于跟踪训练进度
    remove_unused_columns=False,  # 是否删除不使用的列，以减少数据处理开销
    label_names=["labels"],  # 指定标签列的名称，用于训练过程中
    dataloader_num_workers=24, # Number of subprocesses to use for data loading
)

In [15]:
# 训练过程保存状态的回调

import os
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import Seq2SeqTrainer, TrainerCallback, Seq2SeqTrainingArguments, TrainerState, TrainerControl

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: Seq2SeqTrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

In [16]:
# 评估模型

import evaluate
import numpy as np

# 词错误率（WER）是评估ASR模型常用的指标。从 Evaluate加载 WER 指标
metric = evaluate.load("wer")

def compute_metrics(eval_pred):
    preds = eval_pred.predictions
    labels = eval_pred.label_ids
    
    # replace -100 with the pad_token_id
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # we do not want to group tokens when computing the metrics
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    wer = 100 * metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"wer": wer}

In [17]:
# 实例化训练器（Trainer）

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
    compute_metrics=compute_metrics,
)
model.config.use_cache = False

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [18]:
# 训练模型

# 由最后一个 checkpoint 开始重新训练
# trainer.train(resume_from_checkpoint=True)
trainer.train()



Epoch,Training Loss,Validation Loss,Wer
1,0.2628,0.437628,67.4
2,0.2041,0.43386,66.9
3,0.1505,0.4438,66.5




TrainOutput(global_step=1488, training_loss=0.20217879741422592, metrics={'train_runtime': 27822.1897, 'train_samples_per_second': 4.274, 'train_steps_per_second': 0.053, 'total_flos': 2.531417002463232e+20, 'train_loss': 0.20217879741422592, 'epoch': 3.0})

In [19]:
# 保存 LoRA 模型

model.save_pretrained("models/whisper-large-v2-asr-int8")

In [20]:
# 评估模型最终准确率（方法一）

trainer.evaluate(common_voice["test"])

{'eval_loss': 0.4437997043132782,
 'eval_wer': 66.5,
 'eval_runtime': 1254.1623,
 'eval_samples_per_second': 0.797,
 'eval_steps_per_second': 0.067,
 'epoch': 3.0}

In [21]:
# 评估模型最终准确率（方法二）

import evaluate

# 词错误率（WER）是评估ASR模型常用的指标。从 Evaluate加载 WER 指标
metric = evaluate.load("wer")

In [22]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

eval_dataloader = DataLoader(common_voice["test"], batch_size=12, collate_fn=data_collator)

model.eval()

PeftModel(
  (base_model): LoraModel(
    (model): WhisperForConditionalGeneration(
      (model): WhisperModel(
        (encoder): WhisperEncoder(
          (conv1): Conv1d(80, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
          (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
          (embed_positions): Embedding(1500, 1280)
          (layers): ModuleList(
            (0-31): 32 x WhisperEncoderLayer(
              (self_attn): WhisperSdpaAttention(
                (k_proj): Linear8bitLt(in_features=1280, out_features=1280, bias=False)
                (v_proj): lora.Linear8bitLt(
                  (base_layer): Linear8bitLt(in_features=1280, out_features=1280, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1280, out_features=8, bias=False)
                  )
            

In [23]:
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    del generated_tokens, labels, batch
    gc.collect()

100%|██████████| 84/84 [16:57<00:00, 12.11s/it]


In [24]:
wer = 100 * metric.compute()
print(f"{wer=}")

wer=61.4


In [25]:
# 测试模型
test_audio = "data/audio/test_zh.flac"

from transformers import AutomaticSpeechRecognitionPipeline

pipeline = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="chinese", task=task)

with torch.cuda.amp.autocast():
    text = pipeline(test_audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]

text

The model 'PeftModel' is not supported for . Supported models are ['Pop2PianoForConditionalGeneration', 'SeamlessM4TForSpeechToText', 'SeamlessM4Tv2ForSpeechToText', 'SpeechEncoderDecoderModel', 'Speech2TextForConditionalGeneration', 'SpeechT5ForSpeechToText', 'WhisperForConditionalGeneration', 'Data2VecAudioForCTC', 'HubertForCTC', 'MCTCTForCTC', 'SEWForCTC', 'SEWDForCTC', 'UniSpeechForCTC', 'UniSpeechSatForCTC', 'Wav2Vec2ForCTC', 'Wav2Vec2ConformerForCTC', 'WavLMForCTC'].


'这是一段测试用于Whisperer Large V2模型的自动语音识别测试。'

## Use Adapters PEFT after training
<https://huggingface.co/docs/trl/main/en/use_model>

from peft import PeftConfig, PeftModel  
from transformers import AutoModelForCausalLM, AutoTokenizer  

#path/to/your/model/or/name/on/hub  
base_model_name = "kashif/stack-llama-2"  
adapter_model_name = "path/to/my/adapter"  

model = AutoModelForCausalLM.from_pretrained(base_model_name)  
model = PeftModel.from_pretrained(model, adapter_model_name)  

tokenizer = AutoTokenizer.from_pretrained(base_model_name)  

You can also merge the adapters into the base model so you can use the model like a normal transformers model, however the checkpoint will be significantly bigger:  

model = AutoModelForCausalLM.from_pretrained(base_model_name)  
model = PeftModel.from_pretrained(model, adapter_model_name)  

model = model.merge_and_unload()  
model.save_pretrained("merged_adapters")