# PEFT 库 LoRA 实战 - OpenAI Whisper-large-v2

本教程使用 LoRA 在`OpenAI Whisper-large-v2`模型上实现`语音识别(ASR)`任务的微调训练。

我们还结合了`int8` 量化进一步降低训练过程资源开销，同时保证了精度几乎不受影响。

## 全局参数设置

In [1]:
model_name_or_path = "openai/whisper-large-v2"
language = "Chinese (China)"
language_abbr = "zh-CN"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

batch_size=64

## 下载数据集 Common Voice

Common Voice 11.0 数据集包含许多不同语言的录音，总时长达数小时。

本教程以中文数据为例，展示如何使用 LoRA 在 Whisper-large-v2 上进行微调训练。

首先，初始化一个DatasetDict结构，并将训练集（将训练+验证拆分为训练集）和测试集拆分好，按照中文数据集构建配置加载到内存中：

In [2]:
from datasets import load_dataset
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train+validation")
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test")
common_voice["train"][0]

{'client_id': '95368aab163e0387e4fd4991b4f2d8ccfbd4364bf656c860230501fd27dcedf087773e4695a6cf5de9c4f1d406d582283190d065cdfa36b0e2b060cffaca977e',
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/dcc5967c754d4c815fc005d6e297d84537028996cbcf6b34190517630cbc40b4/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/dcc5967c754d4c815fc005d6e297d84537028996cbcf6b34190517630cbc40b4/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([-9.09494702e-13, -2.50111043e-12, -2.04636308e-12, ...,
          1.21667417e-05,  3.23003815e-06, -2.43064278e-07]),
  'sampling_rate': 48000},
 'sentence': '性喜温暖润湿气候且耐寒。',
 'up_votes': 2,
 'down_votes': 0,
 'age': '',
 'gender': '',
 'accent': '',
 'locale': 'zh-CN',
 'segment': ''}

## 预处理训练数据集


In [3]:
from transformers import AutoFeatureExtractor, AutoTokenizer, AutoProcessor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_name_or_path)

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path, language=language, task=task)

processor = AutoProcessor.from_pretrained(
    model_name_or_path, language=language, task=task)

#### 移除数据集中不必要的字段

In [4]:
common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
)

In [5]:
common_voice["train"][0]

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/dcc5967c754d4c815fc005d6e297d84537028996cbcf6b34190517630cbc40b4/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([-9.09494702e-13, -2.50111043e-12, -2.04636308e-12, ...,
          1.21667417e-05,  3.23003815e-06, -2.43064278e-07]),
  'sampling_rate': 48000},
 'sentence': '性喜温暖润湿气候且耐寒。'}

#### 降采样音频数据

查看`common_voice` 数据集介绍，你会发现其音频是以48kHz的采样率进行采样的.

而`Whisper`模型是在16kHZ的音频输入上预训练的，因此我们需要将音频输入降采样以匹配模型预训练时使用的采样率。

通过在音频列上使用`cast_column`方法，并将`sampling_rate`设置为16kHz来对音频进行降采样。

下次调用时，音频输入将实时重新取样：

In [6]:
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [7]:
common_voice["train"][0]

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/dcc5967c754d4c815fc005d6e297d84537028996cbcf6b34190517630cbc40b4/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([ 6.54836185e-11, -2.91038305e-11, -5.82076609e-11, ...,
         -5.96660539e-06,  2.71383760e-05,  1.29687833e-05]),
  'sampling_rate': 16000},
 'sentence': '性喜温暖润湿气候且耐寒。'}

### 整合以上数据处理为一个函数

该数据预处理函数应该包括：
- 通过加载音频列将音频输入重新采样为16kHZ。
- 使用特征提取器从音频数组计算输入特征。
- 将句子列标记化为输入标签。

In [8]:
def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [9]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"])

创建一个`DataCollator`类来将每个批次中的`attention_mask`填充到最大长度，并用`-100`替换填充值，以便在损失函数中被忽略。

然后初始化数据收集器的实例：

In [10]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## 训练模型

In [11]:
from transformers import AutoModelForSpeechSeq2Seq

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")

In [12]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

为了准备模型进行int8量化，使用 `prepare_model_for_int8_training` 函数来处理模型：
- 将所有非int8模块转换为完全精度（fp32）以保持稳定性
- 在输入嵌入层上添加前向钩子，计算输入隐藏状态的梯度
- 启用渐变检查点以进行更高效的内存训练

In [13]:
from peft import prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)



In [14]:
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none")

In [15]:
model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 3,932,160 || all params: 1,547,237,120 || trainable%: 0.25414074863974306


### 演示需要，只训练了100 steps。建议同学改为默认的 3个 epochs 完整训练一个中文语音识别模型。

In [16]:
from transformers import Seq2SeqTrainingArguments

# 设置序列到序列模型训练的参数
training_args = Seq2SeqTrainingArguments(
    output_dir="models/whisper-large-v2-asr-int8",  # 指定模型输出和保存的目录
    per_device_train_batch_size=batch_size,  # 每个设备上的训练批量大小
    gradient_accumulation_steps=1,  # 梯度累积步数，在每次优化器步骤之前累积的更新步数
    learning_rate=1e-3,  # 学习率
    warmup_steps=50,  # 在训练初期增加学习率的步数，有助于稳定训练
    max_steps=100, # 训练总步数
    # num_train_epochs=3,  # 训练的总轮数
    # evaluation_strategy="epoch",  # 设置评估策略，这里是在每个epoch结束时进行评估
    fp16=True,  # 启用混合精度训练，可以提高训练速度，同时减少内存使用
    per_device_eval_batch_size=batch_size,  # 每个设备上的评估批量大小
    generation_max_length=128,  # 生成任务的最大长度
    logging_steps=25,  # 指定日志记录的步骤，用于跟踪训练进度
    remove_unused_columns=False,  # 是否删除不使用的列，以减少数据处理开销
    label_names=["labels"],  # 指定标签列的名称，用于训练过程中
)

#### 训练过程保存状态的回调，长时期训练建议使用

In [17]:
import os
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import Seq2SeqTrainer, TrainerCallback, Seq2SeqTrainingArguments, TrainerState, TrainerControl

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: Seq2SeqTrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

In [18]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False

Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [19]:
trainer.train()



Step,Training Loss


KeyboardInterrupt: 

### 保存 LoRA 模型

In [47]:
model.save_pretrained("models/whisper-large-v2-asr-int8")

### 使用 Pipiline 加载 LoRA 模型，实现自动语音识别任务

In [50]:
test_audio = "data/audio/test_zh.flac"

In [55]:
from transformers import AutomaticSpeechRecognitionPipeline

pipeline = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="chinese", task=task)

The model 'PeftModel' is not supported for . Supported models are ['Pop2PianoForConditionalGeneration', 'SeamlessM4TForSpeechToText', 'SeamlessM4Tv2ForSpeechToText', 'SpeechEncoderDecoderModel', 'Speech2TextForConditionalGeneration', 'SpeechT5ForSpeechToText', 'WhisperForConditionalGeneration', 'Data2VecAudioForCTC', 'HubertForCTC', 'MCTCTForCTC', 'SEWForCTC', 'SEWDForCTC', 'UniSpeechForCTC', 'UniSpeechSatForCTC', 'Wav2Vec2ForCTC', 'Wav2Vec2ConformerForCTC', 'WavLMForCTC'].


In [56]:
with torch.cuda.amp.autocast():
    text = pipeline(test_audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]



In [57]:
text

'这是一段测试，用于WhisperLarge V2模型的自动语音识别测试。'

#### Homework 1: 为中文语料的训练过程增加过程评估，观察 Train Loss 和 Validation Loss 变化；
#### Homework 2: LoRA 模型训练完成后，使用测试集进行完整的模型评估

## Homework1

### 配置环境

In [11]:
#安装必要依赖环境，只需要运行一次
!pip install -r ../requirements.txt

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
[0m

In [22]:
#配置缓存环境，只需要开始时运行一次
import os
#在transformers自定义模型下载的路径方法
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["HF_DATASETS_CACHE"] = "../../autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "../../autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "../../autodl-tmp/hub_cache/"

In [28]:
#配置网络环境，只需要开始时运行一次

import subprocess
import os

# result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True, encoding='utf-8')
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

In [24]:
# 验证环境配置是否修改成功
print("http_proxy",os.environ.get("http_proxy"))
print("https_proxy",os.environ.get("https_proxy"))
print("HF_HOME",os.environ.get("HF_HOME"))
print("HF_DATASETS_CACHE",os.environ.get("HF_DATASETS_CACHE"))
print("HUGGINGFACE_HUB_CACHE",os.environ.get("HUGGINGFACE_HUB_CACHE"))

http_proxy None
https_proxy None
HF_HOME ../../autodl-tmp/cache/
HF_DATASETS_CACHE ../../autodl-tmp/datasets_cache/
HUGGINGFACE_HUB_CACHE ../../autodl-tmp/hub_cache/


### 全局参数配置

In [4]:
model_name_or_path = "openai/whisper-large-v2"
language = "Chinese (China)"
language_abbr = "zh-CN"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

batch_size=32 #140最大

### 数据加载及预处理

In [9]:
#加载数据，第二次不用运行

from datasets import load_dataset
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

# common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train+validation")

common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train")
common_voice["validation"] = load_dataset(dataset_name, language_abbr, split="validation")
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [6]:
common_voice

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 29056
    })
    validation: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 10581
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 10581
    })
})

In [7]:
common_voice["train"][0]

{'client_id': '95368aab163e0387e4fd4991b4f2d8ccfbd4364bf656c860230501fd27dcedf087773e4695a6cf5de9c4f1d406d582283190d065cdfa36b0e2b060cffaca977e',
 'path': '/root/autodl-tmp/datasets_cache/downloads/extracted/da7287605eb1b400e138aa8591c50011825a9947f038ad80ae4f552d376995e7/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
 'audio': {'path': '/root/autodl-tmp/datasets_cache/downloads/extracted/da7287605eb1b400e138aa8591c50011825a9947f038ad80ae4f552d376995e7/zh-CN_train_0/common_voice_zh-CN_33211332.mp3',
  'array': array([-6.82121026e-13, -2.27373675e-12, -2.27373675e-12, ...,
          1.21667399e-05,  3.23003678e-06, -2.43066324e-07]),
  'sampling_rate': 48000},
 'sentence': '性喜温暖润湿气候且耐寒。',
 'up_votes': 2,
 'down_votes': 0,
 'age': '',
 'gender': '',
 'accent': '',
 'locale': 'zh-CN',
 'segment': ''}

In [5]:
#第二次也要运行
from transformers import AutoFeatureExtractor, AutoTokenizer, AutoProcessor

model_path = "../../autodl-tmp/model/openai/whisper-large-v2"

feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)

tokenizer = AutoTokenizer.from_pretrained(
    model_path, language=language, task=task)

processor = AutoProcessor.from_pretrained(
    model_path, language=language, task=task)

In [None]:
# #移除不需要的字段
# common_voice = common_voice.remove_columns(
#     ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]
# )
# common_voice["train"][0]

In [None]:
#第二次不用运行
# 降采样音频数据
from datasets import Audio

# 将名为 "common_voice" 的数据集中的 "audio" 列转换为具有指定采样率（sampling rate）的音频格式。
# 通过调用 .cast_column() 方法，将 "audio" 列转换为特定格式的音频数据，这个格式由 Audio() 函数指定，其中设置了采样率为 16000 Hz。
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
common_voice["train"][0]
common_voice["validation"][0]

In [10]:
#第二次开始不用运行
def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

#remove_columns 用于从数据集中排除不必要的列，具体来说，是在 common_voice.column_names["train"] 中提到的列。
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"])

Map:   0%|          | 0/29056 [00:00<?, ? examples/s]

Map:   0%|          | 0/10581 [00:00<?, ? examples/s]

Map:   0%|          | 0/10581 [00:00<?, ? examples/s]

In [12]:
common_voice

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 29056
    })
    validation: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 10581
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 10581
    })
})

In [15]:
#第二次不用运行
# 存储处理后的数据集到本地
common_voice.save_to_disk("../../autodl-tmp/datasets/processed_common_voice")

Saving the dataset (0/56 shards):   0%|          | 0/29056 [00:00<?, ? examples/s]

Saving the dataset (0/21 shards):   0%|          | 0/10581 [00:00<?, ? examples/s]

Saving the dataset (0/21 shards):   0%|          | 0/10581 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 29056
    })
    validation: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 10581
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 10581
    })
})

In [6]:
# 第二次开始从这里运行
import datasets
# 从本地加载数据集
common_voice = datasets.load_from_disk("../../autodl-tmp/datasets/processed_common_voice")

common_voice

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 29056
    })
    validation: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 10581
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 10581
    })
})

In [7]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

# 这个类主要作为 Seq2Seq 任务中有关语音数据的数据整理器，确保输入特征和标签在训练或评估时得到适当的填充和处理

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### 训练模型

In [8]:
from transformers import AutoModelForSpeechSeq2Seq

model_path = "../../autodl-tmp/model/openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path, load_in_8bit=True, device_map="auto")

In [9]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

In [10]:
#上面做from_pretrained的时不是已经load_in_8bit了吗？为什么下面还要做处理？

from peft import prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)



In [11]:
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none")

In [12]:
model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 3,932,160 || all params: 1,547,237,120 || trainable%: 0.25414074863974306


In [13]:
from transformers import Seq2SeqTrainingArguments

# 设置序列到序列模型训练的参数
training_args = Seq2SeqTrainingArguments(
    output_dir="../../autodl-tmp/model/whisper-large-v2-asr-int8",  # 指定模型输出和保存的目录
    per_device_train_batch_size=batch_size,  # 每个设备上的训练批量大小
    gradient_accumulation_steps=1,  # 梯度累积步数，在每次优化器步骤之前累积的更新步数
    learning_rate=1e-3,  # 学习率
    warmup_steps=50,  # 在训练初期增加学习率的步数，有助于稳定训练
    # max_steps=100, # 训练总步数
    num_train_epochs=3,  # 训练的总轮数
    evaluation_strategy="epoch",  # 设置评估策略，这里是在每个epoch结束时进行评估
    fp16=True,  # 启用混合精度训练，可以提高训练速度，同时减少内存使用
    per_device_eval_batch_size=batch_size,  # 每个设备上的评估批量大小
    generation_max_length=128,  # 生成任务的最大长度
    logging_steps=25,  # 指定日志记录的步骤，用于跟踪训练进度
    remove_unused_columns=False,  # 是否删除不使用的列，以减少数据处理开销
    label_names=["labels"],  # 指定标签列的名称，用于训练过程中
    save_steps=100, 
    save_total_limit=5
)

In [14]:
import os
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import Seq2SeqTrainer, TrainerCallback, Seq2SeqTrainingArguments, TrainerState, TrainerControl

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: Seq2SeqTrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

In [15]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["validation"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False

In [16]:
trainer.train(resume_from_checkpoint=True)



Epoch,Training Loss,Validation Loss
1,0.344,0.38765
2,0.2819,0.373452
3,0.2236,0.378031




TrainOutput(global_step=624, training_loss=0.17545655216926184, metrics={'train_runtime': 15730.1758, 'train_samples_per_second': 5.541, 'train_steps_per_second': 0.04, 'total_flos': 1.855661438140416e+20, 'train_loss': 0.17545655216926184, 'epoch': 3.0})

In [None]:
model.save_pretrained("../../autodl-tmp/peft-models/whisper-large-v2-asr-int8")

### 使用 Pipiline 加载 LoRA 模型，实现自动语音识别任务

In [16]:
from transformers import AutoModelForSpeechSeq2Seq, AutoTokenizer, AutoProcessor
from peft import PeftConfig, PeftModel

model_dir = "../../autodl-tmp/peft-models/whisper-large-v2-asr-int8"

language = "Chinese (China)"
language_abbr = "zh-CN"
language_decode = "chinese"
task = "transcribe"

peft_config = PeftConfig.from_pretrained(model_dir)

base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)

peft_model = PeftModel.from_pretrained(base_model, model_dir)

tokenizer = AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
processor = AutoProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
feature_extractor = processor.feature_extractor


test_audio = "data/audio/test_zh.flac"

from transformers import AutomaticSpeechRecognitionPipeline

pipeline = AutomaticSpeechRecognitionPipeline(model=peft_model, tokenizer=tokenizer, feature_extractor=feature_extractor)

forced_decoder_ids = processor.get_decoder_prompt_ids(language=language_decode, task=task)

import torch

with torch.cuda.amp.autocast():
    text = pipeline(test_audio, max_new_tokens=255)["text"]



In [17]:
text

'这是一段测试用于WhisperLarge V2模型的自动语音识别测试。'

## Homework2

### 配置环境及全局变量

In [1]:
#配置缓存环境，只需要开始时运行一次
import os
#在transformers自定义模型下载的路径方法
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["HF_DATASETS_CACHE"] = "../../autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "../../autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "../../autodl-tmp/hub_cache/"

#配置网络环境，只需要开始时运行一次

import subprocess
import os

# result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True, encoding='utf-8')

output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

# 验证环境配置是否修改成功
print("http_proxy",os.environ.get("http_proxy"))
print("https_proxy",os.environ.get("https_proxy"))
print("HF_HOME",os.environ.get("HF_HOME"))
print("HF_DATASETS_CACHE",os.environ.get("HF_DATASETS_CACHE"))
print("HUGGINGFACE_HUB_CACHE",os.environ.get("HUGGINGFACE_HUB_CACHE"))

http_proxy http://172.20.0.113:12798
https_proxy http://172.20.0.113:12798
HF_HOME ../../autodl-tmp/cache/
HF_DATASETS_CACHE ../../autodl-tmp/datasets_cache/
HUGGINGFACE_HUB_CACHE ../../autodl-tmp/hub_cache/


In [2]:
model_name_or_path = "openai/whisper-large-v2"
language = "Chinese (China)"
language_abbr = "zh-CN"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

### 测试集数据加载及处理

In [3]:
import evaluate

# 词错误率（WER）是评估ASR模型常用的指标。从 Evaluate加载 WER 指标
metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [15]:
#加载测试集数据

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

# common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train")
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [16]:
from transformers import AutoModelForSpeechSeq2Seq, AutoTokenizer, AutoProcessor
from peft import PeftConfig, PeftModel

model_dir = "../../autodl-tmp/peft-models/whisper-large-v2-asr-int8"

peft_config = PeftConfig.from_pretrained(model_dir)

base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)

peft_model = PeftModel.from_pretrained(base_model, model_dir)

tokenizer = AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
processor = AutoProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
feature_extractor = processor.feature_extractor

In [17]:
#第二次不用运行
# 降采样音频数据
from datasets import Audio

# 将名为 "common_voice" 的数据集中的 "audio" 列转换为具有指定采样率（sampling rate）的音频格式。
# 通过调用 .cast_column() 方法，将 "audio" 列转换为特定格式的音频数据，这个格式由 Audio() 函数指定，其中设置了采样率为 16000 Hz。
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
common_voice["test"][0]

{'client_id': '02bf7ccb5f078eb0294cc22f6d725720e4079d0190fa81e71db04ff4cbdf4f22126bdf528b8591f4fe1069fad05f1911977cf960ef8e09a922b3b10d6a6926f0',
 'path': '/root/autodl-tmp/datasets_cache/downloads/extracted/82be27a05c566ef43dc388e9c6d8f792b9fdec3c1a40f497c737e810a27b7bbf/zh-CN_test_0/common_voice_zh-CN_32269533.mp3',
 'audio': {'path': '/root/autodl-tmp/datasets_cache/downloads/extracted/82be27a05c566ef43dc388e9c6d8f792b9fdec3c1a40f497c737e810a27b7bbf/zh-CN_test_0/common_voice_zh-CN_32269533.mp3',
  'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
          1.03953280e-05, -4.73184627e-07,  1.83706070e-05]),
  'sampling_rate': 16000},
 'sentence': '否',
 'up_votes': 2,
 'down_votes': 1,
 'age': '',
 'gender': '',
 'accent': '',
 'locale': 'zh-CN',
 'segment': 'Benchmark'}

In [18]:
#第二次开始不用运行
def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

#remove_columns 用于从数据集中排除不必要的列，具体来说，是在 common_voice.column_names["train"] 中提到的列。
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["test"])
common_voice

DatasetDict({
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 10581
    })
})

In [19]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

# 这个类主要作为 Seq2Seq 任务中有关语音数据的数据整理器，确保输入特征和标签在训练或评估时得到适当的填充和处理

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### 评估模型

In [20]:
# 从 PyTorch 库中导入了 DataLoader 类，用于批量加载数据。
from torch.utils.data import DataLoader
# 导入了 tqdm 模块，用于显示进度条，让代码执行过程更可视化。
from tqdm import tqdm
# 导入了 numpy 库，用于处理数组数据
import numpy as np
# 导入了 Python 的垃圾回收模块，用于手动释放内存。
import gc

# 创建了一个数据加载器 eval_dataloader，用于加载测试集数据。common_voice["test"] 是一个测试集数据集合，
# batch_size=8 表示每个批次中包含8个样本，
# collate_fn=data_collator 是用于在加载样本时对它们进行预处理和分组的函数。
# ？为什么还要进行预处理和分组？
eval_dataloader = DataLoader(common_voice["test"], batch_size=8, collate_fn=data_collator)

# 将模型设置为评估模式，这通常用于禁用训练中的一些特定操作，例如 dropout。
peft_model.eval()

PeftModel(
  (base_model): LoraModel(
    (model): WhisperForConditionalGeneration(
      (model): WhisperModel(
        (encoder): WhisperEncoder(
          (conv1): Conv1d(80, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
          (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
          (embed_positions): Embedding(1500, 1280)
          (layers): ModuleList(
            (0-31): 32 x WhisperEncoderLayer(
              (self_attn): WhisperSdpaAttention(
                (k_proj): Linear8bitLt(in_features=1280, out_features=1280, bias=False)
                (v_proj): lora.Linear8bitLt(
                  (base_layer): Linear8bitLt(in_features=1280, out_features=1280, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1280, out_features=8, bias=False)
                  )
            

In [21]:
# 遍历测试集数据加载器中的每个批次，使用 tqdm 显示循环的进度。
for step, batch in enumerate(tqdm(eval_dataloader)):
    # 启用自动混合精度，这是一种在 NVIDIA GPU 上加速训练和推理的技术。
    with torch.cuda.amp.autocast():
        # 在评估模式下，关闭梯度计算以节省内存和加速计算。
        with torch.no_grad():
            # 使用模型生成器（通常是一个生成式模型，如 Transformer）根据输入特征预测输出序列。
            # 然后将预测的序列转移到 CPU 并转换为 NumPy 数组。
            generated_tokens = (
                peft_model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            # 将真实标签从 GPU 移动到 CPU 并转换为 NumPy 数组。
            labels = batch["labels"].cpu().numpy()
            # 这行代码使用 NumPy 的 np.where 函数，
            # 它的作用是根据一个条件数组来选择两个数组中的元素构成新的数组。
            # 在这里，条件是 labels != -100，它会生成一个布尔数组，
            # 标识哪些位置的标签不等于 -100（通常 -100 用于标记填充的部分）。
            # 如果条件为真，即标签不等于 -100，则保留原始标签，否则用 tokenizer.pad_token_id 替换。
            # 这样做的目的是将填充部分标记为特定的填充标记，以便后续解码时忽略这些填充部分。
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            # 这行代码使用了 tokenizer 的 batch_decode 方法，用于将模型生成的标记序列解码为文本序列。
            # generated_tokens 是模型生成的标记序列的 NumPy 数组。
            # skip_special_tokens=True 参数表示跳过特殊的标记，例如起始标记、结束标记和填充标记，只输出其中的文本内容。
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            # 类似于上一行代码，这行代码也是使用了 tokenizer 的 batch_decode 方法，
            # 但是解码的是真实的标签序列。
            # labels 是真实标签的 NumPy 数组。
            # 同样地，skip_special_tokens=True 参数表示跳过特殊的标记，只输出文本内容。
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            # 将每个批次的预测结果和真实标签传递给评估指标对象，用于计算评估指标的值。
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    # 手动释放内存，删除不再需要的变量。
    del generated_tokens, labels, batch
    # 手动触发 Python 的垃圾回收机制，进一步释放内存。
    gc.collect()

100%|██████████| 1323/1323 [2:01:16<00:00,  5.50s/it] 


In [22]:
# 计算词错误率（WER），并将其转换为百分比形式。
wer = 100 * metric.compute()
print(f"{wer=}")

wer=55.24475524475524


## 小数据集实验用流程(请忽略）

In [27]:
#配置缓存环境，只需要开始时运行一次
import os
#在transformers自定义模型下载的路径方法
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["HF_DATASETS_CACHE"] = "../../autodl-tmp/datasets_cache/"
os.environ["HF_HOME"] = "../../autodl-tmp/cache/"
os.environ["HUGGINGFACE_HUB_CACHE"] = "../../autodl-tmp/hub_cache/"

#配置网络环境，只需要开始时运行一次

import subprocess
import os

# result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True, encoding='utf-8')

output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

# 验证环境配置是否修改成功
print("http_proxy",os.environ.get("http_proxy"))
print("https_proxy",os.environ.get("https_proxy"))
print("HF_HOME",os.environ.get("HF_HOME"))
print("HF_DATASETS_CACHE",os.environ.get("HF_DATASETS_CACHE"))
print("HUGGINGFACE_HUB_CACHE",os.environ.get("HUGGINGFACE_HUB_CACHE"))

http_proxy http://172.20.0.113:12798
https_proxy http://172.20.0.113:12798
HF_HOME ../../autodl-tmp/cache/
HF_DATASETS_CACHE ../../autodl-tmp/datasets_cache/
HUGGINGFACE_HUB_CACHE ../../autodl-tmp/hub_cache/


In [2]:
model_name_or_path = "openai/whisper-large-v2"
language = "Chinese (China)"
language_abbr = "zh-CN"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

batch_size=32 #140最大

In [None]:
#加载数据，第二次不用运行

from datasets import load_dataset
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

# common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train+validation")

common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train")
common_voice["validation"] = load_dataset(dataset_name, language_abbr, split="validation")
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test")

In [None]:
#设置小范围实验用测试集，只用运行一次
small_common_voice = DatasetDict()

small_common_voice["train"] = common_voice["train"].shuffle(seed=16).select(range(640))
small_common_voice["validation"] = common_voice["validation"].shuffle(seed=16).select(range(320))

#实验用测试集处理
# 降采样音频数据
from datasets import Audio

# 将名为 "common_voice" 的数据集中的 "audio" 列转换为具有指定采样率（sampling rate）的音频格式。
# 通过调用 .cast_column() 方法，将 "audio" 列转换为特定格式的音频数据，这个格式由 Audio() 函数指定，其中设置了采样率为 16000 Hz。
small_common_voice = small_common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [3]:
#第二次也要运行
from transformers import AutoFeatureExtractor, AutoTokenizer, AutoProcessor

model_path = "../../autodl-tmp/model/openai/whisper-large-v2"

feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)

tokenizer = AutoTokenizer.from_pretrained(
    model_path, language=language, task=task)

processor = AutoProcessor.from_pretrained(
    model_path, language=language, task=task)

In [None]:
#实验用小数据集，只用运行一次
def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

#remove_columns 用于从数据集中排除不必要的列，具体来说，是在 common_voice.column_names["train"] 中提到的列。
small_common_voice = small_common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"])

small_common_voice.save_to_disk("../../autodl-tmp/datasets/processed_small_common_voice")

In [4]:
# 加载实验用小数据集
import datasets
# 从本地加载数据集
small_common_voice = datasets.load_from_disk("../../autodl-tmp/datasets/processed_small_common_voice")

small_common_voice

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 640
    })
    validation: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 320
    })
})

In [5]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

# 这个类主要作为 Seq2Seq 任务中有关语音数据的数据整理器，确保输入特征和标签在训练或评估时得到适当的填充和处理

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In [6]:
from transformers import AutoModelForSpeechSeq2Seq

model_path = "../../autodl-tmp/model/openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path, load_in_8bit=True, device_map="auto")

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

In [7]:
#上面做from_pretrained的时不是已经load_in_8bit了吗？为什么下面还要做处理？

from peft import prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)

from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 3,932,160 || all params: 1,547,237,120 || trainable%: 0.25414074863974306




In [8]:
from transformers import Seq2SeqTrainingArguments

# 设置序列到序列模型训练的参数
training_args = Seq2SeqTrainingArguments(
    output_dir="../../autodl-tmp/model/small-whisper-large-v2-asr-int8",  # 指定模型输出和保存的目录
    per_device_train_batch_size=64,
    # batch_size,  # 每个设备上的训练批量大小
    gradient_accumulation_steps=1,  # 梯度累积步数，在每次优化器步骤之前累积的更新步数
    learning_rate=1e-3,  # 学习率
    warmup_steps=50,  # 在训练初期增加学习率的步数，有助于稳定训练
    # max_steps=100, # 训练总步数
    num_train_epochs=10,  # 训练的总轮数
    evaluation_strategy="epoch",  # 设置评估策略，这里是在每个epoch结束时进行评估
    fp16=True,  # 启用混合精度训练，可以提高训练速度，同时减少内存使用
    per_device_eval_batch_size=batch_size,  # 每个设备上的评估批量大小
    generation_max_length=128,  # 生成任务的最大长度
    logging_steps=25,  # 指定日志记录的步骤，用于跟踪训练进度
    remove_unused_columns=False,  # 是否删除不使用的列，以减少数据处理开销
    label_names=["labels"],  # 指定标签列的名称，用于训练过程中
    save_steps=100, 
    save_total_limit=5
)

In [9]:
import os
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import Seq2SeqTrainer, TrainerCallback, Seq2SeqTrainingArguments, TrainerState, TrainerControl

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: Seq2SeqTrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

In [10]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=small_common_voice["train"],
    eval_dataset=small_common_voice["validation"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False

In [11]:
trainer.train()
    # resume_from_checkpoint=True)



Epoch,Training Loss,Validation Loss
1,No log,2.019805
2,No log,1.325154



KeyboardInterrupt

