# 加载数据

读取MInDS-14数据集

In [2]:
from datasets import load_dataset, Audio
minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

查看数据集结果

In [3]:
minds = minds.train_test_split(test_size=0.2)

In [4]:
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

移除无用信息

In [5]:
minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])
minds

DatasetDict({
    train: Dataset({
        features: ['audio', 'intent_class'],
        num_rows: 450
    })
    test: Dataset({
        features: ['audio', 'intent_class'],
        num_rows: 113
    })
})

为了让模型更容易从标签 id 获取标签名称，创建一个将标签名称映射到整数和反向映射的字典：

In [6]:
labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

现在可以将标签 id 转换为标签名称或者将标签名称转化为标签id

In [27]:
labels

['abroad',
 'address',
 'app_error',
 'atm_limit',
 'balance',
 'business_loan',
 'card_issues',
 'cash_deposit',
 'direct_debit',
 'freeze',
 'high_value_payment',
 'joint_account',
 'latest_transactions',
 'pay_bill']

In [7]:
id2label[ str ( 2 )]

'app_error'

# 预处理

加载 Wav2Vec2 特征提取器来处理音频信号

In [8]:
from transformers import AutoFeatureExtractor
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")



数据集的采样率为 8000khz（可以在其数据集卡中找到此信息），这意味着需要将数据集重新采样为 16000kHz 才能使用预训练的 Wav2Vec2 模型

In [9]:
minds = minds.cast_column( "audio" , Audio(sampling_rate= 16_000 ))
minds[ "train" ][ 0 ]

{'audio': {'path': '/Users/songchangwei/.cache/huggingface/datasets/downloads/extracted/c894aeb32b2a85cb19ba4c03720f1d9ea0733781be16bc26502f0931a30b3010/en-US~ADDRESS/602baba9bb1e6d0fbce921a9.wav',
  'array': array([2.29874277e-04, 1.51120068e-04, 1.46790771e-05, ...,
         1.25871622e-04, 2.26960750e-04, 1.76190690e-04]),
  'sampling_rate': 16000},
 'intent_class': 1}

现在创建一个预处理函数：
1. 调用audio列来加载，并且如果需要，重新采样音频文件。
2. 检查音频文件的采样率是否与模型预训练的音频数据的采样率匹配。可以在 Wav2Vec2模型中找到此信息。
3. 设置最大输入长度以批量处理较长的输入而不截断它们。

In [10]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

要将预处理函数应用于整个数据集，要使用 🤗 数据集map映射函数。可以通过设置为一次处理数据集的多个元素来加快速度batched=True。删除不需要的列，然后重命名intent_class为labe，因为这是模型期望的名称：

In [11]:
minds

DatasetDict({
    train: Dataset({
        features: ['audio', 'intent_class'],
        num_rows: 450
    })
    test: Dataset({
        features: ['audio', 'intent_class'],
        num_rows: 113
    })
})

In [12]:
encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

Map:   0%|          | 0/113 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['intent_class', 'input_values'],
        num_rows: 450
    })
    test: Dataset({
        features: ['intent_class', 'input_values'],
        num_rows: 113
    })
})

In [13]:
encoded_minds = encoded_minds.rename_column("intent_class", "label")
encoded_minds

DatasetDict({
    train: Dataset({
        features: ['label', 'input_values'],
        num_rows: 450
    })
    test: Dataset({
        features: ['label', 'input_values'],
        num_rows: 113
    })
})

# 评估指标

在训练期间纳入指标通常有助于评估模型的性能。可以使用 🤗 Evaluate库快速加载评估方法。对于此任务，请加载accuracy指标

In [14]:
import evaluate
accuracy = evaluate.load("accuracy")

然后创建一个函数，传递您的预测和标签来计算评价指标：


In [15]:
import numpy as np
def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

# 模型训练

使用AutoModelForAudioClassification加载 Wav2Vec2以及预期标签数量和标签映射

In [16]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer
num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


此时，只剩下三个步骤：
1. 在TrainingArguments中定义训练超参数。唯一必需的参数是output_dir，它指定模型的保存位置。在每个epoch结束时，Trainer将评估准确性并保存训练检查点。
2. 将训练参数与模型、数据集、分词器tokenizer、数据收集器data collator和评价指标函数一起传递给Trainercompute_metrics。
3. 调用train()来微调模型

In [18]:
training_args = TrainingArguments(
    output_dir="output_dir",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    compute_metrics=compute_metrics,
    processing_class=feature_extractor,
)

trainer.train()

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.6361842155456543, 'eval_accuracy': 0.07964601769911504, 'eval_runtime': 3.5699, 'eval_samples_per_second': 31.654, 'eval_steps_per_second': 1.12, 'epoch': 0.8}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.642812967300415, 'eval_accuracy': 0.07964601769911504, 'eval_runtime': 4.7722, 'eval_samples_per_second': 23.679, 'eval_steps_per_second': 0.838, 'epoch': 1.8}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.6504013538360596, 'eval_accuracy': 0.07079646017699115, 'eval_runtime': 3.6513, 'eval_samples_per_second': 30.948, 'eval_steps_per_second': 1.096, 'epoch': 2.8}
{'loss': 12.1235, 'grad_norm': 11090.04296875, 'learning_rate': 2.222222222222222e-05, 'epoch': 3.27}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.6556034088134766, 'eval_accuracy': 0.061946902654867256, 'eval_runtime': 3.8449, 'eval_samples_per_second': 29.39, 'eval_steps_per_second': 1.04, 'epoch': 3.8}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.658604621887207, 'eval_accuracy': 0.061946902654867256, 'eval_runtime': 3.7462, 'eval_samples_per_second': 30.164, 'eval_steps_per_second': 1.068, 'epoch': 4.8}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.659576177597046, 'eval_accuracy': 0.061946902654867256, 'eval_runtime': 4.787, 'eval_samples_per_second': 23.606, 'eval_steps_per_second': 0.836, 'epoch': 5.8}
{'loss': 12.0699, 'grad_norm': 77954.1796875, 'learning_rate': 1.111111111111111e-05, 'epoch': 6.53}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.6594295501708984, 'eval_accuracy': 0.05309734513274336, 'eval_runtime': 9.0648, 'eval_samples_per_second': 12.466, 'eval_steps_per_second': 0.441, 'epoch': 6.8}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.659940719604492, 'eval_accuracy': 0.07079646017699115, 'eval_runtime': 3.4631, 'eval_samples_per_second': 32.629, 'eval_steps_per_second': 1.155, 'epoch': 7.8}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.6602489948272705, 'eval_accuracy': 0.07079646017699115, 'eval_runtime': 6.7003, 'eval_samples_per_second': 16.865, 'eval_steps_per_second': 0.597, 'epoch': 8.8}
{'loss': 12.0532, 'grad_norm': 104368.3828125, 'learning_rate': 0.0, 'epoch': 9.8}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 2.660250425338745, 'eval_accuracy': 0.07079646017699115, 'eval_runtime': 5.1241, 'eval_samples_per_second': 22.053, 'eval_steps_per_second': 0.781, 'epoch': 9.8}
{'train_runtime': 757.8764, 'train_samples_per_second': 5.938, 'train_steps_per_second': 0.04, 'train_loss': 12.082204182942709, 'epoch': 9.8}


TrainOutput(global_step=30, training_loss=12.082204182942709, metrics={'train_runtime': 757.8764, 'train_samples_per_second': 5.938, 'train_steps_per_second': 0.04, 'total_flos': 4.0092549156864e+16, 'train_loss': 12.082204182942709, 'epoch': 9.8})

# 推理

现在已经对模型进行了微调，可以用它来进行推理。首先加载用了推理的数据（注意这里为了演示使用同样的数据）

In [19]:
from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = dataset.features["audio"].sampling_rate
audio_file = dataset[0]["audio"]["path"]

尝试微调模型进行推理的最简单方法是在pipeline()中使用它。使用pipeline模型实例化一个用于音频分类的模型，并将音频文件传递给它

In [22]:
from transformers import pipeline

classifier = pipeline("audio-classification", model="output_dir/checkpoint-30")
classifier(audio_file)

Device set to use mps:0


[{'score': 0.08308025449514389, 'label': 'cash_deposit'},
 {'score': 0.08229406177997589, 'label': 'card_issues'},
 {'score': 0.07532459497451782, 'label': 'atm_limit'},
 {'score': 0.07338222116231918, 'label': 'high_value_payment'},
 {'score': 0.07311287522315979, 'label': 'app_error'}]

如果不使用pipeline，而是自己手动搭建模型的话，代码如下：

加载特征提取器来预处理音频文件并将其返回input为 PyTorch 张量：

In [23]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("output_dir/checkpoint-30")
inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

将输入传递给模型并返回logist：

In [25]:
from transformers import AutoModelForAudioClassification
import torch

model = AutoModelForAudioClassification.from_pretrained("output_dir/checkpoint-30")
with torch.no_grad():
    logits = model(**inputs).logits

获取概率最高的类，并使用模型的id2label映射将其转换为标签：

In [26]:
predicted_class_ids = torch.argmax(logits).item()
predicted_label = model.config.id2label[predicted_class_ids]
predicted_label

'cash_deposit'