使用 Hugging Face Transformer 微调 Audio Spectrogram Transformer
https://towardsdatascience.com/fine-tune-the-audio-spectrogram-transformer-with-transformers-73333c9ef717

In [23]:
from datasets import Dataset, Audio, ClassLabel, Features, load_dataset

In [24]:
esc50 = load_dataset("ashraq/esc50",split="train")


Repo card metadata block was not found. Setting CardData to empty.


加载本地音频文件和标签： 我们可以使用包含文件路径和标签的字典或 pandas DataFrame 将音频文件和关联的标签加载到 Dataset 对象中。如果我们有类名（字符串）到标签索引（整数）的映射，则可以在数据集构建期间包含此信息。

 预处理音频数据
如果我们的数据集来自 Hugging Face Hub，我们将 audio 和 labels 列转换为正确的特征类型：

In [25]:
import numpy as np
from datasets import Audio, ClassLabel
from transformers import ASTFeatureExtractor
df = esc50.select_columns(["target", "category"]).to_pandas()
class_names = df.iloc[np.unique(df["target"], return_index=True)[1]]["category"].to_list()

# cast target and audio column
esc50 = esc50.cast_column("target", ClassLabel(names=class_names))
esc50 = esc50.cast_column("audio", Audio(sampling_rate=16000))

# rename the target feature
esc50 = esc50.rename_column("target", "labels")
num_labels = len(np.unique(esc50["labels"]))

音频投射：Audio功能处理音频文件的加载和处理，将它们重新采样到所需的采样率（在本例中为 16kHz，ASTFeatureExtractor 的采样ASTFeatureExtractor率).
ClassLabel 强制转换：ClassLabel 功能将整数映射到标签，反之亦然。

准备 AST 模型输入： AST 模型需要频谱图输入，因此我们需要将波形编码为模型可以处理的格式。这是使用 ASTFeatureExtractor 实现的，它是从我们打算在数据集上微调的预训练模型的配置中实例化的。

In [26]:
# Define the pretrained model and instantiate the feature extractor
pretrained_model = "MIT/ast-finetuned-audioset-10-10-0.4593"
feature_extractor = ASTFeatureExtractor.from_pretrained(pretrained_model)
model_input_name = feature_extractor.model_input_names[0]
SAMPLING_RATE = feature_extractor.sampling_rate

In [27]:
# Preprocessing function
def preprocess_audio(batch):
    wavs = [audio["array"] for audio in batch["input_values"]]
    inputs = feature_extractor(wavs, sampling_rate=SAMPLING_RATE, return_tensors="pt")
    return {model_input_name: inputs.get(model_input_name), "labels": list(batch["labels"])}

In [28]:
dataset = esc50
label2id = dataset.features["labels"]._str2int 

In [29]:
if "test" not in dataset:
    dataset = dataset.train_test_split(
        test_size=0.2, shuffle=True, seed=0, stratify_by_column="labels")

In [30]:
import torch
from audiomentations import Compose, AddGaussianSNR, GainTransition, Gain, ClippingDistortion, TimeStretch, PitchShift

设置音频增强：为了创建一组音频增强，我们使用 Audiomentations 库中的 Compose 类，该类允许我们链接多个增强。

In [31]:
audio_augmentations = Compose([
    AddGaussianSNR(min_snr_db=10, max_snr_db=20),
    Gain(min_gain_db=-6, max_gain_db=6),
    GainTransition(min_gain_db=-6, max_gain_db=6, min_duration=0.01, max_duration=0.3, duration_unit="fraction"),
    ClippingDistortion(min_percentile_threshold=0, max_percentile_threshold=30, p=0.5),
    TimeStretch(min_rate=0.8, max_rate=1.2),
    PitchShift(min_semitones=-4, max_semitones=4),
], p=0.8, shuffle=True)

p=0.8 参数指定 Compose 序列中的每个增强都有 80% 的几率应用于任何给定的音频样本。这种概率方法确保了训练数据的可变性，防止模型过度依赖任何特定的增强模式，并提高其泛化能力。
shuffle=True 参数随机化了增强的应用顺序，从而增加了另一层可变性。

将 Augmentations 集成到训练管道中： 我们在 preprocess_audio 转换期间应用这些增强，我们还将音频数据编码为频谱图。

In [32]:
def preprocess_audio_with_transforms(batch):
    wavs = [audio_augmentations(audio["array"], sample_rate=SAMPLING_RATE) for audio in batch["input_values"]]
    inputs = feature_extractor(wavs, sampling_rate=SAMPLING_RATE, return_tensors="pt")
    return {model_input_name: inputs.get(model_input_name), "labels": list(batch["labels"])}

In [33]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
dataset = dataset.rename_column("audio", "input_values")

In [34]:
feature_extractor.do_normalize = False  # we set normalization to False in order to calculate the mean + std of the dataset
mean = []
std = []

# we use the transformation w/o augmentation on the training dataset to calculate the mean + std
dataset["train"].set_transform(preprocess_audio, output_all_columns=False)
for i, (audio_input, labels) in enumerate(dataset["train"]):
    cur_mean = torch.mean(dataset["train"][i][audio_input])
    cur_std = torch.std(dataset["train"][i][audio_input])
    mean.append(cur_mean)
    std.append(cur_std)

feature_extractor.mean = np.mean(mean)
feature_extractor.std = np.mean(std)
feature_extractor.do_normalize = True

print("Calculated mean and std:", feature_extractor.mean, feature_extractor.std)

Calculated mean and std: -3.3504603 4.387065


设置训练和验证拆分的转换： 最后，我们将这些转换设置为在训练和评估阶段应用：

In [35]:
dataset["train"].set_transform(preprocess_audio_with_transforms,output_all_columns=False) 
dataset["test"].set_transform(preprocess_audio,output_all_columns=False)

配置并初始化 AST 以进行微调
为了使 AST 模型适应我们特定的音频分类任务，我们需要调整模型的配置。这是因为我们的数据集的类数量与预训练模型不同，并且这些类对应于不同的类别。它需要用一个新的分类器头替换我们的多类问题。

新分类器 head 的权重将随机初始化，而模型的其余权重将从预训练版本加载。通过这种方式，我们可以从预训练的学习特征中受益，并对我们的数据进行微调。

In [36]:
import evaluate
from transformers import ASTConfig, ASTForAudioClassification, TrainingArguments, Trainer

In [37]:
config = ASTConfig.from_pretrained(pretrained_model) 
config.num_labels = num_labels 
config.label2id = label2id 
config.id2label = {v: k for k,v in label2id.items()} 
model = ASTForAudioClassification.from_pretrained(pretrained_model,config=config,ignore_mismatched_sizes=True) 
model.init_weights()

Some weights of ASTForAudioClassification were not initialized from the model checkpoint at MIT/ast-finetuned-audioset-10-10-0.4593 and are newly initialized because the shapes did not match:
- classifier.dense.bias: found shape torch.Size([527]) in the checkpoint and torch.Size([50]) in the model instantiated
- classifier.dense.weight: found shape torch.Size([527, 768]) in the checkpoint and torch.Size([50, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


设置指标并开始训练
在最后一步中，🤗我们将使用 Transformers 库配置训练过程，🤗并使用 Evaluate 库定义评估指标以评估模型的性能。

1. 配置训练参数：TrainingArguments 类有助于为训练过程设置各种参数，例如学习率、批量大小和纪元数。

In [38]:
from transformers import TrainingArguments

# Configure training run with TrainingArguments class
training_args = TrainingArguments(
    output_dir="./runs/ast_classifier",
    logging_dir="./logs/ast_classifier",
    report_to="tensorboard",
    learning_rate=5e-5,  # Learning rate
    push_to_hub=False,
    num_train_epochs=10,  # Number of epochs
    per_device_train_batch_size=8,  # Batch size per device
    eval_strategy="epoch",  # Evaluation strategy
    save_strategy="epoch",
    eval_steps=1,
    save_steps=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    logging_strategy="steps",
    logging_steps=20,
)

In [39]:
import evaluate

In [40]:
accuracy = evaluate.load("accuracy") 
recall = evaluate.load("recall") 
precision = evaluate.load("precision") 
f1 = evaluate.load("f1") 
AVERAGE = "macro" if config.num_labels > 2 else "binary" 

def compute_metrics(eval_pred):
    logits = eval_pred.predictions
    predictions = np.argmax(logits, axis=1)
    metrics = accuracy.compute(predictions=predictions, references=eval_pred.label_ids)
    metrics.update(precision.compute(predictions=predictions, references=eval_pred.label_ids, average=AVERAGE))
    metrics.update(recall.compute(predictions=predictions, references=eval_pred.label_ids, average=AVERAGE))
    metrics.update(f1.compute(predictions=predictions, references=eval_pred.label_ids, average=AVERAGE))
    return metrics

 设置 Trainer：使用 Hugging Face 的 Trainer 类来处理培训过程。此类集成了模型、训练参数、数据集和指标。

In [41]:
trainer = Trainer(
    model=model,
    args=training_args,  # we use our configured training arguments
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics,  # we the metrics function from above
)

In [42]:
trainer.train()

FailedPreconditionError: . is not a directory