# 智能语音处理 实验2 Wav2vec2

**Wav2Vec 2.0** 是自动语音识别 (ASR) 的预训练模型，于 [2020 年 9 月](https://ai.facebook.com/blog/wav2vec-20-learning-the-struct-of-Speech-from-raw-audio/)发布，作者：Alexei Baevski、Michael Auli 和 Alex Conneau。 在英语 ASR 数据集 LibriSpeech 上展示 Wav2Vec2 的卓越性能后不久，*Facebook AI* 推出了 XLSR-Wav2Vec2（点击[此处](https://arxiv.org/abs/2006.13979)）。 XLSR 代表“跨语言语音表示”，指的是 XLSR-Wav2Vec2 学习跨多种语言有用的语音表示的能力。

与 Wav2Vec2 类似，XLSR-Wav2Vec2从50多种语言的未标记语音的数十万小时语音中学习强大的语音表示。与[BERT的掩码语言建模](http://jalammar.github.io/illusterated-bert/) 类似，该模型通过在将特征向量传递到transformer之前随机屏蔽特征向量来学习上下文语音表示。

![wav2vec2_structure](./img/xlsr_wav2vec2.png)

作者首次证明，在跨语言未标记语音数据上大规模预训练 ASR 模型，然后对很少的标记数据进行特定于语言的微调，可实现最先进的结果。 参见官方[paper](https://arxiv.org/pdf/2006.13979.pdf)的表1-5。

在此notebook中，我们将使用此模型来对语音进行分类。

## Load Data

我们使用[Jeannette Shijie Ma](https://www.kaggle.com/mashijie)的[Eating Sound Collection](https://www.kaggle.com/mashijie/eating-sound-collection)数据集。 


首先导入所需python库

In [1]:
import numpy as np
import pandas as pd

from pathlib import Path
from tqdm import tqdm

import torchaudio
from sklearn.model_selection import train_test_split

import os
import sys

In [2]:
data = []

for path in tqdm(Path("./content/data/clips_rd/").glob("**/*.wav")):
    name = str(path).split('/')[-1].split('.')[0]
    label = str(path).split('/')[-2]
    # 删除掉损坏的文件
    try:
        s = torchaudio.load(str(path))
        data.append({
            "path": path,
            "label": label
        })
    except Exception as e:
        print(str(path), e)
        pass

11140it [00:40, 271.83it/s]


查看wav文件

In [3]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,path,label
0,content/data/clips_rd/pizza/pizza_7_63.wav,pizza
1,content/data/clips_rd/pizza/pizza_11_02.wav,pizza
2,content/data/clips_rd/pizza/pizza_8_31.wav,pizza
3,content/data/clips_rd/pizza/pizza_6_04.wav,pizza
4,content/data/clips_rd/pizza/pizza_7_25.wav,pizza


In [4]:
# 删除掉不存在的文件
print(f"before delete: {len(df)}")
df["status"] = df["path"].apply(lambda path: True if os.path.exists(path) else None)
df = df.dropna(subset=["status"])
df.drop(columns=["status"], inplace=True)
print(f"after delete: {len(df)}")

# 随机打乱数据
df = df.sample(frac=1)
df = df.reset_index(drop=True)
df.head()

before delete: 11140


after delete: 11140


Unnamed: 0,path,label
0,content/data/clips_rd/grapes/grapes_3_22.wav,grapes
1,content/data/clips_rd/chips/chips_9_29.wav,chips
2,content/data/clips_rd/candied_fruits/candied_f...,candied_fruits
3,content/data/clips_rd/cabbage/cabbage_3_01.wav,cabbage
4,content/data/clips_rd/gummies/gummies_12_50.wav,gummies


查看label和每个label的样本数

In [56]:
labels = df["label"].unique()
i = 0
for label in labels:
    i += 1
    print(f"{label}: {len(df[df['label'] == label])}", end=", ")
    if i%5 == 0:
        print()
# df_count = df.groupby("label").count()

grapes: 579, chips: 719, candied_fruits: 806, cabbage: 499, gummies: 678, 
ice-cream: 727, pizza: 609, fries: 644, jelly: 442, carrots: 660, 
aloe: 546, pickles: 872, drinks: 292, chocolate: 290, soup: 278, 
burger: 595, wings: 504, ribs: 488, noodles: 411, salmon: 501, 


根据数据集构造训练集和测试集，这里测试集的比例为20%

In [6]:
save_path = "./content/data"

train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["label"])
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)

print(train_df.shape)
print(test_df.shape)

(8912, 2)
(2228, 2)


## Prepare Data for Training

### 导入数据集

In [7]:
# Loading the created dataset using datasets
from datasets import load_dataset, load_metric

data_files = {
    "train": "./content/data/train.csv", 
    "validation": "./content/data/test.csv",
}

dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

print(train_dataset)
print(eval_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['path', 'label'],
    num_rows: 8912
})
Dataset({
    features: ['path', 'label'],
    num_rows: 2228
})


In [8]:
input_column = "path"
output_column = "label"
label_list = train_dataset.unique(output_column)
label_list.sort()  
num_labels = len(label_list)
print(f"A classification problem with {num_labels} classes: {label_list}")

A classification problem with 20 classes: ['aloe', 'burger', 'cabbage', 'candied_fruits', 'carrots', 'chips', 'chocolate', 'drinks', 'fries', 'grapes', 'gummies', 'ice-cream', 'jelly', 'noodles', 'pickles', 'pizza', 'ribs', 'salmon', 'soup', 'wings']


为了预处理音频数据，我们需要确定使用的Wav2Vec2模型，在这里使用的是wav2vec2-base-100k-voxpopuli模型。
为了处理音频任何长度的上下文表示，我们使用合并策略（pooling mode）将该 3D 表示成 2D 。

共有三种pooling策略“mean”、“sum”和“max”。 在这个例子中，我们通过mean方法取得了更好的结果。 接下来，我们需要从这个模型导入配置和特征提取器。

In [9]:
from transformers import AutoConfig, Wav2Vec2Processor, Wav2Vec2FeatureExtractor

In [10]:
model_name_or_path = "/home/xukeyan/.cache/huggingface/transformers/facebook/wav2vec2-base-100k-voxpopuli"
pooling_mode = "mean"

In [11]:
config = AutoConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    label2id={label: i for i, label in enumerate(label_list)},
    id2label={i: label for i, label in enumerate(label_list)},
    finetuning_task="wav2vec2_clf",
)
# 向config中添加pooling_mode属性
setattr(config, 'pooling_mode', pooling_mode)

In [12]:
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path,)
target_sampling_rate = feature_extractor.sampling_rate
print(f"The target sampling rate: {target_sampling_rate}")

The target sampling rate: 16000


### Preprocess Data

现在，我们需要从上下文表示张量中的音频路径中提取特征，并将它们输入到我们的分类模型中。

音频文件通常会存储其值和语音信号数字化的采样率，因此我们编写 **map(...)** 函数来对数据进行重采样（16Hz），并将标签进行表示。

In [13]:
def speech_file_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(sampling_rate, target_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech

def label_to_id(label, label_list):
    if len(label_list) > 0:
        return label_list.index(label) if label in label_list else -1
    return label

def preprocess_function(examples):
    speech_list = [speech_file_to_array_fn(path) for path in examples[input_column]]
    target_list = [label_to_id(label, label_list) for label in examples[output_column]]

    result = feature_extractor(speech_list, sampling_rate=target_sampling_rate)
    result["labels"] = list(target_list)

    return result

In [14]:
# Remove this part
max_train_samples = min(1000, len(train_dataset))
max_eval_samples = min(200, len(eval_dataset))
train_dataset = train_dataset.select(range(max_train_samples))
eval_dataset = eval_dataset.select(range(max_eval_samples))

In [15]:
# print(len(train_dataset["input_values"][0]))

In [16]:
train_dataset = train_dataset.map(preprocess_function,batch_size=10,
                                    batched=True,num_proc=16)
eval_dataset = eval_dataset.map(preprocess_function,batch_size=10,
                                    batched=True,num_proc=16)

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/200 [00:00<?, ? examples/s]

In [17]:
print(train_dataset)
print(len(train_dataset["input_values"][0]))

Dataset({
    features: ['path', 'label', 'input_values', 'labels'],
    num_rows: 1000
})
76006


In [18]:
idx = 0
# print(f"Training input_values: {train_dataset[idx]['input_values']}")
# print(f"Training attention_mask: {train_dataset[idx]['attention_mask']}")
print(f"Training labels: {train_dataset[idx]['labels']} - {train_dataset[idx]['label']}")

Training labels: 4 - carrots


## Model

在深入训练部分之前，我们需要根据合并策略构建分类模型。 

In [19]:
from dataclasses import dataclass
from typing import Optional, Tuple
import torch
from transformers.file_utils import ModelOutput


@dataclass
class SpeechClassifierOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None


In [20]:
import torch
import torch.nn as nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2PreTrainedModel,
    Wav2Vec2Model
)


class Wav2Vec2ClassificationHead(nn.Module):
    """Head for wav2vec classification task."""
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x


class Wav2Vec2ForSpeechClassification(Wav2Vec2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.pooling_mode = config.pooling_mode
        self.config = config

        self.wav2vec2 = Wav2Vec2Model(config)
        self.classifier = Wav2Vec2ClassificationHead(config)

        self.init_weights()

    def freeze_feature_extractor(self):
        self.wav2vec2.feature_extractor._freeze_parameters()

    def merged_strategy(
            self,
            hidden_states,
            mode="mean"
    ):
        if mode == "mean":
            outputs = torch.mean(hidden_states, dim=1)
        elif mode == "sum":
            outputs = torch.sum(hidden_states, dim=1)
        elif mode == "max":
            outputs = torch.max(hidden_states, dim=1)[0]
        else:
            raise Exception(
                "The pooling method hasn't been defined! "
                "Your pooling mode must be one of these ['mean', 'sum', 'max']")

        return outputs

    def forward(
            self,
            input_values,
            attention_mask=None,
            output_attentions=None,
            output_hidden_states=None,
            return_dict=None,
            labels=None,
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        outputs = self.wav2vec2(
            input_values,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = outputs[0]
        hidden_states = self.merged_strategy(hidden_states, mode=self.pooling_mode)
        logits = self.classifier(hidden_states)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SpeechClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )


## Training

数据经过处理后，我们就可以开始设置training pipeline了。 我们将利用trainer进行训练。

需要执行以下操作：

- 定义data collator。 与大多数 NLP 模型相比，XLSR-Wav2Vec2 的输入长度远大于输出长度。 *例如*，输入长度为 50000 的样本的输出长度不超过 100。考虑到输入大小较大，动态填充训练批次会更有效，这意味着所有训练样本只能填充到在此批次中的最长长度，而不是整体最长的样本中。 因此，微调 XLSR-Wav2Vec2 需要一个特殊的padding data collator，我们将在下面定义它。

- 评估指标。 在训练过程中，应该根据错误率来评估模型。 我们应该相应地定义一个“compute_metrics”函数

- 加载预训练的检查点。 我们需要加载预训练的检查点并正确配置它以进行训练。

- 定义训练配置。

对模型进行微调后，我们将在测试数据上进行评估。

### Set-up Trainer

#### data collator

无需讨论太多细节，与常见的数据整理器相比，该数据整理器以不同的方式处理“input_values”和“labels”，因此适用于它们的单独填充函数（再次使用 XLSR-Wav2Vec2 的上下文管理器）。 因为在语音中输入和输出具有不同的模式，这意味着它们不应该由相同的填充函数处理。与常见的数据整理器类似，标签中的填充标记为“-100”，以便在计算损失时**不**考虑这些标记。

In [21]:
from dataclasses import dataclass
from typing import Dict, List, Optional, Union
import torch

import transformers
from transformers import Wav2Vec2Processor, Wav2Vec2FeatureExtractor


@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        feature_extractor (:class:`~transformers.Wav2Vec2FeatureExtractor`)
            The feature_extractor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """
    feature_extractor: Wav2Vec2FeatureExtractor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [feature["labels"] for feature in features]
        d_type = torch.long if isinstance(label_features[0], int) else torch.float
        batch = self.feature_extractor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        batch["labels"] = torch.tensor(label_features, dtype=d_type)
        return batch


In [22]:
data_collator = DataCollatorCTCWithPadding(feature_extractor=feature_extractor, padding=True)

#### evaluation metric

我们将使用 **准确率** 进行分类，使用 **MSE** 进行回归。

In [23]:
is_regression = False

In [24]:
import numpy as np
from transformers import EvalPrediction


def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)

    if is_regression:
        return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
    else:
        return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

#### load pretraining model

In [25]:
model = Wav2Vec2ForSpeechClassification.from_pretrained(
    model_name_or_path,
    config=config,
)

Some weights of Wav2Vec2ForSpeechClassification were not initialized from the model checkpoint at /home/xukeyan/.cache/huggingface/transformers/facebook/wav2vec2-base-100k-voxpopuli and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
model.freeze_feature_extractor()

### define parameters for training

In [28]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./content/wav2vec2-base-100k-eating-sound-collection",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=1.0,
    # fp16=True,
    save_steps=10,
    eval_steps=10,
    logging_steps=10,
    learning_rate=1e-4,
    save_total_limit=2,
)

  return torch._C._cuda_getDeviceCount() > 0


In [29]:
from typing import Any, Dict, Union

import torch
from packaging import version
from torch import nn

from transformers import (
    Trainer,
    is_apex_available,
)

if is_apex_available():
    from apex import amp

if version.parse(torch.__version__) >= version.parse("1.6"):
    _is_native_amp_available = True
    from torch.cuda.amp import autocast


class CTCTrainer(Trainer):
    def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
        """
        Perform a training step on a batch of inputs.

        Subclass and override to inject custom behavior.

        Args:
            model (:obj:`nn.Module`):
                The model to train.
            inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.

                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument :obj:`labels`. Check your model's documentation for all accepted arguments.

        Return:
            :obj:`torch.Tensor`: The tensor with training loss on this batch.
        """
        model.train()
        inputs = self._prepare_inputs(inputs)
        loss = self.compute_loss(model, inputs)

        if self.args.gradient_accumulation_steps > 1:
            loss = loss / self.args.gradient_accumulation_steps

        loss.backward()
        return loss.detach()


### trainer

In [30]:
trainer = CTCTrainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=feature_extractor,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


### Training

In [31]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
10,2.9915,2.982868,0.05
20,2.9548,2.98294,0.065
30,2.9997,2.9828,0.065
40,3.0005,2.979535,0.085
50,2.9741,2.978915,0.085
60,2.9412,2.97833,0.08


TrainOutput(global_step=62, training_loss=2.9760935844913607, metrics={'train_runtime': 357.8558, 'train_samples_per_second': 2.794, 'train_steps_per_second': 0.173, 'total_flos': 6.102377894503526e+16, 'train_loss': 2.9760935844913607, 'epoch': 0.99})

## Evaluation

In [32]:
import librosa
from sklearn.metrics import classification_report

In [33]:
test_dataset = load_dataset("csv", data_files={"test": "./content/data/test.csv"}, delimiter="\t")["test"]
test_dataset

Generating test split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['path', 'label'],
    num_rows: 2228
})

In [34]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

Device: cpu


In [35]:
model_name_or_path = "/home/xukeyan/.cache/huggingface/transformers/m3hrdadfi/wav2vec2-base-100k-eating-sound-collection"
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

Some weights of the model checkpoint at /home/xukeyan/.cache/huggingface/transformers/m3hrdadfi/wav2vec2-base-100k-eating-sound-collection were not used when initializing Wav2Vec2ForSpeechClassification: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForSpeechClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSpeechClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSpeechClassification were not initialized from the model checkpoint at /home/xukeyan/.cache/huggingface/transformers/m3hrdadfi/wav2vec2-base-100k-eating-sound-collection and

In [36]:
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=feature_extractor.sampling_rate)

    batch["speech"] = speech_array
    return batch


def predict(batch):
    features = feature_extractor(batch["speech"], sampling_rate=feature_extractor.sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)

    with torch.no_grad():
        logits = model(input_values).logits 

    pred_ids = torch.argmax(logits, dim=-1).detach().cpu().numpy()
    batch["predicted"] = pred_ids
    return batch

In [37]:
test_dataset = test_dataset.map(speech_file_to_array_fn)

Map:   0%|          | 0/2228 [00:00<?, ? examples/s]

In [38]:
result = test_dataset.map(predict, batched=True, batch_size=8)



Map:   0%|          | 0/2228 [00:00<?, ? examples/s]

In [39]:
label_names = [config.id2label[i] for i in range(config.num_labels)]
label_names

['aloe',
 'burger',
 'cabbage',
 'candied_fruits',
 'carrots',
 'chips',
 'chocolate',
 'drinks',
 'fries',
 'grapes',
 'gummies',
 'ice-cream',
 'jelly',
 'noodles',
 'pickles',
 'pizza',
 'ribs',
 'salmon',
 'soup',
 'wings']

In [40]:
y_true = [config.label2id[name] for name in result["label"]]
y_pred = result["predicted"]

print(y_true[:5])
print(y_pred[:5])

[2, 15, 0, 3, 18]
[2, 15, 0, 3, 18]


In [41]:
print(classification_report(y_true, y_pred, target_names=label_names))

                precision    recall  f1-score   support

          aloe       0.99      0.87      0.93       109
        burger       1.00      0.48      0.65       119
       cabbage       0.91      0.95      0.93       100
candied_fruits       0.96      0.99      0.98       161
       carrots       0.98      0.98      0.98       132
         chips       1.00      0.97      0.98       144
     chocolate       0.85      0.98      0.91        58
        drinks       1.00      0.98      0.99        58
         fries       0.99      0.88      0.93       129
        grapes       0.98      0.97      0.98       116
       gummies       0.93      0.95      0.94       136
     ice-cream       0.97      0.99      0.98       145
         jelly       0.91      0.95      0.93        88
       noodles       0.88      0.91      0.90        82
       pickles       0.98      1.00      0.99       174
         pizza       0.75      0.99      0.85       122
          ribs       0.89      0.89      0.89  

# Prediction

In [42]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2Processor

import librosa
import IPython.display as ipd
import numpy as np
import pandas as pd

In [43]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "/home/xukeyan/.cache/huggingface/transformers/m3hrdadfi/wav2vec2-base-100k-eating-sound-collection"
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

Some weights of the model checkpoint at /home/xukeyan/.cache/huggingface/transformers/m3hrdadfi/wav2vec2-base-100k-eating-sound-collection were not used when initializing Wav2Vec2ForSpeechClassification: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForSpeechClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSpeechClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSpeechClassification were not initialized from the model checkpoint at /home/xukeyan/.cache/huggingface/transformers/m3hrdadfi/wav2vec2-base-100k-eating-sound-collection and

In [44]:
def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    features = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)

    with torch.no_grad():
        logits = model(input_values).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Label": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
    return outputs


STYLES = """
<style>
div.display_data {
    margin: 0 auto;
    max-width: 500px;
}
table.xxx {
    margin: 50px !important;
    float: right !important;
    clear: both !important;
}
table.xxx td {
    min-width: 300px !important;
    text-align: center !important;
}
</style>
""".strip()

def prediction(df_row):
    path, label = df_row["path"], df_row["label"]
    df = pd.DataFrame([{"Label": label, "Sentence": "    "}])
    setup = {
        'border': 2,
        'show_dimensions': True,
        'justify': 'center',
        'classes': 'xxx',
        'escape': False,
    }
    ipd.display(ipd.HTML(STYLES + df.to_html(**setup) + "<br />"))
    speech, sr = torchaudio.load(path)
    speech = speech[0].numpy().squeeze()
    speech = librosa.resample(np.asarray(speech), orig_sr=sr, target_sr=sampling_rate)
    ipd.display(ipd.Audio(data=np.asarray(speech), autoplay=True, rate=sampling_rate))

    outputs = predict(path, sampling_rate)
    r = pd.DataFrame(outputs)
    ipd.display(ipd.HTML(STYLES + r.to_html(**setup) + "<br />"))

In [45]:
test = pd.read_csv("./content/data/test.csv", sep="\t")
test.head()

Unnamed: 0,path,label
0,content/data/clips_rd/cabbage/cabbage_11_04.wav,cabbage
1,content/data/clips_rd/pizza/pizza_5_64.wav,pizza
2,content/data/clips_rd/aloe/aloe_7_08.wav,aloe
3,content/data/clips_rd/candied_fruits/candied_f...,candied_fruits
4,content/data/clips_rd/soup/soup_8_22.wav,soup


In [46]:
prediction(test.iloc[0])

Unnamed: 0,Label,Sentence
0,cabbage,


Unnamed: 0,Label,Score
0,aloe,0.0%
1,burger,0.0%
2,cabbage,100.0%
3,candied_fruits,0.0%
4,carrots,0.0%
5,chips,0.0%
6,chocolate,0.0%
7,drinks,0.0%
8,fries,0.0%
9,grapes,0.0%


In [47]:
prediction(test.iloc[1])

Unnamed: 0,Label,Sentence
0,pizza,


Unnamed: 0,Label,Score
0,aloe,0.0%
1,burger,0.0%
2,cabbage,0.0%
3,candied_fruits,0.0%
4,carrots,0.0%
5,chips,0.0%
6,chocolate,0.0%
7,drinks,0.0%
8,fries,0.0%
9,grapes,0.0%


In [48]:
prediction(test.iloc[2])

Unnamed: 0,Label,Sentence
0,aloe,


Unnamed: 0,Label,Score
0,aloe,99.9%
1,burger,0.0%
2,cabbage,0.0%
3,candied_fruits,0.0%
4,carrots,0.0%
5,chips,0.0%
6,chocolate,0.0%
7,drinks,0.0%
8,fries,0.0%
9,grapes,0.0%
