# 1 prepare

## 1.1 basic prepare

### 1.1.1 login HuggingFace

In [10]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### 1.1.2 prepare torch

In [2]:
import torch
torch.cuda.is_available()

DEBUG:pydot:pydot initializing
DEBUG:pydot:pydot 3.0.1
DEBUG:pydot.core:pydot core module initializing
DEBUG:pydot.dot_parser:pydot dot_parser module initializing


True

## 1.2 prepare dataset from HuggingFace hub

### 1.2.1 load dataset

In [2]:
from datasets import load_dataset, DatasetDict

# prepare a empty dict
common_voice = DatasetDict()

# fill the empty dict
common_voice["train"] = load_dataset(
    "mozilla-foundation/common_voice_17_0", "dv", split="train+validation"
)
common_voice["test"] = load_dataset(
    "mozilla-foundation/common_voice_17_0", "dv", split="test"
)

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 4902
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 2215
    })
})


### 1.2.2 process columns

In [3]:
# only preserve necessary columns
common_voice = common_voice.select_columns(["audio", "sentence"])

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 4902
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 2215
    })
})


## 1.3 prepare processor

In 🤗 Transformers, the Whisper model has an associated feature extractor and tokenizer, called [WhisperFeatureExtractor](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperFeatureExtractor) and [WhisperTokenizer](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperTokenizer) respectively. To make our lives simple, these two objects are wrapped under a single class, called the [WhisperProcessor](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).

- **FeatureExtractor**: for every supported audio model, 🤗 Transformers offer a feature extractor class that can convert raw audio data into the input features the model expects.
- **Tokenizer**: 🤗 Transformers also offer model-specific tokenizers to process the text inputs.

### 1.3.1 check if language supported

In [5]:
from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE

print(TO_LANGUAGE_CODE)
print('Dhivehi' in TO_LANGUAGE_CODE.keys())
print('dv' in TO_LANGUAGE_CODE.values())

{'english': 'en', 'chinese': 'zh', 'german': 'de', 'spanish': 'es', 'russian': 'ru', 'korean': 'ko', 'french': 'fr', 'japanese': 'ja', 'portuguese': 'pt', 'turkish': 'tr', 'polish': 'pl', 'catalan': 'ca', 'dutch': 'nl', 'arabic': 'ar', 'swedish': 'sv', 'italian': 'it', 'indonesian': 'id', 'hindi': 'hi', 'finnish': 'fi', 'vietnamese': 'vi', 'hebrew': 'he', 'ukrainian': 'uk', 'greek': 'el', 'malay': 'ms', 'czech': 'cs', 'romanian': 'ro', 'danish': 'da', 'hungarian': 'hu', 'tamil': 'ta', 'norwegian': 'no', 'thai': 'th', 'urdu': 'ur', 'croatian': 'hr', 'bulgarian': 'bg', 'lithuanian': 'lt', 'latin': 'la', 'maori': 'mi', 'malayalam': 'ml', 'welsh': 'cy', 'slovak': 'sk', 'telugu': 'te', 'persian': 'fa', 'latvian': 'lv', 'bengali': 'bn', 'serbian': 'sr', 'azerbaijani': 'az', 'slovenian': 'sl', 'kannada': 'kn', 'estonian': 'et', 'macedonian': 'mk', 'breton': 'br', 'basque': 'eu', 'icelandic': 'is', 'armenian': 'hy', 'nepali': 'ne', 'mongolian': 'mn', 'bosnian': 'bs', 'kazakh': 'kk', 'albanian'

We can see `Dhivehi` not in support languages list.

If you scroll through this list, you’ll notice that many languages are present, but Dhivehi is one of few that is not! This means that Whisper was not pre-trained on Dhivehi. However, this doesn’t mean that we can’t fine tune Whisper on it. 

In doing so, we’ll be teaching Whisper a new language, one that the pre-trained checkpoint does not support. That’s pretty cool, right!

### 1.3.2 find target language

What we need to do to fine-tune Whisper on a new language is find the language most similar that Whisper was pre-trained on. The Wikipedia article for Dhivehi states that Dhivehi is closely related to the Sinhalese language of Sri Lanka. If we check the language codes again, we can see that Sinhalese is present in the Whisper language set, so we can safely set our language argument to "sinhalese".

Right! We’ll load our processor from the pre-trained checkpoint, setting the language to "sinhalese" and task to "transcribe" as explained above

In [5]:
from transformers import WhisperProcessor

device = "cuda:7" if torch.cuda.is_available() else "cpu"

processor = WhisperProcessor.from_pretrained(
    "openai/whisper-small", language="sinhalese", task="transcribe",
    # device=device
)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /openai/whisper-small/resolve/main/preprocessor_config.json HTTP/11" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /openai/whisper-small/resolve/main/tokenizer_config.json HTTP/11" 200 0
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /openai/whisper-small/resolve/main/processor_config.json HTTP/11" 404 0


## 1.4 proprecess the dataset

### 1.4.1 resample the audio rate

In [1]:
common_voice["train"].features

NameError: name 'common_voice' is not defined

In [8]:
from datasets import Audio

sampling_rate = processor.feature_extractor.sampling_rate
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=sampling_rate))

common_voice["train"].features

{'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None),
 'sentence': Value(dtype='string', id=None)}

### 1.4.2 ignore too long examples

#### 1.4.2.1 tokenization and add key: input_length

tokenization: convert the text to numbers the model can make sense of.

Use processor to tokenize use the fields:
- audio
  - array
  - sampling_rate
- sentence

and output the fields:
- input_features
- labels

In [9]:
def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        audio=audio["array"],
        sampling_rate=audio["sampling_rate"],
        text=example["sentence"],
    )

    # 计算输入音频样本的长度，以秒计
    example["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    return example

common_voice = common_voice.map(
    prepare_dataset, 
    remove_columns=common_voice.column_names["train"], 
    # num_proc=1,
    
    # batched=True,
    # batch_size=50,  # Example batch size, adjust based on your requirement
    num_proc=32
)

Map (num_proc=32):   0%|          | 0/4902 [00:00<?, ? examples/s]

2024-07-24 10:39:16.951109: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-24 10:39:16.965993: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-24 10:39:16.974968: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-24 10:39:16.977326: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different co

Map (num_proc=32):   0%|          | 0/2215 [00:00<?, ? examples/s]

2024-07-24 10:43:59.219077: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-24 10:43:59.236341: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-24 10:43:59.252792: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-24 10:43:59.257916: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-24 10:43:59.258561: I tensorflow/core/util/port.cc:153] oneDNN 

#### 1.4.2.2 filter audio length

In [10]:
max_input_length = 30.0

def is_audio_in_length_range(length):
    return length < max_input_length

print('>>> before filter')
print(common_voice)

common_voice["train"] = common_voice["train"].filter(
    is_audio_in_length_range,
    input_columns=["input_length"],
)

print('>>> after filter')
print(common_voice)

>>> before filter
DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 4902
    })
    test: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 2215
    })
})


Filter:   0%|          | 0/4902 [00:00<?, ? examples/s]

>>> after filter
DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 4902
    })
    test: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 2215
    })
})


### 1.4.3 >>> custom checkpont

#### 1.4.3.1 save dataset to disk

In [11]:
common_voice.save_to_disk('~/huggingface/hub/datasets/common_voice_17_0_with_preprocessed')

Saving the dataset (0/10 shards):   0%|          | 0/4902 [00:00<?, ? examples/s]

Saving the dataset (0/5 shards):   0%|          | 0/2215 [00:00<?, ? examples/s]

#### 1.4.3.2 load dataset from disk

In [3]:
from datasets import load_from_disk

common_voice = load_from_disk('~/huggingface/hub/datasets/common_voice_17_0_with_preprocessed')

print(common_voice)

INFO:datasets:PyTorch version 2.3.1 available.
INFO:datasets:TensorFlow version 2.17.0 available.
DEBUG:fsspec.local:open file: /home/yuchuan/huggingface/hub/datasets/common_voice_17_0_with_preprocessed/dataset_dict.json


DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 4902
    })
    test: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 2215
    })
})


## 1.5 prepare pipeline

### 1.5.1 prepare data collator

- collate function: The function that is responsible for putting together samples inside a batch is called a collate function.
- The default collator is a function that will just convert your samples to tf.Tensor and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won’t be possible in our case since the inputs we have won’t all be of the same size.

In [6]:
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning tf tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### 1.5.2 prepare function: evaluate metrics

In [7]:
import evaluate

metric = evaluate.load("wer")

from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # 用 pad_token_id 替换 -100
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # 我们希望在计算指标时不要组合起词元
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

    # 计算普通的 WER
    wer_ortho = 100 * metric.compute(predictions=pred_str, references=label_str)

    # 计算标准化的 WER
    pred_str_norm = [normalizer(pred) for pred in pred_str]
    label_str_norm = [normalizer(label) for label in label_str]
    # 过滤，从而在评估时只计算 reference 非空的样本
    pred_str_norm = [
        pred_str_norm[i] for i in range(len(pred_str_norm)) if len(label_str_norm[i]) > 0
    ]
    label_str_norm = [
        label_str_norm[i]
        for i in range(len(label_str_norm))
        if len(label_str_norm[i]) > 0
    ]

    wer = 100 * metric.compute(predictions=pred_str_norm, references=label_str_norm)

    return {"wer_ortho": wer_ortho, "wer": wer}

2024-07-24 11:29:34.196612: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-24 11:29:34.209656: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-24 11:29:34.225358: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-24 11:29:34.230141: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-24 11:29:34.243095: I tensorflow/core/platform/cpu_feature_guar

### 1.5.3 build Transformer model

In [8]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /openai/whisper-small/resolve/main/config.json HTTP/11" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /openai/whisper-small/resolve/main/config.json HTTP/11" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /openai/whisper-small/resolve/main/generation_config.json HTTP/11" 200 0


### 1.5.4 prepare train params

some concepts:
- batch:
- per_device_train_batch_size and gradient_accumulation_steps:

error: CUDA "out-of-memory"
- reduce the `per_device_train_batch_size` incrementally by factors of 2 and employ gradient_accumulation_steps to compensate.

In [9]:
# some setting

from functools import partial

# train args: disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# generate args: 为生成设置语言和任务，并重新启用缓存
model.generate = partial(
    model.generate, language="sinhalese", task="transcribe", use_cache=True
)

# train args
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    # The output directory where the model predictions and checkpoints will be written.
    output_dir="./whisper-small-dv",  
    # (int, optional, defaults to 8) 
    # — The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for training.
    per_device_train_batch_size=16,
    #  (int, optional, defaults to 1) 
    # — Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
    gradient_accumulation_steps=1,  # 每次 batch size 下调到一半就把这个参数上调到两倍
    learning_rate=1e-5,
    lr_scheduler_type="constant_with_warmup",
    warmup_steps=50,
    max_steps=500,  # 如果您有自己的 GPU 或者 Colab 付费计划，上调到 4000
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_dir='./logs',            # directory for storing logs
    logging_steps=1,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)



### 1.5.5 build trainer

We can forward the training arguments to the 🤗 Trainer along with our model, dataset, data collator and compute_metrics function

In [10]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "POST /api/repos/create HTTP/11" 409 103
max_steps is given, it will override any value given in num_train_epochs


# 2 train

In [1]:
import logging
logging.basicConfig(level=logging.DEBUG)

trainer.train()

NameError: name 'trainer' is not defined