# Whipser Inference üéµüëÇ‚û°Ô∏èüáßüá©üìù

This uses 2x T4 GPUs for distributed inference. The trained model came from this notebook: https://www.kaggle.com/code/nbroad/whisper-training-starter-kit


Inference Notebook Version | Training Notebook Version | Model | CV WER | LB WER
- | - | - | - | -
 | 1 | openai/whisper-base | 0.69 | 
9 | 2 | bangla-speech-processing/BanglaASR | 0.529 | 0.664
11 | - | bangla-speech-processing/BanglaASR | 0.244 | 0.644
12 | - | bangla-speech-processing/BanglaASR | 0.386 | 0.638
14 | - | bangla-speech-processing/BanglaASR | 0.419 | 0.629
18 | - | IndicWhisper-bn | ? | 0.542

Notebook Version 11 used a model trained on 84k samples from the given dataset. This was overfitting.  
Notebook Version 12 used a model trained on a different set of 90k samples. It also includes postprocessing code. Still overfitting so better CV is needed.  
Notebook Version 14 trained on all samples.  
Notebook Version 18 used the [Bengali version of IndicWhisper](https://github.com/AI4Bharat/vistaar) without any training on the given dataset. Tried adding `do_normalize=True` and CV was worse.

It takes about 2 hours to submit and score for whisper small, 2.5 hours for whisper medium

A couple versions failed because it was doing preprocessing in chunks. Now it does it in each batch.
Versions 15-17 failed because it was predicting empty string. See here for details on this scoring error.

I will update this file with speed optimizations in the near future.
- [Update] BetterTransformer does not seem to help.

In [9]:
!pip install datasets --no-index --find-links=file:///kaggle/input/hf-ds -U -q
    
!cp -r /kaggle/input/python-packages2 /tmp
!tar xvfz /tmp/python-packages2/normalizer.tgz
!pip install ./normalizer/bnunicodenormalizer-0.0.24.tar.gz -f ./ --no-index


!tar xvfz /tmp/python-packages2/jiwer.tgz
!pip install ./jiwer/python-Levenshtein-0.12.2.tar.gz -f ./ --no-index
!pip install ./jiwer/jiwer-2.3.0-py3-none-any.whl -f ./ --no-index

jiwer/
jiwer/jiwer-2.3.0-py3-none-any.whl
jiwer/python-Levenshtein-0.12.2.tar.gz
jiwer/setuptools-65.3.0-py3-none-any.whl
Looking in links: ./
Processing ./jiwer/jiwer-2.3.0-py3-none-any.whl
INFO: pip is looking at multiple versions of jiwer to determine which version is compatible with other requirements. This could take a while.
[31mERROR: Could not find a version that satisfies the requirement python-Levenshtein==0.12.2 (from jiwer) (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for python-Levenshtein==0.12.2[0m[31m
[0m

In [22]:
class CFG:
    
    model_path = "/kaggle/input/indicwhisper-bn/bengali_models/whisper-medium-bn_alldata_multigpu"
    batch_size = 8
    do_eval = False
    num_eval = 500
    do_predict = True
    do_normalize = False

In [23]:
%%writefile infer.py

import os
import sys
import random
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, List, Union

import datasets
import torch
import numpy as np
from datasets import Dataset

from transformers import (
    AutoConfig,
    AutoFeatureExtractor,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    AutoTokenizer,
    HfArgumentParser,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    set_seed,
)

@dataclass
class Config:
    model_name_or_path: str = field(
        metadata={
            "help": "Path to pretrained model or model identifier from huggingface.co/models"
        }
    )
    audio_column_name: str = field(
        default="audio",
        metadata={
            "help": "The name of the dataset column containing the audio data. Defaults to 'audio'"
        },
    )
    num_workers: int = field(
        default=2,
        metadata={
            "help": "The number of workers for preprocessing"
        },
    )
    use_bettertransformer: bool = field(default=False, metadata={
            "help": "Use BetterTransformer (https://huggingface.co/docs/optimum/bettertransformer/overview)"
        }
    num_eval: int = field(default=1000, metadata={
            "help": "The number of samples to run for CV"
        })
    do_normalize: bool = field(default=False, metadata={
            "help": "Normalize in the feature extractor"
        })


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor ([`WhisperProcessor`])
            The processor used for processing the data.
        decoder_start_token_id (`int`)
            The begin-of-sentence of the decoder.
        forward_attention_mask (`bool`)
            Whether to return attention_mask.
    """

    processor: Any
    decoder_start_token_id: int
    forward_attention_mask: bool
    audio_column_name: str
    do_normalize: bool

    def __call__(
        self, features
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        
        model_input_name = self.processor.model_input_names[0]
        
        features = [
            prepare_dataset(
                feature, 
                audio_column_name=self.audio_column_name, 
                model_input_name=model_input_name,
                feature_extractor=self.processor.feature_extractor,
                do_normalize=self.do_normalize
            ) for feature in features
        ]
        
        input_features = [
            {model_input_name: feature[model_input_name]} for feature in features
        ]

        batch = self.processor.feature_extractor.pad(
            input_features, return_tensors="pt"
        )

        return batch

def prepare_dataset(batch, audio_column_name, model_input_name, feature_extractor, do_normalize):
    # process audio
    sample = batch[audio_column_name]

    # if longer than 30 seconds, truncate.
    # for best score, break long files up
    if len(sample["array"]) > (16000 * 30):
        sample["array"] = sample["array"][:16000 * 30]

    inputs = feature_extractor(
        sample["array"],
        sampling_rate=sample["sampling_rate"],
        do_normalize=do_normalize,
    )
    # process audio length
    batch[model_input_name] = inputs.get(model_input_name)[0]

    return batch



def main():

    parser = HfArgumentParser((Config, Seq2SeqTrainingArguments))

    cfg, training_args = parser.parse_args_into_dataclasses()

    # Set seed before initializing model.
    set_seed(training_args.seed)

    config = AutoConfig.from_pretrained(
        cfg.model_name_or_path,
    )

    feature_extractor = AutoFeatureExtractor.from_pretrained(
        cfg.model_name_or_path,
    )
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        cfg.model_name_or_path,
        config=config,
    )
    
    # BetterTransformer does not provide any noticeable speed up in fp32
    if cfg.use_bettertransformer:
        sys.path.append("/kaggle/input/hugging-face-optimum")
        from optimum.bettertransformer import BetterTransformer
        
        model = BetterTransformer.transform(model)
        
        if training_args.local_process_index == 0:
            print("Converted to BetterTransformer")

    processor = AutoProcessor.from_pretrained(cfg.model_name_or_path)
    
    if training_args.do_eval:
        data_dir = "/kaggle/input/bengaliai-speech/train_mp3s"
    else:
        data_dir = "/kaggle/input/bengaliai-speech/test_mp3s"
    
    audio_files = list(map(str, Path(data_dir).glob("*.mp3")))
    
    if training_args.do_eval:
        audio_files = random.sample(audio_files, cfg.num_eval)
    
    ds = Dataset.from_dict({"audio": audio_files})
    
    ds = ds.map(lambda x: {"id": Path(x["audio"]).stem, "filesize": os.path.getsize(x["audio"])}, num_proc=cfg.num_workers)
    
    ds = ds.cast_column(
        cfg.audio_column_name,
        datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate),
    )
    
    # sort by filesize to minimize padding
    ds = ds.sort("filesize")
    ds = ds.add_column("idx", range(len(ds)))
    
    # save ids
    ds.remove_columns([x for x in ds.column_names if x != "id"]).to_json("ids.json")
    
    model_input_name = feature_extractor.model_input_names[0]
    
    
    data_collator = DataCollatorSpeechSeq2SeqWithPadding(
        processor=processor,
        decoder_start_token_id=model.config.decoder_start_token_id,
        forward_attention_mask=False,
        audio_column_name=cfg.audio_column_name,
        do_normalize=cfg.do_normalize,
    )

    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        tokenizer=feature_extractor,
        data_collator=data_collator,
    )
    
    # Probably not necessary to do in chunks, but keeping for the time being
    chunk_size = 250
    
    for num, i in enumerate(range(0, len(ds), chunk_size)):
        ii = min(i+chunk_size, len(ds))
        temp = ds.select(range(i, ii))
        
        predictions = trainer.predict(temp).predictions
    
        Dataset.from_dict({"idx": temp["idx"]}).to_json(f"vectorized_idxs_{num}.json")
        np.save(f"preds_{num}.npy", predictions)
    

if __name__ == "__main__":
    main()

Overwriting infer.py


In [24]:
if CFG.do_eval:
    !torchrun --nproc_per_node 2 infer.py \
      --model_name_or_path $CFG.model_path \
      --report_to "none" \
      --dataloader_num_workers 1 \
      --per_device_eval_batch_size $CFG.batch_size \
      --predict_with_generate \
      --output_dir "./" \
      --remove_unused_columns False \
      --do_eval True \
      --num_eval $CFG.num_eval \
      --do_normalize $CFG.do_normalize

if CFG.do_predict:
    !torchrun --nproc_per_node 2 infer.py \
      --model_name_or_path $CFG.model_path \
      --report_to "none" \
      --dataloader_num_workers 1 \
      --per_device_eval_batch_size $CFG.batch_size \
      --predict_with_generate \
      --output_dir "./" \
      --remove_unused_columns False \
      --do_normalize $CFG.do_normalize

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Creating json from Arrow format: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 34.01ba/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

## Create submission.csv file

In [25]:
import json
import numpy as np
import pandas as pd
from pathlib import Path
from itertools import chain

from transformers import AutoTokenizer
from bnunicodenormalizer import Normalizer


all_ids = [x.stem for x in Path("/kaggle/input/bengaliai-speech/test_mp3s").glob("*.mp3")]

tokenizer = AutoTokenizer.from_pretrained(CFG.model_path)


text_preds = []


for f in sorted(Path("./").glob("preds*.npy")):
    preds = np.load(f)
    text_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True))

# ids.json has the original order
ids = []
with open("ids.json") as fp:
    for line in fp:
        ids.append(json.loads(line)["id"])

pred_idxs = []

for fname in sorted(Path("./").glob("vectorized_*.json")):
    # vectorized_idxs.json has the sorted version
    with open(fname) as fp:
        for line in fp:
            pred_idxs.append(json.loads(line)["idx"])

pred_df = pd.DataFrame({"idx": pred_idxs, "sentence": text_preds})
id_df = pd.DataFrame({"id": ids})
id_df["idx"] = range(len(id_df))

# merge dataframes to match ids and sentences properly
pred_df = pred_df.merge(id_df, on="idx").drop(columns=["idx"])

pred_ids = pred_df.id

if CFG.do_predict:
    missing_ids = set(all_ids) - set(pred_ids)
    if len(missing_ids) > 0:
        temp = pd.DataFrame({"id": list(missing_ids)})
        temp["sentence"] = "‡•§"

        pred_df = pd.concat([
            pred_df, 
            temp,
        ],
            axis=0
        )
    
pred_df["sentence"].fillna("‡•§", inplace=True)

# Post-processing
# from: https://www.kaggle.com/code/reasat/yellowking-dlsprint-inference?scriptVersionId=137162907&cellId=19
bnorm = Normalizer()

def postprocess(sentence):
    _words = [bnorm(word)['normalized']  for word in sentence.split()]
    sentence = " ".join([word for word in _words if word is not None])
    try:
        if sentence[-1]!="‡•§":
            sentence+="‡•§"
    except:
        print(sentence)
    return sentence

pred_df["sentence"] = [postprocess(s) for s in pred_df["sentence"]]
pred_df["sentence"] = [x if len(x) > 0 else "‡•§" for x in pred_df["sentence"]]

pred_df.to_csv("submission.csv", index=False)






In [26]:
pred_df.head()

Unnamed: 0,sentence,id
0,‡¶Ü‡¶õ‡¶æ‡ßú‡¶æ‡¶õ‡¶æ‡ßú‡¶æ‡¶õ‡¶æ‡ßú‡¶æ‡•§,964c0abe4385
1,‡¶≠‡¶ø‡¶§‡ßç‡¶§‡¶ø ‡¶π‡¶® ‡¶ì ‡¶§‡¶ø‡¶®‡¶ø‡•§,0b2954c40b8f
2,‡¶è‡¶ü‡¶ø‡•§,63e0b415bfab
3,‡¶§‡ßã‡¶Æ‡¶∏‡ßá ‡¶ú‡ßá‡¶ñ‡¶ø‡¶≤‡ßá ‡¶™‡ßç‡¶∞‡¶æ‡¶™‡ßç‡¶§ ‡¶Æ‡¶ø‡¶ü‡¶ø‡¶Ç‡•§,fb2774c3c552
4,‡¶è‡¶á ‡¶∏‡¶Æ‡ßç‡¶™‡¶∞‡ßç‡¶ï ‡¶Ü‡¶õ‡ßá‡•§,8a5875d54a61


In [27]:
if CFG.do_eval:
    
    train_df = pd.read_csv("/kaggle/input/bengaliai-speech/train.csv")
    
    train_df["true"] = train_df["sentence"]
    
    pred_df = pred_df.merge(train_df[["true", "id"]], on="id", how="left")
    
    from jiwer import wer
    
    pred_df["wer"] = [wer(pred, gt) for pred, gt in pred_df[["sentence", "true"]].values]
    
    print(round(pred_df["wer"].mean(), 4))

0.736
