# Bengali.AI Speech Recognition
## Recognize Bengali speech from out-of-distribution audio recordings
https://www.kaggle.com/competitions/bengaliai-speech/overview

The goal of this competition is to recognize Bengali speech from out-of-distribution audio recordings. You will build a model trained on the first Massively Crowdsourced (MaCro) Bengali speech dataset with 1,200 hours of data from ~24,000 people from India and Bangladesh. The test set contains samples from 17 different domains that are not present in training.

Your efforts could improve Bengali speech recognition using the first Bengali out-of-distribution speech recognition dataset. In addition, your submission will be among the first open-source speech recognition methods for Bengali.

The full test set contains about 20 hours of speech in almost 8000 MP3 audio files. All of the files in the test set are encoded at a sample rate of 32k, a bit rate of 48k, in one channel.

Details on the dataset are available in the dataset paper: https://arxiv.org/abs/2305.09688

## Log into hugging face

In [1]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
TRAINING_CSV_PATH = "bengaliai-speech/train.csv"
BASE_MODEL="openai/whisper-tiny"
NUM_TRAINING_SAMPLES = 10000
NUM_VALIDATION_SAMPLES = 100


In [3]:
import pandas as pd

data = pd.read_csv(f"{TRAINING_CSV_PATH}")
print(f"Number of training samples: {len(data)}")
data.head()

Number of training samples: 963636


Unnamed: 0,id,sentence,split
0,000005f3362c,ও বলেছে আপনার ঠিকানা!,train
1,00001dddd002,কোন মহান রাষ্ট্রের নাগরিক হতে চাও?,train
2,00001e0bc131,"আমি তোমার কষ্টটা বুঝছি, কিন্তু এটা সঠিক পথ না।",train
3,000024b3d810,নাচ শেষ হওয়ার পর সকলে শরীর ধুয়ে একসঙ্গে ভোজন...,train
4,000028220ab3,"হুমম, ওহ হেই, দেখো।",train


## Pre-process the data

In [4]:
import librosa
from datasets import Audio

# Load audio files with librosa
def load_audio(file_path):
    y, sr = librosa.load(file_path, sr=16000)
    return y

def populate_files(frame, limit=10):
    # Generate file paths based on the 'id' column
    frame['file_path'] = frame['id'].apply(lambda x: f"./bengaliai-speech/train_mp3s/{x}.mp3")
    
    # Load audio from each file and store it in the 'audio' column, but limit to 'limit' number of files
    frame['audio'] = frame['file_path'].head(limit).apply(load_audio)
    
    return frame


data = populate_files(data, limit=10)

## Split the data into training and testing based on values in the 'split' column
print(data['split'].value_counts())
train_df = data[data['split'] == 'train']
train_df = train_df[['id', 'file_path', 'audio', 'sentence']]
print(f"Number of training samples: {len(train_df)}")
validation_df = data[data['split'] == 'valid']
validation_df = validation_df[['id', 'file_path', 'audio', 'sentence']]
print(f"Number of validation samples: {len(validation_df)}")

#limit the number of samples for training and validation for testing
train_df = train_df[:NUM_TRAINING_SAMPLES]
validation_df = validation_df[:NUM_VALIDATION_SAMPLES]

#create the required dataset objects
import datasets
from datasets import Dataset, DatasetDict
train_ds = Dataset.from_dict({"audio": train_df["file_path"], "sentence": train_df['sentence']}).cast_column("audio", Audio(sampling_rate=16000))
print(train_ds[0])
validation_ds = Dataset.from_dict({"audio": validation_df["file_path"], "sentence":validation_df['sentence']}).cast_column("audio", Audio(sampling_rate=16000))

split
train    934048
valid     29588
Name: count, dtype: int64
Number of training samples: 934048
Number of validation samples: 29588
{'audio': {'path': './bengaliai-speech/train_mp3s/000005f3362c.mp3', 'array': array([ 2.32830644e-10, -8.14907253e-10, -6.40284270e-10, ...,
       -1.16623938e-04, -2.85811722e-04,  2.30960548e-04]), 'sampling_rate': 16000}, 'sentence': 'ও বলেছে আপনার ঠিকানা!'}


## Load WhisperFeatureExtractor

In [5]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained(BASE_MODEL)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

## Load WhisperTokenizer

In [6]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained(BASE_MODEL, language="Bengali", task="transcribe")

Downloading (…)okenizer_config.json:   0%|          | 0.00/842 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

In [7]:
input_str = train_ds[0]["sentence"]
print(input_str)
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)

print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")

ও বলেছে আপনার ঠিকানা!
Input:                 ও বলেছে আপনার ঠিকানা!
Decoded w/ special:    <|startoftranscript|><|bn|><|transcribe|><|notimestamps|>ও বলেছে আপনার ঠিকানা!<|endoftext|>
Decoded w/out special: ও বলেছে আপনার ঠিকানা!
Are equal:             True


## Prepare Data

In [8]:
def prepare_dataset(batched_row):
    batched_output = {"input_features": [], "labels": []}

    for idx in range(len(batched_row["audio"])):
        audio_sample = batched_row["audio"][idx]
        sentence_sample = batched_row["sentence"][idx]

        # compute log-Mel input features from input audio array 
        input_features = feature_extractor(audio_sample["array"], sampling_rate=audio_sample["sampling_rate"]).input_features[0]
        batched_output["input_features"].append(input_features)
        
        # encode target text to label ids 
        labels = tokenizer(sentence_sample).input_ids
        batched_output["labels"].append(labels)

    return batched_output

train_ds = train_ds.map(prepare_dataset, remove_columns=train_ds.column_names, batched=True, batch_size=1000, num_proc=4)
print(train_ds)
validation_ds = validation_ds.map(prepare_dataset, remove_columns=validation_ds.column_names, batched=True, batch_size=1000, num_proc=4)


Map (num_proc=4):   0%|          | 0/934048 [00:00<?, ? examples/s]

TimeoutError: 

In [None]:
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


## Create A WhisperProcessor

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(BASE_MODEL, language="Bengali", task="transcribe")
print(processor)
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
print(data_collator)

## Load pre-trained checkpoint


In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

## Define Training Args 

In [None]:
from transformers import Seq2SeqTrainingArguments
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-bn",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=train_ds,
    eval_dataset=validation_ds,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

## Train

In [None]:
print(trainer.accelerator.device)

In [None]:
trainer.train()