# Bengali.AI Speech Recognition
## Recognize Bengali speech from out-of-distribution audio recordings
https://www.kaggle.com/competitions/bengaliai-speech/overview

The goal of this competition is to recognize Bengali speech from out-of-distribution audio recordings. You will build a model trained on the first Massively Crowdsourced (MaCro) Bengali speech dataset with 1,200 hours of data from ~24,000 people from India and Bangladesh. The test set contains samples from 17 different domains that are not present in training.

Your efforts could improve Bengali speech recognition using the first Bengali out-of-distribution speech recognition dataset. In addition, your submission will be among the first open-source speech recognition methods for Bengali.

The full test set contains about 20 hours of speech in almost 8000 MP3 audio files. All of the files in the test set are encoded at a sample rate of 32k, a bit rate of 48k, in one channel.

Details on the dataset are available in the dataset paper: https://arxiv.org/abs/2305.09688

## Log into hugging face

In [1]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
TRAINING_CSV_PATH = "bengaliai-speech/train.csv"

In [11]:
import pandas as pd

data = pd.read_csv(f"{TRAINING_CSV_PATH}")
print(f"Number of training samples: {len(data)}")
data.head()

Number of training samples: 963636


Unnamed: 0,id,sentence,split
0,000005f3362c,ও বলেছে আপনার ঠিকানা!,train
1,00001dddd002,কোন মহান রাষ্ট্রের নাগরিক হতে চাও?,train
2,00001e0bc131,"আমি তোমার কষ্টটা বুঝছি, কিন্তু এটা সঠিক পথ না।",train
3,000024b3d810,নাচ শেষ হওয়ার পর সকলে শরীর ধুয়ে একসঙ্গে ভোজন...,train
4,000028220ab3,"হুমম, ওহ হেই, দেখো।",train


## Pre-process the data

In [23]:
import librosa

# Load audio files with librosa
# Load audio files with librosa
def load_audio(file_path):
    y, sr = librosa.load(file_path, sr=16000)
    return y

def populate_files(frame, limit=10):
    # Generate file paths based on the 'id' column
    frame['file_path'] = frame['id'].apply(lambda x: f"./bengaliai-speech/train_mp3s/{x}.mp3")
    
    # Load audio from each file and store it in the 'audio' column, but limit to 'limit' number of files
    frame['audio'] = frame['file_path'].head(limit).apply(load_audio)
    
    return frame


data = populate_files(data, limit=10)

## Split the data into training and testing based on values in the 'split' column
print(data['split'].value_counts())
train_df = data[data['split'] == 'train']
train_df.drop(columns=['split'])
print(f"Number of training samples: {len(train_df)}")
validation_df = data[data['split'] == 'valid']
validation_df.drop(columns=['split'])
print(f"Number of testing samples: {len(validation_df)}")

train_df.head()

split
train    934048
valid     29588
Name: count, dtype: int64
Number of training samples: 934048
Number of testing samples: 29588


Unnamed: 0,id,sentence,split,file_path,audio
0,000005f3362c,ও বলেছে আপনার ঠিকানা!,train,./bengaliai-speech/train_mp3s/000005f3362c.mp3,"[0.0, 1.2770395e-13, 9.996642e-15, -1.6432678e..."
1,00001dddd002,কোন মহান রাষ্ট্রের নাগরিক হতে চাও?,train,./bengaliai-speech/train_mp3s/00001dddd002.mp3,"[0.0, 1.4704359e-13, 3.3958074e-13, -3.4793018..."
2,00001e0bc131,"আমি তোমার কষ্টটা বুঝছি, কিন্তু এটা সঠিক পথ না।",train,./bengaliai-speech/train_mp3s/00001e0bc131.mp3,"[0.0, 1.446659e-14, -5.119921e-15, -2.5697824e..."
3,000024b3d810,নাচ শেষ হওয়ার পর সকলে শরীর ধুয়ে একসঙ্গে ভোজন...,train,./bengaliai-speech/train_mp3s/000024b3d810.mp3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,000028220ab3,"হুমম, ওহ হেই, দেখো।",train,./bengaliai-speech/train_mp3s/000028220ab3.mp3,"[0.0, -8.549696e-14, -3.1352218e-13, -4.597286..."


## Load WhisperFeatureExtractor

## Load WhisperTokenizer

## Create A WhisperProcessor

## Prepare Data