# <b> <span style='color:#F1A424'>|</span> TABLE OF CONTENTS</b>



<a id="0"></a>
# <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 0px solid #000000">Step 0 - Introduction</p>
## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">WELCOME TO SPEECH RECOGNITION CONTEST</p>

**AUTHOR - SUJAY KAPADNIS**

**DATE - 20 July 2023 - It's a nice day and sun's out - Let's go**

## About Competition
- WE are asked to recognize Bengali speech from out-of-distribution audio recordings
- Dataset has **1,200 hours** of data from ~24,000 people from India and Bangladesh.
- The test set contains samples from **17 different domains** that are not present in training.
- Size of the data - **`26GB`**
- this is a Code Competition, in which the **actual test set is hidden**. In this public version, we give some sample data in the correct format to help you author your solutions. The full test set contains about **20 hours of speech** in almost **8000 MP3** audio files. All of the files in the test set are **encoded at a sample rate of 32k, a bit rate of 48k, in one channel**.

### Out-of-Source Distribution
- Out-of means apart from 
- Source referes to the training data
- So it simply means that the data our model will be tested on, is little different. Which is challenging because in such cases the model tends to perform poor.
- Out-of-distribution recordings can include variations in accent, background noise, recording conditions, speaking styles, or dialects that were not sufficiently represented in the training data.

### Files
**`train/`** -  The training set, comprising several thousand recordings in **MP3 format**.

**`test/`** -  The test set, comprising spontaneous speech recordings from **18 domains**, **17 of which are out-of-distribution with respect to the training set**. There may be domains in the private test set that are **not in the public test set**.

**`examples/`** An **example recording** for **each test set domain**. You may find these example recordings **helpful for creating models robust to domain variation**. These are representative recordings and **none of them are present in the test set**.

**`train.csv`** -  Sentence labels for the training set.
 - `id` -  A unique identifier for this instance. **Corresponds to the file {id}.mp3 in train/**.
 - `sentence` -  A **plain-text transcription of the recording**. Your goal is to **predict these sentences for each recording in the test set**.
 - `split` -  Whether **train or valid**. The annotations in the **valid** split have been **manually reviewed and corrected**, while the annotations in the **train** split have only been **algorithmically cleaned**. The **valid** samples will generally have **higher quality annotations** than the train samples, but are otherwise **drawn from the same distribution**.
 
 - `sample_submission.csv` -  A sample submission file in the correct format. See the Evaluation page for more details.
 
### Evaluation metric
#### WER - Word Error Rate
- The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level
##### WER = (S + I + D) / N
Where
- `S` is the number of word substitutions (words incorrectly recognized by the system).
- `I` is the number of word insertions (words present in the system's output but not in the reference).
- `D` is the number of word deletions (words present in the reference but not recognized by the system).
- `N` is the total number of words in the reference transcription.

#### Example - 

- **Input** - Say my name.
- **Output** - Say your name.
 - S -  There is one substitution (your instead of my).
 - I -  There are no insertions.
 - D -  There are no deletions.
 - N -  The reference transcription has 3 words.
      - WER = (S + I + D) / N
      - WER = (1 + 0 + 0) / 3
      - WER = 1 / 3 ≈ 0.333 (approximately 33.3%)
**Lesser the WER - the better**


### Sample Submission

>
id,sentence

0f3dac00655e,এছাড়াও নিউজিল্যান্ড এ ক্রিকেট দলের হয়েও খেলছেন তিনি।

a9395e01ad21,এছাড়াও নিউজিল্যান্ড এ ক্রিকেট দলের হয়েও খেলছেন তিনি।

bf36ea8b718d,এছাড়াও নিউজিল্যান্ড এ ক্রিকেট দলের হয়েও খেলছেন তিনি।
...


### This is work in Progress, if you find this helpful kindly upvote

<a id="1"></a>
# <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 1 - LOAD YOUR DEPENDENCIES</p>

Links I encourage you to open in new tab
1. https://www.kaggle.com/code/samuelcortinhas/catalogue-of-my-kaggle-notebooks(see NLP Series Notebook)

In [None]:
import os
import pandas as pd
import librosa
import numpy as np
import IPython.display as ipd
import torchaudio
import librosa.display
import matplotlib.pyplot as plt
import random
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim import models

<a id="2"></a>
# <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2 - EDA</p>
<a id="2"></a>
## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.1 - Load the Data</p>

In [None]:
# Define the paths to the data directories
BASE_DIR = '/kaggle/input/bengaliai-speech'
train_data_dir = f"{BASE_DIR}/train_mp3s/"  
test_data_dir = f"{BASE_DIR}/test_mp3s/" 
train_csv_path = f"{BASE_DIR}/train.csv" 
domains = f"{BASE_DIR}/examples/" 

path_template = "/kaggle/input/bengaliai-speech/train_mp3s/{}.mp3"

# Load the train.csv file using pandas
train_df = pd.read_csv(train_csv_path)

# Preview the first few rows of the DataFrame
display(train_df.head())

DOMAINS = os.listdir(f'{BASE_DIR}/examples')

In [None]:
# # domains
# DOMAINS = !ls /kaggle/input/bengaliai-speech/examples/
# DOMAINS

<a id="2"></a>
## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.2 - Summary</p>

In [None]:
# Load audio files and corresponding transcriptions
audio_data = []  # List to store audio data
transcriptions = []  # List to store corresponding transcriptions

for idx, row in train_df.head(5).iterrows():
    audio_file_path = path_template.format(row['id'])

    # Load the audio file using librosa
    audio, sr = librosa.load(audio_file_path, sr=None)

    # Append audio data and transcription to lists
    audio_data.append(audio)
    transcriptions.append(row['sentence'])
    
audio_data = np.array(audio_data,dtype = 'object')
transcriptions = np.array(transcriptions,dtype = 'object')

# Check the shapes of the loaded data
print("Audio data shape:", audio_data.shape)
print("Transcriptions shape:", transcriptions.shape)


In [None]:
# Get the total number of audio files in the training and test directories
train_audio_files = os.listdir(train_data_dir)
test_audio_files = os.listdir(test_data_dir)

# Get the total duration of audio data in the training set (in seconds)
train_total_duration = 0
SAMPLES_TAKEN = 10000
for idx, row in train_df.head(SAMPLES_TAKEN).iterrows():
    audio_file_path = path_template.format(row['id'])
#     audio_info = torchaudio.info(audio_file_path)
#     duration = audio_info.num_frames / audio_info.sample_rate
#     train_total_duration += duration

test_duration = 0
for test_file in test_audio_files:
    audio_file_path = os.path.join(test_data_dir, str(test_file))
#     audio_info = torchaudio.info(audio_file_path)
#     duration = audio_info.num_frames / audio_info.sample_rate
#     test_duration += duration
    
# Get the total number of samples in the training data
total_samples = train_df.shape[0]

# Print the data summary
print("Data Summary:")
print(f"Total number of audio files in the training directory: {len(train_audio_files)}")
print(f"Total number of audio files in the test directory: {len(test_audio_files)}")
# print(f"Total duration of audio data in the training set (in seconds): {train_total_duration:.2f}")
# print(f"Average duration of audio file in training set (in seconds): {train_total_duration/SAMPLES_TAKEN:.2f}")
# print(f"Total duration of audio data in the test set (in seconds): {test_duration:.2f}")
print(f"Total number of samples in the training data: {total_samples}")

<a id="2"></a>
## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.3 - Listen</p>

In [None]:
# !pip install mutagen --quiet
# import mutagen
# from mutagen.mp3 import MP3
# train_df['duration'] = train_df['id'].apply(lambda x: MP3(os.path.join(train_data_dir,(x+'.mp3'))).info.length)
# train_df.head()

In [None]:
# Hello Hello , mic check
# Choose some random indices for checking
random_indices = [0, 10, 20, 30, 40]

for idx in random_indices:
    row = train_df.iloc[idx]
    audio_file_path = path_template.format(row['id'])
#     audio = MP3(audio_file_path)
#     print(audio.info.length)

    # Load the audio file using librosa
    audio, sr = librosa.load(audio_file_path, sr=None)

    # Print the transcription and play the audio
    print("Transcription:", row['sentence'])
    ipd.display(ipd.Audio(audio, rate=sr))

In [None]:
for test_file in test_audio_files:
    audio_file_path = os.path.join(test_data_dir, str(test_file))
    # Load the audio file using librosa
    audio, sr = librosa.load(audio_file_path, sr=None)
    ipd.display(ipd.Audio(audio, rate=sr))

<a id="2"></a>
## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.4 - Visualization</p>

In [None]:
domain_counts = train_df['split'].value_counts()

# Plot the data distribution
plt.figure(figsize=(10, 6))
domain_counts.plot(kind='bar', color='skyblue')
plt.title("Data Distribution Across Domains")
plt.xlabel("Domain")
plt.ylabel("Number of Recordings")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
for idx in random_indices:
    row = train_df.iloc[idx]
    audio_file_path = os.path.join(train_data_dir, f"{row['id']}.mp3")

    # Load the audio file using librosa
    audio, sr = librosa.load(audio_file_path, sr=None)

    # Plot the waveform
    plt.figure(figsize=(10, 4))
    librosa.display.waveshow(audio, sr=sr)
    plt.title(f"Waveform - Audio File ID: {row['id']}")
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.tight_layout()
    plt.show()

    # Plot the log Mel spectrogram
    plt.figure(figsize=(10, 4))
    mel_spec = librosa.feature.melspectrogram(y=audio, sr=sr)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    librosa.display.specshow(mel_spec_db, sr=sr, x_axis='time', y_axis='mel')
    plt.colorbar(format='%+2.0f dB')
    plt.title(f"Log Mel Spectrogram - Audio File ID: {row['id']}")
    plt.xlabel("Time (s)")
    plt.ylabel("Mel Frequency")
    plt.tight_layout()
    plt.show()

Learn more about [Mel Spectograms](https://youtu.be/9GHCiiDLHQ4)

<a id="2"></a>
## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.5 - Sentence Analysis</p>


<a id="2"></a>
### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.5.1 - Text Preprocessing</p>

In [None]:
# Select the first 5 transcriptions
transcriptions = train_df['sentence'][:5].tolist()

# Convert transcriptions to lowercase
transcriptions_lower = [transcription.lower() for transcription in transcriptions]

# Remove punctuation
translator = str.maketrans("", "", string.punctuation)
transcriptions_no_punct = [transcription.translate(translator) for transcription in transcriptions_lower]

# Tokenization
nltk.download('punkt')  
transcriptions_tokens = [word_tokenize(transcription) for transcription in transcriptions_no_punct]


nltk.download('stopwords')  
stop_words = set(stopwords.words('bengali'))
transcriptions_no_stopwords = [
    [word for word in tokens if word not in stop_words]
    for tokens in transcriptions_tokens
]

nltk.download('wordnet')  
stemmer = PorterStemmer()
transcriptions_stemmed = [
    [stemmer.stem(word) for word in tokens]
    for tokens in transcriptions_no_stopwords
]

# Print the preprocessed transcriptions
for i, transcription in enumerate(transcriptions_stemmed):
    print(f"Preprocessed transcription {i+1}: {' '.join(transcription)}")


<a id="2"></a>
### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.5.2 - Tokenization</p>

In [None]:
# Select the first 100 transcriptions 
transcriptions = train_df['sentence'].tolist()

# Tokenization
nltk.download('punkt')  # Download the Punkt tokenizer
transcriptions_tokens = [word_tokenize(transcription) for transcription in transcriptions]

# Print the tokenized transcriptions
for i, transcription_tokens in enumerate(transcriptions_tokens):
    if i%100000 == 0:
        print(f"Tokenized transcription {i+1}: {transcription_tokens}")

<a id="2"></a>
### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.5.3 - Vocab Analysis</p>

In [None]:
# Build vocabulary
vocabulary = set()
for transcription_tokens in transcriptions_tokens:
    vocabulary.update(transcription_tokens)

# print("Vocabulary:")
# print(vocabulary)
print(f"Vocabulary Size: {len(vocabulary)}")

<a id="2"></a>
### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.5.4 - Sentence Length analysis</p>

In [None]:
lens = train_df.sentence.apply(lambda x: len(x))
plt.hist(lens)

In [None]:
# Compute descriptive statistics
sentence_lengths = [len(tokens) for tokens in transcriptions_tokens]
min_length = min(sentence_lengths)
max_length = max(sentence_lengths)
mean_length = sum(sentence_lengths) / len(sentence_lengths)
median_length = sorted(sentence_lengths)[len(sentence_lengths) // 2]

# Plot the distribution of sentence lengths
plt.figure(figsize=(10, 6))
plt.hist(sentence_lengths, bins=50, color='skyblue', edgecolor='black')
plt.axvline(mean_length, color='red', linestyle='dashed', linewidth=2, label='Mean')
plt.axvline(median_length, color='green', linestyle='dashed', linewidth=2, label='Median')
plt.xlabel('Sentence Length')
plt.ylabel('Frequency')
plt.title('Distribution of Sentence Lengths')
plt.legend()
plt.show()

# Print the descriptive statistics
print("Descriptive Statistics:")
print(f"Minimum Sentence Length: {min_length}")
print(f"Maximum Sentence Length: {max_length}")
print(f"Mean Sentence Length: {mean_length:.2f}")
print(f"Median Sentence Length: {median_length}")

<a id="2"></a>
### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 2.5.5 - Word Frequency Analysis</p>

In [None]:
# Flatten the list of tokens
all_tokens = [token for tokens in transcriptions_tokens for token in tokens]

# Count word frequency
word_frequency = Counter(all_tokens)

# Print the word frequency analysis
print("Word Frequency:")
plt.plot(word_frequency.values())

In [None]:
# Calculate the number of unique words in each sentence
sentences = train_df['sentence'].tolist()
unique_word_counts = [len(set(sentence.split())) for sentence in sentences]

# Compute descriptive statistics
min_unique_words = min(unique_word_counts)
max_unique_words = max(unique_word_counts)
mean_unique_words = sum(unique_word_counts) / len(unique_word_counts)
median_unique_words = sorted(unique_word_counts)[len(unique_word_counts) // 2]

# Plot the distribution of unique word counts
plt.figure(figsize=(10, 6))
plt.hist(unique_word_counts, bins=50, color='lightcoral', edgecolor='black')
plt.axvline(mean_unique_words, color='red', linestyle='dashed', linewidth=2, label='Mean')
plt.axvline(median_unique_words, color='green', linestyle='dashed', linewidth=2, label='Median')
plt.xlabel('Number of Unique Words')
plt.ylabel('Frequency')
plt.title('Distribution of Unique Words in Transcriptions')
plt.legend()
plt.show()

# Print the descriptive statistics
print("Descriptive Statistics:")
print(f"Minimum Number of Unique Words: {min_unique_words}")
print(f"Maximum Number of Unique Words: {max_unique_words}")
print(f"Mean Number of Unique Words: {mean_unique_words:.2f}")
print(f"Median Number of Unique Words: {median_unique_words}")

In [None]:
# Select the first 1000 sentences 
sentences = train_df['sentence'][:10000].tolist()

# Tokenization using NLTK
nltk.download('punkt')  # Download the Punkt tokenizer
sentences_tokens = [word_tokenize(sentence) for sentence in sentences]


# Remove Bengali stopwords
sentences_no_stopwords = [
    [word for word in tokens if word not in stop_words]
    for tokens in sentences_tokens
]

# Convert tokenized sentences back to strings
sentences_processed = [' '.join(tokens) for tokens in sentences_no_stopwords]

# Create a CountVectorizer to convert text data to a bag-of-words representation
vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(sentences_processed)

# Perform LDA topic modeling
n_topics = 10  # Number of topics to discover
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_model.fit(X)

# Get the top words for each topic
feature_names = vectorizer.get_feature_names_out()
top_words_per_topic = []
for topic_idx, topic in enumerate(lda_model.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-4:-1]]
    top_words_per_topic.append(top_words)

# Print the top words for each topic
for i, top_words in enumerate(top_words_per_topic):
    print(f"Topic {i + 1}: {' '.join(top_words)}")
    

In [None]:
lda_model.components_

#### We have transcriptions of short speeches which are just sentences, topic modelling seems like a bad idea.

<a id="3"></a>
### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 3 - Domain Analysis</p>

#### About Domains distribution
- Number of domains in training data are not known
- Number of domains in test data = 18
- audio samples of 17 such domains which are not present in the training data are given in examples/
- remaining domain should be present in the training data

In [None]:
for idx in np.arange(5):
    audio_file_path = f'{domains}/{DOMAINS[idx]}'

    # Load the audio file using librosa
    audio, sr = librosa.load(audio_file_path, sr=None)

    # Print DOMAIN
    print(DOMAINS[idx])
    ipd.display(ipd.Audio(audio, rate=sr))

In [None]:
# Define the list of domains and their corresponding audio files
DOMAINS = [
    'Audiobook.wav', 'Parliament Session.wav', 'Bangladeshi TV Drama.wav',
    'Poem Recital.wav', 'Bengali Advertisement.wav', 'Puthi Literature.wav',
    'Cartoon.wav', 'Slang Profanity.mp3', 'Debate.wav', 'Stage Drama Jatra.wav',
    'Indian TV Drama.wav', 'Talk Show Interview.wav', 'Movie.wav', 'Telemedicine.mp3',
    'News Presentation.wav', 'Waz Islamic Sermon.wav', 'Online Class.wav'
]

# Visualize the audio files and play them
for idx in np.arange(5):
    audio_file_path = os.path.join(domains, DOMAINS[idx])

    # Load the audio file using librosa
    audio, sr = librosa.load(audio_file_path, sr=None)

    # Plot the waveform
    plt.figure(figsize=(10, 4))
    librosa.display.waveshow(audio, sr=sr)
    plt.title(f'Waveform - {DOMAINS[idx]}')
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.show()

    # Plot the spectrogram
    plt.figure(figsize=(10, 4))
    D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)), ref=np.max)
    librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='linear')
    plt.colorbar(format='%+2.0f dB')
    plt.title(f'Spectrogram - {DOMAINS[idx]}')
    plt.xlabel('Time (s)')
    plt.ylabel('Frequency (Hz)')
    plt.show()

    # Play the audio
    print(f"Audio: {DOMAINS[idx]}")
    ipd.display(ipd.Audio(audio, rate=sr))

<a id="4"></a>
### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 4 - Model Building</p>
1. Public Wav2Vec2 model - no FT - inference only 
    - In this notebook I am using this [baseline model](https://www.kaggle.com/code/ttahara/bengali-sr-public-wav2vec2-0-w-lm-baseline) to understand the leaderboard. After that we will fine-tune or add some new models



<a id="4"></a>
### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 2px; color:#000000; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #000000">Step 4.1- Requirements</p>

In [None]:
# Copying packages and extracting them for installation
!cp -r ../input/python-packages2 ./

# Install jiwer (Word Error Rate calculation)
!tar xvfz ./python-packages2/jiwer.tgz
!pip install ./jiwer/jiwer-2.3.0-py3-none-any.whl -f ./ --no-index

# Install bnunicodenormalizer (Bengali Unicode normalization)
!tar xvfz ./python-packages2/normalizer.tgz
!pip install ./normalizer/bnunicodenormalizer-0.0.24.tar.gz -f ./ --no-index

# Install pyctcdecode (CTC decoding for Wav2Vec2)
!tar xvfz ./python-packages2/pyctcdecode.tgz
!pip install ./pyctcdecode/attrs-22.1.0-py2.py3-none-any.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/exceptiongroup-1.0.0rc9-py3-none-any.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/hypothesis-6.54.4-py3-none-any.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/pygtrie-2.5.0.tar.gz -f ./ --no-index --no-deps
!pip install ./pyctcdecode/sortedcontainers-2.4.0-py2.py3-none-any.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/pyctcdecode-0.4.0-py2.py3-none-any.whl -f ./ --no-index --no-deps

# Install pypi-kenlm (KenLM language model)
!tar xvfz ./python-packages2/pypikenlm.tgz
!pip install ./pypikenlm/pypi-kenlm-0.1.20220713.tar.gz -f ./ --no-index --no-deps




In [None]:
rm -r python-packages2 jiwer normalizer pyctcdecode pypikenlm

In [None]:
import typing as tp  # Typing module for type hints
from pathlib import Path  # For working with file paths
from functools import partial  # To create partial functions
from dataclasses import dataclass, field  # For creating data classes

import pandas as pd 
import pyctcdecode  # For CTC decoding
import numpy as np  
from tqdm.notebook import tqdm  # For creating progress bars

import librosa  # For audio processing

import pyctcdecode  # For CTC decoding
import kenlm  # For working with KenLM language model
import torch  # PyTorch library for machine learning
from transformers import Wav2Vec2Processor, Wav2Vec2ProcessorWithLM, Wav2Vec2ForCTC  # For loading and processing Wav2Vec2 models
from bnunicodenormalizer import Normalizer  # For Bengali Unicode normalization

import cloudpickle as cpkl  # For pickling objects

In [None]:
# Define paths and parameters for the project
ROOT = Path.cwd().parent
INPUT = ROOT / "input"
DATA = INPUT / "bengaliai-speech"
TRAIN = DATA / "train_mp3s"
TEST = DATA / "test_mp3s"
SAMPLING_RATE = 16_000
MODEL_PATH = INPUT / "bengali-sr-download-public-trained-models/indicwav2vec_v1_bengali/"
LM_PATH = INPUT / "bengali-sr-download-public-trained-models/wav2vec2-xls-r-300m-bengali/language_model/"


In [None]:
# Load Wav2Vec2 model and processor
model = Wav2Vec2ForCTC.from_pretrained(MODEL_PATH)  # CTC instance
# processor will be responsible for handling the audion data
processor = Wav2Vec2Processor.from_pretrained(MODEL_PATH)

In [None]:
# build the vocabulary and a decoder

# Get the vocabulary from the model's tokenizer
vocab_dict = processor.tokenizer.get_vocab()
print('lENGTH OF THE VOCABULARY: ',len(vocab_dict))
vocab_dict


## Joiners and non joiners
in vocabulary we can see some joiners and non joiners, lets know them: <br>
Joiners and non-joiners are control characters used in text processing to manage how characters combine or remain separate when displayed or processed. They are particularly relevant in scripts where characters are connected or joined together to form complex glyphs, ligatures, or combinations.
1. Joiners:
Joiners are characters that indicate that two or more characters should be combined into a single glyph or ligature. They are used to ensure proper rendering of characters that are typically connected in certain scripts. Examples include:

    `ZERO WIDTH JOINER ('\u200d')`: Used to join characters without adding extra space.<br>
    `LEFT-TO-RIGHT MARK ('\u200e')`: Used to ensure a left-to-right writing direction, even within right-to-left text.

2. Non-Joiners:
Non-joiners are characters that indicate that two adjacent characters should remain separate and not be combined into a single glyph. They are used to prevent the automatic formation of ligatures. Examples include:

    `ZERO WIDTH NON-JOINER ('\u200c')`: Used to prevent characters from combining into ligatures.

In [None]:
# Sort the vocabulary based on token IDs
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

# Build a CTC decoder using the sorted vocabulary and a language model
decoder = pyctcdecode.build_ctcdecoder(
    list(sorted_vocab_dict.keys()),  # Vocabulary keys
    str(LM_PATH / "5gram.bin"),  # Path to the language model file
)


In [None]:
# Create a combined processor for Wav2Vec2 model input and language model decoding
processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,  # Feature extractor for audio data
    tokenizer=processor.tokenizer,  # Tokenizer for text data
    decoder=decoder  # Decoder for converting model output to text
)

In [None]:
import torch
from torch.utils.data import Dataset

class BengaliSRTestDataset(Dataset):
    # A custom dataset class for handling Bengali speech test data
    
    def __init__(self, audio_paths: list[str], sampling_rate: int):
        # Constructor to initialize the dataset
        
        # Store the list of audio file paths
        self.audio_paths = audio_paths
        
        # Store the sampling rate used for audio processing
        self.sampling_rate = sampling_rate
    
    def __len__(self):
        # Return the total number of samples in the dataset
        return len(self.audio_paths)
    
    def __getitem__(self, index: int):
        # Get a sample from the dataset given an index
        
        # Get the audio file path corresponding to the index
        audio_path = self.audio_paths[index]
        
        # Get the sampling rate from the dataset settings
        sr = self.sampling_rate
        
        # Load the audio file using librosa, specifying the desired sampling rate
        # 'mono=False' indicates to load the audio as a multi-channel signal
        # [0] at the end gets the audio signal (the first element of the returned tuple)
        audio_signal = librosa.load(audio_path, sr=sr, mono=False)[0]
        
        # Return the loaded audio signal as the sample
        return audio_signal


In [None]:
test = pd.read_csv(DATA / "sample_submission.csv", dtype={"id": str})
print(test.head())

In [None]:
test_audio_paths = [str(TEST / f"{aid}.mp3") for aid in test["id"].values]

In [None]:
# Create a dataset for testing using the list of test audio paths and specified sampling rate
test_dataset = BengaliSRTestDataset(
    test_audio_paths, SAMPLING_RATE
)

# Define a partial function for collating samples into batches
collate_func = partial(
    processor_with_lm.feature_extractor,
    return_tensors="pt", sampling_rate=SAMPLING_RATE,
    padding=True,
)

# Create a data loader for testing
test_loader = torch.utils.data.DataLoader(
    test_dataset, batch_size=4, shuffle=False,
    num_workers=4, collate_fn=collate_func, drop_last=False,
    pin_memory=True,
)


In [None]:
if not torch.cuda.is_available():
    device = torch.device("cpu")
else:
    device = torch.device("cuda")
print(device)

# attach cpu or gpu
model = model.to(device)
model = model.eval()
model = model.half()
torch.cuda.empty_cache()

In [None]:
pred_sentence_list = []  # Initialize an empty list to store predicted sentences

# Perform inference without gradient computation because we are not fine tuning so we don't want to change the weights
with torch.no_grad():
    for batch in tqdm(test_loader):  # Iterate through batches of test data
        x = batch["input_values"]  # Extract the input audio features from the batch
        x = x.to(device, non_blocking=True)  # Move the input data to the device (GPU)
        
        # Use automatic mixed precision for faster and more memory-efficient inference
        with torch.cuda.amp.autocast(True):
            y = model(x).logits  # Get the model's output logits
        del x
        y = y.detach().cpu().numpy()  # Move the logits to the CPU and convert to a numpy array
        
        for l in y:  # Iterate through the logits of the batch
            # Decode the logits into a sentence using the LM with beam search decoding
            sentence = processor_with_lm.decode(l, beam_width=64).text
            pred_sentence_list.append(sentence)  # Append the predicted sentence to the list


In [None]:
bnorm = Normalizer()  # Create a Normalizer object for text normalization

def postprocess(sentence):
    # Define a postprocessing function to clean up and format predicted sentences
    
    period_set = set([".", "?", "!", "।"])  # Set of sentence-ending punctuation
    
    # Split the sentence into words and apply normalization using the Normalizer
    _words = [bnorm(word)['normalized'] for word in sentence.split() if word]
    
    sentence = " ".join(_words)
    
    if not sentence.endswith(tuple(period_set)):
        sentence += "।"
    return sentence


In [None]:
pp_pred_sentence_list = [
    postprocess(s) for s in tqdm(pred_sentence_list)]

In [None]:
test["sentence"] = pp_pred_sentence_list

test.to_csv("submission.csv", index=False)

print(test.head())

# Authors Note


> Hello there, it's my first speech recognition competition and a step towards kaggle featured competition. Although I know the concepts of NLP. Implementing it all at once in such a huge scale is quite overwhelming but I think it can be managed and since it's just a start. 

> I hope you enjoyed this work up untill now, I will periodically update the kernel. 

> If you Liked it, consider upvote and any kinds of suggestions are very well appreciated.

> Adios
**Have a great day ahead**

- Sorry for this delayed update, I was exploring some LLMs. From now on there will be frequent updates. Thank you for your patience.