# Automated Speech Recognition (ASR) Model for Transcribing Swahili Audio to Text

![image-2.png](attachment:image-2.png)

## Business Understanding

### Overview

Nyika Analytika, a prominent data analysis company, is embarking on an innovative project aimed at advancing the field of speech recognition technology within the context of Swahili, one of the most widely spoken languages in Africa. With a rich linguistic history spanning centuries and boasting millions of speakers across diverse countries, Swahili stands as a symbol of cultural heritage and identity for communities throughout East Africa and beyond.

In recent years, the proliferation of digital audio content in Swahili has presented both opportunities and challenges. Nyika Analytika recognizes the pressing need to develop automated systems capable of accurately classifying and transcribing Swahili audio recordings. This project represents a significant endeavor to leverage cutting-edge machine learning techniques to bridge the gap between spoken Swahili and written text.

The intricate phonetics and dialects of Swahili pose a compelling challenge for Automatic Speech Recognition (ASR) systems. However, the potential applications of such technology are vast, ranging from transcription and translation services to content recommendation and language learning tools. By harnessing the power of AI and data-driven approaches, Nyika Analytika aims to unlock new possibilities for preserving and promoting Swahili's linguistic and cultural heritage while facilitating cross-cultural communication on a global scale.

This project will not only contribute to the advancement of ASR technology but also serve as a testament to the dynamic nature of language evolution, shaped by historical forces such as trade, migration, colonization, and cultural exchange. Through rigorous experimentation and collaboration with a team of experts, Nyika Analytika is poised to develop a robust and adaptable model capable of accurately transcribing Swahili audio with a high level of precision and efficiency.

In the following sections, we will delve into the specifics of our approach, outlining our objectives, experimental design, data collection and preprocessing methods, exploratory data analysis techniques, feature engineering strategies, modeling frameworks, and evaluation metrics. By sharing our insights and findings, we hope to contribute to the broader discourse surrounding ASR technology and its implications for linguistic diversity, cultural preservation, and technological innovation.

### Statement of the Problem

The linguistic landscape of East Africa is enriched by Swahili, a language with a storied history dating back centuries and serving as a cornerstone of cultural identity for millions of people across the region. In today's digital age, the proliferation of audio content in Swahili presents a pressing challenge: how can we effectively harness technology to transcribe and analyze spoken Swahili with accuracy and efficiency?

The complexity of Swahili phonetics and dialectical variations poses a significant obstacle for conventional speech recognition systems. Existing ASR models often struggle to accurately decipher the nuances of Swahili pronunciation, leading to errors and inaccuracies in transcriptions. This not only hinders the accessibility of Swahili content but also impedes the development of applications such as language learning tools, content recommendation systems, and automated translation services.

Moreover, the lack of tailored ASR solutions for Swahili perpetuates disparities in linguistic representation and access to digital resources. As Swahili continues to assert its prominence as a global language of communication, commerce, and culture, there is an urgent need to develop specialized tools and methodologies that cater to its unique linguistic characteristics.

The absence of robust ASR systems for Swahili undermines efforts to preserve and promote the language's rich heritage. Without accurate transcription and analysis capabilities, valuable insights from Swahili audio content—ranging from oral histories and cultural narratives to educational materials and political discourse—remain largely untapped. This not only limits the dissemination of Swahili knowledge and perspectives but also hampers efforts to bridge linguistic divides and foster cross-cultural understanding.

In light of these challenges, Nyika Analytika recognizes the need for a comprehensive and innovative approach to Swahili ASR. By leveraging advances in machine learning, natural language processing, and audio signal processing, we aim to develop a state-of-the-art ASR model capable of accurately transcribing spoken Swahili with high fidelity. Through this endeavor, we seek to empower Swahili speakers to fully engage with digital content in their native language while also advancing the broader field of speech recognition technology.

### Objectives

#### Main Objective:
To develop an automated system for converting basic Swahili audio into written text using state-of-the-art speech recognition technology. This overarching goal encompasses a series of specific objectives aimed at achieving robust and accurate transcription capabilities for spoken Swahili.

#### Specific Objectives:
1. Develop a machine learning model capable of accurately transcribing Swahili audio recordings:
- This involves training and fine-tuning a neural network-based ASR model specifically tailored to the phonetics and dialects of Swahili. The model should demonstrate proficiency in recognizing and interpreting spoken Swahili with a high level of accuracy and reliability.
2. Deploy a model that transcribes the recorded audio files:
- Once the ASR model has been trained and validated, it will be deployed as a functional system capable of transcribing Swahili audio recordings in real-time or batch processing mode. The deployment process will involve optimizing the model's performance for efficiency and scalability, ensuring seamless integration with existing workflows and applications.
3. Provide recommendations for further enhancements and applications:
- In addition to developing the core ASR system, this project aims to identify opportunities for further enhancements and applications of Swahili speech recognition technology. This may include exploring additional features or algorithms to improve transcription accuracy, investigating potential use cases in fields such as education, healthcare, and media, and evaluating the impact of the ASR system on Swahili language preservation and accessibility.

### Experimental Design

![image.png](attachment:image.png)

### Metrics of Success:

1. Accuracy: 
   - The primary metric of success for the Swahili ASR system will be accuracy, measured as the ratio of correctly recognized words to the total number of words in the reference transcription. A high accuracy rate indicates the system's ability to understand and convert spoken language into text with a high level of correctness. The ASR model should strive to achieve a high overall accuracy rate across a diverse range of Swahili audio recordings, including various accents, dialects, and speaking styles.
2. Word Error Rate (WER):
   - The Word Error Rate (WER) quantifies the accuracy of the ASR system by comparing the transcribed text to the reference text and measuring the number of insertions, deletions, and substitutions required to align them. A lower WER indicates a higher level of success, as it represents a closer match between the system's output and the expected transcript. Minimizing the WER is essential for ensuring the fidelity and reliability of the transcribed Swahili text, particularly in applications where precision and comprehension are paramount.
3. Speed and Efficiency:
   - In addition to accuracy, the speed and efficiency of the ASR system are crucial metrics of success. The system should be capable of transcribing Swahili audio recordings in a timely manner, with minimal latency and processing overhead. Achieving high throughput and low latency is essential for real-time applications such as live transcription, voice search, and interactive voice response systems. By optimizing computational resources and algorithmic efficiency, the ASR system can deliver fast and responsive performance without sacrificing accuracy or reliability.
4. Robustness and Generalization:
   - The ASR system should demonstrate robustness and generalization across diverse linguistic contexts and environmental conditions. It should be capable of accurately transcribing Swahili audio recordings from speakers with varying accents, speech impediments, and background noise levels. Robustness to environmental factors such as ambient noise, reverberation, and channel distortion is essential for ensuring consistent performance in real-world settings. Additionally, the ASR system should generalize well to unseen data and adapt gracefully to new speakers and speaking styles, reflecting the richness and variability of natural language usage.

## Data Understanding

The dataset for this project consists of audio recordings collected by 300 taskers from Kenya. Each audio recording corresponds to one of twelve different Swahili words, forming the basis of a multi-class classification problem.

The dataset is organized into three main files:
`Swahili_words.zip`:
- This zip file contains all audio files, both for training and testing purposes. Each audio file represents a single utterance of one of the twelve Swahili words.
`train.csv`:
- This file serves as the training dataset and contains the target labels corresponding to each audio recording. Each row in the Train.csv file provides the filename of the audio recording along with its corresponding label.
`test.csv`:
- This file resembles Train.csv but lacks the target-related columns. It contains only the filenames of the audio recordings, without the corresponding labels. This dataset will be used to evaluate the performance of the trained model by applying it to unseen data.

Additionally, there is a validation dataset provided in the form of valid.csv. This file demonstrates the submission format for the competition, with the `Audio_ID` column mirroring that of `test.csv` and an empty `label` column where the predictions will be filled. The order of the rows in `valid.csv` does not matter, but the filenames in the `Audio_ID` column must be correct for accurate evaluation.

Each audio recording corresponds to one of the following twelve Swahili words, along with their English translations:
- Ndio (Yes)
- Hapana (No)
- Mbili (Two)
- Tatu (Three)
- Nne (Four)
- Tano (Five)
- Sita (Six)
- Saba (Seven)
- Nane (Eight)
- Tisa (Nine)
- Kumi (Ten)
- Moja (One)

### Importing Libraries

In [1]:
# Basic data manipultation and analysis
import numpy as np
import pandas as pd

# Data visualization libraries
from scipy.ndimage import minimum_filter1d
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

# Audio preprocessing 
import os
import torch
import shutil
import random
import librosa
import librosa.display
import noisereduce as nr
import IPython.display as ipd
from tqdm.notebook import tqdm

from torch.utils.data import Dataset
from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data import DataLoader

# Machine learning models
import torchaudio
from torch import nn
from torchsummary import summary
from torchvision.models import resnet18, ResNet18_Weights

# Model evaluation metrics
import sklearn.preprocessing
from sklearn.preprocessing import minmax_scale
from sklearn.metrics import accuracy_score
from jiwer import wer

# Import custom functions and trained models
# import Functions
import pickle

# To ensure a more organized and tidy output
import warnings
warnings.filterwarnings('ignore')

# Set seed
np.random.seed(2022)

### Preliminary Analysis
A preliminary examination of the datasets will aid in understanding the nature, structure, and quality of the data. This involves evaluating the variables, identifying any missing or anomalous values, and ensuring the data is conducive for modeling.

Let's initiate this by loading and previewing the datasets:

In [5]:
# Loading files
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

# Unzipping audio zip file
shutil.unpack_archive('data/Swahili_words.zip', 'data/Swahili_words')