## Module #4 Minimum Viable Product

### Mark Streer (DS/ML)

### Motivation

Speech processing is essential to many ML/AI applications including voice recognition, speech transcription, and virtual assistants. However, technology watchdogs and ethicists alike have raised concerns that the underlying algorithms and language models perform poorly on sociolinguistic groups underrepresented in training data. This project aims to classify a corpus of voice recordings of UK speakers according to dialect/accent based on power-spectral characteristics (i.e. MFCCs) as well as phonetic properties (i.e. formants). Sociopolitical issues with dialect profiling and discrimination are inherent in voice recognition technology. This project is framed at ensuring equal access to convenience and state-of-the-art technology, supplied by selecting the most appropriate of a variety of pre-trained models.

### Methodology

The dataset analyzed is the [Open-source Multi-speaker Corpora of the English Accents in the British Isles](https://aclanthology.org/2020.lrec-1.804/), a collection of over 15,000 high-quality audio of English sentences recorded by volunteers having different regional accents of Great Britain and Ireland. The recording scripts were curated specifically for accent elicitation--including personal and location names within the regions in question--and provide high phoneme coverage. The authors state, "the dataset is intended for linguistic analysis as well as use for speech technologies." Six dialect classes are provided, each consisting of recordings from 3-57 volunteers of both genders; due to the small volunteer count and lack of female recordings for Irish dialects, only the first five classes were analyzed in the present study.

1. Northern (England)  
    19 speakers (5 female, 14 male); 2,847 entries
2. Southern (England)  
    57 speakers (28 female, 29 male); 8,492 entries
3. Midlands (England)  
    5 speakers (2 female, 3 male); 696 entries
4. Welsh  
    19 speakers (8 female, 11 male); 2,849 entries
5. Scottish  
    17 speakers (6 female, 11 male); 2,543 entries
6. Irish  
    3 speakers (0 female, 3 male); 450 entries

Librosa is used to calculated mel-frequency cepstrum coefficients (MFCCs) using default parameters such as number of coefficients (n=20), window length, and sampling rate. Since MFCCs are returned as an \[20,n]\ array where n increases with recording length, some kind of aggregation is necessary. For now, each MFCC row is averaged to yield the mean value as the corresponding feature.

Delta MFCCs and Delta-delta MFCCs are also calculated using Librosa and default parameters, for 60 features in total. Results below are only reported for MFCCs for now. (Working on extracting formants as of 2021-10-25.)

### Results
#### Classification performance

* Random forest models resulted in the best performance across the four model types tested, followed by K nearest neighbors (k=10), decision tree, and logistic regression. 
* Strong bias is apparent towards the majority class (Southern) in the logistic regression and decision tree models. Southern-versus-all classification consistently earns the highest f1-score in all models (OVA).

![](mvp_fig1.png)

![](mvp_fig2.png)

Further work will:
1. Re-run analysis using 16 MFCCs - the conventional size in speech processing algorithms - instead of the Librosa default of 20.
2. Split train/test datasets by user to ensure the model is not merely determining speaker similarity.