# Investigating Language Model Bias in Whisper ASR Model

Reeka Estacio

LIGN 214: Computational Phonetics

## Introduction

OpenAI's neural automatic speech recognition (ASR) model, Whisper, leverages both ASR technology to transcribe speech and, crucially, an integrated Transformer-based language model to enhance context-based accuracy. This complex, layered architecture has enabled the model to achieve very high transcription accuracy. However, to what extent does Whisper rely on its language model component over its ASR component? To investigate this, I aim to measure perplexity in Whisper to explore whether the model relies more on semantic relatedness when performing ASR tasks, as opposed to the pure phonological information of the acoustic signal.

To evaluate whether large language models (LLMs) generate language in a way that parallels human processing—and to further explore the role of statistical learning in human language acquisition—previous studies have sought to align LLM behavior with physiological responses to language. For example, EEG research has shown that when individuals encounter unexpected word endings in highly constrained sentence contexts, they exhibit the N400 effect, a negative-going event-related potential (ERP) component linked to semantic processing in response to written and spoken linguistic stimuli. **[cite Kutas paper]**.

Since human predictability judgements are reflected in the N400 effect, measuring perplexity in an LLM sufficiently operationalizes phenomenon. Perplexity measures the language model component’s overall uncertainty when predicting linguistic input, allowing us to assess how strongly the model biases expectation over purely acoustic transcription. Prior research suggests that LLMs exhibit behavior analogous to the N400 effect when encountering unexpected linguistic input, as reflected in increased perplexity **[cite a paper]**. By analyzing perplexity in Whisper’s transcriptions, the current study aims to determine whether its language model component systematically biases transcription toward more statistically probable sentences, potentially overriding the acoustic information of the speech input.

Overall, I aim to compare perplexity scores of five different auditory conditions: (1) expected, (2) phonologically-related, (3) semantically-related, (4) both phonologically- and semantically-related, and (5) neither. Below is an example set of stimuli.

1. The farmer milked the **cow**. (expected, most probable)

2. The farmer milked the **couch**. (phonologically-related, improbable)

3. The farmer milked the **goat**. (semantically-related, probable)

4. The farmer milked the **calf**. (phonologically- and semantically-related, less probable)

5. The farmer milked the **rock**. (neither, improbable)

If my hypothesis is that Whisper over-relies on its language model component, I should see that lower perplexity is assigned to the semantically-related conditions. More specifically, the perplexity scores associated with each condition will rank as the following:

condition 1 (expected, lowest perplexity) < condition 3 (semantically-related) < condition 4 (both semantically- and phonologically-related) < condition 2 (phonologically-related) < condition 5 (neither, highest perplexity).


## Methods

### Stimuli

The stimuli consists of 8 sets, each set containing five sentences where the final word has been manipulated according to the experimental conditions: 1) expected, (2) phonologically-related, (3) semantically-related, (4) both phonologically- and semantically-related, and (5) neither. The total 40 sentences were generated by ChatGPT.

The sentences were then fed into the [ElevenLabs](elevenlabs.io) text-to-speech voice generator in effort to standardize the speaker and minimize noise in the speech input. The auditory stimuli was generated using the Eleven Multiligual v2 model using the "Rachel" voice. Speaker speed was fixed at 1x speed. Stability, similarity, and style exaggeration were fixed at 50%. "Speaker boost" was toggled on. The generated audio files were saved as mp3 files.

The full set of stimuli, including the sentence transcripts and audio files, can be located in the `data` folder on [GitHub](https://github.com/rdestaci/Whisper_LLM_Bias/tree/14f636478534dab9ac30b36e56917b0916cd841f/data).

### Obtaining perplexity

Explanation of code. What size of Whisper model used

In [None]:
# Code for running the model
# Obtain perplexity

## Results

In [None]:
# boxplots showing difference of all 5 conditions

The results suggest that...

## Discussion

This study aims to examine the extent to which Whisper biases its language model component over its ASR component when transcribing linguistic input.

### Further research

It would be worthwhile to explore how these effects manifest on the Whisper model of different sizes. **[cite papers]**

## Conclusion

Overall...

## References