# Investigating Language Model Bias in Whisper (small) ASR Model

Reeka Estacio

LIGN 214: Computational Phonetics

## Introduction

OpenAI's neural automatic speech recognition (ASR) model, Whisper, leverages both ASR technology to transcribe speech and, crucially, an integrated Transformer-based language model to enhance context-based accuracy. This complex, layered architecture has enabled the model to achieve very high transcription accuracy. However, to what extent does Whisper rely on its language model component over its ASR component? To investigate this, I aim to measure perplexity in Whisper to explore whether the model relies more on semantic relatedness when performing ASR tasks, as opposed to the pure phonological information of the acoustic signal.

To evaluate whether large language models (LLMs) generate language in a way that parallels human processing—and to further the conversation regarding the role of statistical learning in human language acquisition—previous studies have sought to align LLM behavior with physiological responses to language. For example, EEG research has shown that when individuals encounter unexpected word endings in highly constrained sentence contexts, they exhibit the N400 effect, a negative-going event-related potential (ERP) component associated with semantic processing in response to written and spoken linguistic stimuli. **[cite Kutas paper]**.

Since human predictability judgements are reflected in the N400 effect, measuring perplexity in an LLM sufficiently operationalizes this phenomenon. Perplexity measures the model's uncertainty when predicting linguistic input, allowing us to assess how strongly the model biases predictability over purely acoustic analysis. Prior research suggests that LLMs exhibit behavior analogous to the N400 effect when encountering unexpected linguistic input, as reflected in increased perplexity **[cite a paper]**. By analyzing perplexity in Whisper’s transcriptions, the current study aims to determine whether its language model component exhibits a systematic bias toward more statistically probable sentences, potentially overriding the acoustic information of the speech input.

Overall, I aim to compare perplexity scores from Whisper-small transcriptions. I examine five different auditory conditions: (1) expected, (2) phonologically-related, (3) semantically-related, (4) both phonologically- and semantically-related, and (5) neither. Below is an example set of stimuli.

1. The farmer milked the **cow**. (expected, most probable)

2. The farmer milked the **couch**. (phonologically-related, improbable)

3. The farmer milked the **goat**. (semantically-related, probable)

4. The farmer milked the **calf**. (phonologically- and semantically-related, less probable)

5. The farmer milked the **rock**. (neither, improbable)

If Whisper-small exhibits a bias on its language model component, I should see that lower perplexity is assigned to the semantically-related conditions. More specifically, the perplexity associated with each condition will rank as the following, from lowest perplexity to highest perplexity: condition 1 (expected) < condition 3 (semantically-related) < condition 4 (both semantically- and phonologically-related) < condition 2 (phonologically-related) < condition 5 (neither).


## Methods

### Stimuli

The stimuli consists of 8 sets of sentences. Each set contains five sentences where the final word has been manipulated according to the experimental conditions: 1) expected, (2) phonologically-related, (3) semantically-related, (4) both phonologically- and semantically-related, and (5) neither. In total, 40 sentences were generated using ChatGPT.

The sentences were then fed into the [ElevenLabs](elevenlabs.io) text-to-speech voice generator in effort to standardize the speaker and minimize noise. The auditory stimuli was generated using the Eleven Multiligual v2 model using the "Rachel" voice. Speaker speed was fixed at 1x speed. Stability, similarity, and style exaggeration were fixed at 50%. The generated audio files were saved as mp3 files.

The full set of stimuli, including the sentence transcripts and audio files, can be located in the `data` folder on [GitHub](https://github.com/rdestaci/Whisper_LLM_Bias/tree/14f636478534dab9ac30b36e56917b0916cd841f/data).

### Obtaining perplexity

In [32]:
%%capture
# Install necessary libraries
import os
import torch
import whisper
import pandas as pd
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import warnings

To compute perplexity, I first load the Whisper small model. 

In [34]:
%%capture
# Load Whisper model (small)
model = whisper.load_model("small", device="cpu")

# Load GPT-2 tokenizer and model for calculating perplexity
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_model.eval()

In [35]:
# Define function for computing perplexity
def compute_perplexity(sentence):
    tokens = tokenizer(sentence, return_tensors="pt")   # tokenize sentence
    input_ids = tokens.input_ids
    
    with torch.no_grad():
        outputs = gpt2_model(input_ids, labels=input_ids)   # get model outputs
        loss = outputs.loss   # compute loss
        perplexity = torch.exp(loss).item()   # exponentiate loss to compute perplexity
        
    return perplexity

In [37]:
%%capture
# Suppress CPU-related warnings
warnings.simplefilter("ignore")

# Load stimuli and path to folder containing audio files
df_stimuli = pd.read_csv("data/stimuli.csv")
stimuli_folder = os.path.join(os.getcwd(), "data/auditory_stimuli")


# Loop through df_stimuli and process audio files
for index, row in df_stimuli.iterrows():
    audio_filename = os.path.join(stimuli_folder, f"{row['id']}{row['condition_id']}.mp3")
    
    # Check if the audio file exists
    if os.path.exists(audio_filename):
        print(f"Processing: {audio_filename}")

        # Transcribe the audio file with Whisper
        result = model.transcribe(audio_filename)
        transcribed_text = result["text"]

        # Compute perplexity
        perplexity = compute_perplexity(transcribed_text)

    else:
        transcribed_text = None  # Mark missing files
        perplexity = None  # No perplexity for missing audio

    # Store the results in df_stimuli
    df_stimuli.at[index, "transcription"] = transcribed_text
    df_stimuli.at[index, "perplexity"] = perplexity

# Save dataframe transcriptions & perplexity to results.csv
df_stimuli.to_csv("results.csv", index=False)

In [39]:
results = pd.read_csv("results.csv")
results

Unnamed: 0,id,condition_id,condition_name,sentence,transcription,perplexity
0,1,A,expected,The farmer milked the cow.,The farmer milked the cow.,218.471649
1,1,B,phonologically related,The farmer milked the couch.,The farmer milked the couch.,971.731323
2,1,C,semantically related,The farmer milked the goat.,The farmer milked the goat.,332.905121
3,1,D,both,The farmer milked the calf.,The farmer milked the calf.,347.72348
4,1,E,neither,The farmer milked the rock.,The farmer milked the rock.,631.884216
5,2,A,expected,The mechanic fixed the car.,The mechanic fixed the car.,922.882385
6,2,B,phonologically related,The mechanic fixed the bar.,The mechanic fixed the bar.,961.292297
7,2,C,semantically related,The mechanic fixed the engine.,The mechanic fixed the engine.,552.510864
8,2,D,both,The mechanic fixed the cart.,The mechanic fixed the cart.,1425.908813
9,2,E,neither,The mechanic fixed the apple.,The mechanic fixed the apple.,1996.952637


## Results

In [None]:
# boxplots showing difference of all 5 conditions

The results suggest that...

## Discussion

This study aims to examine the extent to which Whisper biases its language model component over its ASR component when transcribing linguistic input.

### Further research

It would be worthwhile to explore how these effects manifest on the Whisper model of different sizes. **[cite papers]**

## Conclusion

Overall...

## References