# Investigating Language Model Bias in Whisper (small) ASR Model

Reeka Estacio

LIGN 214: Computational Phonetics


## Introduction

OpenAI's neural automatic speech recognition (ASR) model, Whisper, leverages both ASR technology to transcribe speech and, crucially, an integrated Transformer-based language model to enhance context-based accuracy. This complex, layered architecture has enabled the model to achieve very high transcription accuracy. However, to what extent does Whisper rely on its language model component over its ASR component? To investigate this, I aim to measure perplexity in Whisper to explore whether the model relies more on semantic relatedness when performing ASR tasks, as opposed to the pure phonological information of the acoustic signal.

To evaluate whether large language models (LLMs) generate language in a way that parallels human processing—and to further the conversation regarding the role of statistical learning in human language acquisition—previous studies have sought to align LLM behavior with physiological responses to language. For example, EEG research has shown that when individuals encounter unexpected word endings in highly constrained sentence contexts, they exhibit the N400 effect, a negative-going event-related potential (ERP) component associated with semantic processing in response to written and spoken linguistic stimuli. **[cite Kutas paper]**.

Since human predictability judgements are reflected in the N400 effect, measuring perplexity in an LLM sufficiently operationalizes this phenomenon. Perplexity measures the model's uncertainty when predicting linguistic input, allowing us to assess how strongly the model biases predictability over purely acoustic analysis. Prior research suggests that LLMs exhibit behavior analogous to the N400 effect when encountering unexpected linguistic input, as reflected in increased perplexity **[cite a paper]**. By analyzing perplexity in Whisper’s transcriptions, the current study aims to determine whether its language model component exhibits a systematic bias toward more statistically probable sentences, potentially overriding the acoustic information of the speech input.

### The current study 

I aim to measure perplexity computed from Whisper (small) across five different auditory conditions, manipulating both phonological and semantic relationships between words. The conditions are:

1. **Expected**: The final word is highly predictable based on prior context.

2. **Phonologically-related**: The final word sounds similar to the expected word, but is not predictable given prior context.

3. **Semantically-related**: The final word is semantically-related to the expected word but is less predictable given prior context. It does not sound similar to the expected word.

4. **Both phonologically- and semantically-related**: The final word is semantically-related to the expected word but is less predictable given prior context. It also sounds similar to the expected word.

5. **Neither**: The final word is neither phonetically similar nor semantically expected, making it highly improbable.

Below is an example set of stimuli:

1. The farmer milked the **cow**. (expected, most probable)

2. The farmer milked the **couch**. (phonologically-related)

3. The farmer milked the **goat**. (semantically-related)

4. The farmer milked the **calf**. (phonologically- and semantically-related)

5. The farmer milked the **rock**. (neither, improbable)


To assess the extent to which Whisper's language model component influences its transcription, I will use GPT-2 perplexity as a pure language model baseline for comparison. I chose to use GPT-2 because the language model is likely the most similar to Whisper's embedded language model without its ASR component. If Whisper indeed exhibits a bias towards its language model component over its ASR component, perplexity rankings across conditions should pattern similarly to GPT-2, such that highly predictable words in context are favored over similar-sounding words. This should result in lower perplexity for semantically-related completions, regardless of the acoustic information of the input.

More specifically, the perplexity associated with each condition will rank as the following, from lowest perplexity to highest perplexity: condition 1 (expected) < condition 3 (semantically-related) < condition 4 (both semantically- and phonologically-related) < condition 2 (phonologically-related) < condition 5 (neither).

## Methods

### Stimuli

The sentences were generated using ChatGPT 4o. I prompted the language model to generate multiple sets of sentences where the sentence frame stays the same, but the final word varies based on a provided description of the five experimental conditions. In my prompt, I also ensured that the sentence contexts are highly constrained to the expected word. The final set of stimuli consists of 8 sets of five sentences (one sentence corresponding to each condition), resulting in a total of 40 sentences.

To generate the speech input, the sentences were fed into the [ElevenLabs](elevenlabs.io) text-to-speech voice generator. This method was chosen in lieu of collecting speech recordings because it standardizes the speaker and minimizes outside noise that could potentially influence Whisper's transcriptions. The stimuli were generated using the Eleven Multiligual v2 model using the "Rachel" voice. Speaker speed was fixed at 1x speed. Stability, similarity, and style exaggeration were fixed at 50%. The generated audio files were then downloaded as mp3 files.

The full set of stimuli, including the sentence transcripts and audio files, can be located in the `data` folder on [GitHub](https://github.com/rdestaci/Whisper_LLM_Bias/tree/14f636478534dab9ac30b36e56917b0916cd841f/data).

### Computing perplexity

To begin computing perplexity for each sentence in the stimuli set, I first loaded all necessary libraries and models (GPT-2 and Whisper (small)).

I then processed the stimuli, which includes:

- **`stimuli.csv`:** a file containing the true transcriptions of all sentences and their labeled conditions
- **path to the `auditory_stimuli` folder:** folder containing the speech input as .mp3 files.

In [78]:
%%capture
# Install necessary libraries
import os
import torch
import whisper
import pandas as pd
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import warnings

In [72]:
%%capture
# Load Whisper model (small)
model = whisper.load_model("small")

# Load GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_model.eval()

# Load stimuli and path to folder containing audio files
df_stimuli = pd.read_csv("data/stimuli.csv")
stimuli_folder = os.path.join(os.getcwd(), "data/auditory_stimuli")

Since there is no direct, identical method of computing perplexity for both GPT-2 and Whisper, I defined two functions to compute perplexity:

- `compute_gpt2_perplexity`: Computes GPT-2 perplexity by running a forward pass on the input sentence and taking the exponentiation of the cross-entropy loss.

- `compute_whisper_perplexity`: Transcribes the audio file and extracts the average log probabilities for each segment that Whisper predicts. It *estimates* perplexity by exponentiating the negative mean log probability.

Although these methods are different, they are equivalent in the sense that they both measure average uncertainty across all tokens/segments. Critically, it maintains the negative relationship between the perplexity and uncertainty (high perplexity = more uncertainty). 

In [73]:
## Define function for computing GPT-2 perplexity
def compute_gpt2_perplexity(sentence):
    """
    Compute perplexity of given sentence (text) using GPT-2.
    """
    tokens = tokenizer(sentence, return_tensors="pt")   # tokenize sentence
    input_ids = tokens.input_ids
    
    with torch.no_grad():
        outputs = gpt2_model(input_ids, labels=input_ids)   # get model outputs
        loss = outputs.loss   # extract cross-entropy loss
        perplexity = torch.exp(loss).item()   # exponentiate loss to compute perplexity
        
    return perplexity

## Define function for computing Whisper perplexity
def compute_whisper_perplexity(audio_filename, model):
    """
    Computes Whisper's perplexity based on average log probability of transcription.
    """
    # Transcribe audio
    result = model.transcribe(audio_filename)
    transcribed_text = result["text"]

    # Extract log probabilities from Whisper
    log_probs = []
    if "segments" in result:
        for segment in result["segments"]:
            if "avg_logprob" in segment:   # Whisper provides segment-level avg log probability
                log_probs.append(segment["avg_logprob"])

    # Estimate perplexity
    whisper_perplexity = torch.exp(-torch.tensor(log_probs).mean()).item()

    return transcribed_text, whisper_perplexity

Using these functions, the speech input is transcribed using Whisper, and perplexity for both models are calculated. The resulting DataFrame, `results.csv` contains Whisper's transcription, GPT-2 perplexity, and Whisper perplexity for each sentence.

In [None]:
# %%capture
# Suppress CPU-related warnings
warnings.simplefilter("ignore")

for index, row in df_stimuli.iterrows():
    audio_filename = os.path.join(stimuli_folder, f"{row['id']}{row['condition_id']}.mp3")
    
    # Compute Whisper perplexity and transcription
    transcribed_text, whisper_perplexity = compute_whisper_perplexity(audio_filename, model)

    # Compute GPT-2 perplexity
    gpt2_perplexity = compute_gpt2_perplexity(transcribed_text)

    # Store results
    df_stimuli.at[index, "transcription"] = transcribed_text
    df_stimuli.at[index, "whisper_perplexity"] = whisper_perplexity
    df_stimuli.at[index, "gpt2_perplexity"] = gpt2_perplexity

# Save the results
df_stimuli.to_csv("results.csv", index=False)

In [76]:
results = pd.read_csv("results.csv")
results.head()

Unnamed: 0,id,condition_id,condition_name,sentence,transcription,whisper_perplexity,gpt2_perplexity
0,1,A,expected,The farmer milked the cow.,The farmer milked the cow.,1.415638,218.471649
1,1,B,phonologically related,The farmer milked the couch.,The farmer milked the couch.,1.340787,971.731323
2,1,C,semantically related,The farmer milked the goat.,The farmer milked the goat.,1.350668,332.905121
3,1,D,both,The farmer milked the calf.,The farmer milked the calf.,1.343826,347.72348
4,1,E,neither,The farmer milked the rock.,The farmer milked the rock.,1.402848,631.884216


## Results

In [None]:
# boxplots showing difference of all 5 conditions

The results suggest that...

## Discussion

This study aims to examine the extent to which Whisper biases its language model component over its ASR component when transcribing linguistic input.

### Further research

It would be worthwhile to explore how these effects manifest on the Whisper model of different sizes. **[cite papers]**

## Conclusion

Overall...

## References