# Speech to Text

Converting speech to text is a powerful tool in higher education, making spoken content accessible and analyzable. This technology enables automatic transcription of lecture recordings, creates real-time captions for classroom discussions, and helps international students follow along with spoken English. It also supports research by transcribing interviews, focus groups, and oral histories. By converting speech to text, institutions can improve accessibility, support language learners, and create searchable archives of educational content.

Using models from [HuggingFace](https://huggingface.co/docs), we compare two speech recognition approaches:
- [Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)
- [Whisper](https://huggingface.co/openai/whisper-tiny.en)

In [35]:
from typing import List, Tuple
import os
import glob
import pandas as pd
import numpy as np
import librosa
import torch
from transformers import (
    Wav2Vec2ForCTC, 
    Wav2Vec2Processor,
    WhisperProcessor, 
    WhisperForConditionalGeneration
)

def validate_speech_samples_path(path: str) -> str:
    """
    Validate that the speech samples directory exists and contains .wav files.
    
    Parameters:
    -----------
    path : str
        Path to the speech samples directory
        
    Returns:
    --------
    str
        Validated path to speech samples directory
        
    Raises:
    -------
    RuntimeError
        If the directory doesn't exist or contains no .wav files
    """
    if not os.path.exists(path):
        raise RuntimeError(
            f"Speech samples directory not found at: {path}\n"
            "Please make sure the 'speech_samples' directory exists in the same "
            "directory as this notebook."
        )
    
    wav_files = glob.glob(os.path.join(path, "*.wav"))
    if not wav_files:
        raise RuntimeError(
            f"No .wav files found in: {path}\n"
            "Please make sure the directory contains .wav files."
        )
    
    return path

# Define and validate path to speech samples relative to notebook location
SCRIPT_DIR = os.path.dirname(os.path.abspath('__file__'))
SPEECH_SAMPLES_PATH: str = validate_speech_samples_path(
    os.path.join(SCRIPT_DIR, 'speech_samples')
)

## Using Wav2Vec2

In [36]:
def load_audio_file(file_path: str, sample_rate: int = 16000) -> Tuple[np.ndarray, int]:
    """
    Load and preprocess an audio file for speech-to-text conversion.
    
    Parameters:
    -----------
    file_path : str
        Path to the audio file to be loaded
    sample_rate : int, optional
        Target sampling rate for the audio, defaults to 16000 Hz
        
    Returns:
    --------
    Tuple[np.ndarray, int]
        A tuple containing:
        - The loaded audio data as a numpy array
        - The sampling rate used
    """
    try:
        audio, rate = librosa.load(file_path, sr=sample_rate)
        return audio, rate
    except Exception as e:
        raise RuntimeError(f"Error loading audio file {file_path}: {str(e)}")

def get_audio_files(directory: str, extension: str = "*.wav") -> List[str]:
    """
    Get a sorted list of audio files from a directory.
    
    Parameters:
    -----------
    directory : str
        Path to the directory containing audio files
    extension : str, optional
        File extension pattern to match, defaults to "*.wav"
        
    Returns:
    --------
    List[str]
        Sorted list of file paths matching the extension pattern
    """
    try:
        return sorted(glob.glob(os.path.join(directory, extension)))
    except Exception as e:
        raise RuntimeError(f"Error accessing directory {directory}: {str(e)}")

def create_results_dataframe(wav_files: List[str], transcriptions: List[str]) -> pd.DataFrame:
    """
    Create a DataFrame to store speech-to-text results.
    
    Parameters:
    -----------
    wav_files : List[str]
        List of audio file paths
    transcriptions : List[str]
        List of transcribed text corresponding to the audio files
        
    Returns:
    --------
    pd.DataFrame
        DataFrame containing wav_input and txt_output columns
    """
    return pd.DataFrame({
        'wav_input': wav_files,
        'txt_output': transcriptions
    })

In [37]:
def transcribe_with_wav2vec2(audio_files: List[str]) -> pd.DataFrame:
    """
    Transcribe audio files using the Wav2Vec2 model.
    
    Parameters:
    -----------
    audio_files : List[str]
        List of paths to audio files to transcribe
        
    Returns:
    --------
    pd.DataFrame
        DataFrame containing original audio files and their transcriptions
    """
    try:
        # Initialize Wav2Vec2 model and processor
        processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
        model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
        
        wav_files = []
        transcriptions = []
        
        for filename in audio_files:
            try:
                # Load and process audio
                audio, rate = load_audio_file(filename)
                input_values = processor(
                    audio, 
                    return_tensors="pt", 
                    sampling_rate=rate
                ).input_values
                
                # Generate predictions
                with torch.no_grad():
                    logits = model(input_values).logits
                prediction = torch.argmax(logits, dim=-1)
                
                # Decode prediction to text
                transcription = processor.batch_decode(prediction)[0]
                
                # Store results
                wav_files.append(filename)
                transcriptions.append(transcription)
                
            except Exception as e:
                print(f"Error processing file {filename}: {str(e)}")
                continue
                
        return create_results_dataframe(wav_files, transcriptions)
        
    except Exception as e:
        raise RuntimeError(f"Error initializing Wav2Vec2 model: {str(e)}")

# Process audio files with Wav2Vec2
audio_files = get_audio_files(SPEECH_SAMPLES_PATH)
df_wav2vec2 = transcribe_with_wav2vec2(audio_files)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Using OpenAI Whisper

In [38]:
def transcribe_with_whisper(audio_files: List[str]) -> pd.DataFrame:
    """
    Transcribe audio files using the OpenAI Whisper model.
    
    Parameters:
    -----------
    audio_files : List[str]
        List of paths to audio files to transcribe
        
    Returns:
    --------
    pd.DataFrame
        DataFrame containing original audio files and their transcriptions
    """
    try:
        # Initialize Whisper model and processor
        processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
        model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
        
        wav_files = []
        transcriptions = []
        
        for filename in audio_files:
            try:
                # Load and process audio
                audio, rate = load_audio_file(filename)
                
                # Process audio with attention mask
                inputs = processor(
                    audio, 
                    return_tensors="pt", 
                    sampling_rate=rate,
                    return_attention_mask=True  # Explicitly request attention mask
                )
                
                # Generate predictions with attention mask
                with torch.no_grad():
                    predicted_ids = model.generate(
                        inputs.input_features,
                        attention_mask=inputs.attention_mask
                    )
                
                # Decode prediction to text
                transcription = processor.batch_decode(
                    predicted_ids, 
                    skip_special_tokens=True
                )[0]
                
                # Store results
                wav_files.append(filename)
                transcriptions.append(transcription)
                
            except Exception as e:
                print(f"Error processing file {filename}: {str(e)}")
                continue
                
        return create_results_dataframe(wav_files, transcriptions)
        
    except Exception as e:
        raise RuntimeError(f"Error initializing Whisper model: {str(e)}")

# Process audio files with Whisper
audio_files = get_audio_files(SPEECH_SAMPLES_PATH)
df_whisper = transcribe_with_whisper(audio_files)

## Comparing Results

In [39]:
def display_results(df: pd.DataFrame, model_name: str) -> None:
    """
    Display transcription results in a formatted way.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame containing transcription results
    model_name : str
        Name of the model used for transcription
    """
    print(f"\n{model_name} Results:")
    print("-" * 80)
    pd.set_option('display.max_colwidth', None)
    print(df.to_string(index=False))

In [40]:
# Display results for Wav2Vec2
display_results(df_wav2vec2, "Wav2Vec2")


Wav2Vec2 Results:
--------------------------------------------------------------------------------
                                                                       wav_input                    txt_output
  /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.apple.wav   LOOK AT THE WONDERFUL APPLE
 /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.apples.wav   LOOK AT THE WONDERFUL APPLE
   /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.baby.wav    LOOK AT THE BEAUTIFUL BABY
  /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.babys.wav  LOOK AT THE BEAUTIFUL BABIES
   /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.bike.wav    LOOK AT THE WONDERFUL BIKE
  /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.bikes.wav   LOOK AT THE WONDERFUL BIKES
 /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.cookie.wav  LOOK AT THE WONDERFUL COOKIE
/Users/tereu

In [41]:
# Display results for Whisper
display_results(df_whisper, "Whisper")


Whisper Results:
--------------------------------------------------------------------------------
                                                                       wav_input                      txt_output
  /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.apple.wav    Look at the wonderful apple.
 /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.apples.wav   Look at the wonderful apples.
   /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.baby.wav     Look at the beautiful baby.
  /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.babys.wav   Look at the beautiful babies.
   /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.bike.wav     Look at the wonderful bike.
  /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.bikes.wav    Look at the wonderful bikes.
 /Users/tereuter/Desktop/github/NLP-speech-to-text/speech_samples/the.cookie.wav   Look at the wonderful cooki

## Conclusion

The comparison between Whisper and Wav2Vec2 reveals several key advantages of the Whisper model:

1. Performance
   - Approximately 20% faster transcription speed
   - Potential for further performance optimization

2. Accuracy
   - Better handling of singular/plural forms (e.g., "apple" vs. "apples")
   - More accurate spelling (e.g., "doggies" vs. "DOGGIYS")

3. Nuanced Output
   - Enhanced punctuation handling including emphatic marks
   - Better preservation of emotional context through punctuation
   - Improved potential for downstream tasks like sentiment analysis

These differences make Whisper particularly suitable for applications requiring:
- High transcription accuracy
- Preservation of emotional context
- Integration with sentiment analysis pipelines
- Real-time or near-real-time processing