# Introduction

This notebook presents an End-to-End speech recognition system tailored for medical applications. The goal is to leverage advanced models, Whisper and wav2vec 2.0, to transcribe medical audio data with high accuracy, contributing to the efficiency and effectiveness of healthcare documentation processes.

## Assignment Overview

The assignment involves building a speech recognition system that is fine-tuned on a domain-specific dataset and integrated with a language model to enhance recognition accuracy. The effectiveness of the fine-tuned models will be evaluated against pre-trained models to gauge the improvements in handling domain-specific automatic speech recognition (ASR) tasks.


In [None]:
!pip install transformers datasets torchaudio librosa soundfile
!pip install git+https://github.com/openai/whisper.git


Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/

# Literature Review

## Background

Automatic Speech Recognition (ASR) technology has revolutionized how we interact with machines and has significant implications in the healthcare sector, aiding in transcribing patient interactions, consultations, and clinical documentation.

## Previous Work

Several models have been proposed for ASR, with Whisper and wav2vec 2.0 being the latest in achieving near-human accuracy. Studies have shown that these models can be fine-tuned for specific domains, such as medical speech recognition, to improve performance.

## Whisper and wav2vec 2.0

- **Whisper**: An ASR model that performs robustly across different languages and audio conditions.
- **wav2vec 2.0**: A self-supervised learning framework for speech recognition that captures the nuances of language from raw audio.

## Gaps in Research

Despite advancements, domain-specific challenges in medical ASR remain, including handling jargon and contextual understanding. This research aims to address these challenges by fine-tuning ASR models on a medical dataset.


In [None]:
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import whisper
import librosa
import soundfile as sf
from datasets import load_dataset


In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
# Import OS for navigation and environment set up
import os
# Check current location, '/content' is the Colab virtual machine
os.getcwd()
# Enable the Kaggle environment, use the path to the directory your Kaggle API JSON is stored in
os.environ['KAGGLE_CONFIG_DIR'] = '../gdrive/MyDrive/gdrive/kaggle.json'

In [None]:
os.chdir('../gdrive/MyDrive/gdrive/')

In [None]:
!kaggle datasets download -d paultimothymooney/medical-speech-transcription-and-intent

Downloading medical-speech-transcription-and-intent.zip to /gdrive/MyDrive/gdrive
100% 5.26G/5.27G [01:24<00:00, 59.7MB/s]
100% 5.27G/5.27G [01:24<00:00, 67.1MB/s]


In [None]:
# Unzip the dataset
!unzip /content/medical-speech-transcription-and-intent.zip -d /content/medical-dataset


unzip:  cannot find or open /content/medical-speech-transcription-and-intent.zip, /content/medical-speech-transcription-and-intent.zip.zip or /content/medical-speech-transcription-and-intent.zip.ZIP.


In [None]:
!mkdir -p ~/.kaggle
!cp -r "/content/sample_data/kaggle/kaggle.json" ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d paultimothymooney/medical-speech-transcription-and-intent
!unzip medical-speech-transcription-and-intent.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: medical speech transcription and intent/Medical Speech, Transcription, and Intent/recordings/test/1249120_35154350_58959709.wav  
  inflating: medical speech transcription and intent/Medical Speech, Transcription, and Intent/recordings/test/1249120_35154350_61858707.wav  
  inflating: medical speech transcription and intent/Medical Speech, Transcription, and Intent/recordings/test/1249120_35154350_62723165.wav  
  inflating: medical speech transcription and intent/Medical Speech, Transcription, and Intent/recordings/test/1249120_35154350_67577535.wav  
  inflating: medical speech transcription and intent/Medical Speech, Transcription, and Intent/recordings/test/1249120_35154350_73842430.wav  
  inflating: medical speech transcription and intent/Medical Speech, Transcription, and Intent/recordings/test/1249120_35154350_74284558.wav  
  inflating: medical speech transcription and intent/Medical Speech, Transcri

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Methodology

## Dataset

The medical speech dataset comprises 8.5 hours of audio paired with text annotations for common medical symptoms, sourced from Kaggle. It provides a diverse set of utterances, critical for training robust ASR models.

## Preprocessing

Preprocessing involved normalizing audio files to a consistent format and sampling rate, as well as creating a corresponding transcription dataset that the models can learn from.

## Fine-tuning

Fine-tuning was performed on both Whisper and wav2vec 2.0 models using the medical speech dataset. Hyperparameters were carefully selected to balance model accuracy and training time.

## Language Model Integration

A language model was integrated to provide context and improve the recognition accuracy. This involved experimenting with both internal and external integration of the language model within the system architecture.


However the dataset caused some issues and wasnt able to complete the model training process, attaching my process here regardless

In [None]:
import os
import pandas as pd

dataset_directory = 'Medical_Dataset'
transcriptions_path = os.path.join(dataset_directory, 'overview-of-recordings.csv')

# Load the transcriptions
transcriptions = pd.read_csv(transcriptions_path)
print(transcriptions.head())


   audio_clipping  audio_clipping:confidence background_noise_audible  \
0     no_clipping                     1.0000              light_noise   
1  light_clipping                     0.6803                 no_noise   
2     no_clipping                     1.0000                 no_noise   
3     no_clipping                     1.0000              light_noise   
4     no_clipping                     1.0000                 no_noise   

   background_noise_audible:confidence  overall_quality_of_the_audio  \
0                               1.0000                          3.33   
1                               0.6803                          3.33   
2                               0.6655                          3.33   
3                               1.0000                          3.33   
4                               1.0000                          4.67   

     quiet_speaker  quiet_speaker:confidence  speaker_id  \
0  audible_speaker                       1.0    43453425   
1  audib

In [None]:
def preprocess_audio(audio_path, target_sample_rate=16000):
    # Load the audio file
    audio, sample_rate = librosa.load(audio_path, sr=target_sample_rate)
    # Save the file to a temporary location if needed
    temp_path = audio_path.replace('.mp3', '_resampled.wav')
    sf.write(temp_path, audio, target_sample_rate)
    return temp_path

# Process each audio file in the dataset
transcriptions['processed_audio_path'] = transcriptions['audio_file_path'].apply(preprocess_audio)


In [None]:
for url in transcriptions['file_download']:
    url = str(url)
    !wget  "$url"


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Resolving ml.sandbox.cf3.us (ml.sandbox.cf3.us)... failed: Name or service not known.
wget: unable to resolve host address ‘ml.sandbox.cf3.us’
--2024-01-07 18:31:40--  https://ml.sandbox.cf3.us/cgi-bin/index.cgi?download=1249120_44220382_42961845.wav&key=test_key_TISTK
Resolving ml.sandbox.cf3.us (ml.sandbox.cf3.us)... failed: Name or service not known.
wget: unable to resolve host address ‘ml.sandbox.cf3.us’
--2024-01-07 18:31:40--  https://ml.sandbox.cf3.us/cgi-bin/index.cgi?download=1249120_44220382_50020297.wav&key=test_key_TISTK
Resolving ml.sandbox.cf3.us (ml.sandbox.cf3.us)... failed: Name or service not known.
wget: unable to resolve host address ‘ml.sandbox.cf3.us’
--2024-01-07 18:31:40--  https://ml.sandbox.cf3.us/cgi-bin/index.cgi?download=1249120_44220382_63464751.wav&key=test_key_TISTK
Resolving ml.sandbox.cf3.us (ml.sandbox.cf3.us)... failed: Name or service not known.
wget: unable to resolve host address ‘m

In [None]:
import os

def preprocess_audio_directory(directory_path, target_sample_rate=16000):
    processed_files = []

    for filename in os.listdir(directory_path):
        if filename.endswith('.wav'):
            audio_path = os.path.join(directory_path, filename)
            temp_path = preprocess_audio(audio_path, target_sample_rate)
            processed_files.append(temp_path)

    return processed_files

def preprocess_audio(audio_path, target_sample_rate=16000):
    # Load the audio file
    audio, sample_rate = librosa.load(audio_path, sr=target_sample_rate)
    # Save the file to a temporary location if needed
    output_directory ="/content/Medical_Dataset/preprocess"
    os.makedirs(output_directory, exist_ok=True)

    # Save the file to the specified directory with a new name
    file_name = os.path.basename(audio_path)
    temp_path = os.path.join(output_directory, file_name.replace('.wav', '_resampled.wav'))
    sf.write(temp_path, audio, target_sample_rate)

    return temp_path

# Example usage
directory_path = "/content/Medical_Dataset/recordings/train"
processed_files = preprocess_audio_directory(directory_path, target_sample_rate=16000)
print("Processed files:", processed_files)


Processed files: ['/content/Medical_Dataset/preprocess/1249120_44235678_81459407_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44188922_39157694_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44176037_48938284_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44160489_102120596_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44220382_58288702_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44235678_62476079_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44194084_42535299_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44220382_63850186_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44188922_49011235_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44197979_93192537_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44235678_61143318_resampled.wav', '/content/Medical_Dataset/preprocess/1249120_44176037_102837935_resampled.wav', '/content/Medical_Dataset/prepro

In [None]:
for i in processed_files:
  preprocess_audio(i, target_sample_rate=16000)

In [None]:
from datasets import Dataset

# Create a HuggingFace dataset from the DataFrame
hf_dataset = Dataset.from_pandas(transcriptions[['processed_audio_path', 'transcription']])

# Example of accessing an audio file and its transcription
print(hf_dataset[0]['processed_audio_path'], hf_dataset[0]['transcription'])


In [None]:

# For Whisper, you might not need to extract features as it can process raw audio
# For wav2vec 2.0, you can use the model's feature extractor

from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
output =[]
def extract_features(processed_audio_path):
    # Use librosa to load the audio file
    audio, sr = librosa.load(processed_audio_path, sr=None)
    # Extract features using the feature extractor
    features = feature_extractor(audio, sampling_rate=sr, return_tensors="pt").input_values[0]
    example['input_values'] = features
    return example

# Apply feature extraction to each example in the dataset
train_dataset = train_dataset.map(extract_features, remove_columns=['processed_audio_path'])
test_dataset = test_dataset.map(extract_features, remove_columns=['processed_audio_path'])


In [None]:
import os
import pandas as pd
from transformers import Wav2Vec2FeatureExtractor

# Load Wav2Vec2 feature extractor
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")

# Directory containing processed audio files
processed_audio_directory = '/content/Medical_Dataset/preprocess'

# Output CSV file
output_csv_path = '/content/Medical_Dataset/output.csv'

# List to store extracted features
output =[]
def extract_features(processed_audio_path):
    # Use librosa to load the audio file
    audio, sr = librosa.load(processed_audio_path, sr=None)

    # Extract features using the feature extractor
    features = feature_extractor(audio, sampling_rate=sr, return_tensors="pt").input_values[0]

    # Store the extracted features along with the file name in a dictionary
    example = {'file_name': os.path.basename(processed_audio_path), 'input_values': features.numpy().tolist()}

    return example

# Iterate over processed audio files in the directory
for filename in os.listdir(processed_audio_directory):
    if filename.endswith('_resampled.wav'):
        audio_path = os.path.join(processed_audio_directory, filename)
        output.append(extract_features(audio_path))

# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(output)

# Save the DataFrame to a CSV file
df.to_csv(output_csv_path, index=False)

# Display the DataFrame
print(df.head())


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


KeyboardInterrupt: 

In [None]:
df['audio']

KeyError: 'audio'

In [None]:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.masked_spec_embed', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You sho

In [None]:
def prepare_dataset(batch):
    # Tokenize the transcriptions
    batch["input_values"] = processor(batch["audio"], sampling_rate=16000).input_values[0]
    # batch["labels"] = processor.tokenizer(batch["transcription"], padding="longest").input_ids
    return batch
df=df.map(prepare_dataset)
# train_dataset = train_dataset.map(prepare_dataset, remove_columns=['transcription', 'audio'])
# test_dataset = test_dataset.map(prepare_dataset, remove_columns=['transcription', 'audio'])
from transformers import DataCollatorCTCWithPadding

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)


AttributeError: 'DataFrame' object has no attribute 'map'

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="/content/drive/My Drive/wav2vec2-medical-finetuned", # specify the output directory
  group_by_length=True,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=3, # specify number of epochs
  fp16=True,
  save_steps=400,
  eval_steps=400,
  logging_steps=400,
  learning_rate=1e-4,
  warmup_steps=500,
  save_total_limit=2,
)


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=processor.feature_extractor,
)


In [None]:
trainer.train()
results = trainer.evaluate()
print(results)
trainer.save_model("/content/drive/My Drive/wav2vec2-medical-finetuned")
processor.save_pretrained("/content/drive/My Drive/wav2vec2-medical-finetuned")


# Hypothesis Testing

Based on the literature review and initial observations, the following hypotheses were proposed, since code didnt work theory is proposed here:

1. Fine-tuning Whisper and wav2vec 2.0 on a medical-specific dataset will significantly reduce the Word Error Rate (WER).
2. Integration of a language model will further enhance the ASR performance by providing contextual understanding.

