# Pre-trained Models for Detection and Severity Level Classification of Dysarthria from Speech

## Paper Implementation

**Authors:** Farhad Javanmardi, Sudarsana Reddy Kadiri, Paavo Alku  
**Published in:** Speech Communication, Volume 158, 2024  
**DOI:** https://doi.org/10.1016/j.specom.2024.103047

---

## Abstract

Automatic detection and severity level classification of dysarthria from speech enables non-invasive and effective diagnosis that helps clinical decisions about medication and therapy of patients. In this work, three pre-trained models (**wav2vec2-BASE**, **wav2vec2-LARGE**, and **HuBERT**) are studied to extract features to build automatic detection and severity level classification systems for dysarthric speech.

The experiments were conducted using two publicly available databases:
- **UA-Speech**: 15 dysarthric speakers + 13 control speakers
- **TORGO**: 8 dysarthric speakers + 7 control speakers

**Classifiers used:**
1. Support Vector Machine (SVM)
2. Convolutional Neural Network (CNN)

**Baseline features compared:**
- MFCCs (Mel-Frequency Cepstral Coefficients)
- openSMILE (ComParE 2016)
- eGeMAPS (extended Geneva Minimalistic Acoustic Parameter Set)

---

## Table of Contents

1. [Setup and Imports](#1-setup-and-imports)
2. [Configuration](#2-configuration)
3. [Data Loading](#3-data-loading)
4. [Feature Extraction](#4-feature-extraction)
   - 4.1 [MFCC Features](#41-mfcc-features)
   - 4.2 [openSMILE Features](#42-opensmile-features)
   - 4.3 [eGeMAPS Features](#43-egemaps-features)
   - 4.4 [wav2vec2-BASE Features](#44-wav2vec2-base-features)
   - 4.5 [wav2vec2-LARGE Features](#45-wav2vec2-large-features)
   - 4.6 [HuBERT Features](#46-hubert-features)
5. [Classification Models](#5-classification-models)
   - 5.1 [SVM Classifier](#51-svm-classifier)
   - 5.2 [CNN Classifier](#52-cnn-classifier)
6. [Experiments](#6-experiments)
   - 6.1 [Detection Task](#61-detection-task)
   - 6.2 [Severity Classification Task](#62-severity-classification-task)
7. [Results Analysis](#7-results-analysis)
8. [Comparison with Paper Results](#8-comparison-with-paper-results)
9. [Conclusions](#9-conclusions)

## 1. Setup and Imports

In [None]:
# Install required packages (uncomment if needed)
# !pip install torch torchaudio transformers librosa opensmile scikit-learn pandas numpy matplotlib seaborn tqdm

In [None]:
# Standard library imports
import os
import sys
import warnings
from pathlib import Path

# Data handling
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Audio processing
import librosa
import librosa.display

# Deep learning
import torch

# Progress bar
from tqdm.notebook import tqdm

# Suppress warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

In [None]:
# Import local modules
from config import (
    AudioConfig,
    MFCCConfig,
    OpenSMILEConfig,
    EGeMAPSConfig,
    Wav2Vec2Config,
    HuBERTConfig,
    SVMConfig,
    CNNConfig,
    UASpeechConfig,
    TORGOConfig,
    ExperimentConfig,
    TaskType,
    ClassifierType,
    FeatureType,
    PRETRAINED_MODELS,
    PAPER_RESULTS,
)

from utils import (
    # Audio processing
    load_audio,
    preprocess_audio_batch,
    
    # Feature extraction
    extract_mfcc,
    extract_mfcc_stats,
    extract_opensmile_features,
    extract_egemaps_features,
    PretrainedFeatureExtractor,
    extract_wav2vec2_base_features,
    extract_wav2vec2_large_features,
    extract_hubert_features,
    extract_features,
    extract_features_batch,
    
    # Dataset classes
    UASpeechDataset,
    TORGODataset,
    create_synthetic_dataset,
    
    # Classifiers
    SVMClassifier,
    CNNClassifier,
    
    # Evaluation
    evaluate_classifier,
    get_confusion_matrix,
    print_classification_report,
    leave_one_speaker_out_cv,
    run_experiment,
    run_all_experiments,
    
    # Visualization
    plot_confusion_matrix,
    plot_results_comparison,
    plot_training_history,
    
    # Utilities
    save_results,
    load_results,
    set_seed,
)

print("All modules imported successfully!")

## 2. Configuration

In [None]:
# Set reproducibility
set_seed(42)

# Audio configuration
audio_config = AudioConfig(
    sample_rate=16000,      # Standard for speech processing
    normalize=True,         # Normalize audio amplitude
    remove_silence=False,   # Keep original timing
    min_duration=0.5,       # Minimum 0.5 seconds
    max_duration=10.0,      # Maximum 10 seconds
)

print("Audio Configuration:")
print(f"  Sample Rate: {audio_config.sample_rate} Hz")
print(f"  Normalize: {audio_config.normalize}")
print(f"  Min Duration: {audio_config.min_duration}s")
print(f"  Max Duration: {audio_config.max_duration}s")

In [None]:
# Feature extraction configurations
mfcc_config = MFCCConfig()
opensmile_config = OpenSMILEConfig()
egemaps_config = EGeMAPSConfig()
wav2vec2_base_config = Wav2Vec2Config(model_name="facebook/wav2vec2-base")
wav2vec2_large_config = Wav2Vec2Config(model_name="facebook/wav2vec2-large")
hubert_config = HuBERTConfig()

print("Feature Dimensions:")
print(f"  MFCC: {mfcc_config.feature_dim} (13 MFCCs + deltas + delta-deltas)")
print(f"  openSMILE ComParE: {opensmile_config.feature_dim}")
print(f"  eGeMAPS: {egemaps_config.feature_dim}")
print(f"  wav2vec2-BASE: {wav2vec2_base_config.hidden_size}")
print(f"  wav2vec2-LARGE: {wav2vec2_large_config.hidden_size}")
print(f"  HuBERT: {hubert_config.hidden_size}")

In [None]:
# Classifier configurations
svm_config = SVMConfig()
cnn_config = CNNConfig()

print("SVM Configuration:")
print(f"  Kernel: {svm_config.kernel}")
print(f"  C range: {svm_config.C_range}")
print(f"  Class weight: {svm_config.class_weight}")

print("\nCNN Configuration:")
print(f"  Batch size: {cnn_config.batch_size}")
print(f"  Epochs: {cnn_config.epochs}")
print(f"  Learning rate: {cnn_config.learning_rate}")
print(f"  Dropout: {cnn_config.dropout_rate}")

In [None]:
# Dataset configurations
uaspeech_config = UASpeechConfig()
torgo_config = TORGOConfig()

print("UA-Speech Dataset:")
print(f"  Dysarthric speakers: {uaspeech_config.dysarthric_speakers}")
print(f"  Control speakers: {uaspeech_config.control_speakers}")
print(f"  Total speakers: {uaspeech_config.total_speakers}")
print(f"  Severity levels: {list(uaspeech_config.severity_levels.keys())}")

print("\nTORGO Dataset:")
print(f"  Dysarthric speakers: {torgo_config.dysarthric_speakers}")
print(f"  Control speakers: {torgo_config.control_speakers}")
print(f"  Total speakers: {torgo_config.total_speakers}")

## 3. Data Loading

### Dataset Structure

**UA-Speech Database:**
- 15 dysarthric speakers (4 female, 11 male) with Cerebral Palsy
- 13 control speakers (4 female, 9 male)
- Severity levels based on intelligibility:
  - Very Low (76-100% intelligible): 3 speakers
  - Low (51-75% intelligible): 4 speakers
  - Medium (26-50% intelligible): 6 speakers
  - High severity (0-25% intelligible): 2 speakers

**TORGO Database:**
- 8 dysarthric speakers with various severity levels
- 7 control speakers
- Includes articulatory data (not used in this paper)

In [None]:
# Dataset paths - UPDATE THESE TO YOUR LOCAL PATHS
UASPEECH_PATH = "./data/UA-Speech"  # Path to UA-Speech dataset
TORGO_PATH = "./data/TORGO"         # Path to TORGO dataset

# Check if datasets exist
uaspeech_exists = os.path.exists(UASPEECH_PATH)
torgo_exists = os.path.exists(TORGO_PATH)

print(f"UA-Speech dataset found: {uaspeech_exists}")
print(f"TORGO dataset found: {torgo_exists}")

if not uaspeech_exists and not torgo_exists:
    print("\n" + "="*60)
    print("NOTE: Datasets not found. Using synthetic data for demonstration.")
    print("")
    print("To use real data, download the datasets from:")
    print("- UA-Speech: http://www.isle.illinois.edu/sst/data/UASpeech/")
    print("- TORGO: http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html")
    print("="*60)

In [None]:
# Load or create synthetic data
USE_SYNTHETIC_DATA = not (uaspeech_exists or torgo_exists)

if USE_SYNTHETIC_DATA:
    print("Creating synthetic dataset for demonstration...")
    
    # Create synthetic data for detection task
    detection_features, detection_labels, detection_speakers = create_synthetic_dataset(
        n_samples=500,
        n_features=768,
        n_speakers=28,  # Similar to UA-Speech
        task=TaskType.DETECTION,
        random_state=42,
    )
    
    # Create synthetic data for severity classification
    severity_features, severity_labels, severity_speakers = create_synthetic_dataset(
        n_samples=300,
        n_features=768,
        n_speakers=15,  # Only dysarthric speakers
        task=TaskType.SEVERITY,
        random_state=42,
    )
    
    print(f"Detection data: {detection_features.shape[0]} samples, {detection_features.shape[1]} features")
    print(f"Severity data: {severity_features.shape[0]} samples, {severity_features.shape[1]} features")
    print(f"Detection labels distribution: {np.bincount(detection_labels)}")
    print(f"Severity labels distribution: {np.bincount(severity_labels)}")
else:
    print("Loading real dataset...")
    if uaspeech_exists:
        uaspeech_dataset = UASpeechDataset(UASPEECH_PATH)
        audio_files, detection_labels, detection_speakers = uaspeech_dataset.get_data(TaskType.DETECTION)
        _, severity_labels, severity_speakers = uaspeech_dataset.get_data(TaskType.SEVERITY)
        print(f"UA-Speech: {len(audio_files)} audio files loaded")

## 4. Feature Extraction

This section demonstrates all feature extraction methods used in the paper:

1. **Baseline Features:**
   - MFCC (Mel-Frequency Cepstral Coefficients)
   - openSMILE ComParE 2016
   - eGeMAPS (extended Geneva Minimalistic Acoustic Parameter Set)

2. **Pre-trained Model Features:**
   - wav2vec2-BASE (768-dim, 12 transformer layers)
   - wav2vec2-LARGE (1024-dim, 24 transformer layers)
   - HuBERT-BASE (768-dim, 12 transformer layers)

### Key Finding from Paper:
- **First layer** features are best for **detection** task
- **Last layer** features are best for **severity classification** task

### 4.1 MFCC Features

In [None]:
def demonstrate_mfcc_extraction(audio_signal, sr):
    """
    Demonstrate MFCC feature extraction.
    """
    config = MFCCConfig()
    
    # Extract frame-level MFCCs
    mfcc_frames = extract_mfcc(audio_signal, sr, config)
    
    # Extract utterance-level statistics
    mfcc_stats = extract_mfcc_stats(audio_signal, sr, config)
    
    print("MFCC Extraction:")
    print(f"  Frame-level shape: {mfcc_frames.shape} (frames x features)")
    print(f"  Utterance-level shape: {mfcc_stats.shape} (mean + std)")
    
    return mfcc_frames, mfcc_stats

# Generate example audio for demonstration
example_audio = np.random.randn(16000 * 2)  # 2 seconds of noise
mfcc_frames, mfcc_stats = demonstrate_mfcc_extraction(example_audio, 16000)

In [None]:
# Visualize MFCCs
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Plot MFCCs
im = axes[0].imshow(mfcc_frames.T, aspect='auto', origin='lower', cmap='viridis')
axes[0].set_xlabel('Frame')
axes[0].set_ylabel('MFCC Coefficient')
axes[0].set_title('MFCC Features Over Time')
plt.colorbar(im, ax=axes[0])

# Plot statistics
x = np.arange(len(mfcc_stats))
n_features = len(mfcc_stats) // 2
axes[1].bar(x[:n_features], mfcc_stats[:n_features], alpha=0.7, label='Mean')
axes[1].bar(x[:n_features], mfcc_stats[n_features:], alpha=0.7, label='Std', bottom=mfcc_stats[:n_features])
axes[1].set_xlabel('Feature Index')
axes[1].set_ylabel('Value')
axes[1].set_title('MFCC Statistics (Mean + Std)')
axes[1].legend()

plt.tight_layout()
plt.show()

### 4.2 openSMILE Features

In [None]:
def demonstrate_opensmile_extraction(audio_signal, sr):
    """
    Demonstrate openSMILE feature extraction.
    """
    try:
        import opensmile
        
        config = OpenSMILEConfig()
        features = extract_opensmile_features(audio_signal, sr, config)
        
        print("openSMILE ComParE 2016:")
        print(f"  Feature dimension: {len(features)}")
        print(f"  Feature range: [{features.min():.4f}, {features.max():.4f}]")
        
        return features
        
    except ImportError:
        print("openSMILE not installed. Install with: pip install opensmile")
        return np.random.randn(6373)  # Synthetic features

opensmile_features = demonstrate_opensmile_extraction(example_audio, 16000)

### 4.3 eGeMAPS Features

In [None]:
def demonstrate_egemaps_extraction(audio_signal, sr):
    """
    Demonstrate eGeMAPS feature extraction.
    """
    try:
        import opensmile
        
        config = EGeMAPSConfig()
        features = extract_egemaps_features(audio_signal, sr, config)
        
        print("eGeMAPS v02:")
        print(f"  Feature dimension: {len(features)}")
        print(f"  Feature range: [{features.min():.4f}, {features.max():.4f}]")
        
        return features
        
    except ImportError:
        print("openSMILE not installed. Install with: pip install opensmile")
        return np.random.randn(88)  # Synthetic features

egemaps_features = demonstrate_egemaps_extraction(example_audio, 16000)

### 4.4 wav2vec2-BASE Features

In [None]:
def demonstrate_wav2vec2_extraction(audio_signal, sr, model_name="facebook/wav2vec2-base"):
    """
    Demonstrate wav2vec2 feature extraction.
    """
    try:
        extractor = PretrainedFeatureExtractor(model_name)
        
        # Extract from different layers
        first_layer = extractor.extract_features(audio_signal, sr, layer=1, pooling='mean')
        last_layer = extractor.extract_features(audio_signal, sr, layer=-1, pooling='mean')
        all_layers = extractor.extract_features(audio_signal, sr, return_all_layers=True)
        
        print(f"\n{model_name}:")
        print(f"  First layer features: {first_layer.shape}")
        print(f"  Last layer features: {last_layer.shape}")
        print(f"  Number of layers: {len(all_layers)}")
        
        return extractor, first_layer, last_layer
        
    except Exception as e:
        print(f"Error loading {model_name}: {e}")
        print("Using synthetic features for demonstration")
        return None, np.random.randn(768), np.random.randn(768)

# Note: This requires downloading the model (may take a while on first run)
print("Loading wav2vec2-BASE model...")
wav2vec2_extractor, wav2vec2_first, wav2vec2_last = demonstrate_wav2vec2_extraction(
    example_audio, 16000, "facebook/wav2vec2-base"
)

### 4.5 wav2vec2-LARGE Features

In [None]:
# wav2vec2-LARGE (uncomment to run - requires more memory)
# print("Loading wav2vec2-LARGE model...")
# wav2vec2_large_extractor, wav2vec2_large_first, wav2vec2_large_last = demonstrate_wav2vec2_extraction(
#     example_audio, 16000, "facebook/wav2vec2-large"
# )

print("wav2vec2-LARGE:")
print("  Hidden size: 1024")
print("  Transformer layers: 24")
print("  (Skipped loading to save memory - uncomment above to run)")

### 4.6 HuBERT Features

In [None]:
def demonstrate_hubert_extraction(audio_signal, sr):
    """
    Demonstrate HuBERT feature extraction.
    """
    try:
        extractor = PretrainedFeatureExtractor("facebook/hubert-base-ls960")
        
        # Extract from different layers
        first_layer = extractor.extract_features(audio_signal, sr, layer=1, pooling='mean')
        last_layer = extractor.extract_features(audio_signal, sr, layer=-1, pooling='mean')
        
        print("\nHuBERT-BASE:")
        print(f"  First layer features: {first_layer.shape}")
        print(f"  Last layer features: {last_layer.shape}")
        
        return extractor, first_layer, last_layer
        
    except Exception as e:
        print(f"Error loading HuBERT: {e}")
        print("Using synthetic features for demonstration")
        return None, np.random.randn(768), np.random.randn(768)

print("Loading HuBERT model...")
hubert_extractor, hubert_first, hubert_last = demonstrate_hubert_extraction(example_audio, 16000)

## 5. Classification Models

The paper uses two classifiers:
1. **SVM (Support Vector Machine)** with RBF kernel and grid search for hyperparameter tuning
2. **CNN (Convolutional Neural Network)** with 3 convolutional layers

### 5.1 SVM Classifier

In [None]:
def demonstrate_svm_classifier():
    """
    Demonstrate SVM classifier with grid search.
    """
    # Create synthetic data
    X_train = np.random.randn(200, 768)
    y_train = np.random.randint(0, 2, 200)
    X_test = np.random.randn(50, 768)
    y_test = np.random.randint(0, 2, 50)
    
    # Initialize and train
    svm_config = SVMConfig()
    svm = SVMClassifier(svm_config)
    
    print("Training SVM with grid search...")
    svm.fit(X_train, y_train, tune_hyperparameters=True)
    
    print(f"\nBest parameters: {svm.best_params}")
    
    # Evaluate
    train_acc = svm.score(X_train, y_train)
    test_acc = svm.score(X_test, y_test)
    
    print(f"Train accuracy: {train_acc:.4f}")
    print(f"Test accuracy: {test_acc:.4f}")
    
    return svm

svm_demo = demonstrate_svm_classifier()

### 5.2 CNN Classifier

In [None]:
def demonstrate_cnn_classifier():
    """
    Demonstrate CNN classifier.
    """
    # Create synthetic data
    X_train = np.random.randn(200, 768)
    y_train = np.random.randint(0, 2, 200)
    X_val = np.random.randn(50, 768)
    y_val = np.random.randint(0, 2, 50)
    X_test = np.random.randn(50, 768)
    y_test = np.random.randint(0, 2, 50)
    
    # Configure CNN
    cnn_config = CNNConfig(
        epochs=20,  # Reduced for demo
        batch_size=32,
        learning_rate=0.001,
    )
    
    # Initialize and train
    cnn = CNNClassifier(
        input_dim=768,
        num_classes=2,
        config=cnn_config,
    )
    
    print("Training CNN...")
    cnn.fit(X_train, y_train, X_val, y_val, verbose=True)
    
    # Evaluate
    train_acc = cnn.score(X_train, y_train)
    test_acc = cnn.score(X_test, y_test)
    
    print(f"\nTrain accuracy: {train_acc:.4f}")
    print(f"Test accuracy: {test_acc:.4f}")
    
    return cnn

cnn_demo = demonstrate_cnn_classifier()

In [None]:
# Plot training history
fig = plot_training_history(cnn_demo.history, title="CNN Training History (Demo)")
plt.show()

## 6. Experiments

### Evaluation Protocol

Following the paper, we use **Leave-One-Speaker-Out (LOSO)** cross-validation:
- In each fold, one speaker is held out for testing
- All remaining speakers are used for training
- This ensures speaker-independent evaluation

### Tasks
1. **Detection Task**: Binary classification (Dysarthric vs. Control)
2. **Severity Classification Task**: 4-class classification (Very Low, Low, Medium, High severity)

### 6.1 Detection Task

In [None]:
print("=" * 60)
print("DETECTION TASK: Dysarthric vs. Control")
print("=" * 60)

# Run detection experiments
detection_results = {}

# SVM with detection features
print("\n--- SVM Classifier ---")
svm_detection_results = leave_one_speaker_out_cv(
    detection_features,
    detection_labels,
    detection_speakers,
    classifier_type=ClassifierType.SVM,
    verbose=True,
)
detection_results['SVM'] = svm_detection_results

In [None]:
# CNN with detection features
print("\n--- CNN Classifier ---")
cnn_detection_results = leave_one_speaker_out_cv(
    detection_features,
    detection_labels,
    detection_speakers,
    classifier_type=ClassifierType.CNN,
    classifier_config=CNNConfig(epochs=30),
    verbose=True,
)
detection_results['CNN'] = cnn_detection_results

In [None]:
# Visualize detection results
print("\nDetection Task Results:")
print("-" * 40)

for classifier_name, results in detection_results.items():
    metrics = results['metrics']
    print(f"\n{classifier_name}:")
    print(f"  Accuracy: {metrics['accuracy']*100:.2f}%")
    print(f"  Precision: {metrics['precision']*100:.2f}%")
    print(f"  Recall: {metrics['recall']*100:.2f}%")
    print(f"  F1 Score: {metrics['f1_score']*100:.2f}%")

In [None]:
# Plot confusion matrices for detection
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for idx, (name, results) in enumerate(detection_results.items()):
    cm = results['confusion_matrix']
    sns.heatmap(
        cm, 
        annot=True, 
        fmt='d', 
        cmap='Blues',
        xticklabels=['Control', 'Dysarthric'],
        yticklabels=['Control', 'Dysarthric'],
        ax=axes[idx]
    )
    axes[idx].set_xlabel('Predicted')
    axes[idx].set_ylabel('True')
    axes[idx].set_title(f'{name} - Detection Confusion Matrix')

plt.tight_layout()
plt.show()

### 6.2 Severity Classification Task

In [None]:
print("=" * 60)
print("SEVERITY CLASSIFICATION TASK")
print("Classes: Very Low, Low, Medium, High severity")
print("=" * 60)

# Run severity classification experiments
severity_results = {}

# SVM with severity features
print("\n--- SVM Classifier ---")
svm_severity_results = leave_one_speaker_out_cv(
    severity_features,
    severity_labels,
    severity_speakers,
    classifier_type=ClassifierType.SVM,
    verbose=True,
)
severity_results['SVM'] = svm_severity_results

In [None]:
# CNN with severity features
print("\n--- CNN Classifier ---")
cnn_severity_results = leave_one_speaker_out_cv(
    severity_features,
    severity_labels,
    severity_speakers,
    classifier_type=ClassifierType.CNN,
    classifier_config=CNNConfig(epochs=50),
    verbose=True,
)
severity_results['CNN'] = cnn_severity_results

In [None]:
# Visualize severity results
print("\nSeverity Classification Results:")
print("-" * 40)

for classifier_name, results in severity_results.items():
    metrics = results['metrics']
    print(f"\n{classifier_name}:")
    print(f"  Accuracy: {metrics['accuracy']*100:.2f}%")
    print(f"  Precision: {metrics['precision']*100:.2f}%")
    print(f"  Recall: {metrics['recall']*100:.2f}%")
    print(f"  F1 Score: {metrics['f1_score']*100:.2f}%")

In [None]:
# Plot confusion matrices for severity
severity_labels_names = ['Very Low', 'Low', 'Medium', 'High']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, (name, results) in enumerate(severity_results.items()):
    cm = results['confusion_matrix']
    sns.heatmap(
        cm, 
        annot=True, 
        fmt='d', 
        cmap='Blues',
        xticklabels=severity_labels_names,
        yticklabels=severity_labels_names,
        ax=axes[idx]
    )
    axes[idx].set_xlabel('Predicted')
    axes[idx].set_ylabel('True')
    axes[idx].set_title(f'{name} - Severity Confusion Matrix')

plt.tight_layout()
plt.show()

## 7. Results Analysis

In [None]:
# Compile all results
def compile_results(detection_results, severity_results):
    """
    Compile all results into a DataFrame.
    """
    rows = []
    
    for classifier, results in detection_results.items():
        rows.append({
            'Task': 'Detection',
            'Classifier': classifier,
            'Accuracy': results['metrics']['accuracy'] * 100,
            'Precision': results['metrics']['precision'] * 100,
            'Recall': results['metrics']['recall'] * 100,
            'F1 Score': results['metrics']['f1_score'] * 100,
        })
    
    for classifier, results in severity_results.items():
        rows.append({
            'Task': 'Severity',
            'Classifier': classifier,
            'Accuracy': results['metrics']['accuracy'] * 100,
            'Precision': results['metrics']['precision'] * 100,
            'Recall': results['metrics']['recall'] * 100,
            'F1 Score': results['metrics']['f1_score'] * 100,
        })
    
    return pd.DataFrame(rows)

results_df = compile_results(detection_results, severity_results)
print("\nResults Summary:")
print(results_df.to_string(index=False))

In [None]:
# Visualize results comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Detection results
detection_df = results_df[results_df['Task'] == 'Detection']
x = np.arange(len(detection_df))
axes[0].bar(x, detection_df['Accuracy'], color='steelblue', alpha=0.8)
axes[0].set_xticks(x)
axes[0].set_xticklabels(detection_df['Classifier'])
axes[0].set_ylabel('Accuracy (%)')
axes[0].set_title('Detection Task')
axes[0].set_ylim(0, 100)
for i, v in enumerate(detection_df['Accuracy']):
    axes[0].text(i, v + 1, f'{v:.1f}%', ha='center')

# Severity results
severity_df = results_df[results_df['Task'] == 'Severity']
x = np.arange(len(severity_df))
axes[1].bar(x, severity_df['Accuracy'], color='coral', alpha=0.8)
axes[1].set_xticks(x)
axes[1].set_xticklabels(severity_df['Classifier'])
axes[1].set_ylabel('Accuracy (%)')
axes[1].set_title('Severity Classification Task')
axes[1].set_ylim(0, 100)
for i, v in enumerate(severity_df['Accuracy']):
    axes[1].text(i, v + 1, f'{v:.1f}%', ha='center')

plt.tight_layout()
plt.show()

## 8. Comparison with Paper Results

The paper reports the following key findings:

### Detection Task (UA-Speech, SVM)
| Feature | Accuracy |
|---------|----------|
| MFCC | 95.24% |
| openSMILE | 97.14% |
| eGeMAPS | 96.19% |
| wav2vec2-BASE | 98.10% |
| wav2vec2-LARGE | 98.57% |
| **HuBERT** | **100.00%** |

### Severity Classification (UA-Speech, SVM)
| Feature | Accuracy |
|---------|----------|
| MFCC | 52.38% |
| openSMILE | 58.10% |
| eGeMAPS | 59.05% |
| wav2vec2-BASE | 64.76% |
| wav2vec2-LARGE | 66.67% |
| **HuBERT** | **69.51%** |

In [None]:
# Display paper results
print("Paper Results (from PAPER_RESULTS in config.py):")
print("=" * 60)

print("\nDetection Task - UA-Speech:")
print("-" * 40)
for classifier in ['SVM', 'CNN']:
    print(f"\n{classifier}:")
    for feature, acc in PAPER_RESULTS['detection']['UA-Speech'][classifier].items():
        print(f"  {feature}: {acc}%")

print("\n" + "=" * 60)
print("\nSeverity Classification - UA-Speech:")
print("-" * 40)
for classifier in ['SVM', 'CNN']:
    print(f"\n{classifier}:")
    for feature, acc in PAPER_RESULTS['severity']['UA-Speech'][classifier].items():
        print(f"  {feature}: {acc}%")

In [None]:
# Create comparison visualization
def plot_paper_results():
    """
    Visualize paper results.
    """
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    datasets = ['UA-Speech', 'TORGO']
    tasks = ['detection', 'severity']
    
    for row, task in enumerate(tasks):
        for col, dataset in enumerate(datasets):
            ax = axes[row, col]
            
            features = list(PAPER_RESULTS[task][dataset]['SVM'].keys())
            svm_accs = list(PAPER_RESULTS[task][dataset]['SVM'].values())
            cnn_accs = list(PAPER_RESULTS[task][dataset]['CNN'].values())
            
            x = np.arange(len(features))
            width = 0.35
            
            bars1 = ax.bar(x - width/2, svm_accs, width, label='SVM', color='steelblue')
            bars2 = ax.bar(x + width/2, cnn_accs, width, label='CNN', color='coral')
            
            ax.set_ylabel('Accuracy (%)')
            ax.set_title(f'{task.capitalize()} - {dataset}')
            ax.set_xticks(x)
            ax.set_xticklabels(features, rotation=45, ha='right')
            ax.legend()
            ax.set_ylim(0, 105)
            
            # Add value labels
            for bar in bars1:
                height = bar.get_height()
                ax.annotate(f'{height:.1f}',
                           xy=(bar.get_x() + bar.get_width() / 2, height),
                           xytext=(0, 3),
                           textcoords="offset points",
                           ha='center', va='bottom', fontsize=8)
    
    plt.tight_layout()
    return fig

fig = plot_paper_results()
plt.show()

## 9. Conclusions

### Key Findings from the Paper

1. **Pre-trained models outperform baseline features**: Features extracted from wav2vec2 and HuBERT consistently outperform traditional acoustic features (MFCCs, openSMILE, eGeMAPS).

2. **HuBERT performs best**: Among the three pre-trained models tested, HuBERT features achieve the highest accuracy for both detection and severity classification tasks.

3. **Layer selection matters**:
   - **First transformer layer** features are optimal for **detection** (binary classification)
   - **Last transformer layer** features are optimal for **severity classification** (multi-class)

4. **SVM vs CNN**: SVM classifier slightly outperforms CNN in most experiments, possibly due to the limited size of the datasets.

### Improvements over Baselines

**Detection Task (HuBERT vs best baseline):**
- UA-Speech: +2.86% absolute improvement over openSMILE
- TORGO: +1.33% absolute improvement over openSMILE

**Severity Classification (HuBERT vs best baseline):**
- UA-Speech: +10.46% absolute improvement over eGeMAPS
- TORGO: +6.54% absolute improvement over eGeMAPS

### Practical Implications

- Pre-trained speech models provide powerful features for clinical speech assessment
- No fine-tuning is required - features can be extracted directly from frozen models
- The approach enables non-invasive diagnosis support for dysarthria patients

In [None]:
# Save results
output_dir = Path("./results")
output_dir.mkdir(exist_ok=True)

# Save as CSV
results_df.to_csv(output_dir / "experiment_results.csv", index=False)

# Save detailed results as JSON
all_results = {
    'detection': {k: v['metrics'] for k, v in detection_results.items()},
    'severity': {k: v['metrics'] for k, v in severity_results.items()},
}
save_results(all_results, output_dir / "detailed_results.json")

print(f"Results saved to {output_dir}")

---

## References

1. Javanmardi, F., Kadiri, S. R., & Alku, P. (2024). Pre-trained models for detection and severity level classification of dysarthria from speech. *Speech Communication*, 158, 103047.

2. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. *NeurIPS*.

3. Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. *IEEE/ACM TASLP*.

4. Kim, H., Hasegawa-Johnson, M., Perlman, A., Gunderson, J., Huang, T. S., Watkin, K., & Frame, S. (2008). Dysarthric speech database for universal access research. *Interspeech*.

5. Rudzicz, F., Namasivayam, A. K., & Wolff, T. (2012). The TORGO database of acoustic and articulatory speech from speakers with dysarthria. *Language Resources and Evaluation*.

---

*This notebook implements the paper: "Pre-trained Models for Detection and Severity Level Classification of Dysarthria from Speech" by Javanmardi et al. (2024)*