MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction
Mohammed Salah Al-Radhi1, Géza Németh1, Branislav Gerazov2,
1Department of Telecommunications and Artificial Intelligence, Budapest University of Technology and Economics, Budapest, Hungary
2Faculty of Electrical Engineering and Information Technologies, University of Ss. Cyril and Methodius, Skopje, Macedonia
🧠 [2025.5.29] We are releasing all the code to support research on accelerating MiSTR multi-modal models.
🥳 [2025.05.19, 11:54 AM CET] Our paper has been accepted to Interspeech 2025. Looking forward to seeing you in Rotterdam, The Netherlands at the conference!
- Multi-Modal iEEG Feature Encoding: MiSTR introduces a wavelet-based encoder combined with prosody-aware features (pitch, energy, shimmer, duration, phase variability) to model the neural dynamics of speech production.
- Transformer-Based Prosody Decoder: A novel Transformer architecture captures long-range dependencies in brain activity to predict expressive and fluent Mel spectrograms aligned with speech prosody.
- Neural Phase Vocoder (IHPR): MiSTR proposes Iterative Harmonic Phase Reconstruction (IHPR), ensuring phase continuity and harmonic consistency for high-fidelity audio synthesis without vocoder artifacts.
- State-of-the-Art Performance: Achieves a Pearson correlation of 0.91, STOI of 0.73, and MOSA score of 3.38, outperforming all existing baselines in iEEG-to-speech synthesis.
- Clinically Inspired Design: Designed with speech neuroprosthetics in mind, MiSTR offers a scalable, robust pipeline for restoring natural speech in individuals with severe communication impairments.
- Code and Samples Available: Full implementation, pretrained models, and inference samples are provided in this GitHub repository to support reproducibility and further research.
git clone git clone https://github.com/malradhi/mistr.git
cd mistr
# Recommended: Create a clean environment
conda create -n mistr python=3.10
conda activate mistr
# Install required Python packages
pip install -r requirements.txt
💡 Note: Requires PyTorch ≥ 2.0 and CUDA-compatible GPU for best performance.
python neural_signal_encoder.py
This will generate the following files for each participant (e.g., sub-XX
):
*_feat.npy
: Wavelet + prosody features extracted from iEEG*_spec.npy
: Ground-truth Mel spectrogram from original audio*_prosody.npy
: Extracted prosody features (pitch, energy, shimmer, duration, phase variability)
python spectrogram_mapper_transformer.py
This script will:
- Train a neural autoencoder to compress iEEG features into a compact latent space
- Use a Transformer to predict Mel spectrograms from the latent iEEG representations
- Generate audio waveforms from predicted and ground-truth spectrograms
- Save predicted spectrograms as
*_predicted_spec.npy
- Save synthesized audio files:
*_orig_synthesized.wav
— from original spectrogram*_predicted.wav
— from predicted spectrogram
- Save evaluation results in
temporal_attention_results.npy
python harmonic_phase_reconstructor.py
This script applies Iterative Harmonic Phase Reconstruction (IHPR) to refine phase and improve audio quality.
It will:
- Load predicted spectrograms (
*_predicted_spec.npy
) - Apply harmonic-consistent phase reconstruction
- Save high-fidelity audio waveforms for each participant:
*_predicted.wav
(updated)*_orig_synthesized.wav
(if regenerated)
- Output
.wav
files in the/results/
directory
python neural_output_visualizer.py
This script generates high-resolution plots and visualizations:
results.png
: Participant-wise correlation scoresspec_example.png
and.pdf
: Ground-truth vs. predicted spectrogramswav_example.png
and.pdf
: Waveform comparison of original vs. reconstructed audio*_prosody_visualization.png
: Plots of extracted prosody features (if available)
All visual outputs are saved in the /results/
directory.
./features/ # Extracted features and prosody files
./results/ # Output spectrograms, waveforms, and plots
./harmonic_phase_reconstructor.py
./neural_signal_encoder.py
./spectrogram_mapper_transformer.py
./neural_output_visualizer.py
This work is supported by the European Union’s HORIZON Research and Innovation Programme under grant agreement No. 101120657, project ENFIELD (European Lighthouse to Manifest Trustworthy and Green AI), and by the Ministry of Innovation and Culture and the National Research, Development and Innovation Office of Hungary within the framework of the National Laboratory of Artificial Intelligence.
M.S. Al-Radhi’s research was supported by the project EKÖP-24-4-II-BME-197, through the National Research, Development and Innovation (NKFI) Fund.
We’ve released our code under the MIT License to support open research. If you use it in your work, please consider citing us:
@inproceedings{alradhi2025mistr,
title = {MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction},
author = {Mohammed Salah Al-Radhi and G{\'e}za N{\'e}meth and Branislav Gerazov},
booktitle = {Proceedings of Interspeech 2025},
year = {2025},
address = {Rotterdam, The Netherlands}
}