This repository contains code for the paper:
"STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks" Accepted at INTERSPEECH 2023
The Speech-to-Electromyography GAN (STE-GAN) is a neural network which predicts Electromyography (EMG) signals of articulatory muscles given acoustic speech inputs. We base STE-GAN on the CAR-GAN and HiFi-GAN vocoders. As such, this repository uses the cargan repository as basis.
The STE-GAN generator takes acoustic speech features as input and outputs EMG signals at 800 Hz. In particular, the proposed model uses Soft Speech Units (Soft SUs) of Van Niekerk et al. for the acoustic features. This improves the robustness of the model to speech inputs of unseen speakers. Additionally, the generator networks uses learnable session embeddings to account for differences in EMG recording sessions.
Install the required packages. We use conda to setup a Python 3.10 environment named ste-gan
.
conda create -n ste-gan python=3.10
conda activate ste-gan
pip install -r requirements.txt
pip install -e .
Our experiments use the data set of David Gaddy and Dan Klein, and an internally recorded EMG session. Because we cannot make the second data set public, we describe the setup for the Gaddy and Klein data set in this section.
Download the original data set of and the forced alignments.
We download the files to the raw_data/
directory per default.
# Create the raw_data directory
mkdir raw_data; cd raw_data
# Download and extract the original data set of Gaddy and Klein
wget -O emg_data.tar.gz https://zenodo.org/record/4064409/files/emg_data.tar.gz?download=
tar -xvf emg_data.tar.gz
# Download and extract the forced alignment files
wget https://github.com/dgaddy/silent_speech_alignments/raw/main/text_alignments.tar.gz
tar -xvf text_alignments.tar.gz
# Download the data set split file
wget https://raw.githubusercontent.com/dgaddy/silent_speech/main/testset_largedev.json
We now pre-process the raw data in the format of the STE-GAN repository. We first clean the audio of the corpus using MetricGANPlus. In the project root, run the following script:
python scripts/clean_audio.py raw_data/emg_data/nonparallel_data/* raw_data/emg_data/silent_parallel_data/* raw_data/emg_data/voiced_parallel_data/*
We now extract all features and setup the overall data directory for the STE-GAN project.
The default directory for the post-processed data is data/
.
python scripts/prep_data_gaddy_and_klein.py
The resulting data directory should have the following structure.
data/gaddy_complete
+- train
+- audio
- {sess_id}__{utt_idx}__{speaking_mode}.wav
...
+- emg
- {sess_id}__{utt_idx}__{speaking_mode}.pt
+- emg_feats
- {sess_id}__{utt_idx}__{speaking_mode}.pt
+- mfccs
- {sess_id}__{utt_idx}__{speaking_mode}.pt
+- units
- {sess_id}__{utt_idx}__{speaking_mode}.pt
+- phonemes
- {sess_id}__{utt_idx}__{speaking_mode}.pt
+- transcriptions
- {sess_id}__{utt_idx}__{speaking_mode}.txt
+- valid
+- test
where sess_id
is an identifier for the EMG recording session, utt_idx
is the utterance index within that session,
and speaking_mode
is either normal
for audible speaking, otherwise silent
.
For STE-GAN and the EMG encoder training, we only use normal
utterances and ignore silent
EMG, as the audio is not synchronous with EMG.
The silent
EMG samples are filtered in the EMG data set class used for model training.
We now train the EMG encoder which predicts Soft Speech Units from EMG signals. We will use the trained EMG encoder for training the STE-GAN models.
To train the EMG encoder, run the following script:
python ste_gan/emg_encoder/train.py --exp_dir=exp/emg_encoder
This will train the EMG encoder on the vocalized splits of the Gaddy and Klein corpus.
You can find the best validation loss model in exp/emg_encoder/EMGEncoderTransformer_voiced_only__seq_len__200__data_gaddy_complete/best_val_loss_model.pt
.
We now train the STE-GAN model using the pre-trained EMG encoder.
Simply run the following command
python ste_gan/train.py
This will train the model using default settings using the pre-trained emg_encoder in exp/emg_encoder
.
You can change the default training configuration, used EMG encoder, and other hyperparameters via flags.
You can also override certain hyper-parameters in the base configuration file by setting the command-line arguments.
python ste_gan/train.py --help
The list of command line arguments is:
usage: train.py [-h] [--config CONFIG] [--data DATA] [--emg_enc_cfg EMG_ENC_CFG] [--emg_enc_ckpt EMG_ENC_CKPT] [--checkpoint CHECKPOINT] [--continue_run] [--debug] [--weight_su WEIGHT_SU] [--weight_phoneme WEIGHT_PHONEME]
[--weight_td WEIGHT_TD] [--weight_feat_match WEIGHT_FEAT_MATCH] [--speech_feature_type SPEECH_FEATURE_TYPE] [--chunk_size CHUNK_SIZE] [--batch_size BATCH_SIZE] [--max_steps MAX_STEPS]
options:
-h, --help show this help message and exit
--config CONFIG The main training configuration for this run.
--data DATA A path to a data configuration file.
--emg_enc_cfg EMG_ENC_CFG
A path to an EMG encoder configuration file.
--emg_enc_ckpt EMG_ENC_CKPT
A path to a checkpooint of a pre-trained EMG encoder. Must correspond to the EMG encoder configuration in 'emg_enc_cfg'.
--checkpoint CHECKPOINT
Optional checkpoint to start training from
--continue_run Whether to continue training
--debug Whether to run the training script in debug mode.
--weight_su WEIGHT_SU
Weight of the speech unit loss of the EMG encoder (a value smaller 0.0 means this setting is ignored)
--weight_phoneme WEIGHT_PHONEME
Weight of the phoneme loss of the EMG encoder. (a value smaller than 0.0 means this argument is ignored)
--weight_td WEIGHT_TD
Weight of the the Multi-Time-Domain loss. (a value smaller than 0.0 means this argument is ignored)
--weight_feat_match WEIGHT_FEAT_MATCH
Weight of the Feature matching loss. (a value smaller than 0.0 means this argument is ignored)
--speech_feature_type SPEECH_FEATURE_TYPE
A DataType which denotes the speech feature used as EMG generator input. Leave blank to use the speech feature in the training configuration.
--chunk_size CHUNK_SIZE
The number of EMG samples used for training. (a value smaller than 0.0 means this argument is ignored)
--batch_size BATCH_SIZE
The batch size used for training. (a value smaller than 0.0 means this argument is ignored)
--max_steps MAX_STEPS
The maximum number of training steps. (a value smaller than 0.0 means this argument is ignored)
If you found this repository useful for your research, please consider citing our paper:
@inproceedings{STE-GAN,
author={Kevin Scheck and Tanja Schultz},
title={{STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks}},
year=2023,
booktitle={Accepted at Interspeech 2023},
pages={1--5}
}
- We use the CarGAN repository as basis for this project.
- We use modified code of the silent_speech repository to train the EMG encoder.
- We use the models of the Soft-VC repository to extract Soft Speech units from acoustic speech and to synthesize speech from predicted soft speech units
- We use the MetricGAN-plus model to clean the audio recordings of the EMG data sets
- This project was also inspired by the Articulation GAN model of Beguš et al. (2022) and the EMG-GAN of Atzori et al.