Skip to content

STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks

Notifications You must be signed in to change notification settings

scheck-k/ste-gan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STE-GAN: Speech-to-EMG Generative Adverserial Network

This repository contains code for the paper:

"STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks" Accepted at INTERSPEECH 2023

The Speech-to-Electromyography GAN (STE-GAN) is a neural network which predicts Electromyography (EMG) signals of articulatory muscles given acoustic speech inputs. We base STE-GAN on the CAR-GAN and HiFi-GAN vocoders. As such, this repository uses the cargan repository as basis.

The STE-GAN generator takes acoustic speech features as input and outputs EMG signals at 800 Hz. In particular, the proposed model uses Soft Speech Units (Soft SUs) of Van Niekerk et al. for the acoustic features. This improves the robustness of the model to speech inputs of unseen speakers. Additionally, the generator networks uses learnable session embeddings to account for differences in EMG recording sessions.

Setup

Install the required packages. We use conda to setup a Python 3.10 environment named ste-gan.

conda create -n ste-gan python=3.10
conda activate ste-gan
pip install -r requirements.txt
pip install -e .

Data Setup

Our experiments use the data set of David Gaddy and Dan Klein, and an internally recorded EMG session. Because we cannot make the second data set public, we describe the setup for the Gaddy and Klein data set in this section.

1. Download the original data set

Download the original data set of and the forced alignments. We download the files to the raw_data/ directory per default.

# Create the raw_data directory
mkdir raw_data; cd raw_data

# Download and extract the original data set of Gaddy and Klein
wget -O emg_data.tar.gz https://zenodo.org/record/4064409/files/emg_data.tar.gz?download=
tar -xvf emg_data.tar.gz

# Download and extract the forced alignment files
wget https://github.com/dgaddy/silent_speech_alignments/raw/main/text_alignments.tar.gz
tar -xvf text_alignments.tar.gz

# Download the data set split file
wget https://raw.githubusercontent.com/dgaddy/silent_speech/main/testset_largedev.json

2. Run the data preprocessing scripts

We now pre-process the raw data in the format of the STE-GAN repository. We first clean the audio of the corpus using MetricGANPlus. In the project root, run the following script:

python scripts/clean_audio.py raw_data/emg_data/nonparallel_data/* raw_data/emg_data/silent_parallel_data/* raw_data/emg_data/voiced_parallel_data/*

We now extract all features and setup the overall data directory for the STE-GAN project. The default directory for the post-processed data is data/.

python scripts/prep_data_gaddy_and_klein.py

The resulting data directory should have the following structure.

data/gaddy_complete
 +- train
   +- audio
    - {sess_id}__{utt_idx}__{speaking_mode}.wav
    ...
   +- emg
    - {sess_id}__{utt_idx}__{speaking_mode}.pt
   +- emg_feats
    - {sess_id}__{utt_idx}__{speaking_mode}.pt
  +- mfccs
    - {sess_id}__{utt_idx}__{speaking_mode}.pt
   +- units
    - {sess_id}__{utt_idx}__{speaking_mode}.pt
   +- phonemes
    - {sess_id}__{utt_idx}__{speaking_mode}.pt
   +- transcriptions
    - {sess_id}__{utt_idx}__{speaking_mode}.txt
 +- valid
 +- test

where sess_id is an identifier for the EMG recording session, utt_idx is the utterance index within that session, and speaking_mode is either normal for audible speaking, otherwise silent. For STE-GAN and the EMG encoder training, we only use normal utterances and ignore silent EMG, as the audio is not synchronous with EMG. The silent EMG samples are filtered in the EMG data set class used for model training.

3. EMG encoder training

We now train the EMG encoder which predicts Soft Speech Units from EMG signals. We will use the trained EMG encoder for training the STE-GAN models.

To train the EMG encoder, run the following script:

python ste_gan/emg_encoder/train.py --exp_dir=exp/emg_encoder

This will train the EMG encoder on the vocalized splits of the Gaddy and Klein corpus. You can find the best validation loss model in exp/emg_encoder/EMGEncoderTransformer_voiced_only__seq_len__200__data_gaddy_complete/best_val_loss_model.pt.

4. STE-GAN training

We now train the STE-GAN model using the pre-trained EMG encoder.

Simply run the following command

python ste_gan/train.py

This will train the model using default settings using the pre-trained emg_encoder in exp/emg_encoder. You can change the default training configuration, used EMG encoder, and other hyperparameters via flags. You can also override certain hyper-parameters in the base configuration file by setting the command-line arguments.

python ste_gan/train.py --help

The list of command line arguments is:

usage: train.py [-h] [--config CONFIG] [--data DATA] [--emg_enc_cfg EMG_ENC_CFG] [--emg_enc_ckpt EMG_ENC_CKPT] [--checkpoint CHECKPOINT] [--continue_run] [--debug] [--weight_su WEIGHT_SU] [--weight_phoneme WEIGHT_PHONEME]
                [--weight_td WEIGHT_TD] [--weight_feat_match WEIGHT_FEAT_MATCH] [--speech_feature_type SPEECH_FEATURE_TYPE] [--chunk_size CHUNK_SIZE] [--batch_size BATCH_SIZE] [--max_steps MAX_STEPS]

options:
  -h, --help            show this help message and exit
  --config CONFIG       The main training configuration for this run.
  --data DATA           A path to a data configuration file.
  --emg_enc_cfg EMG_ENC_CFG
                        A path to an EMG encoder configuration file.
  --emg_enc_ckpt EMG_ENC_CKPT
                        A path to a checkpooint of a pre-trained EMG encoder. Must correspond to the EMG encoder configuration in 'emg_enc_cfg'.
  --checkpoint CHECKPOINT
                        Optional checkpoint to start training from
  --continue_run        Whether to continue training
  --debug               Whether to run the training script in debug mode.
  --weight_su WEIGHT_SU
                        Weight of the speech unit loss of the EMG encoder (a value smaller 0.0 means this setting is ignored)
  --weight_phoneme WEIGHT_PHONEME
                        Weight of the phoneme loss of the EMG encoder. (a value smaller than 0.0 means this argument is ignored)
  --weight_td WEIGHT_TD
                        Weight of the the Multi-Time-Domain loss. (a value smaller than 0.0 means this argument is ignored)
  --weight_feat_match WEIGHT_FEAT_MATCH
                        Weight of the Feature matching loss. (a value smaller than 0.0 means this argument is ignored)
  --speech_feature_type SPEECH_FEATURE_TYPE
                        A DataType which denotes the speech feature used as EMG generator input. Leave blank to use the speech feature in the training configuration.
  --chunk_size CHUNK_SIZE
                        The number of EMG samples used for training. (a value smaller than 0.0 means this argument is ignored)
  --batch_size BATCH_SIZE
                        The batch size used for training. (a value smaller than 0.0 means this argument is ignored)
  --max_steps MAX_STEPS
                        The maximum number of training steps. (a value smaller than 0.0 means this argument is ignored)

Paper

If you found this repository useful for your research, please consider citing our paper:

@inproceedings{STE-GAN,
  author={Kevin Scheck and Tanja Schultz},
  title={{STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks}},
  year=2023,
  booktitle={Accepted at Interspeech 2023},
  pages={1--5}
}

Acknowledgements

  • We use the CarGAN repository as basis for this project.
  • We use modified code of the silent_speech repository to train the EMG encoder.
  • We use the models of the Soft-VC repository to extract Soft Speech units from acoustic speech and to synthesize speech from predicted soft speech units
  • We use the MetricGAN-plus model to clean the audio recordings of the EMG data sets
  • This project was also inspired by the Articulation GAN model of Beguš et al. (2022) and the EMG-GAN of Atzori et al.

About

STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages