ECAPA-TDNN-for-Depression

Code for paper in "ECAPA-TDNN Based Depression Detection from Clinical Speech"

Environments

The experimental environment is listed here. Using different versions of packages may cause some problems.

python==3.7
torch==1.9.0
torchaudio=0.9.0
pandas==1.3.0
numpy==1.20.3
matplotlib==3.4.2
librosa==0.8.1
sklearn==1.0.2

DataSets

our corpus is recored between HAMD interview, only in audio modality. The corresponding corpus is described in detail in our atricle.

The corpus used in this paper is not publicly available at the moment, and this situation may improve in the future as we continue to collect and expand the existing dataset.

Features

MFCC is used as the feature that input to the neural network. Before feature extractoin, the raw speech data needs to be processed in the following steps:

The first thing to do is to separate the sound channels. As we collected the data, we save different speakers in different channels, i.e. the doctor's voice in the left channel and the subject's voice in the right channel.
The next steps is separating the noise from speech, this is implemented by Voice Activity Detection (VAD). The simplest double threshold method is used.
After processing the above steps, we obtained the subjects' speech of varying lengths. The speech is cut to three seconds with 50% overlap, and then MFCC features are extracted.

Models

we use ECAPA-TDNN for depression classification in this paper, for its excellent performance in various tasks.

ECAPA-TDNN was adjusted from speaker recognition to speech classification, in a senese, a simplified version.

for more detials about the models, please refer to ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification.

ECAPA-TDNN has been validated to perform well on classification tasks, and we just use it for classification. However, this is also some interesting findings. In our experiemnts, we also extracted the embedding form model, and used T-SNE for visiualized the relationship between embeddings and speakers. we found speech embeddings spoken by one speakers always forming a cluster, and those from depressed speech were more dispersed. This remains to further verification.

Cited

Waiting.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ETDNN.py		ETDNN.py
README.md		README.md
calc.py		calc.py
loader.py		loader.py
main.py		main.py
mask_aug.py		mask_aug.py
proprecess.py		proprecess.py
scheduler.py		scheduler.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETDNN.py

ETDNN.py

README.md

README.md

calc.py

calc.py

loader.py

loader.py

main.py

main.py

mask_aug.py

mask_aug.py

proprecess.py

proprecess.py

scheduler.py

scheduler.py

train.py

train.py

utils.py

utils.py

Repository files navigation

ECAPA-TDNN-for-Depression

Environments

DataSets

Features

Models

Cited

About

Releases

Packages

Languages

wy192/ETDNN

Folders and files

Latest commit

History

Repository files navigation

ECAPA-TDNN-for-Depression

Environments

DataSets

Features

Models

Cited

About

Resources

Stars

Watchers

Forks

Languages