Skip to content

qinxiaoyi/Cross-Age_Speaker_Verification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 

Repository files navigation

Cross-Age Speaker Verification

image This repository contains the trial files and VoxCeleb1&2 age information of the paper "Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings". Following is the download source. We provide Google Drive Link and Baidu Pan resource.

Files Download Link
Trial files Google Drive
Vox1 Age info Google Drive
Vox2dev Age info Google Drive

or

Download Link: https://pan.baidu.com/s/1m3lDeAEt0lJj-CPHqM4e1Q

Passwd: qv90

0.Introduction

In this paper, we consider the age impact for speaker verification systems. We adopt the face age estimation method to predict the speaker age value from the associated visual data, then label the audio recording with the estimated age. We construct multiple Cross-Age test sets on VoxCeleb (Vox-CA), which deliberately select the positive trials with large age-gap. Finally, we propose an age-decoupling adversarial learning (ADAL) to alleviate the negative effect of the age gap and reduce intra-class variance. The main contribution of this paper is:

  1. Propose cross-age speaker verification task
  2. Proposed an age-decoupling adversarial learning mothod

Paper arxiv |

1.Cross-age speaker verification task

1.1 Construction Detail

The construction pipelines adopt the following steps:

  1. Gathering the face image from meta-data of VoxCeleb1 and VoxCeleb2
  2. Estimating the age of each face image.
  3. Labeling the estimated age value for each audio utterance.
  4. Selecting large age-gap audios as positive pairs and the pairs of same nationality and gender as negative pair

Faces from video -> Face Age Estimation module -> faces age -> age average -> audio age

For the sake of clarity, the key stages are described in the following paragraphs: Estimating and labeling age for audio: We use the Dex to estimate the age for each face image, and the average age value of faces is used as the estimated age for the segment. Forming positive/negative pairs: First, the positive pairs must be the cross-age case. Second, all negative pairs are constructed within the same nationality and gender. Following the rules mentioned above, there are four VoxCA sets are constructed according to different age-gap categories:

  1. Vox-CA5. The age gap of the positive pair is 5 years at least. The candidate speakers must possess more than 7 years of max age-gap data.
  2. Vox-CA10. The age gap of the positive pair is 10 years at least. The candidate speakers must possess more than 12 years of max age-gap data.
  3. Vox-CA15. The age gap of the positive pair is 15 years at least. The candidate speakers must possess more than 17 years of max age-gap data.
  4. Vox-CA20. The age gap of the positive pair is 20 years at least. The candidate speakers must possess more than 22 years of max age-gap data.

1.2 Comparsion with Vox-E, Vox-H and Vox-CA

In this part, we consider three cases of trial pairs. Case 1. Positive pair within intra-segment. The pair audios are selected from the same video segment. Case 2. Positive pair within the cross-age. Case 3. Negative pair within the same nationality and gender.

The following table is the statistics of VoxCeleb test trials and our proposed trials. image

The following is the results on different test set based on the ResNet-GSP-ArcFace model. The our-E and our-H is our implemented test set following the VoxCeleb rules.

Test set Construct EER[%] mDCF0.01
Vox official
Vox-O random 0.962% 0.100
Vox-E random 1.094% 0.122
Vox-H nation & gender 1.939% 0.200
our proposed
our-E random 1.202% 0.123
our-H nation & gender 2.044% 0.192
only-N nation 1.568% 0.164
only-H gender 1.534% 0.146
only-I intra-segment 0.227% 0.015
only-CA5 age 1.953% 0.177
only-CA10 age 3.437% 0.272
only-CA15 age 5.927% 0.352
only-CA20 age 8.185% 0.464
Vox-CA5 age & nation & gender 3.407% 0.300
Vox-CA10 age & nation & gender 4.974% 0.370
Vox-CA15 age & nation & gender 8.028% 0.481
Vox-CA20 age & nation & gender 10.419% 0.646

The Vox-CA provides new benchmarks for crossage matching scenarios and hard tasks.

2. Learning Age-invariant Speaker Embeddings

The following is the structure ADAL module. image

Model Vox-E Vox-H Only-CA5 Only-CA10 Only-CA15 Only-CA20 Vox-CA5 Vox-CA10 Vox-CA15 Vox-CA20
ResNet34-Softmax 2.798% 4.806% 4.310% 6.004% 8.019% 9.308% 7.366% 9.125% 12.405% 14.888%
ResNet34-Arcface 1.094% 1.939% 1.935% 3.437% 5.927% 8.185% 3.407% 4.974% 8.028% 10.419%
+ GRL 1.122% 1.934% 2.021% 3.579% 6.036% 8.566% 3.405% 4.949% 8.017% 10.610%
+ Age Residual 1.121% 1.960% 2.040% 3.536% 5.871% 7.864% 3.499% 5.078% 8.039% 10.229%
+ ARE (ours) 1.108% 1.951% 1.980% 3.345% 5.719% 7.803% 3.431% 4.814% 7.786% 9.911%
+ ADAL (ours) 1.121% 1.974% 1.991% 3.330% 5.540% 7.442% 3.441% 4.822% 7.515% 9.519%

The following are the other comparison system implemented.

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published