Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[examples] Add SRE16 recipe. #177

Merged
merged 16 commits into from
Jul 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ pip3 install wespeakerruntime
```

## 🔥 News
* 2023.07.14: Support the [NIST SRE16 recipe](https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016), see [#177](https://github.com/wenet-e2e/wespeaker/pull/177).
* 2023.07.10: Support the [Self-Supervised Learning recipe](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v3) on Voxceleb, including [DINO](https://openaccess.thecvf.com/content/ICCV2021/papers/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.pdf), [MoCo](https://openaccess.thecvf.com/content_CVPR_2020/papers/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.pdf) and [SimCLR](http://proceedings.mlr.press/v119/chen20j/chen20j.pdf), see [#180](https://github.com/wenet-e2e/wespeaker/pull/180).

* 2023.06.30: Support the [SphereFace2](https://ieeexplore.ieee.org/abstract/document/10094954) loss function, with better performance and noisy robust in comparison with the ArcMargin Softmax, see [#173](https://github.com/wenet-e2e/wespeaker/pull/173).
Expand All @@ -44,14 +45,16 @@ pip3 install wespeakerruntime
## Recipes

* [VoxCeleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb): Speaker Verification recipe on the [VoxCeleb dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)
* 🔥 UPDATE 2023.07.10: We support self-supervised learning recipe on Voxceleb! Achiving **2.627%** (ECAPA_TDNN_GLOB_c1024) EER on vox1-O-clean test set without any labels.
* 🔥 UPDATE 2022.10.31: We support deep r-vector up to the 293-layer version! Achiving **0.447%/0.043** EER/mindcf on vox1-O-clean test set
* 🔥 UPDATE 2023.07.10: We support self-supervised learning recipe on Voxceleb! Achieving **2.627%** (ECAPA_TDNN_GLOB_c1024) EER on vox1-O-clean test set without any labels.
* 🔥 UPDATE 2022.10.31: We support deep r-vector up to the 293-layer version! Achieving **0.447%/0.043** EER/mindcf on vox1-O-clean test set
* 🔥 UPDATE 2022.07.19: We apply the same setups as the CNCeleb recipe, and obtain SOTA performance considering the open-source systems
- EER/minDCF on vox1-O-clean test set are **0.723%/0.069** (ResNet34) and **0.728%/0.099** (ECAPA_TDNN_GLOB_c1024), after LM fine-tuning and AS-Norm
* [CNCeleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/cnceleb/v2): Speaker Verification recipe on the [CnCeleb dataset](http://cnceleb.org/)
* 🔥 UPDATE 2022.10.31: 221-layer ResNet achieves **5.655%/0.330** EER/minDCF
* 🔥 UPDATE 2022.07.12: We migrate the winner system of CNSRC 2022 [report](https://aishell-cnsrc.oss-cn-hangzhou.aliyuncs.com/T082.pdf) [slides](https://aishell-cnsrc.oss-cn-hangzhou.aliyuncs.com/T082-ZhengyangChen.pdf)
- EER/minDCF reduction from 8.426%/0.487 to **6.492%/0.354** after large margin fine-tuning and AS-Norm
* [NIST SRE16](https://github.com/wenet-e2e/wespeaker/tree/master/examples/sre/v2): Speaker Verification recipe for the [2016 NIST Speaker Recognition Evaluation Plan](https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016). Similar recipe can be found in [Kaldi](https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16).
* 🔥 UPDATE 2023.07.14: We support NIST SRE16 recipe. After PLDA adaptation, we achieved 6.608%, 10.01%, and 2.974% EER on trial Pooled, Tagalog, and Cantonese, respectively.
* [VoxConverse](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxconverse): Diarization recipe on the [VoxConverse dataset](https://www.robots.ox.ac.uk/~vgg/data/voxconverse/)

## Support List:
Expand Down
21 changes: 21 additions & 0 deletions examples/sre/v2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
## Results for SRE16

* Setup: fbank40, num_frms200, epoch150, Softmax, aug_prob0.6
* Scoring: cosine & PLDA & PLDA Adaptation
* Metric: EER(%)

Without PLDA training data augmentation:
| Model | Params | Backend | Pooled | Tagalog | Cantonese |
|:------|:------:|:------------:|:------------:|:------------:|:------------:|
| ResNet34-TSTP-emb256 | 6.63M | Cosine | 15.4 | 19.82 | 10.39 |
| | | PLDA | 9.36 | 14.26 | 4.513 |
| | | Adapt PLDA | 6.608 | 10.01 | 2.974 |

With PLDA training data augmentation:
| Model | Params | Backend | Pooled | Tagalog | Cantonese |
|:------|:------:|:------------:|:------------:|:------------:|:------------:|
| ResNet34-TSTP-emb256 | 6.63M | Cosine | 15.4 | 19.82 | 10.39 |
| | | PLDA | 8.944 | 13.54 | 4.462 |
| | | Adapt PLDA | 6.543 | 9.666 | 3.254 |

* 🔥 UPDATE 2023.07.14: Support the [NIST SRE16 recipe](https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016), see [#177](https://github.com/wenet-e2e/wespeaker/pull/177).
81 changes: 81 additions & 0 deletions examples/sre/v2/conf/resnet.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
### train configuration

exp_dir: exp/ResNet34-TSTP-emb256-fbank40-num_frms200-aug0.6-spFalse-saFalse-Softmax-SGD-epoch150
gpus: "[0,1]"
num_avg: 10
enable_amp: False # whether enable automatic mixed precision training

seed: 42
num_epochs: 150
save_epoch_interval: 5 # save model every 5 epochs
log_batch_interval: 100 # log every 100 batchs

dataloader_args:
batch_size: 256
num_workers: 16
pin_memory: False
prefetch_factor: 8
drop_last: True

dataset_args:
# the sample number which will be traversed within one epoch, if the value equals to 0,
# the utterance number in the dataset will be used as the sample_num_per_epoch.
sample_num_per_epoch: 780000
shuffle: True
shuffle_args:
shuffle_size: 1500
filter: True
filter_args:
min_num_frames: 100
max_num_frames: 300
resample_rate: 8000
speed_perturb: False
num_frms: 200
aug_prob: 0.6 # prob to add reverb & noise aug per sample
fbank_args:
num_mel_bins: 40
frame_shift: 10
frame_length: 25
dither: 1.0
spec_aug: False
spec_aug_args:
num_t_mask: 1
num_f_mask: 1
max_t: 10
max_f: 8
prob: 0.6

model: ResNet34 # ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
model_init: null
model_args:
feat_dim: 40
embed_dim: 256
pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
two_emb_layer: False
projection_args:
project_type: "softmax" # add_margin, arc_margin, sphere, softmax, arc_margin_intertopk_subcenter

margin_scheduler: MarginScheduler
margin_update:
initial_margin: 0.0
final_margin: 0.2
increase_start_epoch: 20
fix_start_epoch: 40
update_margin: True
increase_type: "exp" # exp, linear

loss: CrossEntropyLoss
loss_args: {}

optimizer: SGD
optimizer_args:
momentum: 0.9
nesterov: True
weight_decay: 0.0001

scheduler: ExponentialDecrease
scheduler_args:
initial_lr: 0.1
final_lr: 0.00005
warm_up_epoch: 6
warm_from_zero: True
95 changes: 95 additions & 0 deletions examples/sre/v2/local/extract_sre.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#!/bin/bash

# Copyright (c) 2022 Hongji Wang (jijijiang77@gmail.com)
# 2023 Zhengyang Chen (chenzhengyang117@gmail.com)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

exp_dir=''
model_path=''
nj=4
gpus="[0,1]"
data_type="shard" # shard/raw/feat
data=data
reverb_data=data/rirs/lmdb
noise_data=data/musan/lmdb
aug_plda_data=0

. tools/parse_options.sh
set -e

if [ $aug_plda_data = 0 ];then
sre_plda_data=sre
else
sre_plda_data=sre_aug
fi

data_name_array=(
"${sre_plda_data}"
"sre16_major"
"sre16_eval_enroll"
"sre16_eval_test"
)
data_list_path_array=(
"${data}/${sre_plda_data}/${data_type}.list"
"${data}/sre16_major/${data_type}.list"
"${data}/sre16_eval_enroll/${data_type}.list"
"${data}/sre16_eval_test/${data_type}.list"
)
data_scp_path_array=(
"${data}/${sre_plda_data}/wav.scp"
"${data}/sre16_major/wav.scp"
"${data}/sre16_eval_enroll/wav.scp"
"${data}/sre16_eval_test/wav.scp"
) # to count the number of wavs
nj_array=($nj $nj $nj $nj)
batch_size_array=(1 1 1 1) # batch_size of test set must be 1 !!!
num_workers_array=(1 1 1 1)
if [ $aug_plda_data = 0 ];then
aug_prob_array=(0.0 0.0 0.0 0.0)
else
aug_prob_array=(0.67 0.0 0.0 0.0)
fi
count=${#data_name_array[@]}

for i in $(seq 0 $(($count - 1))); do
wavs_num=$(wc -l ${data_scp_path_array[$i]} | awk '{print $1}')
bash tools/extract_embedding.sh --exp_dir ${exp_dir} \
--model_path $model_path \
--data_type ${data_type} \
--data_list ${data_list_path_array[$i]} \
--wavs_num ${wavs_num} \
--store_dir ${data_name_array[$i]} \
--batch_size ${batch_size_array[$i]} \
--num_workers ${num_workers_array[$i]} \
--aug_prob ${aug_prob_array[$i]} \
--reverb_data ${reverb_data} \
--noise_data ${noise_data} \
--nj ${nj_array[$i]} \
--gpus $gpus
done

wait

echo "mean vector of enroll"
python tools/vector_mean.py \
--spk2utt ${data}/sre16_eval_enroll/spk2utt \
--xvector_scp $exp_dir/embeddings/sre16_eval_enroll/xvector.scp \
--spk_xvector_ark $exp_dir/embeddings/sre16_eval_enroll/enroll_spk_xvector.ark

mkdir -p ${exp_dir}/embeddings/eval
cat ${exp_dir}/embeddings/sre16_eval_enroll/enroll_spk_xvector.scp \
${exp_dir}/embeddings/sre16_eval_test/xvector.scp \
> ${exp_dir}/embeddings/eval/xvector.scp

echo "Embedding dir is (${exp_dir}/embeddings)."
36 changes: 36 additions & 0 deletions examples/sre/v2/local/filter_utt_accd_dur.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copyright (c) 2023 Zhengyang Chen
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import fire


def main(wav_scp, utt2voice_dur, filter_wav_scp, dur_thres=5.0):

utt2voice_dur_dict = {}
with open(utt2voice_dur, "r") as f:
for line in f:
utt, dur = line.strip().split()
utt2voice_dur_dict[utt] = float(dur)

with open(wav_scp, "r") as f, open(filter_wav_scp, "w") as fw:
for line in f:
utt = line.strip().split()[0]
if utt in utt2voice_dur_dict:
if utt2voice_dur_dict[utt] > dur_thres:
fw.write(line)


if __name__ == "__main__":
fire.Fire(main)
57 changes: 57 additions & 0 deletions examples/sre/v2/local/generate_sre_aug.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright (c) 2023 Zhengyang Chen
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import os
import fire


def main(ori_dir, aug_dir, aug_copy_num=2):

if not os.path.exists(aug_dir):
os.makedirs(aug_dir)

read_wav_scp = os.path.join(ori_dir, 'wav.scp')
aug_wav_scp = os.path.join(aug_dir, 'wav.scp')
read_utt2spk = os.path.join(ori_dir, 'utt2spk')
aug_utt2spk = os.path.join(aug_dir, 'utt2spk')
read_vad = os.path.join(ori_dir, 'vad')
store_vad = os.path.join(aug_dir, 'vad')

with open(read_wav_scp, 'r') as f, open(aug_wav_scp, 'w') as wf:
for line in f:
line = line.strip().split()
utt, other_info = line[0], ' '.join(line[1:])
for i in range(aug_copy_num + 1):
wf.write(utt + '_copy-' + str(i) + ' ' + other_info + '\n')

with open(read_utt2spk, 'r') as f, open(aug_utt2spk, 'w') as wf:
for line in f:
line = line.strip().split()
utt, spk = line[0], line[1]
for i in range(aug_copy_num + 1):
wf.write(utt + '_copy-' + str(i) + ' ' + spk + '\n')

with open(read_vad, 'r') as f, open(store_vad, 'w') as wf:
for line in f:
line = line.strip().split()
seg, utt, vad = line[0], line[1], ' '.join(line[2:])
for i in range(aug_copy_num + 1):
new_seg = seg + '_copy-' + str(i)
new_utt = utt + '_copy-' + str(i)
wf.write(new_seg + ' ' + new_utt + ' ' + vad + '\n')


if __name__ == "__main__":
fire.Fire(main)
Loading
Loading