wenet-e2e · JiJiJiang · Jul 14, 2023 · Jun 23, 2023 · Jul 6, 2023 · Jul 6, 2023
diff --git a/README.md b/README.md
@@ -35,6 +35,7 @@ pip3 install wespeakerruntime
 ```
 
 ## 🔥 News
+* 2023.07.14: Support the [NIST SRE16 recipe](https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016), see [#177](https://github.com/wenet-e2e/wespeaker/pull/177).
 * 2023.07.10: Support the [Self-Supervised Learning recipe](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v3) on Voxceleb, including [DINO](https://openaccess.thecvf.com/content/ICCV2021/papers/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.pdf), [MoCo](https://openaccess.thecvf.com/content_CVPR_2020/papers/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.pdf) and [SimCLR](http://proceedings.mlr.press/v119/chen20j/chen20j.pdf), see [#180](https://github.com/wenet-e2e/wespeaker/pull/180).
 
 * 2023.06.30: Support the [SphereFace2](https://ieeexplore.ieee.org/abstract/document/10094954) loss function, with better performance and noisy robust in comparison with the ArcMargin Softmax, see [#173](https://github.com/wenet-e2e/wespeaker/pull/173).
@@ -44,14 +45,16 @@ pip3 install wespeakerruntime
 ## Recipes
 
 * [VoxCeleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb): Speaker Verification recipe on the [VoxCeleb dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)
-    * 🔥 UPDATE 2023.07.10: We support self-supervised learning recipe on Voxceleb! Achiving **2.627%** (ECAPA_TDNN_GLOB_c1024) EER on vox1-O-clean test set without any labels.
-    * 🔥 UPDATE 2022.10.31: We support deep r-vector up to the 293-layer version! Achiving **0.447%/0.043** EER/mindcf on vox1-O-clean test set
+    * 🔥 UPDATE 2023.07.10: We support self-supervised learning recipe on Voxceleb! Achieving **2.627%** (ECAPA_TDNN_GLOB_c1024) EER on vox1-O-clean test set without any labels.
+    * 🔥 UPDATE 2022.10.31: We support deep r-vector up to the 293-layer version! Achieving **0.447%/0.043** EER/mindcf on vox1-O-clean test set
     * 🔥 UPDATE 2022.07.19: We apply the same setups as the CNCeleb recipe, and obtain SOTA performance considering the open-source systems
       - EER/minDCF on vox1-O-clean test set are **0.723%/0.069** (ResNet34) and **0.728%/0.099** (ECAPA_TDNN_GLOB_c1024), after LM fine-tuning and AS-Norm
 * [CNCeleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/cnceleb/v2): Speaker Verification recipe on the [CnCeleb dataset](http://cnceleb.org/)
     * 🔥 UPDATE 2022.10.31: 221-layer ResNet achieves **5.655%/0.330**  EER/minDCF
     * 🔥 UPDATE 2022.07.12: We migrate the winner system of CNSRC 2022 [report](https://aishell-cnsrc.oss-cn-hangzhou.aliyuncs.com/T082.pdf) [slides](https://aishell-cnsrc.oss-cn-hangzhou.aliyuncs.com/T082-ZhengyangChen.pdf)
       - EER/minDCF reduction from 8.426%/0.487 to **6.492%/0.354** after large margin fine-tuning and AS-Norm
+* [NIST SRE16](https://github.com/wenet-e2e/wespeaker/tree/master/examples/sre/v2): Speaker Verification recipe for the [2016 NIST Speaker Recognition Evaluation Plan](https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016). Similar recipe can be found in [Kaldi](https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16).
+   * 🔥 UPDATE 2023.07.14: We support NIST SRE16 recipe. After PLDA adaptation, we achieved 6.608%, 10.01%, and 2.974% EER on trial Pooled, Tagalog, and Cantonese, respectively.
 * [VoxConverse](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxconverse): Diarization recipe on the [VoxConverse dataset](https://www.robots.ox.ac.uk/~vgg/data/voxconverse/)
 
 ## Support List:

diff --git a/examples/sre/v2/README.md b/examples/sre/v2/README.md
@@ -0,0 +1,21 @@
+## Results for SRE16
+
+* Setup: fbank40, num_frms200, epoch150, Softmax, aug_prob0.6
+* Scoring: cosine & PLDA & PLDA Adaptation
+* Metric: EER(%)
+
+Without PLDA training data augmentation:
+| Model | Params | Backend | Pooled | Tagalog | Cantonese |
+|:------|:------:|:------------:|:------------:|:------------:|:------------:|
+| ResNet34-TSTP-emb256 | 6.63M | Cosine | 15.4 | 19.82 | 10.39 |
+|                      |       | PLDA | 9.36 | 14.26 | 4.513 |
+|                      |       | Adapt PLDA | 6.608 | 10.01 | 2.974 |
+
+With PLDA training data augmentation:
+| Model | Params | Backend | Pooled | Tagalog | Cantonese |
+|:------|:------:|:------------:|:------------:|:------------:|:------------:|
+| ResNet34-TSTP-emb256 | 6.63M | Cosine | 15.4 | 19.82 | 10.39 |
+|                      |       | PLDA | 8.944 | 13.54 | 4.462 |
+|                      |       | Adapt PLDA | 6.543 | 9.666 | 3.254 |
+
+* 🔥 UPDATE 2023.07.14: Support the [NIST SRE16 recipe](https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016), see [#177](https://github.com/wenet-e2e/wespeaker/pull/177).
diff --git a/examples/sre/v2/conf/resnet.yaml b/examples/sre/v2/conf/resnet.yaml
@@ -0,0 +1,81 @@
+### train configuration
+
+exp_dir: exp/ResNet34-TSTP-emb256-fbank40-num_frms200-aug0.6-spFalse-saFalse-Softmax-SGD-epoch150
+gpus: "[0,1]"
+num_avg: 10
+enable_amp: False # whether enable automatic mixed precision training
+
+seed: 42
+num_epochs: 150
+save_epoch_interval: 5 # save model every 5 epochs
+log_batch_interval: 100 # log every 100 batchs
+
+dataloader_args:
+  batch_size: 256
+  num_workers: 16
+  pin_memory: False
+  prefetch_factor: 8
+  drop_last: True
+
+dataset_args:
+  # the sample number which will be traversed within one epoch, if the value equals to 0,
+  # the utterance number in the dataset will be used as the sample_num_per_epoch.
+  sample_num_per_epoch: 780000
+  shuffle: True
+  shuffle_args:
+    shuffle_size: 1500
+  filter: True
+  filter_args:
+    min_num_frames: 100
+    max_num_frames: 300
+  resample_rate: 8000
+  speed_perturb: False
+  num_frms: 200
+  aug_prob: 0.6 # prob to add reverb & noise aug per sample
+  fbank_args:
+    num_mel_bins: 40
+    frame_shift: 10
+    frame_length: 25
+    dither: 1.0
+  spec_aug: False
+  spec_aug_args:
+    num_t_mask: 1
+    num_f_mask: 1
+    max_t: 10
+    max_f: 8
+    prob: 0.6
+
+model: ResNet34 # ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
+model_init: null
+model_args:
+  feat_dim: 40
+  embed_dim: 256
+  pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
+  two_emb_layer: False
+projection_args:
+  project_type: "softmax" # add_margin, arc_margin, sphere, softmax, arc_margin_intertopk_subcenter
+
+margin_scheduler: MarginScheduler
+margin_update:
+  initial_margin: 0.0
+  final_margin: 0.2
+  increase_start_epoch: 20
+  fix_start_epoch: 40
+  update_margin: True
+  increase_type: "exp" # exp, linear
+
+loss: CrossEntropyLoss
+loss_args: {}
+
+optimizer: SGD
+optimizer_args:
+  momentum: 0.9
+  nesterov: True
+  weight_decay: 0.0001
+
+scheduler: ExponentialDecrease
+scheduler_args:
+  initial_lr: 0.1
+  final_lr: 0.00005
+  warm_up_epoch: 6
+  warm_from_zero: True
diff --git a/examples/sre/v2/local/extract_sre.sh b/examples/sre/v2/local/extract_sre.sh
@@ -0,0 +1,95 @@
+#!/bin/bash
+
+# Copyright (c) 2022 Hongji Wang (jijijiang77@gmail.com)
+#               2023 Zhengyang Chen (chenzhengyang117@gmail.com)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+exp_dir=''
+model_path=''
+nj=4
+gpus="[0,1]"
+data_type="shard"  # shard/raw/feat
+data=data
+reverb_data=data/rirs/lmdb
+noise_data=data/musan/lmdb
+aug_plda_data=0
+
+. tools/parse_options.sh
+set -e
+
+if [ $aug_plda_data = 0 ];then
+    sre_plda_data=sre
+else
+    sre_plda_data=sre_aug
+fi
+
+data_name_array=(
+    "${sre_plda_data}"
+    "sre16_major"
+    "sre16_eval_enroll"
+    "sre16_eval_test"
+)
+data_list_path_array=(
+    "${data}/${sre_plda_data}/${data_type}.list"
+    "${data}/sre16_major/${data_type}.list"
+    "${data}/sre16_eval_enroll/${data_type}.list"
+    "${data}/sre16_eval_test/${data_type}.list"
+)
+data_scp_path_array=(
+    "${data}/${sre_plda_data}/wav.scp"
+    "${data}/sre16_major/wav.scp"
+    "${data}/sre16_eval_enroll/wav.scp"
+    "${data}/sre16_eval_test/wav.scp"
+) # to count the number of wavs
+nj_array=($nj $nj $nj $nj)
+batch_size_array=(1 1 1 1) # batch_size of test set must be 1 !!!
+num_workers_array=(1 1 1 1)
+if [ $aug_plda_data = 0 ];then
+    aug_prob_array=(0.0 0.0 0.0 0.0)
+else
+    aug_prob_array=(0.67 0.0 0.0 0.0)
+fi
+count=${#data_name_array[@]}
+
+for i in $(seq 0 $(($count - 1))); do
+  wavs_num=$(wc -l ${data_scp_path_array[$i]} | awk '{print $1}')
+  bash tools/extract_embedding.sh --exp_dir ${exp_dir} \
+    --model_path $model_path \
+    --data_type ${data_type} \
+    --data_list ${data_list_path_array[$i]} \
+    --wavs_num ${wavs_num} \
+    --store_dir ${data_name_array[$i]} \
+    --batch_size ${batch_size_array[$i]} \
+    --num_workers ${num_workers_array[$i]} \
+    --aug_prob ${aug_prob_array[$i]} \
+    --reverb_data ${reverb_data} \
+    --noise_data ${noise_data} \
+    --nj ${nj_array[$i]} \
+    --gpus $gpus
+done
+
+wait
+
+echo "mean vector of enroll"
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre16_eval_enroll/spk2utt \
+  --xvector_scp $exp_dir/embeddings/sre16_eval_enroll/xvector.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre16_eval_enroll/enroll_spk_xvector.ark
+
+mkdir -p ${exp_dir}/embeddings/eval
+cat ${exp_dir}/embeddings/sre16_eval_enroll/enroll_spk_xvector.scp \
+    ${exp_dir}/embeddings/sre16_eval_test/xvector.scp \
+    > ${exp_dir}/embeddings/eval/xvector.scp
+
+echo "Embedding dir is (${exp_dir}/embeddings)."
diff --git a/examples/sre/v2/local/filter_utt_accd_dur.py b/examples/sre/v2/local/filter_utt_accd_dur.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2023 Zhengyang Chen
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import fire
+
+
+def main(wav_scp, utt2voice_dur, filter_wav_scp, dur_thres=5.0):
+
+    utt2voice_dur_dict = {}
+    with open(utt2voice_dur, "r") as f:
+        for line in f:
+            utt, dur = line.strip().split()
+            utt2voice_dur_dict[utt] = float(dur)
+
+    with open(wav_scp, "r") as f, open(filter_wav_scp, "w") as fw:
+        for line in f:
+            utt = line.strip().split()[0]
+            if utt in utt2voice_dur_dict:
+                if utt2voice_dur_dict[utt] > dur_thres:
+                    fw.write(line)
+
+
+if __name__ == "__main__":
+    fire.Fire(main)
diff --git a/examples/sre/v2/local/generate_sre_aug.py b/examples/sre/v2/local/generate_sre_aug.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2023 Zhengyang Chen
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import os
+import fire
+
+
+def main(ori_dir, aug_dir, aug_copy_num=2):
+
+    if not os.path.exists(aug_dir):
+        os.makedirs(aug_dir)
+
+    read_wav_scp = os.path.join(ori_dir, 'wav.scp')
+    aug_wav_scp = os.path.join(aug_dir, 'wav.scp')
+    read_utt2spk = os.path.join(ori_dir, 'utt2spk')
+    aug_utt2spk = os.path.join(aug_dir, 'utt2spk')
+    read_vad = os.path.join(ori_dir, 'vad')
+    store_vad = os.path.join(aug_dir, 'vad')
+
+    with open(read_wav_scp, 'r') as f, open(aug_wav_scp, 'w') as wf:
+        for line in f:
+            line = line.strip().split()
+            utt, other_info = line[0], ' '.join(line[1:])
+            for i in range(aug_copy_num + 1):
+                wf.write(utt + '_copy-' + str(i) + ' ' + other_info + '\n')
+
+    with open(read_utt2spk, 'r') as f, open(aug_utt2spk, 'w') as wf:
+        for line in f:
+            line = line.strip().split()
+            utt, spk = line[0], line[1]
+            for i in range(aug_copy_num + 1):
+                wf.write(utt + '_copy-' + str(i) + ' ' + spk + '\n')
+
+    with open(read_vad, 'r') as f, open(store_vad, 'w') as wf:
+        for line in f:
+            line = line.strip().split()
+            seg, utt, vad = line[0], line[1], ' '.join(line[2:])
+            for i in range(aug_copy_num + 1):
+                new_seg = seg + '_copy-' + str(i)
+                new_utt = utt + '_copy-' + str(i)
+                wf.write(new_seg + ' ' + new_utt + ' ' + vad + '\n')
+
+
+if __name__ == "__main__":
+    fire.Fire(main)