Skip to content

thanhpv2102/Vietnam-Celeb.Interspeech

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

Vietnam-Celeb: a large-scale dataset for Vietnamese speaker recognition

Contact email: thanh.pv.ds@gmail.com

Citation:

@inproceedings{pham23b_interspeech,
  author={Viet Thanh Pham and Xuan Thai Hoa Nguyen and Vu Hoang and Thi Thu Trang Nguyen},
  title={{Vietnam-Celeb: a large-scale dataset for Vietnamese speaker recognition}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={1918--1922},
  doi={10.21437/Interspeech.2023-1989}
}

To extract the 4 parts, run the two following codes:

zip -F vietnam-celeb-part.zip --out full-dataset.zip
unzip full-dataset.zip
  • The data folder contains the utterances of every speakers in the dataset, in which each speaker has a folder with its name being the ID of that speaker.

  • There are three text files corresponding to the datasets that we have split, as discussed in the anonymous submission to Interspeech 2023:

    • vietnam-celeb-t.txt: list of utterances in the training set of Vietnam-Celeb
    • vietnam-celeb-e.txt: Pairs of utterances in the Vietnam-Celeb-E test set.
    • vietnam-celeb-h.txt: Pairs of utterances in the Vietnam-Celeb-H test set.
  • We also include a TSV file containing the information of every speaker in the dataset, which include the following attributes:

    • speaker_id: ID of the speaker
    • gender: gender of the speaker
    • dialect: Vietnamese dialect of the speaker
    • source: the crawling source of the utterances of a speaker.

About

Official repo for the Vietnam-Celeb dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published