Skip to content

xingbow/TED_dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

VoiceCoach

This is the TED Talk dataset for VoiceCoach, CHI 2020, https://arxiv.org/abs/2001.07876.

TED_dataset

The datasets contain meta information of 2623 TED Talks in official TED.com website until Jun 7th, 2019.

The meta information includes fields: 'author', 'datefilmed', 'totalviews', 'comments', 'language', 'downloadlink', 'vidlen', 'aws-transcripts', 'datecrawled', 'datepublished', 'title', 'id', 'url', 'keywords', 'videoname', 'ratings', and complete information is stored in the field 'alldata_JSON'.

Fileds

  • url: original video link
  • aws-transcripts: Each video in the dataset is transcribed by AWS. It has two fileds, including:
    • transcript: all words in the video
    • words: an array containing detailed information about all words. e.g.,
      • "start_time": "12.94",
      • "end_time": "13.25",
      • "alternatives": [{"confidence": "0.9097", "content": "we"}], "type": "pronunciation"}]

Video Downloading

tedvideo_download.py contains the code for downloading ted videos from TED.com

Video2mp3/wav

You can use ffmpeg to convert .mp4 to other audio formats (e.g., mp3, wav, etc.)

(updating)

Notice

If you use this dataset, please cite our paper

VoiceCoach: Interactive Evidence-based Training for Voice Modulation Skills in Public Speaking

Preprint: https://arxiv.org/abs/2001.07876

Authors: Xingbo Wang, Haipeng Zeng, Yong Wang, Aoyu Wu, Zhida Sun, Xiaojuan Ma, Huamin Qu

Acknowledgements

The dataset is shared under the Creative Commons license.

About

this is the repo for TED dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages