Skip to content

Latest commit

 

History

History
24 lines (21 loc) · 6.8 KB

_dataset_statistics.md

File metadata and controls

24 lines (21 loc) · 6.8 KB

Dataset statistics

The following table shows the data size, number of lines, and description for each data source we used in transformer-based Thai language model pre-training.


Dataset name Data size Number of lines Description
wisesight-large 51.44GB 314M a large dataset of social media posts provided by the social listening platform Wisesight for this study. The dataset contains posts Twitter, Facebook, Pantip, Instagram, YouTube and other websites sampled from 2019.
pantip-large 22.35GB 95M a collection of posts and answers of Thailand's largest online bulletin board Pantip.com from 2015 to 2019 provided by audience analytics platform Chaos Theory.
Thairath-222k 1.48GB 5M a collection of articles published on newspaper website Thairath.com up to December 2019. (GitHub)
prachathai-67k 903.1MB 2.7M a collection of articles published on newspaper website Prachathai.com from August 24, 2004 to November 15, 2018. (GitHub)
Thai Wikipedia 515MB 843k the Wikipedia articles extracted using Giuseppe Attardi’s WikiExtractor in September 2020. All HTML tags, bullet points, and tables are removed. (GitHub)
OpenSubtitles 468.8MB 5M a collection of movie subtitles translated by crowdsourcing from OpenSubtitles.org [Lison and Tiedemann, 2016]. We use only the portions containing Thai texts.
ThaiPBS-111k 372.3MB 858k a collection of articles published on newspaper website ThaiPBS.or.th up to December 2019. (GitHub)
Thai National Corpus (TNC) 366MB 797k a 14-million-word corpus of Thai texts containing 75% non-fiction and 25% fiction works. Media source breakdown is 60% books, 25% magazines, and the rest from other publications and writings. Most of the texts are curated from 1998 to 2007 [Aroonmanakun et al., 2009].
scb-mt-en-th-2020 290.4MB 947k a parallel corpus of Englsih-Thai sentence pairs curated news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data, government documents, and machine-generated text [Lowphansirikul et al., 2021]. (GitHub)
JW300 182.8MB 727k a parallel corpus of religion texts from jw.org that includes Thai texts.
wongnai-corpus 64MB 101k a collection of restaurant reivews and ratings (1 to 5 stars) published on Wongnai.com. (GitHub)
QED 42MB 407k a collection of transcripts for educational videos and lectures collaboratively created on the AMARA web-based platform [Abdelali et al., 2014].
bibleuedin 2.18MB 62k a multilingual corpus of the Bible created by Christos Christodoulopoulos and Mark Steedman.
wisesight-sentiment 5.3MB 22k a collection of Twitter posts about consumer products and services from 2016 to early 2019 labeled positive, negative, neutral and question [GitHub].
tanzil 2.4MB 6k a collection of Quran translations compiled by the Tanzil project [Tiedemann, 2012]..
tatoeba 1MB 2k a collection of translated sentences from the crowdsourced multilingual dataset Tatoeba [Tiedemann, 2012]..