On a whim, I'm centralizing some test datasets for topic modeling here. Please contribute by pull request!
Datasets are organized by in the Data subfolder by Data Format > Dataset Parent Folder > Data Files. Please keep this organizational structure (or propose a better one!) when making pull requests.
- Data
- lda-c (Data in Blei's lda-c format)
- blei-ap
- Description: 2246 documents from the Associated Press
- Author: David Blei
- Source: http://www.cs.princeton.edu/~blei/lda-c/
- blei-ap
- raw (Unprocessed datasets)
- Nematode biology abstracts
- Source: https://web.archive.org/web/20040328153507/http://elegans.swmed.edu/wli/cgcbib
- Note: Test data used by Teh, et al, in Hierarchical Dirichlet Processes.
- 20news-bydate
- Description: "The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder."
- Source: http://qwone.com/~jason/20Newsgroups/
- Author: Ken Lang
- econtalk-show-notes
- Description: Econtalk podcast show notes and transcripts in Markdown format. Includes all episodes through May 11, 2015.
- Source: http://www.econtalk.org
- Author: Tim Hopper
- Nematode biology abstracts
- lda-c (Data in Blei's lda-c format)