Fetching contributors…
Cannot retrieve contributors at this time
235 lines (127 sloc) 10.1 KB

Sebastian Raschka
Last updated: 01/16/2015

A collection of links to various free and open-source datasets.

I am looking forward to extend this little collection! If you don't find your favorite datasets listed here, just let me know (via email or twitter) and I will add them in no-time!

## Sections

Dataset Repositories

[back to top]

  • Kaggle - Kaggle, the leading platform for predictive modeling competitions.

  • UCI MLR - UC Irvine Machine Learning Repository.

  • - Public data maintained by Google.

  • Freebase - A community-curated database of well-known people, places, and things.

  • - Machine learning data set repository for uploading and finding data sets.

  • Infochimps - A huge collection of large-sized data sets.

  • Amazon Web Services - Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.

  • Databib - A searchable catalog / registry / directory / bibliography of research data repositories.

  • figshare - An online digital repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos.

  • reddit r/datasets - Datasets shared on reddit.

  • datahub - The free, powerful data management platform from the Open Knowledge Foundation

  • Quandl - A search engine for numerical data

  • enigma - A search engine for public records published by governments, companies and organizations.

Datasets by Format

[back to top]


[back to top]


[back to top]

  • Mobio - bi-modal (audio and video) data taken from 152 people.

  • Million Song Dataset - The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

  • Music Data Mining - A collection of research done on music analysis and links to various datasets.

  • CMU Audio Databases - A collection of databases for speech recognition.

  • CMU Audio Databases - A collection of databases for speech recognition.

  • CMU_ARCTIC speech synthesis databases - Phonetically balanced, US English single speaker databases designed for unit selection speech synthesis research.

  • VoxForge - GPL speech audio corpora.


[back to top]

  • TechTC - Technion Repository of Text Categorization Datasets containing 300 labeled datasets with categorization difficulties indicated by baseline SVM accuracies.

  • SMS Spam Collection - A public dataset of 5572 SMS messages that are labeled as either "spam" or "ham" (not spam).

  • musiXmatch - A dataset of lyrics for the songs in the one million songs dataset. The lyrics are pre-processed and available as "bag of words" after stemming.

  • Google books Ngram Viewer - The corpus of Google books as n-grams available for quick online queries or download.

  • Jeb Bush's email archive - Jeb Bush's emails during his days as the governor of Florida.

  • Amazon Google Books Ngrams - A data set containing Google Books n-gram corpuses.

  • The Wayback Machine - 80 terabytes of archived web crawl data available for research.

  • SMS Spam Collection - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site.

  • Yahoo News Feed dataset - An 1.5 TB dataset for building machine learning recommendation systems

  • The full Reddit Submission Corpus 2006-2015 - This represents all publicly available Reddit submissions from January 2006 - August 31, 2015).

Time Series

[back to top]

  • NGAFID - National General Aviation Flight Information Database. Time series data from various flight data recorders for flights that are approximately an hour long each.

Datasets by Topic

[back to top]

Natural Sciences

[back to top]

Web, Technology, and Social Networks

[back to top]

Historical Data and Human Resources

[back to top]

Finance and Companies

[back to top]

Government Data and Politics

[back to top]