Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



64 Commits

Repository files navigation


Awesome CC0

A list of Twitter datasets and related resources, released under CC0. If you have a resource to add to the list, feel free to open a pull request, or email me at

The license, when known, is given in {curly brackets}. Dataset size is given in [square brackets] when available.

  • Sentiment140 - Automatically labelled; authors assume that any tweet with positive emoticons, like :), are positive, and tweets with negative emoticons, like :(, are negative.
  • Weather-sentiment
  • Crowdflower Gender Classifier Data [20k] - Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.
  • Sanders Analytics {?} [5k]- Use Internet Archive's Wayback Machine to get the data. The dataset consists of 5513 hand-classified tweets. Each tweet was classified with respect to one of four different topics.
  • Geoparse Benchmark Open Dataset {BSD-4_Clause} [?] - The geoparsing benchmark dataset contains 1000’s of tweets recorded during 4 different natural disasters. These events are Hurricane Sandy 2012, Milan Blackouts 2013, Turkish Earthquake 2012 and the Christchurch Earthquake 2012. Each tweet in the dataset has been manually labelled with location entries at the building, street and region levels to provide a gold standard for evaluation work. The data consists of the full JSON serialized tweet metadata (i.e. including text) with an additional ‘entities’ field of type ‘mentions’ for the ground truth location annotations.
  • Learning Multiview Embeddings of Twitter Users
  • Developing Age and Gender Predictive Lexica over Social Media, 2014 - We derive predictive lexica (words and weights) for age and gender using regression and classification models from word usage in Facebook, blog, and Twitter data with associated demographiclabels. The lexica, made publicly available, achieved state-of-the-art accuracy in language based age and gender prediction over Facebook and Twitter, and were evaluated for generalization across social media genres as well as in limited message situations.
  • Predicting the Demographics of Twitter Users from Website Traffic Data
  • Inferring Perceived Demographics from User Emotional Tone and User-Environment Emotional Contrast
  • Mining User Interests to Predict Perceived Psycho-Demographic Traits on Twitter
  • Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
  • Who tweets? deriving the demographic characteristics of age, occupation and social class from twitter user meta-data
  • Please check for duplicates first.
  • Keep descriptions short, simple and unbiased.
  • Please make an individual commit for each suggestion
  • Add a new category if needed.
  • For datasets, please keep the format when possible: The license, when known, is given in {curly brackets}. Dataset size is given in [square brackets] when available.

Thank you for your suggestions!


To the extent possible under law, Shay Palachy has waived all copyright and related or neighboring rights to this work.