Skip to content
A list of Twitter datasets and related resources.
Branch: master
Clone or download
Latest commit 3da3de7 Nov 18, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Create LICENSE Oct 29, 2018
README.rst Update README.rst Nov 18, 2019

README.rst

awesome-twitter-data

Awesome CC0

A list of Twitter datasets and related resources, released under CC0. If you have a resource to add to the list, feel free to open a pull request, or email me at shay.palachy@gmail.com.

The license, when known, is given in {curly brackets}. Dataset size is given in [square brackets] when available.

1   Twitter Datasets

1.1   Tweet datasets

1.1.1   Tweet ID datasets

1.2   Tweet datasets (labelled)

  • Sentiment140 - Automatically laballed; authors assume that any tweet with positive emoticons, like :), are positive, and tweets with negative emoticons, like :(, are negative.
  • Weather-sentiment
  • Crowdflower Gender Classifier Data [20k] - Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.
  • Sanders Analytics {?} [5k]- Use Internet Archive's Wayback Machine to get the data. The dataset consists of 5513 hand-classified tweets. Each tweet was classified with respect to one of four different topics.
  • Geoparse Benchmark Open Dataset {BSD-4_Clause} [?] - The geoparsing benchmark dataset contains 1000’s of tweets recorded during 4 different natural disasters. These events are Hurricane Sandy 2012, Milan Blackouts 2013, Turkish Earthquake 2012 and the Christchurch Earthquake 2012. Each tweet in the dataset has been manually labelled with location entries at the building, street and region levels to provide a gold standard for evaluation work. The data consists of the full JSON serialized tweet metadata (i.e. including text) with an additional ‘entities’ field of type ‘mentions’ for the ground truth location annotations.

1.3   User datasets

1.4   Lost Datasets

2   Other Lists

3   Tools

3.1   Data Collection

3.2   Analysis

4   Academic Papers

  • Learning Multiview Embeddings of Twitter Users

4.1   Demographics Prediction

  • Developing Age and Gender Predictive Lexica over Social Media, 2014 - We derive predictive lexica (words and weights) for age and gender using regression and classification models from word usage in Facebook, blog, and Twitter data with associated demographiclabels. The lexica, made publicly available, achieved state-of-the-art accuracy in language based age and gender prediction over Facebook and Twitter, and were evaluated for generalization across social media genres as well as in limited message situations.
  • Predicting the Demographics of Twitter Users from Website Traffic Data
  • Inferring Perceived Demographics from User Emotional Tone and User-Environment Emotional Contrast
  • Mining User Interests to Predict Perceived Psycho-Demographic Traits on Twitter
  • Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
  • Who tweets? deriving the demographic characteristics of age, occupation and social class from twitter user meta-data

5   Articles & blog posts

6   Contributing

  • Please check for duplicates first.
  • Keep descriptions short, simple and unbiased.
  • Please make an individual commit for each suggestion
  • Add a new category if needed.
  • For datasets, please keep the format when possible: The license, when known, is given in {curly brackets}. Dataset size is given in [square brackets] when available.

Thank you for your suggestions!

7   License

CC0

To the extent possible under law, Shay Palachy has waived all copyright and related or neighboring rights to this work.

You can’t perform that action at this time.