Skip to content

sjgiorgi/blm_twitter_corpus

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

Black Lives Matter Twitter Corpus

DOI

A data set of 63.9 million tweets from 13.0 million users from over 100 countries which contain one of the following keywords: BlackLivesMatter, AllLivesMatter and BlueLivesMatter.

Warning Full tweets (and associated meta-data) are not sharable. This repo only contains Tweet IDs which you must rehydrate yourself.

Data

Data is available at Zenodo.

Tweet Counter Per Day

Daily tweet counts are available in the tweet_counts_per_day.csv file.

Downloading Tweet Content

Due to Twitter's Terms of Service we are only able to distribute the numeric tweet id. Here we give brief instructions on how to populate the full tweet data from the list of ids. To do this we will use the Python command line tool Twarc. The following steps assume you have Twarc installed as well as a Twitter Developer account. To install Twarc, please run

pip3 install twarc

Next, you must configure Twarc with your Twitter API tokens.

twarc configure

Next, download the data from Zenodo:

wget https://zenodo.org/record/4897616/files/twitter.tar.gz
tar -xf twitter.tar.gz

This file contains separate folders for each year. Since the volume of tweets in 2020 was significantly larger than all previous years, all years except 2020 have a single file, while 2020 has separate files for each month. Each file contains the following fields: message_id, blacklivesmatter, alllivesmatter, bluelivesmatter. Next, we create a file containing only tweet ids. We extract data from June 2020 as an example:

cd twitter/2020
gunzip 2020-06.csv.gz
cut -d, -f1 2020-06.csv > 2020-06.txt

Using twarc from the command line

This command will produce a file where each line is a separate json file for each tweet ids. Note that only tweets which are publicly available at the time of your pull will be downloaded. Thus, our numbers might not match the numbers you see.

twarc hydrate 2020-06.txt > blm.jsonl

Using hydrate Python script

To run this script you must install the Python package tqdm:

pip3 install tqdm

Before running hydrate.py you must have the above file (blm_tweet_ids.txt) in the same directory as the Python script. Then run

python3 hydrate.py

This will produce the file blm_tweet_ids.jsonl.

Other options

Besides twarc, there are many other tools available for downloading Twitter data, such as TwitterMySQL and hydrator.

Using Excel

Please be careful when opening these files in Excel. Excel might automatically convert the Tweet ids from an integer format to a decimal.

Citation

If you use this data in your work please cite the following paper:

@misc{giorgi2022twitter,
  author       = {Salvatore Giorgi and
                  Sharath Chandra Guntuku and
                  McKenzie Himelein-Wachowiak and
                  Amy Kwarteng and
                  Sy Hwang and
                  Muhammad Rahman and
                  Brenda Curtis},
  title        = {Twitter Data of the \#BlackLivesMatter Movement And 
                   Counter Protests: 2013 to 2021},
  year         = {2022},
  journal      = {Proceedings of the International AAAI Conference on Web and Social Media}, 
}

Contact

If you have any questions please contact Salvatore Giorgi at sgiorgi[at]sas[dot]upenn[dot]edu.

License

Licensed under a GNU General Public License v3 (GPLv3).

About

Corpus of Black Lives Matters and counter protests tweets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages