GitHub - thies1006/hatespeech-data-HT-2017: Release of hatespeech dataset from twitter and Whisper. We analyzed this data in our 2017 HyperText paper.

HATESPEECH DATA FROM TWITTER AND WHISPER

This repository contains data from Mainack Mondal, Leandro Araujo Silva and Fabricio Benevenuto. 2017. "A Measurement Study of Hate Speech in Social Media." HT'17. You can read the paper here
If you have any questions about this data feel free to contact Dr. Mainack Mondal (email id in the paper pdf).
This is data on hatepspeech collected from social media sites Twitter and Whisper on 2014 - 2015.
The dataset contains total 20,705 hate posts from Twitter and 7,604 hate posts from Whisper.
Please cite our paper in any published work that uses any of these resources.

@inproceedings{mondal2017,
 author = {Mainack Mondal and Leandro A. A. Silva and Fabricio Benevenuto},
 title = {A Measurement Study of Hate Speech in Social Media},
 booktitle = {Proceedings of the 28th ACM Conference on Hypertext and Social Media},
 series = {HT '17},
 year = {2017},
 location = {Prague, Czech Republic},
 publisher = {ACM}
}

WARNING

This dataset contain swear words and hateful language that users posted while expressing hate.
The csv file contains unicode characters.

Other publications based on this data

This data is used in two other of our published work.

Characterizing Usage of Explicit Hate Expressions in Social Media 
Mainack Mondal, Leandro Araujo Silva, Denzil Correa and Fabricio Benevenuto.
New Review of Hypermedia and Multimedia (THAM), vol. 24, no. 2, pp. 110-130, June 2018.

Read the THAM paper here

Analyzing the Targets of Hate in Online Social Media.
Leandro Silva, Mainack Modal, Denzil Correa, Fabricio Benevenuto, and Ingmar Weber.
In Proceedings of the Int'l AAAI Conference on Weblogs and Social (ICWSM’16). Cologne, Germany. May 2016.

Read the ICWSM paper here

Description of Twitter data

Our Twitter dataset including 20,705 tweets are in the csv file hatespeech_twitter_released_hypertext_2017.csv
There are three columns:
1. Tweet id: Unfortunately Twitter does not allow developers to share the full tweet json object, but only the tweet id. You need to refetch the tweet object using these tweet ids. Note that some tweets might be unavailable due to account deletion, tweet deletion or account suspension between 2015 (when the data was collected) and now.
2. Hate targets extracted from the tweet text: Groups of people extracted from the tweet text who are the subject of hate in the tweet. E.g., for a hypothetical tweet "I strongly hate/dislike black people", "black people" is the hate target.
3. Hate categories assigned by us: We manually labeled each hate_target into one of our hate categories and put the manual label in this column. For a comprehensive list of the hate categories, check our paper.
Full tweet objects corresponsing to these tweet ids for academic research purposes are available upon request. please follow the instructions provided in agreement-tweetobject.txt.

Description of Whisper data

Whisper is an anonymous social media site.
Our Whisper dataset including 7,604 posts are in the csv file hatespeech_whisper_released_hypertext_2017.csv
There are eight columns:
1. Serial: An incremental serial assigned to each post by us.
2. text:The user generated text of the post.
3. unix timestamp (milliseconds): time when the post was uploaded
4. whisper_assigned_categories: Whisper automatically assigns categories to the text. This column contains a comma separated list of those categories for each post. A blank value means no category was assigned.
5. country: Country from which the post was made.
6. region: Region (e.g., US state or city) from which the post was made.
7. Hate targets extracted from the whisper text: Groups of people extracted from the whisper text who are the subject of hate in the whisper. E.g., for a hypothetical whisper "I strongly hate/dislike black people", "black people" is the hate target.
8. Hate categories assigned by us: We manually labeled each hate_target into one of our hate categories and put the manual label in this column. For a comprehensive list of the hate categories, check our paper.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
LICENSE		LICENSE
README.md		README.md
agreement-tweetobject.txt		agreement-tweetobject.txt
hatespeech_twitter_released_hypertext_2017.csv		hatespeech_twitter_released_hypertext_2017.csv
hatespeech_whisper_released_hypertext_2017.csv		hatespeech_whisper_released_hypertext_2017.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

agreement-tweetobject.txt

agreement-tweetobject.txt

hatespeech_twitter_released_hypertext_2017.csv

hatespeech_twitter_released_hypertext_2017.csv

hatespeech_whisper_released_hypertext_2017.csv

hatespeech_whisper_released_hypertext_2017.csv

Repository files navigation

HATESPEECH DATA FROM TWITTER AND WHISPER

WARNING

Other publications based on this data

Description of Twitter data

Description of Whisper data

Release date: 7th March, 2019

Authors: Mainack Mondal, Leandro A. A. Silva and Fabricio Benevenuto

About

Releases

Packages

License

thies1006/hatespeech-data-HT-2017

Folders and files

Latest commit

History

Repository files navigation

HATESPEECH DATA FROM TWITTER AND WHISPER

WARNING

Other publications based on this data

Description of Twitter data

Description of Whisper data

Release date: 7th March, 2019

Authors: Mainack Mondal, Leandro A. A. Silva and Fabricio Benevenuto

About

Resources

License

Stars

Watchers

Forks