# Replication of Hemker (2018)

The goal of this notebook is to follow the methodology explained in Hemker (2018) to perform a replication of his results. Note that the source code is not available, rendering this task a bit harder.

### Data Retrieval

In [167]:
# Source: Davidson et al. (2017)

import pandas as pd

df = pd.read_csv("./data/labeled_data.csv", index_col=0)
raw_tweets = df.tweet
raw_labels = df["class"].values

In [3]:
df.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


## Data Preprocessing
---

### Noise Removal

In [4]:
# Source: Davidson et al. (2017)

import re
import html

def preprocess(text_string):
    
    # Regex
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    hashtag_regex = '#[\w\-]+'
    
    # If we wish to find hashtags, we have to unescape HTML entities
    parsed_text = html.unescape(text_string)
    parsed_text = re.sub(space_pattern, ' ', parsed_text)
    parsed_text = re.sub(hashtag_regex, 'HASHTAGHERE', parsed_text)
    parsed_text = re.sub(giant_url_regex, 'URLHERE', parsed_text)
    parsed_text = re.sub(mention_regex, 'MENTIONHERE', parsed_text)
    
    #parsed_text = parsed_text.code("utf-8", errors='ignore')
    
    return parsed_text

def _test_preprocess():
    
    assert "HASHTAGHERE" == preprocess("#iam1hashtag")
    assert "URLHERE" == preprocess("https://seminar.minerva.kgi.edu")
    assert "MENTIONHERE" == preprocess("@vinimiranda")
    assert ' ' == preprocess("        ")
    assert "&MENTIONHERE URLHERE HASHTAGHERE " == \
        preprocess("&amp;@vinimiranda    https://seminar.minerva.kgi.edu     #minerva    ")
    
_test_preprocess()

print("Example of a raw tweet:\n{}".format(raw_tweets[68]))
print("\nIts cleaned version is:\n{}".format(preprocess(raw_tweets[68])))

Example of a raw tweet:
"@Almightywayne__: @JetsAndASwisher @Gook____ bitch fuck u http://t.co/pXmGA68NC1" maybe you'll get better. Just http://t.co/TPreVwfq0S

It's cleaned version is:
"MENTIONHERE: MENTIONHERE MENTIONHERE bitch fuck u URLHERE" maybe you'll get better. Just URLHERE


In [5]:
tweets = raw_tweets.map(preprocess)

### Sentiment Analysis

In [6]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer as VS

sentiment_analyzer = VS()

# Example
sentiment_analyzer.polarity_scores(tweets[68])

{'neg': 0.38, 'neu': 0.469, 'pos': 0.151, 'compound': -0.6597}

### Hate Subclass Extraction

In [109]:
# Partly from https://stackoverflow.com/questions/31836058/nltk-named-entity-recognition-to-a-python-list
# I do not implement co-reference resolution since a single NE is sufficient for directed hate speech.
from nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

def hate_classification(hate_tweet):
    '''Receives a hateful tweet. 
       Return 3 for directed hate speech and 4 otherwise.'''
    
    if bool(hate_tweet.count("MENTIONHERE")): return(3)

    # URLHERE is considered a proper noun by the pos tagger.
    # Remove them before checking for proper nouns
    no_punct_hate = ''.join([char for char in hate_tweet if char not in punctuation])
    no_URL_hate = ' '.join([token for token in no_punct_hate.split() if token != "URLHERE"])
    has_NE = False
    for sent in nltk.sent_tokenize(no_URL_hate):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label'):
                return(3)  # Named Entity found    

    return(4)
        
def _test_hate_classification():
    assert hate_classification("MENTIONHERE") == 3
    assert hate_classification("Karen is absolutely crazy") == 3
    assert hate_classification("Karen is his sister. She's absolutely crazy") == 3
    assert hate_classification("They should all be sent to Mexico") == 3
    assert hate_classification("They should all leave the country") == 4
    assert hate_classification("some hate speech stuff") == 4
    assert hate_classification("") == 4

_test_hate_classification()

In [139]:
hate_tweets = tweets[df["class"] == 0].values
_hate_prnt = lambda x : "Generalized" if hate_classification(x) == 4 else "Directed"

print("Example of a hateful tweet: \n{}".format(hate_tweets[20]))
print("Its type of hate speech is: {}\n".format(_hate_prnt(hate_tweets[20])))

print("Example of a hateful tweet:\n{}".format(hate_tweets[10]))
print("Its type of hate speech is: {}\n".format(_hate_prnt(hate_tweets[10])))

Example of a hateful tweet: 
"We're out here, and we're queer!" " 2, 4, 6, hut! We like it in our butt!"
Its type of hate speech is: Generalized

Example of a hateful tweet:
"MENTIONHERE: Jackies a retard HASHTAGHERE" At least I can make a grilled cheese!
Its type of hate speech is: Directed



In [195]:
# Change hate speech labels (0) to directed (3) / generalized labels (4) 
labels = raw_labels.copy()
for i, (tweet, label) in enumerate(zip(tweets, raw_labels)):
    
    if label == 0:  # If hate speech
        labels[i] = hate_classification(tweet)

def _test_labels():
    assert 1 not in pd.Series(labels).value_counts().index
    assert 3 in pd.Series(labels).value_counts().index
    assert 4 in pd.Series(labels).value_counts().index

In [196]:
pd.Series(labels).value_counts()

1    19190
2     4163
3     1183
4      247
dtype: int64