#  Hate Speech Detector

In today's session, we learned that in order to detect sentiments we can simply compare freqeuencies of positive and negative words. To this end, we downloaded a dictionary of such terms from the web and then determined their respective frequency. If there are more positive terms in a document than negative ones, we considered it to have a positive sentiment and otherwise a negative one.

There are many such dictinonaries produced by linguists but also other communities such as journalists. We can use these with the same approach we used for detecting sentiments to understand texts in different contexts. Journalists, for instance, have developed https://www.hatebase.org/, the world's largest online repository of structured, multilingual, usage-based hate speech. 

Here, we will use hatebase to develop a hate speech detector for tweets by counting the number of hate words in tweets. We will concentrate on the English language. You can go to https://www.hatebase.org/ and explore the search functions to take a look at the English terms in hatebase. 


Next, we need to download the hatebase dictionary, which is unfortunately not that easy. You need to register for an API key and then work relatively hard to get the API to return all English hate speech terms. 

I have commented out the hate_vocabulary(api_key) function that speaks to https://www.hatebase.org/ and instead provided you with a direct import from a local CSV file. If you want to, for instance, download the dictionary for another language than English, you need to un-commnent those lines.

In [1]:
import pandas as pd

hate_df = pd.read_csv('https://raw.githubusercontent.com/goto4711/social-cultural-analytics/master/hate-vocab-eng.csv')

In [2]:
hate_df.head()

Unnamed: 0.1,Unnamed: 0,word,meaning,offensiveness,number_of_sightings
0,1,abbo,"Australian Aboriginal person. Originally, this...",0.0,0
1,2,ABC,[1] American-born Chinese [2] Australian-born ...,0.0,0
2,3,ABCD,"American-Born Confused Desi, Indian Americans,...",0.0,0
3,4,abo,"Australian Aboriginal person. Originally, this...",0.0,37
4,5,af,"An African, used by white Rhodesians.",0.0,461


Next we access Twitter the way we learned today. The code is set to a query Twitter about 'Trump' below but is not active. In order to activate it, you need to add your Twitter API details. You can of course also change the search_term.


In [3]:
import tweepy
import requests
from ipynb.fs.full.keys import *

consumer_key = twit_key
consumer_secret = twit_secr
access_token = twit_token

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)

search_term = 'Trump'
# tweets = tweepy.Cursor(api.search, q = search_term)
tweets = api.search(q=search_term)

In [4]:
for tweet in tweets:
    print(tweet.text)

RT @EricHaftelLive: Remember that time when Susan Collins promised nothing would happen to #RoeVWade when she confirmed Kavanaugh to SCOTUS…
RT @MMFlint: The Trump-dominated Supreme Court has refused to block Texas from banning abortion, effectively killing Roe v Wade. To be clea…
RT @OccupyDemocrats: BREAKING: The Texas State House and Senate BOTH pass the Texas “New Jim Crow” bill that was inspired by Trump’s Big Li…
RT @CREWcrew: Kevin McCarthy would not be doing this if there wasn't something making him *very* nervous in those phone records https://t.c…
RT @hdemauleon: Ahí nomás, modestamente.
👇
AMLO llega a tercer informe con 61 mil afirmaciones falsas, el doble de Trump en todo su mandato…
RT @ElectionWiz: LISTEN: Trump calls for removing GOP Senate Minority Leader Mitch McConnell.

"Mitch McConnell should not be the leader, h…
@RichardCurren7 @chipfranklin So the Obama voters who flipped to Trump get a pass? Come on. Dig deeper.
RT @JenniferJJacobs: Biden today meets for 1st tim

Rather than performing a live Twitter search, I have saved the a search on 'Trump' from the day of his 2017 inauguration. That's the read.csv command further down. Please, note that the file was created with the old twitteR library, which means that some of the column names are different from what you are used to. But the one we are interested in is still called 'text'.

In [5]:
tweets = pd.read_csv("https://raw.githubusercontent.com/goto4711/social-cultural-analytics/master/trump-tweets-20-1.csv", encoding='latin-1')
                

Let's take a look at tweets. You will see all texts as well as a lot of other information.

In [6]:
tweets.head()

Unnamed: 0.1,Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,1,RT @ChrisTran1997: Wanting Trump to fail makes...,False,0,,2017-01-20 19:58:31,False,,822533789251489794,,"<a href=""http://twitter.com/download/iphone"" r...",brrriiieee,7,True,False,,
1,2,"RT @TuhafAmaGercek: Donald Trump, ABD BaÅkanl...",False,0,,2017-01-20 19:58:31,False,,822533789234659328,,"<a href=""http://twitter.com/#!/download/ipad"" ...",sailorreihino,282,True,False,,
2,3,And you losers on the left continue to wonder ...,False,0,,2017-01-20 19:58:31,True,,822533789230518274,,"<a href=""http://twitter.com/download/iphone"" r...",DavidYDG,0,False,False,,
3,4,RT @rogerwilko: #Trump speech is like the nati...,False,0,,2017-01-20 19:58:31,False,,822533789209554946,,"<a href=""http://twitter.com/download/android"" ...",samella_donavan,384,True,False,,
4,5,KPHO Phoenix Devotes 24 Hours to Trump's Impac...,False,0,,2017-01-20 19:58:31,False,,822533789196881925,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",bcbeat,0,False,False,,


Next we start with our standard text mining workflow 

Unfortunately, tweets can be difficult to process, as people use very different types of language, of formatting, etc. I have therefore provided you with a clean_tweets function, which applies to all the texts in the tweets and save the results in a tweets_text list.

In [7]:
def clean_tweets(df):
    text = df['text']
    tweet_list = []
    for tweet in text:
#         print(tweet)
        tweet = tweet.split()
        for word in tweet:
            if len(word) < 3:
                word.replace(word, "")
        tweet_list.append(tweet)
    return tweet_list
    
tweet_list = clean_tweets(tweets)
tweet_list_str = str([tweet for tweets in tweet_list for tweet in tweets])

Next, our ususal steps to prepare a TM corpus.

In [8]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopwords = list(stopwords.words('english'))

def Corpuser(corpus):
    corpus = word_tokenize(corpus)
    corpus = [word.replace(" ", "") for word in corpus]
    corpus = [word.lower() for word in corpus if word.isalpha()]

    corpus = [word for word in corpus if word not in stopwords]
    
    return corpus

Corpuser(tweet_list_str)

['tuhafamagercek',
 'rogerwilko',
 'trump',
 'rorschach',
 'sniffles',
 'trump',
 'trump',
 'chelseahandler',
 'freetrade',
 'inauguration',
 'benshapiro',
 'j',
 'ottoemezzo',
 'danjgross',
 'inauguration',
 'dimaio',
 'trump',
 'huffingtonpost',
 'pzfeed',
 'daisyrdley',
 'hewillnotdivideus',
 'sophiabush',
 'resist',
 'pattymo',
 'trump',
 'christylezz',
 'got',
 'ta',
 'ca',
 'obama',
 'trump',
 'mattgertz',
 'could',
 'jemelehill',
 'newshour',
 'inauguration',
 'dayone',
 'newsstrump',
 'samswey',
 'dreadchapo',
 'potus',
 'amp',
 'potus',
 'nomandate',
 'realdonaldtrump',
 'ca',
 'kill',
 'buzzfeednews',
 'trump',
 'prisonplanet',
 'cornellbarnard',
 'cofc',
 'cluedont',
 'potus',
 'j',
 'melaniatrump',
 'amp',
 'kimbeex',
 'donald',
 'pizzagatefeed',
 'breitbartnews',
 'stranahan',
 'pookiedaslave',
 'n',
 'u',
 'n',
 'cgbposts',
 'house',
 'brendannyhan',
 'trump',
 'amp',
 'í',
 'charleshurt',
 'trump',
 'jezebel',
 'aidenmarceron',
 'engadget',
 'trump',
 'luscas',
 'e',
 'm

Our next TM workflow step will be to create a term-document-matrix to count the terms in the documents. You might have noticed above that the hatebase vocabulary contains not just single words but also phrases of more than one word such as 'African catfish'. As we also learned today, these are so-called bigrams (2 word phrases). So, we create two term-document-matrices one for the single words (also called unigrams) and one for the bigrams.



In [9]:
# first we create a frequency table

def frequencytable(corpus):
    words = Corpuser(corpus)
    freq_table = {}
    for word in words:
        if word in freq_table:
            freq_table[word] += 1
        else:
            freq_table[word] = 1
    return freq_table

In [10]:
table = frequencytable(tweet_list_str)

In [11]:
# then create the actual dataframe where each tweet is a column
# possible for a few tweets but might be too complex for 1000?
# is it needed to distinguish every tweet or is it about the entire corpus? --> Yes you want to know 
# how many tweets contain hatespeech

dfs = []
for i in range(1000):
    table = frequencytable(str(tweet_list[i]))
    i = pd.DataFrame.from_dict(table, orient='index', columns={i})
    dfs.append(i)

In [12]:
merged_df = pd.concat(dfs, axis=1)
merged_df = merged_df.fillna(0)

In [13]:
merged_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
tuhafamagercek,0,1.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
rogerwilko,0,0.0,0,1.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
trump,0,0.0,0,1.0,1.0,1.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
rorschach,0,0.0,0,1.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
sniffles,0,0.0,0,1.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
pkcapitol,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,1.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
radikal,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
rubenaguilar,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0
z,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,1.0,0.0,0.0,0.0,0


In [14]:
# now we check which of the tweets contain hate speech thus therefore match withe the hate speech dictionary
# --> but where are the hate speech words? is it about the offensivenes?
# what exactly is being done with the bigram?

# hate_dff = hate_df.loc[hate_df['offensiveness'] > 0]

In [15]:
hate_dict = hate_df[['word', 'offensiveness']]

In [16]:
hate_dict

Unnamed: 0,word,offensiveness
0,abbo,0.0
1,ABC,0.0
2,ABCD,0.0
3,abo,0.0
4,af,0.0
...,...,...
572,zip,0.0
573,zipperhead,0.0
574,zippohead,0.0
575,ZOG,0.0


All we have to do now is find out which rownames (terms) of tdm correspond to terms in our hate speech dictionary. 

In [17]:
# we drop the rownames that are not in the hate_dict

hate_voc = hate_dict['word'].values.tolist()
hate_voc = [word.lower() for word in hate_voc if word.isalpha()]

hate_speech = merged_df[merged_df.index.isin(hate_voc)]

The columns (docs) of tdm that are larger than 0 are then the tweets which contain hate speech words.



In [18]:
# hate = hate_speech[hate_speech > 0]

In [19]:
hate_voc

['abbo',
 'abc',
 'abcd',
 'abo',
 'af',
 'africoon',
 'albino',
 'americoon',
 'amo',
 'angie',
 'anglo',
 'ann',
 'ape',
 'apple',
 'argie',
 'armo',
 'azn',
 'banana',
 'beaner',
 'beaney',
 'bengali',
 'bhrempti',
 'bint',
 'bird',
 'bitch',
 'blaxican',
 'blockhead',
 'bludger',
 'bluegum',
 'bogan',
 'bong',
 'boo',
 'boojie',
 'boon',
 'booner',
 'boong',
 'boonga',
 'boonie',
 'boxhead',
 'brownie',
 'bubble',
 'buck',
 'buckethead',
 'buckra',
 'buckwheat',
 'buddhahead',
 'buffie',
 'bumblebee',
 'bung',
 'bunga',
 'burrhead',
 'butterhead',
 'caublasian',
 'celestial',
 'charlie',
 'charva',
 'charver',
 'chav',
 'chigger',
 'chinaman',
 'chinig',
 'chink',
 'chonky',
 'chug',
 'chunky',
 'clam',
 'clamhead',
 'cocoa',
 'coconut',
 'colored',
 'coloured',
 'coolie',
 'coon',
 'cracker',
 'cripple',
 'crow',
 'cunt',
 'cushi',
 'cushite',
 'dago',
 'darkey',
 'darkie',
 'darky',
 'dego',
 'dhimmi',
 'dinge',
 'dink',
 'div',
 'divvy',
 'dogan',
 'dogun',
 'domes',
 'dyke',
 '

In [20]:
# what is the hate speech vocab??
hate_speech

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
abc,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0


In [21]:
merged_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
tuhafamagercek,0,1.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
rogerwilko,0,0.0,0,1.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
trump,0,0.0,0,1.0,1.0,1.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
rorschach,0,0.0,0,1.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
sniffles,0,0.0,0,1.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
pkcapitol,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,1.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
radikal,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
rubenaguilar,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0
z,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,...,0.0,0.0,0.0,0,0.0,1.0,0.0,0.0,0.0,0


In [22]:
exists = 'trump' in tweets.text
print(exists)

False


In [23]:
merged_df.index.isin(hate_voc)

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [24]:
merged_df.index

Index(['tuhafamagercek', 'rogerwilko', 'trump', 'rorschach', 'sniffles',
       'chelseahandler', 'freetrade', 'inauguration', 'benshapiro', 'j',
       ...
       'dordogne', 'france', 'un', 'maisonblanche', 'ericgarland', 'pkcapitol',
       'radikal', 'rubenaguilar', 'z', 'varneyco'],
      dtype='object', length=722)

In [25]:
words = word_tokenize(tweet_list_str)
words = [word.replace(" ", "") for word in words]
words = [word.lower() for word in words if word.isalpha()]

In [26]:
exists = 'trump' in words
print(exists)

True


In [27]:
words

['a',
 'we',
 'tuhafamagercek',
 'rogerwilko',
 'trump',
 'a',
 'rorschach',
 'sniffles',
 'trump',
 'trump',
 'chelseahandler',
 'freetrade',
 'inauguration',
 'benshapiro',
 'j',
 'ottoemezzo',
 'danjgross',
 'inauguration',
 'dimaio',
 'trump',
 'huffingtonpost',
 'pzfeed',
 'daisyrdley',
 'hewillnotdivideus',
 'sophiabush',
 'resist',
 'pattymo',
 'trump',
 'christylezz',
 'got',
 'ta',
 'ca',
 'obama',
 'trump',
 'mattgertz',
 'could',
 'had',
 'jemelehill',
 'newshour',
 'inauguration',
 'dayone',
 'newsstrump',
 'samswey',
 'dreadchapo',
 'potus',
 'amp',
 'a',
 'a',
 'potus',
 'nomandate',
 'realdonaldtrump',
 'i',
 'ca',
 'kill',
 'no',
 'buzzfeednews',
 'a',
 'i',
 'trump',
 'prisonplanet',
 'i',
 'cornellbarnard',
 'cofc',
 'a',
 'a',
 'cluedont',
 'a',
 'i',
 'a',
 'a',
 'potus',
 'j',
 'melaniatrump',
 'amp',
 'kimbeex',
 'donald',
 'pizzagatefeed',
 'breitbartnews',
 'stranahan',
 'they',
 'a',
 'pookiedaslave',
 'n',
 'u',
 'n',
 'cgbposts',
 'house',
 'brendannyhan',
 '

In [28]:
# Why doesn't it find the words that are there? such as bitch, bubble, etc....