#  Hate Speech Detector

In today's session, we learned that in order to detect sentiments we can simply compare freqeuencies of positive and negative words. To this end, we downloaded a dictionary of such terms from the web and then determined their respective frequency. If there are more positive terms in a document than negative ones, we considered it to have a positive sentiment and otherwise a negative one.

There are many such dictinonaries produced by linguists but also other communities such as journalists. We can use these with the same approach we used for detecting sentiments to understand texts in different contexts. Journalists, for instance, have developed https://www.hatebase.org/, the world's largest online repository of structured, multilingual, usage-based hate speech. 

Here, we will use hatebase to develop a hate speech detector for tweets by counting the number of hate words in tweets. We will concentrate on the English language. You can go to https://www.hatebase.org/ and explore the search functions to take a look at the English terms in hatebase. 



Next, we need to download the hatebase dictionary, which is unfortunately not that easy. You need to register for an API key and then work relatively hard to get the API to return all English hate speech terms. 

I have commented out the hate_vocabulary(api_key) function that speaks to https://www.hatebase.org/ and instead provided you with a direct import from a local CSV file. If you want to, for instance, download the dictionary for another language than English, you need to un-commnent those lines.

In [1]:
import pandas as pd

hate_df = pd.read_csv('https://raw.githubusercontent.com/goto4711/social-cultural-analytics/master/hate-vocab-eng.csv')

In [2]:
hate_df.head()

Unnamed: 0.1,Unnamed: 0,word,meaning,offensiveness,number_of_sightings
0,1,abbo,"Australian Aboriginal person. Originally, this...",0.0,0
1,2,ABC,[1] American-born Chinese [2] Australian-born ...,0.0,0
2,3,ABCD,"American-Born Confused Desi, Indian Americans,...",0.0,0
3,4,abo,"Australian Aboriginal person. Originally, this...",0.0,37
4,5,af,"An African, used by white Rhodesians.",0.0,461


Next we access Twitter the way we learned today. The code is set to a query Twitter about 'Trump' below but is not active. In order to activate it, you need to add your Twitter API details. You can of course also change the search_term.

In [3]:
import tweepy
import requests
from ipynb.fs.full.keys import *

consumer_key = twit_key
consumer_secret = twit_secr
access_token = twit_token

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)

search_term = 'Trump'
tweets = api.search(q=search_term)

ModuleNotFoundError: No module named 'ipynb.fs.full.keys'

In [4]:
for tweet in tweets:
    print(tweet.text)

RT @EamonJavers: The CEO of Donald Trump’s new SPAC is a man named Patrick Orlando. According to his bio, he has recently “been serving as…
@Potomacbeat So, in my state the vote total was 

Biden 424,921
Trump 365,654

Let's assume for a moment that I cas… https://t.co/ihfFtAW3dU
RT @glennkirschner2: If DOJ refuses to prosecute Donald Trump for the many crimes he inarguably committed, it will NOT be a decision made o…
RT @ScottAdamsSays: As of today, no one in a leadership position has explained to the public what is being done to solve the supply chain p…
Nine months after being expelled from social media for his role in inciting the Jan. 6 Capitol insurrection, former… https://t.co/tgpZ5gKf2B
RT @ScottAdamsSays: As of today, no one in a leadership position has explained to the public what is being done to solve the supply chain p…
@MUDDLAW Stiglitz no es el mismo economista DESPRESTIGIADO democRATA que dijo que la legislación de Trump de reduci… https://t.co/CmYFUDKGme
#LFC trump PSG

Rather than performing a live Twitter search, I have saved the a search on 'Trump' from the day of his 2017 inauguration. That's the read.csv command further down. Please, note that the file was created with the old twitteR library, which means that some of the column names are different from what you are used to. But the one we are interested in is still called 'text'.

In [5]:
tweets = pd.read_csv("https://raw.githubusercontent.com/goto4711/social-cultural-analytics/master/trump-tweets-20-1.csv", encoding='latin-1')

Let's take a look at tweets. You will see all texts as well as a lot of other information.

In [6]:
tweets.head()

Unnamed: 0.1,Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,1,RT @ChrisTran1997: Wanting Trump to fail makes...,False,0,,2017-01-20 19:58:31,False,,822533789251489794,,"<a href=""http://twitter.com/download/iphone"" r...",brrriiieee,7,True,False,,
1,2,"RT @TuhafAmaGercek: Donald Trump, ABD BaÅkanl...",False,0,,2017-01-20 19:58:31,False,,822533789234659328,,"<a href=""http://twitter.com/#!/download/ipad"" ...",sailorreihino,282,True,False,,
2,3,And you losers on the left continue to wonder ...,False,0,,2017-01-20 19:58:31,True,,822533789230518274,,"<a href=""http://twitter.com/download/iphone"" r...",DavidYDG,0,False,False,,
3,4,RT @rogerwilko: #Trump speech is like the nati...,False,0,,2017-01-20 19:58:31,False,,822533789209554946,,"<a href=""http://twitter.com/download/android"" ...",samella_donavan,384,True,False,,
4,5,KPHO Phoenix Devotes 24 Hours to Trump's Impac...,False,0,,2017-01-20 19:58:31,False,,822533789196881925,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",bcbeat,0,False,False,,


Next we start with our standard text mining workflow 

Unfortunately, tweets can be difficult to process, as people use very different types of language, of formatting, etc. I have therefore provided you with a clean_tweets function, which applies to all the texts in the tweets and save the results in a tweet_list

In [51]:
def clean_tweets(df):
    text = df['text']
    tweet_list = []
    for tweet in text:
        tweet = tweet.split()
        tweet = ["" if len(word) < 3 else word for word in tweet]
        tweet_list.append(tweet)
    return tweet_list
    
tweet_list = clean_tweets(tweets)
tweet_list_new = []
for tweet in tweet_list:
    tweet_str = " ".join(tweet)
    tweet_list_new.append(tweet_str)


Next, our ususal steps to prepare a TM corpus.

In [95]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopwords = list(stopwords.words('english'))

def Corpuser(corpus):
    corpus = word_tokenize(corpus)
    corpus = [word.replace(" ", "") for word in corpus]
    corpus = [word.lower() for word in corpus if word.isalpha()]

    corpus = [word for word in corpus if word not in stopwords]
    
    return corpus

# tweet_corp = Corpuser(tweet_list_new)
# print(tweet_corp)

docs = []
for tweet in tweet_list_new:
    doc = Corpuser(tweet)
    docs.append(str(doc))


Our next TM workflow step will be to create a term-document-matrix to count the terms in the documents.

In [83]:
from nltk import *

tf = FreqDist(docs)
print(tf)

<FreqDist with 806 samples and 1000 outcomes>


In [71]:
from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
dtm = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
dtm = dtm.T
dtm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
ab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abbiamo,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abc,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abcpolitics,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abcworldnews,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
youtube,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yup,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zacharynalepa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zero,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


All we have to do now is find out which rownames (terms) of tdm correspond to terms in our hate speech dictionary. The columns (docs) of tdm that are larger than 0 are then the tweets which contain hate speech words.

The python function isin answers the question: 'Where do the values in the hate vocabulary appear in the dataframe'

In [81]:
hate_voc = hate_dict['word'].values.tolist()
hate_voc = [word.lower() for word in hate_voc if word.isalpha()]

hate_speech = dtm[dtm.index.isin(hate_voc)]

In [82]:
hate_speech

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
abc,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bitch,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
boo,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bubble,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
clam,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
idiot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
nigga,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
property,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tan,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
trash,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we only need to find the indexes of these words to see to which tweet they belong. The columns (docs) of tdm that are larger than 0 are then the tweets which contain hate speech words.

In [94]:
hate_speecht = hate_speech.T
bitch = hate_speecht.index[hate_speecht['bitch'] > 0].to_list()
idiot = hate_speecht.index[hate_speecht['idiot'] > 0].to_list()
print(bitch)
print(idiot)

[141, 216, 519]
[342]


Let's check out the tweets that contain 'bitch

In [100]:
tweets_bitch = tweets.iloc[bitch]['text']
for tweet in tweets_bitch:
    print(tweet)

A Trump bitch stopped the fire pit ugh í ½í¹
RT @TomiLahren: They will march and protest and whine and bitch and then a magical thing will happen..nothing. President Trump will just coâ¦
Fuck trump bitch


Some of these are very angry about Trump, but probably still not really hate speech. This shows the limitations of the approach to use simple words and phrases.

But this approach can still be useful to filter tweets for manual review by editors. Twitters and others actually have engines like this. It is frequently used in apps like http://www.huffingtonpost.com/entry/donald-trump-stock-alert_us_586e67dce4b0c4be0af325fc, which sends alerts when Donald Trump tweets about your stocks.