# Project #15: Twitter hate speech detection 2

## 1. Introduction



The goal of our project is to find efficient methods for identifying hate speech on twitter. Our aim is to find a set of features that could be used to identify hate speech content.



For our analysis, we have gathered two data sets. The first data set was collected by searching for tweets containing specific hashtags (topics). The second data set was collected from active twitter users that frequently posted hate speech content. Both data sets were obtained using Twitter API and the search-tweets pytho library.

In [None]:
# Standard libraries
import sys; sys.path.insert(0, '..') # add parent folder to path

#3rd. party
import pandas as pd
import matplotlib.pyplot as plt

# Custom scripts
import liwc_empath
import util

# Dictionary keys
CATEGORY_HATE = "hate_speech_topics"
CATEGORY_NON_HATE = "non_hate_speech_topics"

## 2. Data sets

### Data-set 1: Hate speech hash tags
The first data set was collected by searching for tweets containing specific hashtags that were provided to us in the project assignment. The hash tags were: #terrorist, #radicalist, #islamophobia, #extremist, and #bombing. 


In [None]:
# Read labeled tweets with specific hash tags
tweets_hashtag = {}
tweets_hashtag["bombing"] = util.read_tweets(["../Data/tweets_bombing_labeled.json"])
tweets_hashtag["extremist"] = util.read_tweets(["../Data/tweets_extremist_labeled.json"])
tweets_hashtag["islamophobia"] = util.read_tweets(["../Data/tweets_islamophobia_labeled.json"])
tweets_hashtag["radicalist"] = util.read_tweets(["../Data/tweets_radicalist_labeled.json"])
tweets_hashtag["terrorist"] = util.read_tweets(["../Data/tweets_terrorist_labeled.json"])

print('Hashtag summaries: ')
for key in tweets_hashtag.keys():
    util.print_hashtag_summary(key, tweets_hashtag[key])

# Read the combined labeled data set
labeled_tweets = util.read_tweets(["../Data/tweets_labeled_combined.json"])
print('All labeled tweets:')
util.print_hashtag_summary('ALL', labeled_tweets)

## 3. Characterization of the labeled data set

### 3.1 Sentiment analysis
TEXT HERE

In [None]:
# CODE HERE

### 3.2 LIWC Features


In [None]:
from liwc_empath import analyze_tweets_liwc

liwc_categories = analyze_tweets_liwc(labeled_tweets)

print("Top 20 topics in hate tweets:")
for i in range(20):
    print(liwc_categories[CATEGORY_HATE][i][0]+ ": " + str(round(liwc_categories[CATEGORY_HATE][i][1],4)))
    
print("\nTop 20 topics in non hate tweets:")
for i in range(20):
    print(liwc_categories[CATEGORY_NON_HATE][i][0]+ ": " + str(round(liwc_categories[CATEGORY_NON_HATE][i][1],4)))

# Draw top 20 categories    
categories_hate = [x for (x,y) in liwc_categories[CATEGORY_HATE][:20]]                                      
values_hate = [y for (x,y) in liwc_categories[CATEGORY_HATE][:20]]

categories_non_hate = [x for (x,y) in liwc_categories[CATEGORY_NON_HATE][:20]]                                     
values_non_hate = [y for (x,y) in liwc_categories[CATEGORY_NON_HATE][:20]]

fig, axs = plt.subplots(2,1,figsize=(12,10), sharey=False)

axs[0].bar(categories_hate, values_hate)
axs[0].set_title("Top 20 LIWC Catgories: Hate speech tweets")
axs[0].tick_params(labelrotation=45)

axs[1].bar(categories_non_hate, values_non_hate)
axs[1].set_title("Top 20 LIWC Catgories: Non hate speech tweets")
axs[1].tick_params(labelrotation=45)

plt.tight_layout()



### 3.3 Emoticon usage
TEXT HERE

In [None]:
# CODE HERE

### 3.4 Named entities
TEXT HERE

In [None]:
# CODE HERE

### 3.5 Named phrases
TEXT HERE

## 4. Radicalization of active hate speakers

In [None]:
# CODE HERE
from analyse_user import analyse_users

first_user = analyse_users('..\\Data\\tweets_user_ViidarUkonpoika.json')
second_user = analyse_users('..\\Data\\tweets_user_UKInfidel.json')
third_user = analyse_users('..\\Data\\tweets_user_DrDavidDuke.json')


#print('file: ' + first_user["source_file"])
#print('mean sentiment percentile: ' + str(mean_sentiment_perc))
#print('number of negative posts: ' + str(no_neg_posts))
#print('volume of negative posts: ' + str(vol_neg_posts))
#print('number of very negative posts:' + str(no_very_neg_posts))
#print('volume of very negative posts:' + str(vol_very_neg_posts))
#print('number of days active: '+ str(time_active.days))
#print('radicalization score: '+ str(radicalization_score))
#print('very negative post and their sentiments:')
#[print(str(item[0]),item[1]) for item in very_neg_tweets_and_sentiments]
#df = pd.DataFrame(sentiments, columns = ['sentiment'])
#df.hist(bins=50)