<a href="https://colab.research.google.com/github/radonys/Reddit-HateSpeech-Application/blob/master/Reddit_HateSpeech_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reddit Hate-Speech Modelling

Reddit Hate-Speech Modelling using data from [A Benchmark Dataset for Learning to Intervene in Online Hate Speech](https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech)

## Libraries

### Install

In [1]:
!pip install praw

Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/5c/39/17251486951815d4514e4a3f179d4f3e7af5f7b1ce8eaba5a3ea61bc91f2/praw-7.0.0-py3-none-any.whl (143kB)
[K     |████████████████████████████████| 153kB 860kB/s eta 0:00:01
[?25hCollecting websocket-client>=0.54.0
[?25l  Downloading https://files.pythonhosted.org/packages/4c/5f/f61b420143ed1c8dc69f9eaec5ff1ac36109d52c80de49d66e0c36c3dfdf/websocket_client-0.57.0-py2.py3-none-any.whl (200kB)
[K     |████████████████████████████████| 204kB 815kB/s eta 0:00:01     |████████████████▎               | 102kB 815kB/s eta 0:00:01
[?25hCollecting update-checker>=0.16
  Downloading https://files.pythonhosted.org/packages/d6/c3/aaf8a162df8e8f9d321237c7c0e63aff95b42d19f1758f96606e3cabb245/update_checker-0.17-py2.py3-none-any.whl
Collecting prawcore<2.0,>=1.3.0
  Downloading https://files.pythonhosted.org/packages/c9/8e/d076cb8f26523f91eef3e75d6cf9143b2f16d67ce7d681a61d0bbc783f49/prawcore-1.3.0-py3-none-any.whl
Installing 

### Import

In [2]:
import os
import praw
import pandas as pd
import datetime as dt
import logging
import numpy as np
from numpy import random
import gensim
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
from collections import defaultdict
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report
import pickle

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yashsrivastava/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Variable Declarations

In [3]:
reddit = praw.Reddit(client_id='#', client_secret='#', user_agent='#', username='#', password='#')

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

## Utility Functions

In [4]:
def clean_text(text):
   
    text = BeautifulSoup(text, "lxml").text
    text = text.lower()
    text = ''.join([i for i in text if not i.isdigit()])
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)

    return text

def process_info(row):

  info = defaultdict(list)

  texts = row['text'].split("\n")[:-1]
  
  hate_idx = row['hate_speech_idx'][1:-1].split(',')
  hate_idx = [int(i) - 1 for i in hate_idx]

  for txt in texts:
    info['text'].append(clean_text(txt))

  hate = np.zeros(len(texts))

  try:
    
    for idx in hate_idx:
      hate[idx] = 1

  except Exception as error:
    
    print(error)
    return {}

  info['hate'] = hate

  return info

## Download Annotated Data

In [5]:
!wget https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech/raw/master/data/reddit.csv
data = pd.read_csv("reddit.csv")
data.dropna(subset = ['hate_speech_idx'], inplace = True)
data.drop(['response', 'id'], inplace = True, axis = 1)
data.reset_index(inplace = True, drop = True)
data.head(5)

--2020-05-17 01:59:02--  https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech/raw/master/data/reddit.csv
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech/master/data/reddit.csv [following]
--2020-05-17 01:59:04--  https://raw.githubusercontent.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech/master/data/reddit.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.248.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.248.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7384220 (7.0M) [text/plain]
Saving to: ‘reddit.csv’


2020-05-17 01:59:12 (911 KB/s) - ‘reddit.csv’ saved [7384220/7384220]

Unnamed: 0,text,hate_speech_idx
0,1. A subsection of retarded Hungarians? Ohh bo...,[1]
1,"1. > ""y'all hear sumn?"" by all means I live i...",[3]
2,1. Because the Japanese aren't retarded and kn...,[1]
3,1. That might be true if we didn't have an exa...,"[2, 3]"
4,"1. Why, what is the point of making all of tha...",[8]


## Clean Downloaded Data

In [6]:
cleaned_data = pd.DataFrame()

for row in data.iterrows():
  cleaned_data = cleaned_data.append(pd.DataFrame(process_info(row[1])), ignore_index=True)

index 2 is out of bounds for axis 0 with size 2
index 19 is out of bounds for axis 0 with size 14


## Hate/Non-Hate Classifier

### ML Algorithms

#### Logistic Regression

In [7]:
def logisticreg(X_train, X_test, y_train, y_test):

  from sklearn.linear_model import LogisticRegression

  logreg = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf', LogisticRegression(n_jobs=1, C=1e5)),
                 ])
  logreg.fit(X_train, y_train)

  y_pred = logreg.predict(X_test)

  print('accuracy %s' % accuracy_score(y_pred, y_test))
  print(classification_report(y_test, y_pred))

#### Random Forest

In [8]:
def randomforest(X_train, X_test, y_train, y_test):
  
  from sklearn.ensemble import RandomForestClassifier
  
  ranfor = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf', RandomForestClassifier(n_estimators = 1000, random_state = 42)),
                 ])
  ranfor.fit(X_train, y_train)

  y_pred = ranfor.predict(X_test)

  filename = 'finalized_model.sav'
  pickle.dump(ranfor, open(filename, 'wb'))

  print('accuracy %s' % accuracy_score(y_pred, y_test))
  print(classification_report(y_test, y_pred))

#### Linear Support Vector Machine

In [9]:
def linear_svm(X_train, X_test, y_train, y_test):
  
  from sklearn.linear_model import SGDClassifier

  sgd = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
                 ])
  sgd.fit(X_train, y_train)

  y_pred = sgd.predict(X_test)

  print('accuracy %s' % accuracy_score(y_pred, y_test))
  print(classification_report(y_test, y_pred))

### Train Test Varied Data ML Models

In [10]:
def train_test(X,y):
 
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

  print("Results of Linear Support Vector Machine")
  linear_svm(X_train, X_test, y_train, y_test)
  print("Results of Logistic Regression")
  logisticreg(X_train, X_test, y_train, y_test)
  print("Results of Random Forest")
  randomforest(X_train, X_test, y_train, y_test)

In [11]:
train_test(cleaned_data.text, cleaned_data.hate)

Results of Linear Support Vector Machine
accuracy 0.8224666142969363
              precision    recall  f1-score   support

         0.0       0.80      0.99      0.89      3542
         1.0       0.96      0.43      0.60      1550

    accuracy                           0.82      5092
   macro avg       0.88      0.71      0.74      5092
weighted avg       0.85      0.82      0.80      5092

Results of Logistic Regression




accuracy 0.7360565593087196
              precision    recall  f1-score   support

         0.0       0.83      0.79      0.81      3542
         1.0       0.56      0.62      0.59      1550

    accuracy                           0.74      5092
   macro avg       0.69      0.70      0.70      5092
weighted avg       0.74      0.74      0.74      5092

Results of Random Forest
accuracy 0.9183032207384132
              precision    recall  f1-score   support

         0.0       0.91      0.97      0.94      3542
         1.0       0.93      0.79      0.85      1550

    accuracy                           0.92      5092
   macro avg       0.92      0.88      0.90      5092
weighted avg       0.92      0.92      0.92      5092



## References

1) https://arxiv.org/abs/1909.04251

2) https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech

3) http://www.storybench.org/how-to-scrape-reddit-with-python/

4) https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568