## Introduction

This notebook details the process of using machine learning algorithms to detect hate speech in text through the use of Natural Language Processing (NLP) techniques. 

Sources for the data include:

- Hatebase.org API

## Data Extraction from Hatebase API

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import requests
from config import hatebase_api_key
import time

Setting variables to access the API and authenticate with key

In [2]:
auth_url = "https://api.hatebase.org/4-4/authenticate"
auth_key = "api_key={}".format(hatebase_api_key)
headers = {
    'Content-Type': "application/x-www-form-urlencoded",
    'cache-control': "no-cache"
    }

Accessing the API and retrieving use token

In [3]:
response = requests.request("POST", auth_url, data=auth_key, headers=headers)

In [4]:
token = response.json()["result"]["token"]

Querying the API using the vocabulary endpoint. This query will extract the first page of English vocabulary deemed hatespeech

In [5]:
vocab_url = "https://api.hatebase.org/4-4/get_vocabulary"
lang = "eng"
resp_format = "json"
vocab_payload = "token=" + token + "&format=" + resp_format + "&language=" + lang

In [8]:
vocab_response = requests.request("POST", vocab_url, data=vocab_payload, headers=headers)

In [9]:
vocab_json = vocab_response.json()

In [10]:
vocab_pages  = vocab_json["number_of_pages"]

Querying the API using the sightings endpoint. This query will extract the first page of "sightings" of hateful terms. We will extract data from the year 2019

In [18]:
sighting_url = "https://api.hatebase.org/4-4/get_sightings"
sighting_year = "2019"
sighting_payload = "token=" + token + "&year=" + sighting_year + "&format=" + resp_format + "&language=" + lang

In [19]:
sighting_response = requests.request("POST", sighting_url, data=sighting_payload, headers=headers)

In [20]:
sighting_json = sighting_response.json()

In [26]:
sighting_pages2019 = sighting_json["number_of_pages"]

Defining function to query multiple pages of API based on the endpoint and store results in a dataframe

In [34]:
def get_hatebase_vocab(url, vocab_pages):
    answer = []
    for page in range(1, vocab_pages+1):
        print(page)
        payload = "token=" + token + "&format=" + resp_format + "&page=" + str(page) + "&language=" + lang
        response = requests.request("POST", url, data=payload, headers=headers)
        result = response.json()["result"]
        answer.append(result)
    df = pd.DataFrame([])
    for i in range(vocab_pages):
        df = df.append(pd.DataFrame(answer[i]))
    return df

In [35]:
vocab_df = get_hatebase_vocab(vocab_url, vocab_pages)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16


In [38]:
vocab_df.reset_index(drop=True, inplace=True)