# 02 - Data Wrangling
In this section we extend some of the initial data cleaning steps taken during exploratory data analysis to more thoroughly process the corpus of tweets.

In [19]:
import os
import pandas as pd
import numpy as np
import re

pd.options.display.max_colwidth = 140

In [12]:
# import data
data_directory = os.path.join('..','data','')
csv_path = os.path.join(data_directory, 'twitter_hate_speech.csv')
df_orig = pd.read_csv(csv_path, engine = 'python', delimiter = ',')

# columns re-named for convenience
df_orig.columns = ['id', 'golden','state', 'trusted_judgements','last_judgment','rating','confidence',
                   'created','orig_golden','orig_last_judgement','orig_trusted_judgements','orig_id','orig_state',
                   'updated','orig_is_hate_speech','is_hate_speech_gold','reason','confidence2','id2','text']
df_orig.head(3)

Unnamed: 0,id,golden,state,trusted_judgements,last_judgment,rating,confidence,created,orig_golden,orig_last_judgement,orig_trusted_judgements,orig_id,orig_state,updated,orig_is_hate_speech,is_hate_speech_gold,reason,confidence2,id2,text
0,853718217,True,golden,86,,The tweet uses offensive language but not hate speech,0.6013,,True,,0.0,615561535.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses offensive language but not hate speech,,1.0,1666196000.0,Warning: penny boards will make you a faggot
1,853718218,True,golden,92,,The tweet contains hate speech,0.7227,,True,,0.0,615561723.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses offensive language but not hate speech,,1.0,429512100.0,Fuck dykes
2,853718219,True,golden,86,,The tweet contains hate speech,0.5229,,True,,0.0,615562039.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses offensive language but not hate speech,,1.0,395623800.0,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot


# Dropping columns
The only columns that may be of use for classification and later analysis are:
- id: Each tweet is identified with a unique number 9-digit number. This will be useful as an index.
- text: The actual content of the tweet, including user handles and hashtags.
- rating: The label classifying a tweet as non-offensive, offensive, or hateful.
- confidence: The degree of agreement amonst different 'judges' regarding a tweet's classification.

In [13]:
df = df_orig[['id','text','rating','confidence']]
df = df.set_index('id')
df.head(3)

Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,Warning: penny boards will make you a faggot,The tweet uses offensive language but not hate speech,0.6013
853718218,Fuck dykes,The tweet contains hate speech,0.7227
853718219,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,The tweet contains hate speech,0.5229


### Recode hate speech classifications
Next, we re-encode the hate speech labels to something more manageable:
- 0: Does not contain offensive language
- 1: Contains offensive language but not hate speech
- 2: Contains hate speech

All columns except for 'is_hate_speech' and 'text' are also dropped.

In [17]:
df = df[['text','rating','confidence']]
categories = df.rating.unique()
print(categories)
df['rating'] = df['rating'].replace(categories, [1,2,0])
df.head(3)

[1 2 0]


Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,Warning: penny boards will make you a faggot,1,0.6013
853718218,Fuck dykes,2,0.7227
853718219,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,2,0.5229


# Cleaning up the text
The tweets contained within this dataset have been untouched, and as such, require heavy cleaning prior to any form of analysis. As a form of electronic communication, tweets contain user handles, hashtags, URL's, and emoticons, which will need to be addressed in addition to standard NLP processing techniques such as lemmatization. Furthermore, as a casual messaging and posting medium, special care also needs to be paid to misspellings, spacing errors, abbreviations, slang, and other anomalies arising from looser notions of grammar and syntax (all of which are more extreme in the case of offensive and racist tweets).

In [18]:
pd.options.display.max_colwidth = 140
df.head(5)

Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,Warning: penny boards will make you a faggot,1,0.6013
853718218,Fuck dykes,2,0.7227
853718219,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,2,0.5229
853718220,"""@jayswaggkillah: ""@JacklynAnnn: @jayswaggkillah Is a fag"" jackie jealous"" Neeeee",2,0.5184
853718221,@Zhugstubble You heard me bitch but any way I'm back th texas so wtf u talking about bitch ass nigga,1,0.5185


### Capitalization
If the tone of text can vary depending on how its capitalized, then the rated offensiveness of a tweet might similarly be affected by capitalization. However, given the challenges of working with a small corpus, the gains in simplifying the text to avoid overfitting outweigh the cost of losing clues about tone.

In [25]:
df['text'] = df.text.str.lower()
df.head(3)

Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,warning: penny boards will make you a faggot,1,0.6013
853718218,fuck dykes,2,0.7227
853718219,@sizzurp__ @ilikecats74 @yopapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,2,0.5229


### Special tokens and characters
Below we define some simple functions to clean the tweets:
- replace_user(): replaces all user handles with a placeholder
- replace_link(): replaces all URL's with a placeholder
- erase_special(): removes punctuation and undefined characters. the one exception is the apostrophe.
- erase_numbers(): removes meaningless patterns of digits resulting from encoding errors

Like capitalization, punctuation could be useful data for an NLP task, but we discard it on account of the corpus size as well (producing a bonus affect of removing many other anomalies).

In [99]:
def replace_user(tweet):
    return re.sub(r'(@\w+\s*)', r'<user> ', tweet)

def replace_url(tweet):
    return re.sub(r'(https?://\S*)', r'<url> )', tweet)

def erase_special(tweet):
    regex = r'#|&|\(|\)|\"|\.|\?|!|,|;|:|(�\S*\d*)|(_*UNDEF)|\\n|\s\'|\'\s|-|/|$|%|\n|{|}|[|]|~'
    return re.sub(regex, ' ', tweet)

def erase_numbers(tweet):
    regex = r'(128\d{3})|(82\d{2})'
    return re.sub(regex, ' ', tweet)

def normalize(tweet):
    x = tweet.split()
    y = ''
    for token in x:
        y = ' '.join([y,token.lower()])
    return y[1:]

def clean_tweet(tweet):
    tweet = replace_user(tweet)
    tweet = replace_url(tweet)
    tweet = erase_numbers(tweet)
    tweet = erase_special(tweet)
    return tweet

In [109]:
start = 400
offset = 100

df_clean = df.copy()
df_clean['text'] = df_clean['text'].apply(clean_tweet)
df_clean.iloc[start:start+offset,:]

Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853719770,<user> <user> <user> how about blacks parents don't want white teachers with guns around their kids racist rapist cops are enough,1,0.3454
853719771,what's spooky blacks instagram _,0,1.0000
853719772,<user> <user> <user> <user> hope you say that to blacks too depend girl don't have a chance sorry i just took a crap,0,0.6759
853719773,<user> <user> <user> yes white america is a lot better than muslim s arabia blacks are way more racist than whites,2,0.6672
853719774,<user> ha ha enjoy your time off blackys pressie all under tree waiting on santa_,0,1.0000
853719775,this white bitch prolly doesn't have one black female friend and if she does it is some weird emo black trying to tell me about blacks,2,0.6578
853719776,blacks can't be racist we welcomed racist murderous land thieves to our mother land and what did they do they invented racism and races,0,1.0000
853719777,if the chinese can go to africa and profit then there's nothing wrong with blacks going back to the motherland for profit reawakening,0,0.6719
853719778,<user> no you think i'm prettier and better that's a fucked up mentality i think i struggle just like every one of us blacks,1,1.0000
853719779,white south africans should write us blacks an open letter expressing their real feelings and sign it with hate your privileged,0,0.3420


### Duplicate tweets
For some reason, a significant number of tweets are duplicated within this dataset. This is particularly problematic for word vector training where the associations between pairs of words would be grossly overstated due to duplicated data.

In [110]:
pd.DataFrame(df_clean.text.value_counts()).head(10)

Unnamed: 0,text
<user> shut up nigger,34
amid economic recovery school districts desperate for bus drivers when unemployment is high school district <url>,27
1 2 3 1 2 3 4 how many niggers are in my store,21
[drum and bass] btsm x lektrique religion muzzy remix <url> <url>,21
123 123 4 how many niggers are in my store,14
1223 4 how many niggers are in my store vine by funny vines <user> 1223 4 how many niggers are in my <url>,14
governors have no right to reduce n18 000 minimum wage ngige <url> via <user>,14
1 2 3 1 2 3 4 how many niggers are in my store,14
imagine when a cpl holder shoots a towel head with an ak to stop a major attack what will the left say then <url>,13
<user> wtf steve haters calling him a nigga i got your back <url>,12


Removing duplicate tweets (and keeping the original) removes almost 1,500 tweets!

In [113]:
df_clean = df_clean.drop_duplicates(subset='text')
print(df_clean.shape)
df_clean.head()

(13086, 3)


Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,warning penny boards will make you a faggot,1,0.6013
853718218,fuck dykes,2,0.7227
853718219,<user> <user> <user> <user> <user> at least i dont look like jefree starr faggot,2,0.5229
853718220,<user> <user> <user> is a fag jackie jealous neeeee,2,0.5184
853718221,<user> you heard me bitch but any way i'm back th texas so wtf u talking about bitch ass nigga,1,0.5185


# Saving Data
At this point, the tweets are sufficiently clean for classification algorithms relying on word vector embeddings. We don't apply lemmatization for these models simply because we expect a neural network to be able to detect whether two words sharing the same base have the same semantic meaning. If two such words are judged to be essentially the same, then their embedding vectors will reflect the similarity, and overfitting can be avoided.

In [116]:
import pickle as pkl

destination = os.path.join('..','data','dataframe_clean')
with open(destination, 'wb') as file_out:
    pkl.dump(df_clean, file_out)

However, for classification algorithms like Multinomial Naive Bayes where tweets are represented as sparse word count vectors, an un-lemmatized corpus would explode the dimensionality of these tweet vectors and pose a serious overfitting problem. Thus, we output dataframe with lemmatized tweets as well. 

In [120]:
import spacy

nlp = spacy.load('en')

def lemmatize(tweet):
    x = str()
    for token in nlp(tweet):
        x = ' '.join([x,token.lemma_])
    return x[1:]

In [121]:
df_lemma = df_clean.copy()
df_lemma['text'] = df_lemma['text'].apply(lemmatize)
df_lemma.head(3)

Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,warning penny board will make -PRON- a faggot,1,0.6013
853718218,fuck dyke,2,0.7227
853718219,< user > < user > < user > < user > < user > at least i do not look like jefree starr faggot,2,0.5229


In [123]:
destination2 = os.path.join('..','data','dataframe_lemma')
with open(destination2, 'wb') as file_out:
    pkl.dump(df_lemma, file_out)