# 02 - Data Wrangling
In this section we extend some of the initial data cleaning steps taken during exploratory data analysis to more thoroughly process the corpus of tweets.

In [49]:
import os
import pandas as pd
import numpy as np
import re
import pickle as pkl

pd.options.display.max_colwidth = 140
pd.options.display.max_info_columns = 140

In [2]:
# import data
data_directory = os.path.join('..','data','')
csv_path = os.path.join(data_directory, 'twitter_hate_speech.csv')
df_orig = pd.read_csv(csv_path, engine = 'python', delimiter = ',')

# columns re-named for convenience
df_orig.columns = ['id', 'golden','state', 'trusted_judgements','last_judgment','rating','confidence',
                   'created','orig_golden','orig_last_judgement','orig_trusted_judgements','orig_id','orig_state',
                   'updated','orig_is_hate_speech','is_hate_speech_gold','reason','confidence2','id2','text']
df_orig.head(3)

Unnamed: 0,id,golden,state,trusted_judgements,last_judgment,rating,confidence,created,orig_golden,orig_last_judgement,orig_trusted_judgements,orig_id,orig_state,updated,orig_is_hate_speech,is_hate_speech_gold,reason,confidence2,id2,text
0,853718217,True,golden,86,,The tweet uses offensive language but not hate speech,0.6013,,True,,0.0,615561535.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses offensive language but not hate speech,,1.0,1666196000.0,Warning: penny boards will make you a faggot
1,853718218,True,golden,92,,The tweet contains hate speech,0.7227,,True,,0.0,615561723.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses offensive language but not hate speech,,1.0,429512100.0,Fuck dykes
2,853718219,True,golden,86,,The tweet contains hate speech,0.5229,,True,,0.0,615562039.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses offensive language but not hate speech,,1.0,395623800.0,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot


# Dropping columns
The only columns that may be of use for classification and later analysis are:
- id: Each tweet is identified with a unique number 9-digit number. This will be useful as an index.
- text: The actual content of the tweet, including user handles and hashtags.
- rating: The label classifying a tweet as non-offensive, offensive, or hateful.
- confidence: The degree of agreement amonst different 'judges' regarding a tweet's classification.

In [3]:
df = df_orig[['id','text','rating','confidence']]
df = df.set_index('id')
df.head(3)

Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,Warning: penny boards will make you a faggot,The tweet uses offensive language but not hate speech,0.6013
853718218,Fuck dykes,The tweet contains hate speech,0.7227
853718219,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,The tweet contains hate speech,0.5229


### Recode hate speech classifications
Next, we re-encode the hate speech labels to something more manageable:
- 0: Does not contain offensive language
- 1: Contains offensive language but not hate speech
- 2: Contains hate speech

All columns except for 'is_hate_speech' and 'text' are also dropped.

In [4]:
df = df[['text','rating','confidence']]
categories = df.rating.unique()
print(categories)
df['rating'] = df['rating'].replace(categories, [1,2,0])
df.head(3)

['The tweet uses offensive language but not hate speech'
 'The tweet contains hate speech' 'The tweet is not offensive']


Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,Warning: penny boards will make you a faggot,1,0.6013
853718218,Fuck dykes,2,0.7227
853718219,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,2,0.5229


# Cleaning up the text
The tweets contained within this dataset have been untouched, and as such, require heavy cleaning prior to any form of analysis. As a form of electronic communication, tweets contain user handles, hashtags, URL's, and emoticons, which will need to be addressed in addition to standard NLP processing techniques such as lemmatization. Furthermore, as a casual messaging and posting medium, anomalies frequently arise from looser notions of grammar and syntax (all of which are more extreme in the case of offensive and racist tweets).

In [5]:
pd.options.display.max_colwidth = 140
df.head(5)

Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,Warning: penny boards will make you a faggot,1,0.6013
853718218,Fuck dykes,2,0.7227
853718219,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,2,0.5229
853718220,"""@jayswaggkillah: ""@JacklynAnnn: @jayswaggkillah Is a fag"" jackie jealous"" Neeeee",2,0.5184
853718221,@Zhugstubble You heard me bitch but any way I'm back th texas so wtf u talking about bitch ass nigga,1,0.5185


### Special tokens and characters
Below we define some simple functions to clean the tweets:
- replace_user(): Replaces all user handles with a placeholder
- clean_tweet(): Replaces url's and certain special chracters with spaces
- lemmatize(): Replaces all tokens with their lemma. For classification algorithms where tweets are represented as sparse term frequency vectors, an un-lemmatized corpus would explode the dimensionality of these vectors and pose a serious overfitting problem.

In [113]:
import re
import spacy

nlp = spacy.load('en')

replace_user = lambda tweet: re.sub(r'(@\w+\s*)', r'TWITTER_HANDLE ', tweet)

regex = r'#|&|\(|\)|\"|(https?://\S*)|(�\S*\d*)|(128\d{3})|(_*UNDEF)|x\d+\.?\d*|X\d+'
clean_tweet = lambda tweet: re.sub(regex, ' ', tweet)

def lemmatize(tweet):
    x = str()
    for token in nlp(tweet):
        x = ' '.join([x,token.lemma_])
    return x[1:]

In [114]:
df_clean = df.copy()
df_clean['text'] = df_clean['text'].apply(clean_tweet)
df_clean['text'] = df_clean['text'].apply(clean_tweet) # applied twice to as some anomalies remain after first pass
df_clean['text'] = df_clean['text'].apply(replace_user)
df_clean['text'] = df_clean['text'].apply(lemmatize)

In [115]:
start = 40
offset = 10
df_clean.iloc[start:start+offset,:]

Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718257,"sit alone watch white chicks , no pant , fuzzy blanket , tea || turn up",0,1.0
853718258,-PRON- want to go to a haunt house maybe get possesed y'know just to see if ghost be real ; ;,0,1.0
853718259,$ 10 buck the browns get johnny ! ; ; ;,0,1.0
853718260,-PRON- be with a bitch with a mustash for a year and a half ? wtf be wrong itch -PRON-,1,1.0
853718261,"rt twitter_handle : huge ass , small waist amp ; okay face amp ; bitch really think -PRON- famous",1,0.962
853718262,bobby flay in this bitch,1,0.9613
853718263,-PRON- be never gon na be ok with -PRON- nigga around alot of bitch while with -PRON- boy . cause -PRON- be once that female -PRON- boy ...,1,0.8018
853718264,where the bad bitch at ? lol twitter_handle,1,0.9618
853718265,rt twitter_handle : -PRON- just can not help but to hate -PRON- . even though -PRON- never intentionally do anything to -PRON- -PRON- be...,1,0.6837
853718266,"rt twitter_handle : -PRON- female overlook -PRON- geek . when -PRON- take these glass off , -PRON- be no longer clark kent . -PRON- go s...",1,0.9608


### Duplicate tweets
For some reason, a significant number of tweets are duplicated within this dataset. This is particularly problematic for word vector training where the associations between pairs of words would be grossly overstated due to duplicated data.

In [116]:
pd.DataFrame(df_clean.text.value_counts()).head(30)

Unnamed: 0,text
twitter_handle shut up nigger,33
"amid economic recovery , school district desperate for bus drivers : when unemployment be high , school district ...",27
"1 , 2 , 3 , 1 , 2 , 3 ... 4 how many nigger be in -PRON- store",21
[ drum and bass ] btsm x lektrique - religion muzzy remix -,21
"-PRON- be truly amiable twitter_handle . \n 2 day until christmas , may -PRON- please \n follow twitter_handle to make -PRON- exquisite ? stay well pal ..",17
"governor have no right to reduce n18,000 minimum wage ngige via twitter_handle",14
"1 , 2 , 3 , 1 , 2 , 3 .... 4 how many nigger be in -PRON- store",14
123 123 4 how many nigger be in -PRON- store,14
1223 4 how many nigger be in -PRON- store vine by funny vines : twitter_handle : 1223 4 how many nigger be in -PRON- ...,14
imagine when a cpl holder shoot a towel head with an ak to stop a major attack . what will the left say then ?,13


Removing duplicate tweets (and keeping the original) removes about 1,500 tweets!

In [117]:
df_clean = df_clean.drop_duplicates(subset='text')
print(df_clean.shape)
df_clean.head()

(13086, 3)


Unnamed: 0_level_0,text,rating,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
853718217,warning : penny board will make -PRON- a faggot,1,0.6013
853718218,fuck dyke,2,0.7227
853718219,twitter_handle twitter_handle twitter_handle twitter_handle twitter_handle at least i do not look like jefree starr faggot,2,0.5229
853718220,twitter_handle : twitter_handle : twitter_handle be a fag jackie jealous neeeee,2,0.5184
853718221,twitter_handle -PRON- hear -PRON- bitch but any way -PRON- be back th texas so wtf u talk about bitch ass nigga,1,0.5185


# Create train and test sets and save data
Of the roughly 13,000 tweets, we hold out 30% for the test set and ensure the proportion of non-offensive, offensive, and hateful tweets are the same for both training and test sets.

In [119]:
from sklearn.model_selection import train_test_split

X = df_clean.text
y = df_clean.rating

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y, random_state = 1)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

data_all = dict(df_orig = df, df_clean = df_clean, X_train = X_train, X_test = X_test, y_train = y_train,
               y_test = y_test)

X_train shape: (9160,)
X_test shape: (3926,)
y_train shape: (9160,)
y_test shape: (3926,)


In [120]:
destination = os.path.join('..','data','data_all')
with open(destination, 'wb') as file_out:
    pkl.dump(data_all, file_out)