## 1. Data Preprocessing

 ##### Deciding which package to use:  
 - spaCy
 
 spaCy outperforms NLTK in **word tokenization** & **Part-of-speech tagging**, and though NLTK performs faster for **Sentence tokenization** through simple attempts at splitting text into sentences, spaCy constructs a syntactic tree for each sentence which is a more robust method that yields more information about the text.

 Here we assume that the Out of bag samples are all in English, and as such spaCy can be used


 ![https://www.thedataincubator.com/wp-content/uploads/timing.png](https://www.thedataincubator.com/wp-content/uploads/timing.png)

In [1]:
import pandas as pd

### Mounting Drive and importing dataset
from google.colab import drive
drive.mount('/content/drive')

path = "/content/drive/My Drive/BT4222 Group Project/Final Project/Codes/labeled_data.csv"
twitter_hate = pd.read_csv(path)

Mounted at /content/drive


In [2]:
### Extract and review dataset
twitter_hate.head(20)

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
5,5,3,1,2,0,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just..."
6,6,3,0,3,0,1,"!!!!!!""@__BrighterDays: I can not just sit up ..."
7,7,3,0,3,0,1,!!!!&#8220;@selfiequeenbri: cause I'm tired of...
8,8,3,0,3,0,1,""" &amp; you might not get ya bitch back &amp; ..."
9,9,3,1,2,0,1,""" @rhythmixx_ :hobbies include: fighting Maria..."


### Cleaning Tweets using spacy and scikit learn

https://www.kaggle.com/code/thebrownviking20/topic-modelling-with-spacy-and-scikit-learn/notebook

In [3]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [4]:
import string
from tqdm import tqdm
import re 

stopwords = list(STOP_WORDS) +['rt'] # adding 'rt' to stopwords. 'rt' only refers to retweet does not give value to tweet
punctuations = list(string.punctuation) 
punctuations.remove('#') # remove '#' from punctuations as hashtags add meaning to tweet
parser = English()

def spacy_tokenizer(sentence):
    # input: tweet
    # output: lowercased tweet, with stopwords and punctuations removed
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

tqdm.pandas()
twitter_hate["tweets_cleaned_v1"] = twitter_hate["tweet"].progress_apply(spacy_tokenizer)

100%|██████████| 24783/24783 [00:07<00:00, 3126.98it/s]


In [5]:
tqdm.pandas()

### removes the following from each tweet:
    ### @user_names
    ### URL links 
    ### Numbers
    ### underscores

twitter_hate["tweets_cleaned"] = twitter_hate["tweets_cleaned_v1"].progress_apply(lambda x: re.sub(r"(_[A-Za-z0-9-_]+)|(@[A-Za-z0-9]+)|[^\w\s]|http\S+|[0-9]", "",x))

twitter_hate.head(1000) ### visual check


100%|██████████| 24783/24783 [00:00<00:00, 165471.33it/s]


Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,tweets_cleaned_v1,tweets_cleaned
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,@mayasolovely woman complain cleaning house am...,woman complain cleaning house amp man trash
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,@mleew17 boy dats cold ... tyga dwn bad cuffin...,boy dats cold tyga dwn bad cuffin dat hoe st...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,@urkindofbrand dawg @80sbaby4life fuck bitch s...,dawg fuck bitch start cry confused shit
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,@c_g_anderson @viva_based look like tranny,look like tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,@shenikaroberts shit hear true faker bitch tol...,shit hear true faker bitch told ya
...,...,...,...,...,...,...,...,...,...
995,1017,3,0,3,0,1,&#128514;&#128514;&#128514;&#128514; RT @SMASH...,# 128514;&#128514;&#128514;&#128514 @smashavel...,murda sucking bitches howdhow
996,1018,3,0,3,0,1,&#128514;&#128514;&#128514;&#128514; bitch if ...,# 128514;&#128514;&#128514;&#128514 bitch hobb...,bitch hobbit need let know right
997,1019,3,0,2,1,1,&#128514;&#128514;&#128514;&#128514; these fol...,# 128514;&#128514;&#128514;&#128514 folks bad ...,folks bad talk trash
998,1020,6,0,6,0,1,&#128514;&#128514;&#128514;&#128514;&#128514; ...,# 128514;&#128514;&#128514;&#128514;&#128514 b...,brittany bitch u dog man


In [6]:
def bucket (x):
  # input: class column (0,1,2)
  # output: bucket hate + offensive = 1, neutral = 0
  if x == 2:
    return 0
  return 1

twitter_hate["class"] = twitter_hate['class'].progress_apply(bucket)

100%|██████████| 24783/24783 [00:00<00:00, 665263.59it/s]


In [7]:
### Extract the tweet (tweets_cleaned) and bucketed class (class) into a new dataframe
twitter_cleaned = twitter_hate[["tweets_cleaned","class"]]

twitter_cleaned.head(100)

Unnamed: 0,tweets_cleaned,class
0,woman complain cleaning house amp man trash,0
1,boy dats cold tyga dwn bad cuffin dat hoe st...,1
2,dawg fuck bitch start cry confused shit,1
3,look like tranny,1
4,shit hear true faker bitch told ya,1
...,...,...
95,going school sucks dick hoes attend,1
96,way fuck yo bitch year old,1
97,come bring food car retard,1
98,richnow hella tinder hoes friend anymore chil...,1


In [8]:
### Alternative way of representing class to allow running Neural Network
hate = []
neu = []
for i in twitter_cleaned["class"]:
  if i == 0:
    hate.append(0)
    neu.append(1)
  elif i == 1:
    hate.append(1)
    neu.append(0)
twitter_cleaned["hate"] = hate
twitter_cleaned["neutral"] = neu

twitter_cleaned = twitter_cleaned[["tweets_cleaned","neutral","hate"]]

twitter_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


Unnamed: 0,tweets_cleaned,neutral,hate
0,woman complain cleaning house amp man trash,1,0
1,boy dats cold tyga dwn bad cuffin dat hoe st...,0,1
2,dawg fuck bitch start cry confused shit,0,1
3,look like tranny,0,1
4,shit hear true faker bitch told ya,0,1


In [9]:
# saving train, cv and test processed data and labels
import pickle
fp = "/content/drive/My Drive/BT4222 Group Project/Final Project/Codes/Jar of Pickles/twitter_cleaned.pkl"
with open(fp,mode="wb") as f:
    pickle.dump(obj=(twitter_cleaned),
                file=f)