# Introduction

#### If you choose to view this file, please know that some may find the content in it to be very offensive.

In this Jupyter notebook file, the preprocessing of the data is done so that it can be used in the algorithm. Two data sources are merged together such that there now exists a balanced dataset split somewhat evenly between positive and negative tweets. Specifically, there are a total of 21421 positive tweets and 20610 negative tweets. 

The first dataset examined, labeled as dataframe `df1`, contains negative tweets only. The words in it are can be classified as hate speech, offensive language, or neutral. However, only the first two classes from this dataframe are selected because the goal is to focus on classifying a tweet as negative or positive. Therefore, neutral tweets can be dropped. Moreover, there are not enough neutral tweets within `df1` to accurately differentiate them from the tweets classified under hate speech and offensive language.

The second dataset examined, labeled as dataframe `df2`, includes primarily positive tweets. 23000 of these tweets are taken to be joined with the negative tweet data in order to create a balanced dataset for the model.

The following modifications are made to the tweets:

* Converting all alphabetical characters to lowercase.
* Removing duplicate tweets.
* Removing retweets.
* Removing handles (@).
* Removing special characters.
* Identifying the parts of speech present in the tweets (lemmatization).
* Reformatting spaces and hashtags to get rid of excess whitespace.
* Removing English stopwords.

After modifying the tweets, the columns used to do the exploratory analysis and modification are dropped.

In [143]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Import custom functions
from custom import removeRegex
from custom import getPatternCount
from sentenceprocess import getLemma
from sentenceprocess import posTag

# Merging Datasets Together

There are 6 columns present in the first dataset. 

* `count`: number of CrowdFlower users who classified each tweet in its current category.
* `hate_speech`: number of CF users who judged the tweet to be hate speech.
* `offensive_language`: number of CF users who judged the tweet to be offensive language.
* `neither`: number of CF users who judged the tweet to be neutral (non-offensive).
* `class`: class label given by the majority of the CF users (`0` is `hate_speech`, `1` is `offensive_language`, `2` is `neither`).
* `tweet`: actual tweet.

In [169]:
# Import the initial dataset
df1 = pd.read_csv('../data/raw/labeled_data_raw.csv')
# Rename first column as ID
df1.rename(columns = {'Unnamed: 0': 'ID'}, inplace = True)
# Set ID to index
df1.set_index('ID').head(5)

Unnamed: 0_level_0,count,hate_speech,offensive_language,neither,class,tweet
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;


In [170]:
# Get a count of how many of each class are present
df1['class'].value_counts()

1    19190
2    4163 
0    1430 
Name: class, dtype: int64

In [173]:
# Select the tweets classified as hate speech. Look at the first 5.
hateSpeech = df1.loc[df1['class'] == 0]
hateSpeech.head(5)

Unnamed: 0,ID,count,hate_speech,offensive_language,neither,class,tweet
85,85,3,2,1,0,0,"""@Blackman38Tide: @WhaleLookyHere @HowdyDowdy11 queer"" gaywad"
89,90,3,3,0,0,0,"""@CB_Baby24: @white_thunduh alsarabsss"" hes a beaner smh you can tell hes a mexican"
110,111,3,3,0,0,0,"""@DevilGrimz: @VigxRArts you're fucking gay, blacklisted hoe"" Holding out for #TehGodClan anyway http://t.co/xUCcwoetmn"
184,186,3,3,0,0,0,"""@MarkRoundtreeJr: LMFAOOOO I HATE BLACK PEOPLE https://t.co/RNvD2nLCDR"" This is why there's black people and niggers"
202,204,3,2,1,0,0,"""@NoChillPaz: ""At least I'm not a nigger"" http://t.co/RGJa7CfoiT""\n\nLmfao"


In [177]:
# Offensive but not hate speech. Look at the first 5.
offensiveLanguage = df1.loc[df1['class'] == 1]
offensiveLanguage.head(5)

Unnamed: 0,ID,count,hate_speech,offensive_language,neither,class,tweet
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;
5,5,3,1,2,0,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"""


In [175]:
# Tweets judged as neutral. Look at the first 5.
neutral = df1.loc[df1['class'] == 2]
neutral.head(5)

Unnamed: 0,ID,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...
40,40,3,0,1,2,2,""" momma said no pussy cats inside my doghouse """
63,63,3,0,0,3,2,"""@Addicted2Guys: -SimplyAddictedToGuys http://t.co/1jL4hi8ZMF"" woof woof hot scally lad"
66,66,3,0,1,2,2,"""@AllAboutManFeet: http://t.co/3gzUpfuMev"" woof woof and hot soles"
67,67,3,0,1,2,2,"""@Allyhaaaaa: Lemmie eat a Oreo &amp; do these dishes."" One oreo? Lol"


`df1` can be prepared so that the positive tweets can be joined. 

In [178]:
# Insert new column in our dataframe to account for positive tweets. This column will be present after the 'neither' column.
df1.insert(5, 'positive', 0)

The second dataset which contains positive tweets has only 3 columns.

* `id`: unique ID for each tweet.
* `label`: categorized as offensive or not (`0` indicates that it is not offensive, `1` indicates that it is offensive).
* `tweet`: actual tweet.

In [188]:
# Import second dataset.
df2 = pd.read_csv('../data/raw/train_E6oV3lV.csv')

In [189]:
dfPos = df2.loc[df2['label'] == 0]
dfPos.head(5)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


In [190]:
# Given that there are 23,353 offensive tweets in the other dataset, select 23,000 positive tweets from df2
dfPos23 = dfPos[:23000]

In [191]:
# Setting the index as ID
dfPos23.set_index('id')

Unnamed: 0,ID,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation
...,...,...,...
24724,24725,0,"yes, leeds!! #lesbiunion #girlsweekend"
24725,24726,0,in other non tattoo related news my boy will be #crystalpalace mascot at the home game against liverpool next season ! ð´ðµ
24726,24727,0,finally wo agaya :):)
24727,24728,0,ðð ð #love #instagood #photooftheday top.tags #tbt #cute #me #beautiful #followme #followâ¦


In [192]:
# Recreate the same structure of the other df
dfPos23.insert(1, 'count', 3)
dfPos23.insert(2, 'hate_speech', 0)
dfPos23.insert(3, 'offensive_language', 0)
dfPos23.insert(4, 'neither', 0)
dfPos23.insert(5, 'positive', 3)
dfPos23.head(5)

Unnamed: 0,id,count,hate_speech,offensive_language,neither,positive,label,tweet
0,1,3,0,0,0,3,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,3,0,0,0,3,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,3,0,0,0,3,0,bihday your majesty
3,4,3,0,0,0,3,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,3,0,0,0,3,0,factsguide: society now #motivation


In [193]:
# Remove the label column.
dfPos23.drop(['label'], axis = 1, inplace = True)

In [194]:
# Insert label as 4 for positive tweets
dfPos23.insert(6, 'class', 3)

In [195]:
dfPos23.head(10)

Unnamed: 0,id,count,hate_speech,offensive_language,neither,positive,class,tweet
0,1,3,0,0,0,3,3,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,3,0,0,0,3,3,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,3,0,0,0,3,3,bihday your majesty
3,4,3,0,0,0,3,3,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,3,0,0,0,3,3,factsguide: society now #motivation
5,6,3,0,0,0,3,3,[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo
6,7,3,0,0,0,3,3,@user camping tomorrow @user @user @user @user @user @user @user dannyâ¦
7,8,3,0,0,0,3,3,the next school year is the year for exams.ð¯ can't think about that ð­ #school #exams #hate #imagine #actorslife #revolutionschool #girl
8,9,3,0,0,0,3,3,we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers â¦
9,10,3,0,0,0,3,3,@user @user welcome here ! i'm it's so #gr8 !


In [196]:
# Checking to see if all IDs are unique
dfPos23['id'] = dfPos23['id'] + 25926

In [205]:
# Final df has 23000 positive tweets
dfPos23.head(10)

Unnamed: 0,id,count,hate_speech,offensive_language,neither,positive,class,tweet
0,25927,3,0,0,0,3,3,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,25928,3,0,0,0,3,3,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,25929,3,0,0,0,3,3,bihday your majesty
3,25930,3,0,0,0,3,3,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,25931,3,0,0,0,3,3,factsguide: society now #motivation
5,25932,3,0,0,0,3,3,[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo
6,25933,3,0,0,0,3,3,@user camping tomorrow @user @user @user @user @user @user @user dannyâ¦
7,25934,3,0,0,0,3,3,the next school year is the year for exams.ð¯ can't think about that ð­ #school #exams #hate #imagine #actorslife #revolutionschool #girl
8,25935,3,0,0,0,3,3,we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers â¦
9,25936,3,0,0,0,3,3,@user @user welcome here ! i'm it's so #gr8 !


In [206]:
df = pd.concat([df1, dfPos23], join = 'inner')

In [207]:
# Final dataset has a good balance between positive and negative tweets
df['positive'].value_counts()

0    24783
3    23000
Name: positive, dtype: int64

In [208]:
# Reshuffle the rows
df = df.sample(frac = 1, random_state = 43).reset_index(drop = True)

In [209]:
len(df)

47783

In [210]:
df['positive'].value_counts()

0    24783
3    23000
Name: positive, dtype: int64

# Data Cleaning

We can check the length of the dataframe after each modification to ensure we are not adding or removing any tweets uninentially.

`tweet_lowercase` will contain all the tweets, but all the alphabetic characters will be lowercase.

In [211]:
df['tweet_lowercase'] = df['tweet'].apply(lambda x: x.lower() if type(x) == str else x)
df

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.
...,...,...,...,...,...,...,...,...
47778,3,0,0,0,3,3,when quay collab with @user says sold out!!!ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð© #noooo #whyyyy #loveyoudesi #,when quay collab with @user says sold out!!!ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð© #noooo #whyyyy #loveyoudesi #
47779,3,0,3,0,0,1,RT @_groovymovie: &#8220;@Shane_A1: Hmu talmbout match but when I pull up its 30 other niccas wit ya &#128530;&#8221; smfh shit like that kill me,rt @_groovymovie: &#8220;@shane_a1: hmu talmbout match but when i pull up its 30 other niccas wit ya &#128530;&#8221; smfh shit like that kill me
47780,3,0,0,0,3,3,angry squeaking frog video: #frog #nature #animals #cute #adorable,angry squeaking frog video: #frog #nature #animals #cute #adorable
47781,3,0,3,0,0,1,RT @obey_jrock__: This is a true ride or die bitch &#128175; http://t.co/y1t8CTQn4U,rt @obey_jrock__: this is a true ride or die bitch &#128175; http://t.co/y1t8ctqn4u


In [213]:
len(df)

47783

We check for duplicates next. Ultimately, the duplicates should be removed because repetitive data can lead to bias.

In [214]:
dup = df[df.duplicated('tweet_lowercase', keep = 'first')]
dup.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase
303,3,0,0,0,3,3,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
544,3,0,0,0,3,3,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
645,3,0,0,0,3,3,i finally found a way how to delete old tweets! you might find it useful as well: #deletetweets,i finally found a way how to delete old tweets! you might find it useful as well: #deletetweets
1027,3,0,0,0,3,3,i finally found a way how to delete old tweets! you might find it useful as well: #deletetweets,i finally found a way how to delete old tweets! you might find it useful as well: #deletetweets
1085,3,0,0,0,3,3,can #lighttherapy help with or #depression? #altwaystoheal #healthy is #happy !!,can #lighttherapy help with or #depression? #altwaystoheal #healthy is #happy !!


In [215]:
# Number of duplicates
len(dup)

1589

In [216]:
# Drop duplicates
df = df.drop_duplicates(subset = 'tweet_lowercase', keep = 'first')

In [217]:
len(df)

46194

In [218]:
# Ensuring duplicates are removed
dup = df[df.duplicated('tweet_lowercase', keep = 'first')]
dup

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase


Retweets are removed. Hyperlinks present can also be retweets, so remove those.

In [219]:
retweet = df[df['tweet_lowercase'].str.contains(r'http://t(?!$)')] # Regex for retweet
retweet

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase
11,3,1,2,0,0,1,"@HopOnTheBeast i found her right here, good job unfollowing me btw faggot http://t.co/kD3iSeIoLd","@hoponthebeast i found her right here, good job unfollowing me btw faggot http://t.co/kd3iseiold"
45,3,0,3,0,0,1,"RT @TooSexist: Women complain that chivalry is dead. Yes congratulations bitches, you killed it with feminism http://t.co/Mrc82ZUhOn","rt @toosexist: women complain that chivalry is dead. yes congratulations bitches, you killed it with feminism http://t.co/mrc82zuhon"
56,3,0,0,3,0,2,Yankees great Joe DiMaggio reportedly used to beat wife Marilyn Monroe. Here she is in 1954 announcing divorce http://t.co/blS7dalMiF,yankees great joe dimaggio reportedly used to beat wife marilyn monroe. here she is in 1954 announcing divorce http://t.co/bls7dalmif
76,3,0,3,0,0,1,""" pussy is a powerful drug "" &#128517; #HappyHumpDay http://t.co/R8jsymiB5b",""" pussy is a powerful drug "" &#128517; #happyhumpday http://t.co/r8jsymib5b"
83,3,0,0,3,0,2,&#8220;@CaptainYankee2: Two of the best Yankees Derek Jeter and Joe Torre #JoeTorreDay http://t.co/XMyxfDBKOX&#8221; @jordan_luree,&#8220;@captainyankee2: two of the best yankees derek jeter and joe torre #joetorreday http://t.co/xmyxfdbkox&#8221; @jordan_luree
...,...,...,...,...,...,...,...,...
47743,3,0,3,0,0,1,RT @MyDickNeedsCPR: What lonely hoe made this? http://t.co/eEFThf0tvb,rt @mydickneedscpr: what lonely hoe made this? http://t.co/eefthf0tvb
47750,3,0,3,0,0,1,Bored then a hoe! Listening to these fuck ass adults lecturing us with @__vercetti http://t.co/NyBO16RMsh,bored then a hoe! listening to these fuck ass adults lecturing us with @__vercetti http://t.co/nybo16rmsh
47756,4,1,3,0,0,1,trash both ways lol RT @AgdaCoroner: Bitch Killed Herself....Look Like Bill Maher With Makeup on http://t.co/IWLAG2J5Sl,trash both ways lol rt @agdacoroner: bitch killed herself....look like bill maher with makeup on http://t.co/iwlag2j5sl
47781,3,0,3,0,0,1,RT @obey_jrock__: This is a true ride or die bitch &#128175; http://t.co/y1t8CTQn4U,rt @obey_jrock__: this is a true ride or die bitch &#128175; http://t.co/y1t8ctqn4u


Emojis are represented as text and should be removed as well.

In [221]:
emoji = df[df['tweet'].str.contains(r'#[0-9]')]
emoji['class'].value_counts()

1    4852
2    994 
3    406 
0    213 
Name: class, dtype: int64

Within the subset of positive tweets, the emojis can have linguistic meanings, represent dates, etc.

In [61]:
posWEmoji = emoji.loc[emoji['class'] == 3]
posWEmoji.head(10)

Unnamed: 0,id,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_low
51,48434,3,0,0,0,3,3,@user just done my first #2minutebeachclean in ages. also got one wet foot putting a starfish back in the water â­ï¸ð #southwold ð #dâ¦,@user just done my first #2minutebeachclean in ages. also got one wet foot putting a starfish back in the water â­ï¸ð #southwold ð #dâ¦
192,26678,3,0,0,0,3,3,checked in #holiday #croatia #13daystogo,checked in #holiday #croatia #13daystogo
430,39963,3,0,0,0,3,3,"""no one is better than dad."" fathersday!!! #fathersday2016 #fatherslove #disney #disneyfana #101dalmatians","""no one is better than dad."" fathersday!!! #fathersday2016 #fatherslove #disney #disneyfana #101dalmatians"
538,32023,3,0,0,0,3,3,welcome to candjdays! :) #first #video #vlog #checkitout #youtube #couple #9videos #phoenix #az #florida,welcome to candjdays! :) #first #video #vlog #checkitout #youtube #couple #9videos #phoenix #az #florida
594,38174,3,0,0,0,3,3,@user d-7 opening soon #miami #restaurant #saltandsugarcafe #20flaglerstreet,@user d-7 opening soon #miami #restaurant #saltandsugarcafe #20flaglerstreet
610,46640,3,0,0,0,3,3,@user pheonix u10s new look team to play noh walkden tomorrow #1stgame,@user pheonix u10s new look team to play noh walkden tomorrow #1stgame
613,25939,3,0,0,0,3,3,i get to see my daddy today!! #80days #gettingfed,i get to see my daddy today!! #80days #gettingfed
954,35212,3,0,0,0,3,3,popsy &amp; little all ready for @user #10minutestogo,popsy &amp; little all ready for @user #10minutestogo
1200,30155,3,0,0,0,3,3,i'm off to #florida #usa in #july for #3 #weeks woop woop #holiday,i'm off to #florida #usa in #july for #3 #weeks woop woop #holiday
1226,32523,3,0,0,0,3,3,the new baby is on her way! xx #700d #cannon,the new baby is on her way! xx #700d #cannon


In [222]:
# Delete URLs
df['no_url'] = np.vectorize(removeRegex)(df['tweet_lowercase'], "https?://[A-za-z0-9./]*")

In [223]:
df.head(15)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.
5,3,0,0,0,3,3,i've noticed a lot of #icontf16 presentations mention happiness. wonder if profession has above average happiness? @user,i've noticed a lot of #icontf16 presentations mention happiness. wonder if profession has above average happiness? @user,i've noticed a lot of #icontf16 presentations mention happiness. wonder if profession has above average happiness? @user
6,3,0,0,0,3,3,getting for this weekends shows! #country #music #lylepierce,getting for this weekends shows! #country #music #lylepierce,getting for this weekends shows! #country #music #lylepierce
7,3,0,0,0,3,3,@user my final legislative session day has officially begun! @user @user #albany,@user my final legislative session day has officially begun! @user @user #albany,@user my final legislative session day has officially begun! @user @user #albany
8,3,2,1,0,0,0,@lucas_wright955 @MichaelGT03 faggots,@lucas_wright955 @michaelgt03 faggots,@lucas_wright955 @michaelgt03 faggots
9,3,0,0,0,3,3,a #bikini kind of life ð´ summer #palmtrees #breeze #place #cali #california #swimwearâ¦,a #bikini kind of life ð´ summer #palmtrees #breeze #place #cali #california #swimwearâ¦,a #bikini kind of life ð´ summer #palmtrees #breeze #place #cali #california #swimwearâ¦


In [225]:
df['handle_count'] = np.vectorize(getPatternCount)(df['tweet_lowercase'], "@[\w]*")

In [226]:
df['handle_count'].value_counts()

0     23968
1     15967
2     4176 
3     1344 
4     416  
5     158  
6     89   
8     31   
7     26   
9     12   
10    6    
11    1    
Name: handle_count, dtype: int64

In [227]:
df.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1


In [228]:
# Remove Twitter handles
df['no_handle'] = np.vectorize(removeRegex)(df['no_url'], "@[\w]*")

In [229]:
df.tail(10)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle
47771,3,0,0,0,3,3,@user '' x'mas &amp; my bihday disney ! '' #love #thanks #karen #â¡,@user '' x'mas &amp; my bihday disney ! '' #love #thanks #karen #â¡,@user '' x'mas &amp; my bihday disney ! '' #love #thanks #karen #â¡,1,'' x'mas &amp; my bihday disney ! '' #love #thanks #karen #â¡
47773,3,0,0,0,3,3,my soul is happiest on the water! #soul #happier #happiest #water #ocean #beach #caliâ¦,my soul is happiest on the water! #soul #happier #happiest #water #ocean #beach #caliâ¦,my soul is happiest on the water! #soul #happier #happiest #water #ocean #beach #caliâ¦,0,my soul is happiest on the water! #soul #happier #happiest #water #ocean #beach #caliâ¦
47774,3,0,0,0,3,3,"â #nzd/usd post-rbnz rally almost reversed, 0.7000 closer #blog #silver #gold #forex","â #nzd/usd post-rbnz rally almost reversed, 0.7000 closer #blog #silver #gold #forex","â #nzd/usd post-rbnz rally almost reversed, 0.7000 closer #blog #silver #gold #forex",0,"â #nzd/usd post-rbnz rally almost reversed, 0.7000 closer #blog #silver #gold #forex"
47775,3,0,3,0,0,1,"RT @MAKEUP_SEX: trash talked by many . hated by some . & guess how many fucks i give , its less than one .","rt @makeup_sex: trash talked by many . hated by some . & guess how many fucks i give , its less than one .","rt @makeup_sex: trash talked by many . hated by some . & guess how many fucks i give , its less than one .",1,"rt : trash talked by many . hated by some . & guess how many fucks i give , its less than one ."
47776,3,0,0,0,3,3,"#bihday to leo's mom, #celia ..","#bihday to leo's mom, #celia ..","#bihday to leo's mom, #celia ..",0,"#bihday to leo's mom, #celia .."
47777,3,0,3,0,0,1,@_ElenaRaquel_ its swag bitch aha,@_elenaraquel_ its swag bitch aha,@_elenaraquel_ its swag bitch aha,1,its swag bitch aha
47778,3,0,0,0,3,3,when quay collab with @user says sold out!!!ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð© #noooo #whyyyy #loveyoudesi #,when quay collab with @user says sold out!!!ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð© #noooo #whyyyy #loveyoudesi #,when quay collab with @user says sold out!!!ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð© #noooo #whyyyy #loveyoudesi #,1,when quay collab with says sold out!!!ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð©ð«ð© #noooo #whyyyy #loveyoudesi #
47779,3,0,3,0,0,1,RT @_groovymovie: &#8220;@Shane_A1: Hmu talmbout match but when I pull up its 30 other niccas wit ya &#128530;&#8221; smfh shit like that kill me,rt @_groovymovie: &#8220;@shane_a1: hmu talmbout match but when i pull up its 30 other niccas wit ya &#128530;&#8221; smfh shit like that kill me,rt @_groovymovie: &#8220;@shane_a1: hmu talmbout match but when i pull up its 30 other niccas wit ya &#128530;&#8221; smfh shit like that kill me,2,rt : &#8220;: hmu talmbout match but when i pull up its 30 other niccas wit ya &#128530;&#8221; smfh shit like that kill me
47781,3,0,3,0,0,1,RT @obey_jrock__: This is a true ride or die bitch &#128175; http://t.co/y1t8CTQn4U,rt @obey_jrock__: this is a true ride or die bitch &#128175; http://t.co/y1t8ctqn4u,rt @obey_jrock__: this is a true ride or die bitch &#128175;,1,rt : this is a true ride or die bitch &#128175;
47782,3,0,3,0,0,1,RT @AllHailTaron_: I got the deals for the low. I know you hoes lonely so fuck with these cuffing season specials. &#128184;&#128175; http://t.co/YURpX99Hdb,rt @allhailtaron_: i got the deals for the low. i know you hoes lonely so fuck with these cuffing season specials. &#128184;&#128175; http://t.co/yurpx99hdb,rt @allhailtaron_: i got the deals for the low. i know you hoes lonely so fuck with these cuffing season specials. &#128184;&#128175;,1,rt : i got the deals for the low. i know you hoes lonely so fuck with these cuffing season specials. &#128184;&#128175;


In [230]:
# Remove special characters (except hashtags and apostrophes). Replace with a whitespace.
df['no_special'] = df['no_handle'].str.replace("[^a-zA-Z#']", " ")

In [232]:
df.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement


In [233]:
# Remove single hashtags with nothing following them
df['remove_empty_hashtag'] = np.vectorize(removeRegex)(df['no_special'], " # ")

In [234]:
# Counting length of tweets after URLs are removed. Use this to see if there is a correlation between length of a tweet and the sentiment
df['tweet_length'] = df['remove_empty_hashtag'].apply(lambda x: len(x))
df.head(15)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere,days to go #gettingthere,30
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag,get off my twitter fag,23
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #,these hoes got more bodies than a cemetery,58
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement,dubya hates todays teabagger movement,39
5,3,0,0,0,3,3,i've noticed a lot of #icontf16 presentations mention happiness. wonder if profession has above average happiness? @user,i've noticed a lot of #icontf16 presentations mention happiness. wonder if profession has above average happiness? @user,i've noticed a lot of #icontf16 presentations mention happiness. wonder if profession has above average happiness? @user,1,i've noticed a lot of #icontf16 presentations mention happiness. wonder if profession has above average happiness?,i've noticed a lot of #icontf presentations mention happiness wonder if profession has above average happiness,i've noticed a lot of #icontf presentations mention happiness wonder if profession has above average happiness,117
6,3,0,0,0,3,3,getting for this weekends shows! #country #music #lylepierce,getting for this weekends shows! #country #music #lylepierce,getting for this weekends shows! #country #music #lylepierce,0,getting for this weekends shows! #country #music #lylepierce,getting for this weekends shows #country #music #lylepierce,getting for this weekends shows #country #music #lylepierce,62
7,3,0,0,0,3,3,@user my final legislative session day has officially begun! @user @user #albany,@user my final legislative session day has officially begun! @user @user #albany,@user my final legislative session day has officially begun! @user @user #albany,3,my final legislative session day has officially begun! #albany,my final legislative session day has officially begun #albany,my final legislative session day has officially begun #albany,69
8,3,2,1,0,0,0,@lucas_wright955 @MichaelGT03 faggots,@lucas_wright955 @michaelgt03 faggots,@lucas_wright955 @michaelgt03 faggots,2,faggots,faggots,faggots,9
9,3,0,0,0,3,3,a #bikini kind of life ð´ summer #palmtrees #breeze #place #cali #california #swimwearâ¦,a #bikini kind of life ð´ summer #palmtrees #breeze #place #cali #california #swimwearâ¦,a #bikini kind of life ð´ summer #palmtrees #breeze #place #cali #california #swimwearâ¦,0,a #bikini kind of life ð´ summer #palmtrees #breeze #place #cali #california #swimwearâ¦,a #bikini kind of life summer #palmtrees #breeze #place #cali #california #swimwear,a #bikini kind of life summer #palmtrees #breeze #place #cali #california #swimwear,96


In [235]:
# Check that there are no tweets that are greater than 280 characters
dfLen = df.loc[df['tweet_length'] > 280]

In [236]:
dfLen.sort_values(by=['tweet_length'], ascending = False)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length
29982,3,0,3,0,0,1,RT @TrxllLegend: One good girl is worth a thousand bitches\n\n&#128112; = &#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#8230;,rt @trxlllegend: one good girl is worth a thousand bitches\n\n&#128112; = &#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#8230;,rt @trxlllegend: one good girl is worth a thousand bitches\n\n&#128112; = &#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#8230;,1,rt : one good girl is worth a thousand bitches\n\n&#128112; = &#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#128109;&#8230;,rt one good girl is worth a thousand bitches # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #,rt one good girl is worth a thousand bitches,511
28044,3,0,3,0,0,1,No summer school? &#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515; eat a dick school. Im done with your bitch ass !!!!!!,no summer school? &#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515; eat a dick school. im done with your bitch ass !!!!!!,no summer school? &#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515; eat a dick school. im done with your bitch ass !!!!!!,0,no summer school? &#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515;&#128515; eat a dick school. im done with your bitch ass !!!!!!,no summer school # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # eat a dick school im done with your bitch ass,no summer school eat a dick school im done with your bitch ass,462
28352,3,0,3,0,0,1,&#8220;@Untouchable_T: Never seen so many perfect bitches til I made a Twitter &#128564; but &#128056;&#9749;&#65039;&#8221;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;,&#8220;@untouchable_t: never seen so many perfect bitches til i made a twitter &#128564; but &#128056;&#9749;&#65039;&#8221;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;,&#8220;@untouchable_t: never seen so many perfect bitches til i made a twitter &#128564; but &#128056;&#9749;&#65039;&#8221;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;,1,&#8220;: never seen so many perfect bitches til i made a twitter &#128564; but &#128056;&#9749;&#65039;&#8221;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;&#128175;,# never seen so many perfect bitches til i made a twitter # but # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #,never seen so many perfect bitches til i made a twitter but,434
23454,3,0,2,1,0,1,A guy on True Blood is getting his penis inspected and the doctor told him it look like an eggplant\n\n&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;,a guy on true blood is getting his penis inspected and the doctor told him it look like an eggplant\n\n&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;,a guy on true blood is getting his penis inspected and the doctor told him it look like an eggplant\n\n&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;,0,a guy on true blood is getting his penis inspected and the doctor told him it look like an eggplant\n\n&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;&#127814;,a guy on true blood is getting his penis inspected and the doctor told him it look like an eggplant # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #,a guy on true blood is getting his penis inspected and the doctor told him it look like an eggplant,335
6865,3,0,3,0,0,1,RT @digiflorals: bitch do it look like I care \n\n&#12288; N\n&#12288;&#12288; O\n&#12288;&#12288;&#12288; O\n&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288; .\n&#12288;&#12288;&#12288;&#12288;.,rt @digiflorals: bitch do it look like i care \n\n&#12288; n\n&#12288;&#12288; o\n&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288; .\n&#12288;&#12288;&#12288;&#12288;.,rt @digiflorals: bitch do it look like i care \n\n&#12288; n\n&#12288;&#12288; o\n&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288; .\n&#12288;&#12288;&#12288;&#12288;.,1,rt : bitch do it look like i care \n\n&#12288; n\n&#12288;&#12288; o\n&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288; .\n&#12288;&#12288;&#12288;&#12288;.,rt bitch do it look like i care # n # # o # # # o # # # # o # # # # # o # # # # # o # # # # # o # # # # # # # # # # # # # # # # # # # # # #,rt bitch do it look like i care n o o o o o o,302
39283,3,0,3,0,0,1,"RT @cotydankh: ""are these hoes loyal?""\n\n&#12288; N\n&#12288;&#12288; O\n&#12288;&#12288;&#12288; O\n&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288; .\n&#12288;&#12288;&#12288;&#12288;.","rt @cotydankh: ""are these hoes loyal?""\n\n&#12288; n\n&#12288;&#12288; o\n&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288; .\n&#12288;&#12288;&#12288;&#12288;.","rt @cotydankh: ""are these hoes loyal?""\n\n&#12288; n\n&#12288;&#12288; o\n&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288; .\n&#12288;&#12288;&#12288;&#12288;.",1,"rt : ""are these hoes loyal?""\n\n&#12288; n\n&#12288;&#12288; o\n&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288;&#12288; o\n&#12288;&#12288;&#12288;&#12288;&#12288;o\n&#12288;&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288; &#12290;\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288;.\n&#12288;&#12288;&#12288; .\n&#12288;&#12288;&#12288;&#12288;.",rt are these hoes loyal # n # # o # # # o # # # # o # # # # # o # # # # # o # # # # # o # # # # # # # # # # # # # # # # # # # # # #,rt are these hoes loyal n o o o o o o,296
24023,3,0,0,3,0,2,RT @Mr_MshkL: &#1589;&#1608;&#1585;&#1577; &#1604;&#1591;&#1575;&#1574;&#1585; &#1575;&#1604;&#1603;&#1575;&#1585;&#1583;&#1610;&#1606;&#1575;&#1604; &#1575;&#1604;&#1571;&#1581;&#1605;&#1585; &#1575;&#1604;&#1605;&#1605;&#1610;&#1586;&#1548; &#1571;&#1581;&#1583; &#1588;&#1582;&#1589;&#1610;&#1575;&#1578; &#1604;&#1593;&#1576;&#1577; &#1575;&#1604;&#1591;&#1610;&#1608;&#1585; &#1575;&#1604;&#1594;&#1575;&#1590;&#1576;&#1577; angry birds &#1575;&#1604;&#1588;&#1607;&#1610;&#1585;&#1577; !\n&#8226; http://t.co/0lowkClb,rt @mr_mshkl: &#1589;&#1608;&#1585;&#1577; &#1604;&#1591;&#1575;&#1574;&#1585; &#1575;&#1604;&#1603;&#1575;&#1585;&#1583;&#1610;&#1606;&#1575;&#1604; &#1575;&#1604;&#1571;&#1581;&#1605;&#1585; &#1575;&#1604;&#1605;&#1605;&#1610;&#1586;&#1548; &#1571;&#1581;&#1583; &#1588;&#1582;&#1589;&#1610;&#1575;&#1578; &#1604;&#1593;&#1576;&#1577; &#1575;&#1604;&#1591;&#1610;&#1608;&#1585; &#1575;&#1604;&#1594;&#1575;&#1590;&#1576;&#1577; angry birds &#1575;&#1604;&#1588;&#1607;&#1610;&#1585;&#1577; !\n&#8226; http://t.co/0lowkclb,rt @mr_mshkl: &#1589;&#1608;&#1585;&#1577; &#1604;&#1591;&#1575;&#1574;&#1585; &#1575;&#1604;&#1603;&#1575;&#1585;&#1583;&#1610;&#1606;&#1575;&#1604; &#1575;&#1604;&#1571;&#1581;&#1605;&#1585; &#1575;&#1604;&#1605;&#1605;&#1610;&#1586;&#1548; &#1571;&#1581;&#1583; &#1588;&#1582;&#1589;&#1610;&#1575;&#1578; &#1604;&#1593;&#1576;&#1577; &#1575;&#1604;&#1591;&#1610;&#1608;&#1585; &#1575;&#1604;&#1594;&#1575;&#1590;&#1576;&#1577; angry birds &#1575;&#1604;&#1588;&#1607;&#1610;&#1585;&#1577; !\n&#8226;,1,rt : &#1589;&#1608;&#1585;&#1577; &#1604;&#1591;&#1575;&#1574;&#1585; &#1575;&#1604;&#1603;&#1575;&#1585;&#1583;&#1610;&#1606;&#1575;&#1604; &#1575;&#1604;&#1571;&#1581;&#1605;&#1585; &#1575;&#1604;&#1605;&#1605;&#1610;&#1586;&#1548; &#1571;&#1581;&#1583; &#1588;&#1582;&#1589;&#1610;&#1575;&#1578; &#1604;&#1593;&#1576;&#1577; &#1575;&#1604;&#1591;&#1610;&#1608;&#1585; &#1575;&#1604;&#1594;&#1575;&#1590;&#1576;&#1577; angry birds &#1575;&#1604;&#1588;&#1607;&#1610;&#1585;&#1577; !\n&#8226;,rt # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # angry birds # # # # # # # #,rt angry birds,295


In [238]:
# Remove spaces in tweets to count only characters
df['nospaces'] = df['remove_empty_hashtag'].str.replace(" ", "")

In [239]:
df['character_count'] = df['nospaces'].apply(lambda x: len(x))
df.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length,nospaces,character_count
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere,days to go #gettingthere,30,daystogo#gettingthere,21
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag,get off my twitter fag,23,getoffmytwitterfag,18
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #,these hoes got more bodies than a cemetery,58,thesehoesgotmorebodiesthanacemetery,35
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124,afriendjusttoldmeshe'safraidtogotodcrallyampbeattackedby#berniebrosortheracistscuzshe'snotwhite,95
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement,dubya hates todays teabagger movement,39,dubyahatestodaysteabaggermovement,33


In [240]:
# Ensure no missing values
df.isna().sum()

count                   0
hate_speech             0
offensive_language      0
neither                 0
positive                0
class                   0
tweet                   0
tweet_lowercase         0
no_url                  0
handle_count            0
no_handle               0
no_special              0
remove_empty_hashtag    0
tweet_length            0
nospaces                0
character_count         0
dtype: int64

# Lemmatization with Parts of Speech

Lemmatizing a part of speech means that we classify each word as an adjective, adverb, noun, or verb. First, each word in sentence is treated as a token and a tag is given based off the lexical database Wordnet. To learn more about Wordnet, please visit https://wordnet.princeton.edu/. Tuples of tokens and wordnet tags are then created, and we look for a match. If there is a match present, the word is classified (lemmatized) as one of the parts of speech. One exception exists ('ass'), which has been tweaked within the `getLemma` function.

In [241]:
df['lemmatized'] = df['no_special'].apply(lambda x: getLemma(x))

In [242]:
df.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length,nospaces,character_count,lemmatized
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere,days to go #gettingthere,30,daystogo#gettingthere,21,day to go # gettingthere
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag,get off my twitter fag,23,getoffmytwitterfag,18,get off my twitter fag
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #,these hoes got more bodies than a cemetery,58,thesehoesgotmorebodiesthanacemetery,35,these hoe get more body than a cemetery # # #
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124,afriendjusttoldmeshe'safraidtogotodcrallyampbeattackedby#berniebrosortheracistscuzshe'snotwhite,95,a friend just tell me she 's afraid to go to dc rally amp be attack by # berniebros or the racist cuz she 's not white
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement,dubya hates todays teabagger movement,39,dubyahatestodaysteabaggermovement,33,dubya hat today teabagger movement


In [147]:
df['lemma1'] = df['lemmatized'].str.replace('# ', '#')
df.head(5)

Unnamed: 0,id,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_low,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length,nospaces,character_count,lemmatized,lemma1
0,28842,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere,days to go #gettingthere,30,daystogo#gettingthere,21,day to go # gettingthere,day to go #gettingthere
1,5684,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag,get off my twitter fag,23,getoffmytwitterfag,18,get off my twitter fag,get off my twitter fag
2,22263,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #,these hoes got more bodies than a cemetery,58,thesehoesgotmorebodiesthanacemetery,35,these hoe get more body than a cemetery # # #,these hoe get more body than a cemetery ###
3,40774,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124,afriendjusttoldmeshe'safraidtogotodcrallyampbeattackedby#berniebrosortheracistscuzshe'snotwhite,95,a friend just tell me she 's afraid to go to dc rally amp be attack by # berniebros or the racist cuz she 's not white,a friend just tell me she 's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she 's not white
4,7082,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement,dubya hates todays teabagger movement,39,dubyahatestodaysteabaggermovement,33,dubya hat today teabagger movement,dubya hat today teabagger movement


In [244]:
# Removing spaces after hashtags
df['lemma_no_space_after_hashtag'] = df['lemmatized'].str.replace('# ', '#')
df.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length,nospaces,character_count,lemmatized,lemma_no_space_after_hashtag
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere,days to go #gettingthere,30,daystogo#gettingthere,21,day to go # gettingthere,day to go #gettingthere
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag,get off my twitter fag,23,getoffmytwitterfag,18,get off my twitter fag,get off my twitter fag
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #,these hoes got more bodies than a cemetery,58,thesehoesgotmorebodiesthanacemetery,35,these hoe get more body than a cemetery # # #,these hoe get more body than a cemetery ###
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124,afriendjusttoldmeshe'safraidtogotodcrallyampbeattackedby#berniebrosortheracistscuzshe'snotwhite,95,a friend just tell me she 's afraid to go to dc rally amp be attack by # berniebros or the racist cuz she 's not white,a friend just tell me she 's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she 's not white
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement,dubya hates todays teabagger movement,39,dubyahatestodaysteabaggermovement,33,dubya hat today teabagger movement,dubya hat today teabagger movement


In [245]:
# Removing spaces after apostrophes
df['lemma_final'] = df['lemma_no_space_after_hashtag'].str.replace(" '" ,"'")
df.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length,nospaces,character_count,lemmatized,lemma_no_space_after_hashtag,lemma_final
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere,days to go #gettingthere,30,daystogo#gettingthere,21,day to go # gettingthere,day to go #gettingthere,day to go #gettingthere
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag,get off my twitter fag,23,getoffmytwitterfag,18,get off my twitter fag,get off my twitter fag,get off my twitter fag
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #,these hoes got more bodies than a cemetery,58,thesehoesgotmorebodiesthanacemetery,35,these hoe get more body than a cemetery # # #,these hoe get more body than a cemetery ###,these hoe get more body than a cemetery ###
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124,afriendjusttoldmeshe'safraidtogotodcrallyampbeattackedby#berniebrosortheracistscuzshe'snotwhite,95,a friend just tell me she 's afraid to go to dc rally amp be attack by # berniebros or the racist cuz she 's not white,a friend just tell me she 's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she 's not white,a friend just tell me she's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she's not white
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement,dubya hates todays teabagger movement,39,dubyahatestodaysteabaggermovement,33,dubya hat today teabagger movement,dubya hat today teabagger movement,dubya hat today teabagger movement


Now, we can remove stopwords and words shorter than two characters (these may not hold much meaning). In English, examples of stop words are “the”, “a”, “an”, and “in”. Generally speaking, they are articles, pronouns, and prepositions. You can find more information about stop words at https://www.geeksforgeeks.org/removing-stop-words-nltk-python/.

In [246]:
# Import stopwords in English
stop = stopwords.words('english')
df['tweet_no_stopwords'] = df['lemmatized'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [247]:
df.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length,nospaces,character_count,lemmatized,lemma_no_space_after_hashtag,lemma_final,tweet_no_stopwords
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere,days to go #gettingthere,30,daystogo#gettingthere,21,day to go # gettingthere,day to go #gettingthere,day to go #gettingthere,day go # gettingthere
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag,get off my twitter fag,23,getoffmytwitterfag,18,get off my twitter fag,get off my twitter fag,get off my twitter fag,get twitter fag
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #,these hoes got more bodies than a cemetery,58,thesehoesgotmorebodiesthanacemetery,35,these hoe get more body than a cemetery # # #,these hoe get more body than a cemetery ###,these hoe get more body than a cemetery ###,hoe get body cemetery # # #
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124,afriendjusttoldmeshe'safraidtogotodcrallyampbeattackedby#berniebrosortheracistscuzshe'snotwhite,95,a friend just tell me she 's afraid to go to dc rally amp be attack by # berniebros or the racist cuz she 's not white,a friend just tell me she 's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she 's not white,a friend just tell me she's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she's not white,friend tell 's afraid go dc rally amp attack # berniebros racist cuz 's white
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement,dubya hates todays teabagger movement,39,dubyahatestodaysteabaggermovement,33,dubya hat today teabagger movement,dubya hat today teabagger movement,dubya hat today teabagger movement,dubya hat today teabagger movement


In [248]:
# Removing words shorter than two characters because they will likely not be relevant
df['tweet_no_stopwords_no_short'] = df['tweet_no_stopwords'].apply(
    lambda x: ' '.join([word for word in x.split() if len(word) > 2]))
df.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,positive,class,tweet,tweet_lowercase,no_url,handle_count,no_handle,no_special,remove_empty_hashtag,tweet_length,nospaces,character_count,lemmatized,lemma_no_space_after_hashtag,lemma_final,tweet_no_stopwords,tweet_no_stopwords_no_short
0,3,0,0,0,3,3,13 days to go #gettingthere,13 days to go #gettingthere,13 days to go #gettingthere,0,13 days to go #gettingthere,days to go #gettingthere,days to go #gettingthere,30,daystogo#gettingthere,21,day to go # gettingthere,day to go #gettingthere,day to go #gettingthere,day go # gettingthere,day gettingthere
1,3,1,2,0,0,1,@anggxo get off my twitter fag,@anggxo get off my twitter fag,@anggxo get off my twitter fag,1,get off my twitter fag,get off my twitter fag,get off my twitter fag,23,getoffmytwitterfag,18,get off my twitter fag,get off my twitter fag,get off my twitter fag,get twitter fag,get twitter fag
2,3,1,2,0,0,1,These hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,these hoes got more bodies than a cemetery # # #,these hoes got more bodies than a cemetery,58,thesehoesgotmorebodiesthanacemetery,35,these hoe get more body than a cemetery # # #,these hoe get more body than a cemetery ###,these hoe get more body than a cemetery ###,hoe get body cemetery # # #,hoe get body cemetery
3,3,0,0,0,3,3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the racists cuz she's not white.,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124,afriendjusttoldmeshe'safraidtogotodcrallyampbeattackedby#berniebrosortheracistscuzshe'snotwhite,95,a friend just tell me she 's afraid to go to dc rally amp be attack by # berniebros or the racist cuz she 's not white,a friend just tell me she 's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she 's not white,a friend just tell me she's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she's not white,friend tell 's afraid go dc rally amp attack # berniebros racist cuz 's white,friend tell afraid rally amp attack berniebros racist cuz white
4,3,0,0,3,0,2,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,@ramaxe1965 dubya hates todays teabagger movement.,1,dubya hates todays teabagger movement.,dubya hates todays teabagger movement,dubya hates todays teabagger movement,39,dubyahatestodaysteabaggermovement,33,dubya hat today teabagger movement,dubya hat today teabagger movement,dubya hat today teabagger movement,dubya hat today teabagger movement,dubya hat today teabagger movement


# Labels for Binary Classification

We should re-label the tweets in our dataset at this opint. Since we use a binary classifier, we can drop the neutral tweets column (and also becasue comparatively, it is not a large enough subset) for the simplicity of the model. Positive tweets will be classified as `0` and negative tweets will be classified as `1`. 

In [249]:
df['positive'].value_counts()

0    24773
3    21421
Name: positive, dtype: int64

In [250]:
# Exclude neutral tweets because we are building a binary model right now
df = df.loc[df['class'] != 2]

In [251]:
len(df)

42031

In [252]:
df['neg_label'] = df['class'].apply(lambda x: 0 if x == 3 else 1)

In [253]:
df['neg_label'].value_counts()

0    21421
1    20610
Name: neg_label, dtype: int64

# Restructure Final Dataframe

We drop several of the columns, as mentioned in the introduction. These are the ones used in the intermediate steps.

In [259]:
# Commented out below because already dropped. Then realized 'nospaces' was missed.
#df.drop(['count', 'hate_speech', 'offensive_language', 'neither', 'positive', 'class', 'tweet', 'no_url', 'no_handle', 
        #'no_special'], axis = 1, inplace = True)
df.drop(['nospaces'], axis = 1, inplace = True)

In [260]:
df.head(5)

Unnamed: 0,tweet_lowercase,handle_count,remove_empty_hashtag,tweet_length,character_count,lemmatized,lemma_no_space_after_hashtag,lemma_final,tweet_no_stopwords,tweet_no_stopwords_no_short,neg_label
0,13 days to go #gettingthere,0,days to go #gettingthere,30,21,day to go # gettingthere,day to go #gettingthere,day to go #gettingthere,day go # gettingthere,day gettingthere,0
1,@anggxo get off my twitter fag,1,get off my twitter fag,23,18,get off my twitter fag,get off my twitter fag,get off my twitter fag,get twitter fag,get twitter fag,1
2,these hoes got more bodies than a cemetery&#128056;&#9749;&#65039;.,0,these hoes got more bodies than a cemetery,58,35,these hoe get more body than a cemetery # # #,these hoe get more body than a cemetery ###,these hoe get more body than a cemetery ###,hoe get body cemetery # # #,hoe get body cemetery,1
3,a friend just told me she's afraid to go to dc rally &amp; be attacked by #berniebros or the @user racists cuz she's not white.,1,a friend just told me she's afraid to go to dc rally amp be attacked by #berniebros or the racists cuz she's not white,124,95,a friend just tell me she 's afraid to go to dc rally amp be attack by # berniebros or the racist cuz she 's not white,a friend just tell me she 's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she 's not white,a friend just tell me she's afraid to go to dc rally amp be attack by #berniebros or the racist cuz she's not white,friend tell 's afraid go dc rally amp attack # berniebros racist cuz 's white,friend tell afraid rally amp attack berniebros racist cuz white,0
5,i've noticed a lot of #icontf16 presentations mention happiness. wonder if profession has above average happiness? @user,1,i've noticed a lot of #icontf presentations mention happiness wonder if profession has above average happiness,117,95,i 've notice a lot of # icontf presentation mention happiness wonder if profession have above average happiness,i 've notice a lot of #icontf presentation mention happiness wonder if profession have above average happiness,i've notice a lot of #icontf presentation mention happiness wonder if profession have above average happiness,'ve notice lot # icontf presentation mention happiness wonder profession average happiness,'ve notice lot icontf presentation mention happiness wonder profession average happiness,0


In [261]:
df.isna().sum()

tweet_lowercase                 0
handle_count                    0
remove_empty_hashtag            0
tweet_length                    0
character_count                 0
lemmatized                      0
lemma_no_space_after_hashtag    0
lemma_final                     0
tweet_no_stopwords              0
tweet_no_stopwords_no_short     0
neg_label                       0
dtype: int64

In [262]:
len(df) # Final length of the dataset for EDA

42031

In [263]:
# Save to file
df.to_csv(r'../data/processed/clean_data.csv', index = False)