# PREPROCESSING

In [2]:
import pandas as pd
import html
import numpy as np
import re

In [3]:
df = pd.read_csv('metoo_tweets_dec2017.csv', encoding="ISO-8859-1")

# This encoding is called 'Latin-1' which covers Latin scripts like German, English, Spanish.
# During trial and error, this encoding transformed the problem characters into something representable and removable.

##### There were a lot of rows where special characters like apostrophe  and quotes were improperly encoded. 
##### It went unnoticed during the first round of preprocessing and we only realized there was a problem where the preprocessed text made no sense and was full of arbitrary half-complete words. 

##### Later in the preprocessing, we are using a library called 'contractions' to expand all of the contractions that come up in tweets because for lemmatization, the contractions must be expanded for it to be able to effectively reduce the words to their root form.

##### Here, we printed some of the problem rows to illustrate was exactly was going wrong.


# Problem Rows:

In [4]:
df[df['Unnamed: 0']==174147]['text'].values[0]

'RT @TomArnold: Iâ\x80\x99m #MeToo &amp; donâ\x80\x99t doubt any other woman but the Leeann Tweeden-John Phillips-Roger Stone lies &amp; set up of Al Franken were pâ\x80¦'

In [5]:
df[df['Unnamed: 0']==49306]['text'].values[0]

'RT @LeeannTweeden: I\x89Ûªve decided it\x89Ûªs time to tell my story. #MeToo\rhttps://t.co/TqTgfvzkZg'

In [6]:
df[df['Unnamed: 0']==32505]['text'].values[0]

'@JustinTrudeau I have been waiting over a month for a response to a letter from @ChandraNepean regarding sexual abu\x89Û_ https://t.co/P7yo6ZteqW'

In [7]:
df[df['Unnamed: 0']==31564]['text'].values[0]

'When you speak the truth they\x89Ûªll always abandon you. #MeToo @SBBookJBraxton https://t.co/AitbenL4E1'

In [8]:
df[df['Unnamed: 0']==27379]['text'].values[0]

'@StockMonsterVIP @3stherfr33bird #metoo he gets it. And he isn\x89Ûªt prejudice! Practical with #EYESWIDEOPEN'

In [9]:
df[df['Unnamed: 0']==25833]['text'].values[0]

'@yashar Franken should resign, in my opinion.  I\x89Ûªm not on Team Blue, I\x89Ûªm not on Team Red, I\x89Ûªm on Team Pink Pussy. V\x89Û_ https://t.co/e01EVf93ck'

In [10]:
df[df['Unnamed: 0']==65379]['text'].values[0]

"RT @apbenven: Reminder that if a woman didn't post #MeToo, it doesn't mean she wasn't sexually assaulted or harassed. Survivors don't owe y\x89Û_"

In [11]:
df[df['Unnamed: 0']==127539]['text'].values[0]

'RT @TomArnold: Iâ\x80\x99m #MeToo &amp; donâ\x80\x99t doubt any other woman but the Leeann Tweeden-John Phillips-Roger Stone lies &amp; set up of Al Franken were pâ\x80¦'

# Function to Correct Encoding Issue

In [12]:
def decoding_correction(df, column_name):
    for index, row in df.iterrows():
        try:
            text = row[column_name]
            text = html.unescape(text)  # Unescape HTML entities first
            decoded_text = text.encode('utf-8').decode('utf-8')
        except UnicodeEncodeError as e:
            # Handling the error
            continue

        # Apply character cleaning to the decoded_text
        cleaned_text = ''.join([char if ord(char) < 128 and char.isprintable() else '' for char in decoded_text])
        df.at[index, column_name] = cleaned_text

    return df

How this function works:

- takes in the dataframe and the column name of the column whose encoding issue we want to correct.
- we extract the text from that row.
- then we unescape the HTML entities meaning `&lt;` is changed into the actual form like `<` because in our dataset, special characters were also present as HTML entities.
- then through trail and error, we found that encoding the text into `utf-8` and decoding it back to `utf-8` changes the sequence of improper characters into the actual form, like quotes or apostrophe.
- however, there were still some rows where this did not work. 
- in that case, the encode and decode line gave an error.
- for the remaining rows which still contained the problem causing characters, we checked if it was an ASCII character (less than 128) and was printable (with this we were effectively removing white spaces also), and if they didnot meet these conditions, they were noot included in the cleaned text. 
- we update the row with the clean text.

In [13]:
# printing the df
decoding_correction(df, 'text')

Unnamed: 0.1,Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,1,American Harem.. #MeToo https://t.co/HjExLJdGuF,False,0,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://instagram.com"" rel=""nofollow"">...",ahmediaTV,0,False,False,,
1,2,@johnconyersjr @alfranken why have you guys ...,False,0,johnconyersjr,11/29/17 23:59,False,,9.360000e+17,266149840.0,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",JesusPrepper74,0,False,False,,
2,3,Watched Megan Kelly ask Joe Keery this A.M. if...,False,0,,11/29/17 23:59,True,,9.360000e+17,,"<a href=""http://twitter.com/download/android"" ...",DemerisePotvin,0,False,False,,
3,4,Women have been talking about this crap the en...,False,0,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",TheDawnStott,0,False,False,,
4,5,.@BetteMidler please speak to this sexual assa...,False,15,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://twitter.com/#!/download/ipad"" ...",scottygirl2014,11,False,False,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
393130,393131,RT @Suffragentleman: You can only choose one.....,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com/download/android"" ...",boaomega22,616,True,False,,
393131,393132,"#MeToo, say victims of sexual harassment in Ja...",False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://bufferapp.com"" rel=""nofollow"">...",April_Magazine,0,False,False,,
393132,393133,Susan Collins tries to #MeToo her way out of h...,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Noofer55,0,False,False,,
393133,393134,RT @OneMillionVjj: Punish those who choose not...,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",ZBezzt,5,True,False,,


#### Printing the earlier problem rows to check if the non-ASCII characters were effectively removed/replaced with the correct character.

In [14]:
df[df['Unnamed: 0']==174147]['text'].values[0]

'RT @TomArnold: Im #MeToo & dont doubt any other woman but the Leeann Tweeden-John Phillips-Roger Stone lies & set up of Al Franken were p'

In [15]:
df[df['Unnamed: 0']==49306]['text'].values[0]

'RT @LeeannTweeden: Ive decided its time to tell my story. #MeToohttps://t.co/TqTgfvzkZg'

In [16]:
df[df['Unnamed: 0']==32505]['text'].values[0]

'@JustinTrudeau I have been waiting over a month for a response to a letter from @ChandraNepean regarding sexual abu_ https://t.co/P7yo6ZteqW'

In [17]:
df[df['Unnamed: 0']==31564]['text'].values[0]

'When you speak the truth theyll always abandon you. #MeToo @SBBookJBraxton https://t.co/AitbenL4E1'

In [18]:
df[df['Unnamed: 0']==27379]['text'].values[0]

'@StockMonsterVIP @3stherfr33bird #metoo he gets it. And he isnt prejudice! Practical with #EYESWIDEOPEN'

In [19]:
df[df['Unnamed: 0']==25833]['text'].values[0]

'@yashar Franken should resign, in my opinion.  Im not on Team Blue, Im not on Team Red, Im on Team Pink Pussy. V_ https://t.co/e01EVf93ck'

In [20]:
df[df['Unnamed: 0']==65379]['text'].values[0]

"RT @apbenven: Reminder that if a woman didn't post #MeToo, it doesn't mean she wasn't sexually assaulted or harassed. Survivors don't owe y_"

In [21]:
df[df['Unnamed: 0']==127539]['text'].values[0]

'RT @TomArnold: Im #MeToo & dont doubt any other woman but the Leeann Tweeden-John Phillips-Roger Stone lies & set up of Al Franken were p'

In [22]:
df

Unnamed: 0.1,Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,1,American Harem.. #MeToo https://t.co/HjExLJdGuF,False,0,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://instagram.com"" rel=""nofollow"">...",ahmediaTV,0,False,False,,
1,2,@johnconyersjr @alfranken why have you guys ...,False,0,johnconyersjr,11/29/17 23:59,False,,9.360000e+17,266149840.0,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",JesusPrepper74,0,False,False,,
2,3,Watched Megan Kelly ask Joe Keery this A.M. if...,False,0,,11/29/17 23:59,True,,9.360000e+17,,"<a href=""http://twitter.com/download/android"" ...",DemerisePotvin,0,False,False,,
3,4,Women have been talking about this crap the en...,False,0,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",TheDawnStott,0,False,False,,
4,5,.@BetteMidler please speak to this sexual assa...,False,15,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://twitter.com/#!/download/ipad"" ...",scottygirl2014,11,False,False,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
393130,393131,RT @Suffragentleman: You can only choose one.....,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com/download/android"" ...",boaomega22,616,True,False,,
393131,393132,"#MeToo, say victims of sexual harassment in Ja...",False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://bufferapp.com"" rel=""nofollow"">...",April_Magazine,0,False,False,,
393132,393133,Susan Collins tries to #MeToo her way out of h...,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Noofer55,0,False,False,,
393133,393134,RT @OneMillionVjj: Punish those who choose not...,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",ZBezzt,5,True,False,,


#### After going through these (and more) we concluded that the encoding issue was resolved and words like `Iâ\x80\x99m` were converted to either `I'm` or `Im` both of which is acceptable to the contractions library function as it gives `I am` in both cases.

# Remove all html links

#### From the previously printed tweets, we can see that each tweet contained a link to the actual tweet on twitter which was irrelevant to us so we decided to remove it. 

In [23]:
pattern = r'https?://\S+'  # Regular expression for https?://
df['text'] = df['text'].apply(lambda x: re.sub(pattern, '', x))

the pattern is defined as:
- `http` defines the beginning of all URLs that were recorded in the tweets.
- `s?` indicated that the `s` after the `http` is optional and may occur 0 or 1 times.
- `://` defines the symbols used in a URL after `https`
- `\S+` part matches the domain/path of the URL since it matches to any non-whitespace character. `+` indicates that it should occur one or more times.

In [24]:
df[df['Unnamed: 0']==49306]['text'].values[0]

'RT @LeeannTweeden: Ive decided its time to tell my story. #MeToo'

In [25]:
df[df['Unnamed: 0']==32505]['text'].values[0]

'@JustinTrudeau I have been waiting over a month for a response to a letter from @ChandraNepean regarding sexual abu_ '

In [26]:
df[df['Unnamed: 0']==31564]['text'].values[0]

'When you speak the truth theyll always abandon you. #MeToo @SBBookJBraxton '

In [27]:
df[df['Unnamed: 0']==25833]['text'].values[0]

'@yashar Franken should resign, in my opinion.  Im not on Team Blue, Im not on Team Red, Im on Team Pink Pussy. V_ '

#### The URLs were effectively removed from the tweets.

# Remove all usernames @

In [28]:
pattern = r'@\w+'  # Regular expression for Twitter handles
df['text'] = df['text'].apply(lambda x: re.sub(pattern, '', x))
# apply function to replace the pattern in every row with nothing.

the pattern is defined as:
- the `@` tells the function to look out for actual @ since usernames start with @.
- `\w+` matches any word characters namely, uppercase/lowercase letters, digits and underscores.
- the `+` sign indicates that there can be one or more such characters.

In [29]:
df[df['Unnamed: 0']==49306]['text'].values[0]

'RT : Ive decided its time to tell my story. #MeToo'

In [30]:
df[df['Unnamed: 0']==32505]['text'].values[0]

' I have been waiting over a month for a response to a letter from  regarding sexual abu_ '

In [31]:
df[df['Unnamed: 0']==31564]['text'].values[0]

'When you speak the truth theyll always abandon you. #MeToo  '

In [32]:
df[df['Unnamed: 0']==25833]['text'].values[0]

' Franken should resign, in my opinion.  Im not on Team Blue, Im not on Team Red, Im on Team Pink Pussy. V_ '

In [33]:
df[df['Unnamed: 0']==127539]['text'].values[0]

'RT : Im #MeToo & dont doubt any other woman but the Leeann Tweeden-John Phillips-Roger Stone lies & set up of Al Franken were p'

#### All the usernames were effectively removed.

# Removing leading and trailing spaces

strip() function applied on all rows to remove trailing and leading spaces in case they may have seeped during the above preprocessing steps.


In [34]:
df['text'] = df['text'].apply(lambda x: x.strip())

In [35]:
df[df['Unnamed: 0']==32505]['text'].values[0]

'I have been waiting over a month for a response to a letter from  regarding sexual abu_'

# Remove all hashtags '#'

In [36]:
df.to_csv("preprocessed_1_with_hashtags.csv") 
#saving all the tweets that still contain hashtags because they will be needed later.

In [37]:
# Regular expression pattern for Twitter handles
pattern = r'#\w+' #identify words starting with '#'
df['text_without_hashtag'] = df['text'].apply(lambda x: re.sub(pattern, '', x))

#### creating a new column for text that doesnot contain the hashtags.

In [38]:
df[df['Unnamed: 0']==127539]['text_without_hashtag'].values[0]

'RT : Im  & dont doubt any other woman but the Leeann Tweeden-John Phillips-Roger Stone lies & set up of Al Franken were p'

In [39]:
df[df['Unnamed: 0']==31564]['text_without_hashtag'].values[0]

'When you speak the truth theyll always abandon you. '

In [40]:
df[df['Unnamed: 0']==49306]['text_without_hashtag'].values[0]

'RT : Ive decided its time to tell my story. '

In [41]:
df['text_without_hashtag'] = df['text_without_hashtag'].apply(lambda x: x.strip())
# removing leading and trailing spaces that may have crept up after removal of hashtags.

In [42]:
# checking if the trailing/leading spaces are gone

In [43]:
df[df['Unnamed: 0']==49306]['text_without_hashtag'].values[0]

'RT : Ive decided its time to tell my story.'

In [44]:
df[df['Unnamed: 0']==31564]['text_without_hashtag'].values[0]

'When you speak the truth theyll always abandon you.'

# Removing " RT : "

These were in the tweets that were retweeted and didnot provided any meaningfull information to us, so we removed it.

In [45]:
df['text'] = df['text'].str.replace('RT : ', '')
df['text_without_hashtag'] = df['text_without_hashtag'].str.replace('RT : ', '')


In [46]:
df[df['Unnamed: 0']==49306]['text_without_hashtag'].values[0]

'Ive decided its time to tell my story.'

In [47]:
df[df['Unnamed: 0']==49306]['text'].values[0]

'Ive decided its time to tell my story. #MeToo'

In [48]:
df[df['Unnamed: 0']==32505]['text'].values[0]

'I have been waiting over a month for a response to a letter from  regarding sexual abu_'

In [49]:
df[df['Unnamed: 0']==25833]['text'].values[0]

'Franken should resign, in my opinion.  Im not on Team Blue, Im not on Team Red, Im on Team Pink Pussy. V_'

# Removing trailing underscores

At this point, we have two columns:
- 'text' --- which has all the tweets, without html links, without @, without "RT: " and trailing and leading spaces removed.


- 'text_without_hashtag' --- same as 'text', just doesn't have the hashtags. 


- The reason the tweets with hashtags were preserved is because we want to do analysis on the hashtags.


- Now, there are some tweets that have trailing underscores because of the removal of the encoding issue. The `rstrip('_')` will be applied to both 'text' and 'text_without_hashtag'.

In [50]:
df['text'] = df['text'].str.rstrip('_') # removing trailing underscore from 'text'

In [51]:
df['text_without_hashtag'] = df['text_without_hashtag'].str.rstrip('_') # removing trailing underscore from 'text_without_hashtag'

In [52]:
df[df['Unnamed: 0']==25833]['text'].values[0]

'Franken should resign, in my opinion.  Im not on Team Blue, Im not on Team Red, Im on Team Pink Pussy. V'

In [53]:
df[df['Unnamed: 0']==32505]['text'].values[0]

'I have been waiting over a month for a response to a letter from  regarding sexual abu'

In [54]:
df[df['Unnamed: 0']==32505]['text_without_hashtag'].values[0]

'I have been waiting over a month for a response to a letter from  regarding sexual abu'

# Remove trailing and leading double quotes "

#### There were some tweets that were in double quotes, so we decided to get eliminate the double quotes too.

In [55]:
df[df['Unnamed: 0']==293277]['text_without_hashtag'].values[0]

'"Breaking our silence is a must. But without a dialogue,  might just seem not accessible for many Muslim women."'

In [56]:
df[df['Unnamed: 0']==293277]['text'].values[0]

'"Breaking our silence is a must. But without a dialogue, #MeToo might just seem not accessible for many Muslim women." #TheFe'

In [57]:
df['text'] = df['text'].str.strip('"')
df['text_without_hashtag'] = df['text_without_hashtag'].str.strip('"')

In [58]:
df[df['Unnamed: 0']==293277]['text'].values[0]

'Breaking our silence is a must. But without a dialogue, #MeToo might just seem not accessible for many Muslim women." #TheFe'

In [59]:
df[df['Unnamed: 0']==293277]['text_without_hashtag'].values[0]

'Breaking our silence is a must. But without a dialogue,  might just seem not accessible for many Muslim women.'

# Removal of trailing and leading spaces (again)

In [60]:
df['text'] = df['text'].apply(lambda x: x.strip())
df['text_without_hashtag'] = df['text_without_hashtag'].apply(lambda x: x.strip())

# Converting all the words to lowercase

In [61]:
df['text'] = df['text'].str.lower()
df['text_without_hashtag'] = df['text_without_hashtag'].str.lower()

# Expanding contractions

In [62]:
import contractions
def expand_contractions(text):
   expanded_text = contractions.fix(text)
   return expanded_text

how the function works:
- we used the .fix function from the contractions library to expand the contractions and returned the expanded text.

In [63]:
df['expanded_tweet'] = df['text_without_hashtag'].apply(lambda x: expand_contractions(x))

the tweets in which contractions were expanded were stored in a newly made column called `expanded_tweets`

# Removing punctuation


In [64]:
import string
exclude = string.punctuation
remove_punctuation = str.maketrans('','',exclude)

Code explaination:
- string module is imported, it contains various string constants like letters, digits and punctuation.
- exclude is being defined as a string that contains all ASCII punctuation characters.
- the `maketrans` function is used make a translation table for the replacement/removal.
- the first two parameters are empty strings because they tell what is to be replaced with what. 
- since, there is no replacement, only removal, those are empty. We have no mappings.  
- the third arugment contains a string of all the punctuation characters to be removed.
- so the line `str.maketrans('','',exclude)` says 'remove all the characters present in `exclude` string from the og string.'


In [65]:
df['punctuation_removed'] = df['expanded_tweet'].apply(lambda x:x.translate(remove_punctuation))

- the `translate` function is used to apply this instance of str.maketrans which is a translation table.

In [66]:
df

Unnamed: 0.1,Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude,text_without_hashtag,expanded_tweet,punctuation_removed
0,1,american harem.. #metoo,False,0,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://instagram.com"" rel=""nofollow"">...",ahmediaTV,0,False,False,,,american harem..,american harem..,american harem
1,2,why have you guys not resigned yet? liberal hy...,False,0,johnconyersjr,11/29/17 23:59,False,,9.360000e+17,266149840.0,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",JesusPrepper74,0,False,False,,,why have you guys not resigned yet? liberal hy...,why have you guys not resigned yet? liberal hy...,why have you guys not resigned yet liberal hyp...
2,3,watched megan kelly ask joe keery this a.m. if...,False,0,,11/29/17 23:59,True,,9.360000e+17,,"<a href=""http://twitter.com/download/android"" ...",DemerisePotvin,0,False,False,,,watched megan kelly ask joe keery this a.m. if...,watched megan kelly ask joe keery this a.m. if...,watched megan kelly ask joe keery this am if s...
3,4,women have been talking about this crap the en...,False,0,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",TheDawnStott,0,False,False,,,women have been talking about this crap the en...,women have been talking about this crap the en...,women have been talking about this crap the en...
4,5,. please speak to this sexual assault by duri...,False,15,,11/29/17 23:59,False,,9.360000e+17,,"<a href=""http://twitter.com/#!/download/ipad"" ...",scottygirl2014,11,False,False,,,. please speak to this sexual assault by duri...,. please speak to this sexual assault by duri...,please speak to this sexual assault by durin...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
393130,393131,you can only choose one...#metoo,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com/download/android"" ...",boaomega22,616,True,False,,,you can only choose one...,you can only choose one...,you can only choose one
393131,393132,"#metoo, say victims of sexual harassment in ja...",False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://bufferapp.com"" rel=""nofollow"">...",April_Magazine,0,False,False,,,", say victims of sexual harassment in japan (v...",", say victims of sexual harassment in japan (v...",say victims of sexual harassment in japan via...
393132,393133,susan collins tries to #metoo her way out of h...,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",Noofer55,0,False,False,,,susan collins tries to her way out of her tax...,susan collins tries to her way out of her tax...,susan collins tries to her way out of her tax...
393133,393134,punish those who choose not to.#consent #metoo,False,0,,12/25/17 0:00,False,,9.450820e+17,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",ZBezzt,5,True,False,,,punish those who choose not to.,punish those who choose not to.,punish those who choose not to


# Removing stopwords

Removal of stopwords is important because:
- they carry little to no meaning with themselves
- creates unnecessary noise
- dilutes the importance of actual words during sentiment analysis / topic modelling.
- examples: this, that, and, in, is, at

In [67]:
from nltk.corpus import stopwords
# importing the list of stopwords from the nltk library
# it contains a corpus of stopwords for different languages

In [68]:
stop_words = set(stopwords.words('english'))
# getting the list of stop words in the english language.

Before stopword removal, it's essential to tokenize the tweet into a list of words for easy stopword removal.

In [69]:
from nltk.tokenize import word_tokenize, sent_tokenize
df['tokenized_tweets'] = df['punctuation_removed'].apply(lambda x: word_tokenize(x))
# storing the tokenized_tweets in a different column.

In [70]:
df['tokenized_tweets']

0                                         [american, harem]
1         [why, have, you, guys, not, resigned, yet, lib...
2         [watched, megan, kelly, ask, joe, keery, this,...
3         [women, have, been, talking, about, this, crap...
4         [please, speak, to, this, sexual, assault, by,...
                                ...                        
393130                        [you, can, only, choose, one]
393131    [say, victims, of, sexual, harassment, in, jap...
393132    [susan, collins, tries, to, her, way, out, of,...
393133                [punish, those, who, choose, not, to]
393134    [chief, justice, john, roberts, orders, miscon...
Name: tokenized_tweets, Length: 393135, dtype: object

In [71]:
def remove_stopwords(tokens):
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    return filtered_tokens

How the function works:
- it takes in the tweets 
- then creates a new list that only contains words that are not in the stop_words list
- returns that new list

In [72]:
df['tokenized_stopword_removed_tweets'] = df['tokenized_tweets'].apply(remove_stopwords) # results in a list of words

In [73]:
df['almost_clean_tweets'] = df['tokenized_stopword_removed_tweets'].apply(lambda x: ' '.join(x)) 
# join them back together after stopword removal, helps eliminate any unnecessary space

In [74]:
df

Unnamed: 0.1,Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,...,isRetweet,retweeted,longitude,latitude,text_without_hashtag,expanded_tweet,punctuation_removed,tokenized_tweets,tokenized_stopword_removed_tweets,almost_clean_tweets
0,1,american harem.. #metoo,False,0,,11/29/17 23:59,False,,9.360000e+17,,...,False,False,,,american harem..,american harem..,american harem,"[american, harem]","[american, harem]",american harem
1,2,why have you guys not resigned yet? liberal hy...,False,0,johnconyersjr,11/29/17 23:59,False,,9.360000e+17,266149840.0,...,False,False,,,why have you guys not resigned yet? liberal hy...,why have you guys not resigned yet? liberal hy...,why have you guys not resigned yet liberal hyp...,"[why, have, you, guys, not, resigned, yet, lib...","[guys, resigned, yet, liberal, hypocrisy]",guys resigned yet liberal hypocrisy
2,3,watched megan kelly ask joe keery this a.m. if...,False,0,,11/29/17 23:59,True,,9.360000e+17,,...,False,False,,,watched megan kelly ask joe keery this a.m. if...,watched megan kelly ask joe keery this a.m. if...,watched megan kelly ask joe keery this am if s...,"[watched, megan, kelly, ask, joe, keery, this,...","[watched, megan, kelly, ask, joe, keery, rub, ...",watched megan kelly ask joe keery rub fingers ...
3,4,women have been talking about this crap the en...,False,0,,11/29/17 23:59,False,,9.360000e+17,,...,False,False,,,women have been talking about this crap the en...,women have been talking about this crap the en...,women have been talking about this crap the en...,"[women, have, been, talking, about, this, crap...","[women, talking, crap, entire, time, finally, ...",women talking crap entire time finally someone...
4,5,. please speak to this sexual assault by duri...,False,15,,11/29/17 23:59,False,,9.360000e+17,,...,False,False,,,. please speak to this sexual assault by duri...,. please speak to this sexual assault by duri...,please speak to this sexual assault by durin...,"[please, speak, to, this, sexual, assault, by,...","[please, speak, sexual, assault, interview]",please speak sexual assault interview
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
393130,393131,you can only choose one...#metoo,False,0,,12/25/17 0:00,False,,9.450820e+17,,...,True,False,,,you can only choose one...,you can only choose one...,you can only choose one,"[you, can, only, choose, one]","[choose, one]",choose one
393131,393132,"#metoo, say victims of sexual harassment in ja...",False,0,,12/25/17 0:00,False,,9.450820e+17,,...,False,False,,,", say victims of sexual harassment in japan (v...",", say victims of sexual harassment in japan (v...",say victims of sexual harassment in japan via...,"[say, victims, of, sexual, harassment, in, jap...","[say, victims, sexual, harassment, japan, via,...",say victims sexual harassment japan via asahi ...
393132,393133,susan collins tries to #metoo her way out of h...,False,0,,12/25/17 0:00,False,,9.450820e+17,,...,False,False,,,susan collins tries to her way out of her tax...,susan collins tries to her way out of her tax...,susan collins tries to her way out of her tax...,"[susan, collins, tries, to, her, way, out, of,...","[susan, collins, tries, way, tax, bill, debacle]",susan collins tries way tax bill debacle
393133,393134,punish those who choose not to.#consent #metoo,False,0,,12/25/17 0:00,False,,9.450820e+17,,...,True,False,,,punish those who choose not to.,punish those who choose not to.,punish those who choose not to,"[punish, those, who, choose, not, to]","[punish, choose]",punish choose


at this point `almost_clean_tweets` column contains tweets that have the following operations done on:
- removed links
- removed usernames
- removed hashtags
- removed 'RT: '
- removed trailing/leading spaces
- removed trailing underscores
- removed trailer/leading quotation marks
- contractions expanded
- punctuation removed
- tokenized & then stopwords removed

# Lemmatization (final step)

In [75]:
import spacy
nlp = spacy.load("en_core_web_md")

# loading the pre-trained medium english model from spacy

In [76]:
lemmatized_tweets = []

for text in df['almost_clean_tweets']:
    doc = nlp(text)
    lemmatized_text = " ".join([token.lemma_ for token in doc])
    lemmatized_tweets.append(lemmatized_text)
    
    
df['lemmatized_tweets'] = lemmatized_tweets

Code explaination:
- the `nlp(text)` tokenizes the tweet in a non-explicit manner (doesn't create a list).
- additionally, it does linguistic and syntactic analysis on the text.
- a new list is made called `lemmatized_text` which contains the .lemma_ attribute of the token, i.e it's root form.
- the list is joined into a string
- creating a new column for the lemmatized tweets.

##### Lemmatization is the last step which is final, the tweets now only contains words that are necessary to us and in their root form.

# Exporting the preprocessed dataframe

In [77]:
df.to_csv("Preprocessed_Tweets.csv")

----------------------------