# Introduction to NLP preprocessing and encoding

## Part 1: Preprocessing

For this first section, we will be running through some common preprocessing steps. Remember, preprocessing should be based on your data and use case, not a one size fits all approach.

We will be using a subsample of Twitter data from this dataset: https://www.kaggle.com/datasets/monogenea/game-of-thrones-twitter?resource=download

In [1]:
import pandas as pd
import numpy as np
# read in the gotTwitter dataset we will be working with
got_data = pd.read_csv('gotTwitter.csv', dtype='str')

In [2]:
# set_option to view full column in notebook
pd.set_option('display.max_colwidth', None)

# make sure it loaded correctly, there should be 3 columns, an id number, a created_at date, and the text of the tweets
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"Over 370,000 'Game of Thrones' fans sign petition for remake of season 8 https://t.co/jqxWu4E5k3"
1,x1129144306206298112,5/16/2019 21:59,"With both Game of Thrones and The Big Bang Theory ending this week, I<U+2019>m wondering if it would be fun to take hugely-popular, long-running, juggernaut shows I<U+2019>ve never seen a single episode of, and watch JUST the finale, and see what I think?"
2,x1129144249205895171,5/16/2019 21:59,"Suddenly, last episode, Daenerys embraced the thrill of genocide, specifically targeting civilians with dragon fire. Personality changes happen in fiction, but not with such lack of subtlety -- not to characters @GameOfThrones respect and understand.\n\n https://t.co/FR9HzLcGB0"
3,x1129144246869663745,5/16/2019 21:59,"Sprinkles causes a stampede by releasing a limited-time-only <U+2018>Game of Thrones<U+2019> dragon fruit cupcake sold Friday, May 17-Sunday, May 19 https://t.co/9zNhC00Sj4"
4,x1129141956095954956,5/16/2019 21:49,"<U+2018>Game of Thrones<U+2019> is airing its final episode, and here<U+2019>s what we<U+2019>ll miss when it ends https://t.co/Adb12iWRqb"


#### Remove things like special characters, symbols, punctuation, URLs, etc. from the data that contains little information for a model to learn and are often primarily noise.

In [3]:
import re, string #import packages for regex replacement

def clean_text_round(row):
    row = re.sub(r'http\S+', '', row) #remove urls
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row) #remove mentions
    return row

clean = lambda x: clean_text_round(x)

In [4]:
# apply the function above across each row of the text column
got_data.loc[:, 'text'] = got_data['text'].apply(clean)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"Over 370,000 'Game of Thrones' fans sign petition for remake of season 8"
1,x1129144306206298112,5/16/2019 21:59,"With both Game of Thrones and The Big Bang Theory ending this week, I<U+2019>m wondering if it would be fun to take hugely-popular, long-running, juggernaut shows I<U+2019>ve never seen a single episode of, and watch JUST the finale, and see what I think?"
2,x1129144249205895171,5/16/2019 21:59,"Suddenly, last episode, Daenerys embraced the thrill of genocide, specifically targeting civilians with dragon fire. Personality changes happen in fiction, but not with such lack of subtlety -- not to characters respect and understand.\n\n"
3,x1129144246869663745,5/16/2019 21:59,"Sprinkles causes a stampede by releasing a limited-time-only <U+2018>Game of Thrones<U+2019> dragon fruit cupcake sold Friday, May 17-Sunday, May 19"
4,x1129141956095954956,5/16/2019 21:49,"<U+2018>Game of Thrones<U+2019> is airing its final episode, and here<U+2019>s what we<U+2019>ll miss when it ends"


We can see there are still elements that appear noisy, let's add a line to our function:

In [5]:
# added a new line
def clean_text_round(row):
    row = re.sub(r'http\S+', '', row) #remove urls
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row) #remove mentions
    row = re.sub(r"<[^>]+>", '', row) # remove carrot inserts from collection <---- new operation
    return row

clean = lambda x: clean_text_round(x)

In [7]:
# apply the function above across each row of the text column
got_data.loc[:, 'text'] = got_data['text'].apply(clean)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"Over 370,000 'Game of Thrones' fans sign petition for remake of season 8"
1,x1129144306206298112,5/16/2019 21:59,"With both Game of Thrones and The Big Bang Theory ending this week, Im wondering if it would be fun to take hugely-popular, long-running, juggernaut shows Ive never seen a single episode of, and watch JUST the finale, and see what I think?"
2,x1129144249205895171,5/16/2019 21:59,"Suddenly, last episode, Daenerys embraced the thrill of genocide, specifically targeting civilians with dragon fire. Personality changes happen in fiction, but not with such lack of subtlety -- not to characters respect and understand.\n\n"
3,x1129144246869663745,5/16/2019 21:59,"Sprinkles causes a stampede by releasing a limited-time-only Game of Thrones dragon fruit cupcake sold Friday, May 17-Sunday, May 19"
4,x1129141956095954956,5/16/2019 21:49,"Game of Thrones is airing its final episode, and heres what well miss when it ends"


Much better! What other cleaning steps might you take here? Add a line to the function for additional cleaning steps.

In [8]:
# your code here
def clean_new_line(row):
    row = row.replace("\n","")
    row = re.sub(r'[^\w\s]', '', row)
    return row

new_line_clean = lambda x: clean_new_line(x)

In [9]:
got_data.loc[:, 'text'] = got_data['text'].apply(new_line_clean)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,Over 370000 Game of Thrones fans sign petition for remake of season 8
1,x1129144306206298112,5/16/2019 21:59,With both Game of Thrones and The Big Bang Theory ending this week Im wondering if it would be fun to take hugelypopular longrunning juggernaut shows Ive never seen a single episode of and watch JUST the finale and see what I think
2,x1129144249205895171,5/16/2019 21:59,Suddenly last episode Daenerys embraced the thrill of genocide specifically targeting civilians with dragon fire Personality changes happen in fiction but not with such lack of subtlety not to characters respect and understand
3,x1129144246869663745,5/16/2019 21:59,Sprinkles causes a stampede by releasing a limitedtimeonly Game of Thrones dragon fruit cupcake sold Friday May 17Sunday May 19
4,x1129141956095954956,5/16/2019 21:49,Game of Thrones is airing its final episode and heres what well miss when it ends


#### Lowercase

There are many ways to lowercase your data, here we use re, feel free to drop in your new lines as we redefine the function

In [10]:
# added a new line
def clean_text_round(row):
    row = row.lower() # make lowercase  <---- new operation
    row = re.sub(r'http\S+', '', row) #remove urls
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row) #remove mentions
    row = re.sub(r"<[^>]+>", '', row) # remove carrot inserts from collection
    return row

clean = lambda x: clean_text_round(x)

In [11]:
got_data.loc[:, 'text'] = got_data['text'].apply(clean)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,over 370000 game of thrones fans sign petition for remake of season 8
1,x1129144306206298112,5/16/2019 21:59,with both game of thrones and the big bang theory ending this week im wondering if it would be fun to take hugelypopular longrunning juggernaut shows ive never seen a single episode of and watch just the finale and see what i think
2,x1129144249205895171,5/16/2019 21:59,suddenly last episode daenerys embraced the thrill of genocide specifically targeting civilians with dragon fire personality changes happen in fiction but not with such lack of subtlety not to characters respect and understand
3,x1129144246869663745,5/16/2019 21:59,sprinkles causes a stampede by releasing a limitedtimeonly game of thrones dragon fruit cupcake sold friday may 17sunday may 19
4,x1129141956095954956,5/16/2019 21:49,game of thrones is airing its final episode and heres what well miss when it ends


#### Tokenization

Again, there are many ways to tokenize. The package NLTK has built in functions to help us tokenize in different ways, including by word and by sentence. Here, we tokenize using a special tweet tokenizer that is able to take things like emojis into account. Read the documentation here: https://www.nltk.org/api/nltk.tokenize.html

In [12]:
# from nltk import tokenizer 
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer() 

# define a function that we can apply over our data
def tweet_tokenize(row):
    row = tweet_tokenizer.tokenize(row)
    return row

tokenized = lambda x: tweet_tokenize(x)

In [13]:
got_data.loc[:, 'text'] = got_data['text'].apply(tokenized)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"[over, 370000, game, of, thrones, fans, sign, petition, for, remake, of, season, 8]"
1,x1129144306206298112,5/16/2019 21:59,"[with, both, game, of, thrones, and, the, big, bang, theory, ending, this, week, im, wondering, if, it, would, be, fun, to, take, hugelypopular, longrunning, juggernaut, shows, ive, never, seen, a, single, episode, of, and, watch, just, the, finale, and, see, what, i, think]"
2,x1129144249205895171,5/16/2019 21:59,"[suddenly, last, episode, daenerys, embraced, the, thrill, of, genocide, specifically, targeting, civilians, with, dragon, fire, personality, changes, happen, in, fiction, but, not, with, such, lack, of, subtlety, not, to, characters, respect, and, understand]"
3,x1129144246869663745,5/16/2019 21:59,"[sprinkles, causes, a, stampede, by, releasing, a, limitedtimeonly, game, of, thrones, dragon, fruit, cupcake, sold, friday, may, 17sunday, may, 19]"
4,x1129141956095954956,5/16/2019 21:49,"[game, of, thrones, is, airing, its, final, episode, and, heres, what, well, miss, when, it, ends]"


#### Remove stop words

In [14]:
import nltk
nltk.download('stopwords') # download the list of stopwords from nltk if you have not done this before
from nltk.corpus import stopwords # import stopwords

stopeng = set(stopwords.words('english')) #set language

#define a function to remove stopwords
def remove_stopwords(row):
    row = [w for w in row if w not in stopeng]
    return row

no_stopwords = lambda x: remove_stopwords(x)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saahi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
got_data.loc[:, 'text'] = got_data['text'].apply(no_stopwords)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"[370000, game, thrones, fans, sign, petition, remake, season, 8]"
1,x1129144306206298112,5/16/2019 21:59,"[game, thrones, big, bang, theory, ending, week, im, wondering, would, fun, take, hugelypopular, longrunning, juggernaut, shows, ive, never, seen, single, episode, watch, finale, see, think]"
2,x1129144249205895171,5/16/2019 21:59,"[suddenly, last, episode, daenerys, embraced, thrill, genocide, specifically, targeting, civilians, dragon, fire, personality, changes, happen, fiction, lack, subtlety, characters, respect, understand]"
3,x1129144246869663745,5/16/2019 21:59,"[sprinkles, causes, stampede, releasing, limitedtimeonly, game, thrones, dragon, fruit, cupcake, sold, friday, may, 17sunday, may, 19]"
4,x1129141956095954956,5/16/2019 21:49,"[game, thrones, airing, final, episode, heres, well, miss, ends]"


What do you notice about our data as we add additional preprocessing? How do you think this will impact our analysis?

Human interpretation is lost, however the data becomes more machine-readable to make better predictions.

#### Lemmatization/Stemming

We'll start with lemmatization:

In [16]:
nltk.download('wordnet') # you may need to run these depending on your setup
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()

# define function to lemmatize
def lemmatize(row):
    row = [lmtzr.lemmatize(token) for token in row]
    row = ' '.join(row) # this is the final step of our guided walkthrough, so I have re-joined the tweets into single documents instead of lists
    return row

lemmatized = lambda x: lemmatize(x)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\saahi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\saahi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [17]:
got_data.loc[:, 'text'] = got_data['text'].apply(lemmatized)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,370000 game throne fan sign petition remake season 8
1,x1129144306206298112,5/16/2019 21:59,game throne big bang theory ending week im wondering would fun take hugelypopular longrunning juggernaut show ive never seen single episode watch finale see think
2,x1129144249205895171,5/16/2019 21:59,suddenly last episode daenerys embraced thrill genocide specifically targeting civilian dragon fire personality change happen fiction lack subtlety character respect understand
3,x1129144246869663745,5/16/2019 21:59,sprinkle cause stampede releasing limitedtimeonly game throne dragon fruit cupcake sold friday may 17sunday may 19
4,x1129141956095954956,5/16/2019 21:49,game throne airing final episode here well miss end


How would you apply stemming? Hint: https://www.nltk.org/howto/stem.html

In [17]:
# your code here
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stemming(row):
    row = [stemmer.stem(token) for token in row]
    row = ' '.join(row)
    return row

stemmed = lambda x: stemming(x)

In [18]:
got_data.loc[:, 'text'] = got_data['text'].apply(stemmed)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,3 7 0 0 0 0 g a m e t h r o n e f a n s i g n p e t i t i o n r e m a k e s e a s o n 8
1,x1129144306206298112,5/16/2019 21:59,g a m e t h r o n e b i g b a n g t h e o r y e n d i n g w e e k i m w o n d e r i n g w o u l d f u n t a k e h u g e l y p o p u l a r l o n g r u n n i n g j u g g e r n a u t s h o w i v e n e v e r s e e n s i n g l e e p i s o d e w a t c h f i n a l e s e e t h i n k
2,x1129144249205895171,5/16/2019 21:59,s u d d e n l y l a s t e p i s o d e d a e n e r y s e m b r a c e d t h r i l l g e n o c i d e s p e c i f i c a l l y t a r g e t i n g c i v i l i a n d r a g o n f i r e p e r s o n a l i t y c h a n g e h a p p e n f i c t i o n l a c k s u b t l e t y c h a r a c t e r r e s p e c t u n d e r s t a n d
3,x1129144246869663745,5/16/2019 21:59,s p r i n k l e c a u s e s t a m p e d e r e l e a s i n g l i m i t e d t i m e o n l y g a m e t h r o n e d r a g o n f r u i t c u p c a k e s o l d f r i d a y m a y 1 7 s u n d a y m a y 1 9
4,x1129141956095954956,5/16/2019 21:49,g a m e t h r o n e a i r i n g f i n a l e p i s o d e h e r e w e l l m i s s e n d


If you have extra time: how might you split our data into ngrams? Hint: https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams

In [19]:
# your code here

## Part 2: Encoding Examples

#### BOW (bag of words)

Countvectorizer does all of the work for us, pay attention to the lecture to know what it does :)

In [18]:
# first we'll select a subset of data so we're not looking at massive representations
data_list = got_data.iloc[:50]['text'].to_list()
# data_list

In [19]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

# create variable for vectorizer
bow = CountVectorizer()
 
# apply function on data subsample
bow_result = bow.fit_transform(data_list)

In [20]:
import numpy as np
# feature names corresponds to our vocabulary in this case
feature_array = np.array(bow.get_feature_names_out()) # get_feature_names_out is dependent on sklearn version, can also be get_feature_names_out
bow_sorting = np.argsort(bow_result.toarray()).flatten()[::-1]

# sorting by the top n features based on bow
n = 10
bow_top_n = feature_array[bow_sorting][:n]
print(bow_top_n)

['throne' 'warfare' 'history' 'parallel' 'penultimate' 'classical' 'see'
 'theory' 'dany' 'destroy']


#### TF-IDF

Sklearn also has a module for tfidf, implemented below. 

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create variable for vectorizer
tfidf = TfidfVectorizer()
 
# apply function on data subsample
tfidf_result = tfidf.fit_transform(data_list)

In [22]:
feature_array = np.array(tfidf.get_feature_names_out()) # get_feature_names is dependent on sklearn version
tfidf_sorting = np.argsort(tfidf_result.toarray()).flatten()[::-1]

n = 10
tfidf_top_n = feature_array[tfidf_sorting][:n]
print(tfidf_top_n)

['queen' 'rain' 'warfare' 'parallel' 'penultimate' 'classical' 'dany'
 'destroy' 'history' 'realworld']


Why do you think the top N lists are different between BOW and TF-IDF?

The top N lists generated by Bag-of-Words (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF) can be different because of the way they represent the text.

BOW represents text as a collection of words and their frequencies in the document, without considering the order of the words. The frequency of each word in the document is used to generate a feature vector that represents the document. When generating the top N list using BOW, the most frequent words in the document will be ranked highest.

TF-IDF also considers the importance of words in the document relative to their frequency in the corpus. It takes into account how often a word appears in the document and how rare it is in the corpus. Words that are common in the corpus but rare in the document are given a higher weight, as they are considered more important for characterizing the document. When generating the top N list using TF-IDF, the words that are most characteristic of the document will be ranked highest.

Therefore, the top N lists generated by BOW and TF-IDF can be different because BOW only considers word frequency in the document, while TF-IDF considers the importance of words relative to their frequency in the corpus. This can result in different words being ranked highest in the two representations.

## Part 3: Word Embeddings

For our word embedding exploration we will be using the gensim library. Gensim has pretrained embeddings for glove, word2vec, and fasttext, although we will only be playing around with word2vec.

Here is some documentation: https://radimrehurek.com/gensim/models/word2vec.html, https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.html


In [23]:
# pip install gensim
import gensim.downloader as api

In [37]:
# this will load pretrained embeddings from word2vec based on google news data, it will take a little while to process so be patient
w2v_google_news = api.load('word2vec-google-news-300')

In [38]:
# you can run code for analogies, I have included the king - man, + woman = queen example to start
w2v_google_news.most_similar_cosmul(positive=['king','woman'], negative=['man'])

[('queen', 0.9314123392105103),
 ('monarch', 0.858533501625061),
 ('princess', 0.8476566076278687),
 ('Queen_Consort', 0.8150269985198975),
 ('queens', 0.8099815249443054),
 ('crown_prince', 0.8089976906776428),
 ('royal_palace', 0.8027306795120239),
 ('monarchy', 0.8019613027572632),
 ('prince', 0.800979733467102),
 ('empress', 0.7958389520645142)]

In [32]:
# comparing words similar to a single word
w2v_google_news.most_similar('patriot')

[('patriots', 0.6744824051856995),
 ('patriotic', 0.5867904424667358),
 ('statesman', 0.5711328387260437),
 ('constitutionalist', 0.5575802326202393),
 ('traitor', 0.5424764156341553),
 ('revolutionist', 0.5388073921203613),
 ('hero', 0.5304943323135376),
 ('ardent_patriot', 0.5239595174789429),
 ('patriotism', 0.5227564573287964),
 ('rabble_rouser', 0.5194998979568481)]

In [33]:
w2v_google_news.distance('president', 'patriot')

0.7869569361209869

Explore different word relationships using the tools above or commands from the linked documentation, what relationships seem surprising? Why might the model have embedded certain words in similar ways when linguistically we wouldn't expect them to be similar? Use at least one new example per method.

In [34]:
# your code here
w2v_google_news.most_similar('computer')

[('computers', 0.7979379892349243),
 ('laptop', 0.6640493273735046),
 ('laptop_computer', 0.6548868417739868),
 ('Computer', 0.647333562374115),
 ('com_puter', 0.6082080006599426),
 ('technician_Leonard_Luchko', 0.5662748217582703),
 ('mainframes_minicomputers', 0.5617720484733582),
 ('laptop_computers', 0.5585449934005737),
 ('PC', 0.5539618730545044),
 ('maker_Dell_DELL.O', 0.5519254207611084)]

In [35]:
#your text answer here, in markdown
w2v_google_news.distance('ball', 'sport')

0.8321875184774399

Ball and Sport are similar, as expected.

In [36]:
w2v_google_news.distance('student', 'dumb')

0.8995845317840576

In this example we would not expect student and dumb to be highly similar which is actually the case