## Assignment 13: Using Python Packages
Tierney O'Sullivan
November 27, 2022

### Library: nltk & spacy

nltk [documentation](https://www.nltk.org/)  

nltk, or "Natural Language Tool kit" is a library for working with text data.  
Common uses include:

- text classification
- tokenization
- stemming
- tagging
- parsing 
- semantic reasoning

Here, we will use it's library as a tool to preprocess text data from Twitter which can then be used for further NLP analysis like clustering or sentiment analysis.  

spacy [documentation](https://spacy.io/)  

SpaCy is a newer NLP library and has some faster algorithms, since it has been implemented in Cython. 

## Twitter data
The data set we are using today is from kaggle, and includes tweets related to COVID-19 vaccines and can be downloaded [here](https://www.kaggle.com/datasets/gpreda/all-covid19-vaccines-tweets).


### Import libraries
First we will import libraries needed for this notebook. 

In [99]:
# Download packages
import numpy as np
import pandas as pd
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from IPython.display import display, HTML
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import spacy #lemmatization

import matplotlib.pyplot as plt

In [100]:
df = pd.read_csv('vaccination_all_tweets.csv')

In [101]:
df.head()

Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet
0,1340539111971516416,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,['PfizerBioNTech'],Twitter for Android,0,0,False
1,1338158543359250433,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,,Twitter Web App,1,1,False
2,1337858199140118533,eli🇱🇹🇪🇺👌,Your Bed,"heil, hydra 🖐☺",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,"['coronavirus', 'SputnikV', 'AstraZeneca', 'Pf...",Twitter for Android,0,0,False
3,1337855739918835717,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",,Twitter Web App,446,2129,False
4,1337854064604966912,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,"['whereareallthesickpeople', 'PfizerBioNTech']",Twitter for iPhone,0,0,False


Let's use only the first 1000 tweets for an example to reduce processing time.

In [102]:
df = df.loc[0:999,:]

### Data cleaning

In [103]:
# write a function to create date column of month_year

def date_create(created_at):
    ''' Getting YYYY_MM from created_at '''
    year = created_at.split("-")[0]
    month = created_at.split("-")[1]
    year_month = year + "_" + month
    return year_month

In [104]:
#function for removing emojis and special characters from tweets
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

In [105]:
#very basic cleaning to clean text from tweets
#basic cleaning: 
#1 lowercase, 
#2 remove emojis,
#3 use regex to remove urls, 
#4 use regex to remove punctuation,
#5 remove user handles, which come after @ sign
#6 extra spaces, etc.
def clean_tweet(text):
    '''Text Preprocessing '''
    text = text.lower() #1
    text = remove_emojis(text) #2
    text = re.sub(r'http\S+', '', text) #3
    text = re.sub(r'[^\w\d\s\']+', '', text) #4
    text = re.sub('@[^\s]+','',text) #5
  
    text = re.sub("^\s+|\s+$", "", text, flags=re.UNICODE) #6
    text = " ".join(re.split("\s+", text, flags=re.UNICODE)) #6
   
    return text

df.loc[:, "clean_text"] = df.loc[:, "text"].apply(clean_tweet) #run function for each tweet
df.loc[:, "year_month"] = df.loc[:, "user_created"].apply(date_create)

### Download stopwords
nltk has a list of predefined stop words for the english language.

Stop words are the most commonly used words in a language and don't often provide much value in discerning the meaning of the text data, so they are removed to reduce dimensionality. 


In [106]:
# use the predefined list of english language stopwords from nltk
nltk.download('stopwords')
stopwords_eng = stopwords.words('english')

# for tweets, we might want to add some to make it customized
append_words = ["via", "also", "amp",
               "do", "will", "did", "does", "should",
               "are", "could", "had", "has"," have",
               "is", "might", "must", "need", "shall",
               "should", "was", "were", "will", "would",
               "bc", "yo", "etc"] 
stopwords_eng.extend(append_words)
print(stopwords_eng)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tierney/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Lemmatization
One way to reduce the dimensionality of the text data is to use lemmatization, which aims to change words to their root form. This would change plurals to singulars for nouns, or past tense to present for verbs.

nltk has a built-in lemmatization algorithm that has been made from WordNet, and incorporates the word's part of speech, and surrounding words when lemmatizing. It's a bit more sophisticated than the also common but simpler approach of stemming. 

In [107]:
from nltk.stem import WordNetLemmatizer 
wnl = WordNetLemmatizer()


In [108]:
wnl.lemmatize("arches")

'arch'

The downside is that you have to supply the part of speech for it to correctly lemmatize verbs. See below, just using the lemmatize function, the default part of speech is noun, so the word listening remains unchanged. 

In [109]:
wnl.lemmatize('listening')

'listening'

In [110]:
wnl.lemmatize('listening', 'v')

'listen'

So, instead of using nltk's lemmatization function we'll try spacy's instead. 


In [111]:
# text lemmatization using spacy
nlp = spacy.load('en_core_web_sm') #lemmatization

def lemm(text):
    text = nlp(text)
    tokens = []
    for token in text:
        tokens.append(token)
    text = " ".join([token.lemma_ for token in text])
   
    return text

In [112]:
df.loc[:, 'lemm_Text'] = df.loc[:, 'clean_text'].apply(lemm) #run function for each shot-related tweet


In [113]:
df.head()

Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet,clean_text,year_month,lemm_Text
0,1340539111971516416,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,['PfizerBioNTech'],Twitter for Android,0,0,False,same folks said daikon paste could treat a cyt...,2009_04,same folk say daikon paste could treat a cytok...
1,1338158543359250433,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,,Twitter Web App,1,1,False,while the world has been on the wrong side of ...,2009_09,while the world have be on the wrong side of h...
2,1337858199140118533,eli🇱🇹🇪🇺👌,Your Bed,"heil, hydra 🖐☺",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,"['coronavirus', 'SputnikV', 'AstraZeneca', 'Pf...",Twitter for Android,0,0,False,coronavirus sputnikv astrazeneca pfizerbiontec...,2020_06,coronavirus sputnikv astrazeneca pfizerbiontec...
3,1337855739918835717,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",,Twitter Web App,446,2129,False,facts are immutable senator even when you're n...,2008_09,fact be immutable senator even when you be not...
4,1337854064604966912,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,"['whereareallthesickpeople', 'PfizerBioNTech']",Twitter for iPhone,0,0,False,explain to me again why we need a vaccine bori...,2020_04,explain to I again why we need a vaccine boris...


### Remove stopwords from tweet text

In [114]:
# preprocess by removing stop words manually
# using list comprehension
# output is a list of words
def preprocess(tweet):
    return [w for w in tweet.lower().split() if w not in stopwords_eng]


In [115]:
df['clean_text_nostops'] = df['lemm_Text'].apply(preprocess)

df.head()

Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet,clean_text,year_month,lemm_Text,clean_text_nostops
0,1340539111971516416,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,['PfizerBioNTech'],Twitter for Android,0,0,False,same folks said daikon paste could treat a cyt...,2009_04,same folk say daikon paste could treat a cytok...,"[folk, say, daikon, paste, treat, cytokine, st..."
1,1338158543359250433,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,,Twitter Web App,1,1,False,while the world has been on the wrong side of ...,2009_09,while the world have be on the wrong side of h...,"[world, wrong, side, history, year, hopefully,..."
2,1337858199140118533,eli🇱🇹🇪🇺👌,Your Bed,"heil, hydra 🖐☺",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,"['coronavirus', 'SputnikV', 'AstraZeneca', 'Pf...",Twitter for Android,0,0,False,coronavirus sputnikv astrazeneca pfizerbiontec...,2020_06,coronavirus sputnikv astrazeneca pfizerbiontec...,"[coronavirus, sputnikv, astrazeneca, pfizerbio..."
3,1337855739918835717,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",,Twitter Web App,446,2129,False,facts are immutable senator even when you're n...,2008_09,fact be immutable senator even when you be not...,"[fact, immutable, senator, even, ethically, st..."
4,1337854064604966912,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,"['whereareallthesickpeople', 'PfizerBioNTech']",Twitter for iPhone,0,0,False,explain to me again why we need a vaccine bori...,2020_04,explain to I again why we need a vaccine boris...,"[explain, vaccine, borisjohnson, matthancock, ..."


### Vectorize text and create a tfidf matrix

tfidf stands for term frequency inverse document frequency

It is a common way to measure the importance of a word in a document by indicating how common it is in that document, compared to all other documents in the data.

Here documents are tweets. So term frequency refers to how many times a word is used in a given tweet, and document frequency is the number of times a word is used in any tweet. Thus if a word gets used a lot in a single tweet but is rare, it's going to have a high value in the tf-idf matrix and is assumed to have a lot of meaning in the tweet. 

Because nltk doesn't have a tfidf function, let's use sci kit learn's built in one.

Creates a matrix with one row for each tweet and one row for each token or word in any of the tweets.

In [118]:
vect = TfidfVectorizer(min_df = 0.00017, stop_words=stopwords_eng, ngram_range = (1,1)).fit(df.lemm_Text)
x = vect.fit_transform(df.lemm_Text)
x_df = pd.DataFrame(x.toarray(), columns = vect.get_feature_names_out())
x_df.shape

(1000, 2911)

# Cosine similarity
An example of a usecase for the tfidf matrix is for us to examine tweets that are similar to one another based on the words that are in them. Cosine similarity is a useful similarity or distance based metric for text data, as it doesn't matter how long the documents are. 

Tweets that are exactly alike will score 1, and ones that don't have any matching words will score 0. 

Let's see if we can find similar tweets to one another using this metric out of the first 1000 tweets about COVID-19 vaccines. 

In [128]:
import sklearn
cosine_mat = sklearn.metrics.pairwise.cosine_similarity(x_df)
# replace diagonal with zero instead of 1 for perfect self-matches
cosine_mat[cosine_mat==1]=0
cosine_df = pd.DataFrame(cosine_mat)
cosine_df.index = x_df.index
cosine_df.columns = x_df.index
cosine_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0.000000,0.000000,0.009378,0.0,0.011529,0.000000,0.000000,0.000000,0.019472,0.056724,...,0.009712,0.009064,0.000000,0.009938,0.000000,0.017503,0.050329,0.009050,0.008457,0.008199
1,0.000000,1.000000,0.079026,0.0,0.000000,0.000000,0.037222,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.009378,0.079026,1.000000,0.0,0.025060,0.009571,0.000000,0.000000,0.181720,0.000000,...,0.021111,0.019703,0.089136,0.021602,0.000000,0.018622,0.008913,0.019671,0.008997,0.017822
3,0.000000,0.000000,0.000000,1.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.011529,0.000000,0.025060,0.0,0.000000,0.011766,0.000000,0.000000,0.052036,0.000000,...,0.025954,0.024223,0.025935,0.026558,0.000000,0.022894,0.010957,0.024184,0.011061,0.021911
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.017503,0.000000,0.018622,0.0,0.022894,0.000000,0.000000,0.158297,0.119215,0.000000,...,0.059462,0.055495,0.039316,0.019735,0.000000,0.000000,0.000000,0.017971,0.051775,0.050198
996,0.050329,0.000000,0.008913,0.0,0.010957,0.008197,0.000000,0.000000,0.018507,0.051686,...,0.009231,0.008615,0.054492,0.009446,0.000000,0.000000,0.000000,0.089847,0.000000,0.007793
997,0.009050,0.000000,0.019671,0.0,0.024184,0.009236,0.000000,0.000000,0.040846,0.000000,...,0.020373,0.019014,0.020358,0.020847,0.106127,0.017971,0.089847,0.000000,0.008683,0.017199
998,0.008457,0.000000,0.008997,0.0,0.011061,0.000000,0.000000,0.000000,0.057599,0.000000,...,0.028729,0.026813,0.018996,0.009535,0.035192,0.051775,0.000000,0.008683,1.000000,0.024253


In [147]:
cosine_df.max().where(cosine_df.max()<1).loc[100:120]

100         NaN
101    0.375933
102         NaN
103         NaN
104         NaN
105         NaN
106    0.316697
107    0.247538
108         NaN
109    0.836042
110    0.296032
111         NaN
112         NaN
113    0.176355
114         NaN
115    0.232486
116    0.381110
117    1.000000
118         NaN
119    0.307711
120    0.119351
dtype: float64

Let's find two tweets that are similar:
Here we compare tweet number 109 to it's highest cosine similarity match: tweet number 108 with a score of 0.836.

In [153]:
cosine_df.loc[109,].max()

0.8360415108175331

In [155]:
cosine_df.loc[109,].idxmax()

108

In [156]:
df.loc[108,'text']

'#BreakingNews A nurse in New York City on Monday became the first person in the United States to receive the corona… https://t.co/02Mu5HKYs5'

In [157]:
df.loc[109,'text']

'#UPDATE A nurse in New York City on Monday became the first person in the United States to receive the coronavirus… https://t.co/N4j8xorUzO'

This is a very close match!

This could be a useful tool to use for a clustering algorithm, or to use for stratified sampling of tweets for annotation, to ensure that your annotators are given a diverse sample of tweets to label. 

An alternative to cosine similarity is "soft cosine similarity" and is another option that takes into account words meanings, rather than looking for exact matches. This can be done using gensim's library but that's for another notebook. 


Overall, I think that nltk and spaCy were both helpful to get the preprocessing of the tweet text data finished. It seems like SpaCy has better documentation, especially with their free web course for using it for NLP with code and everything. However, for researchers it may be problematic since the algorithms are chosen for you to optimize processing speeds and may limit scientific reproducibility, documentation, and the users ability to adjust presets. It seems like spacy is more popular for developers and nltk more popular with researchers, especially linguistics. Overall, it's good to be able to integrate tools from both in a text pre-processing pipeline.