## Introduction to Computational Social Science methods with Python

### Natural Language Processing - Text Preprocessing

<div class='alert alert-block alert-success'>
<b>In this Python notebook</b>, 

we will explore how to perform text preprocessing on a corpus of tweets. Text preprocessing is a critical task in natural language processing (NLP), which involves cleaning and transforming raw text data into a format that can be analyzed by NLP algorithms. In this notebook, we will focus on several specific techniques for text preprocessing, including tokenization, lemmatization, stemming, part of speech tagging, stop word removal, and n-grams extraction.

By the end of this notebook, you will have a basic understanding of how to perform text preprocessing on text data, Let's get started!
</div>

In [1]:
import pandas as pd

# import the data 
tweets_df = pd.read_csv('../data/top_500_retweeted_tweets_clean.csv', encoding = "utf-8")
tweets_df.head() 

Unnamed: 0,tweet_id,text,retweets,urls,mentions,hashtags,emoji,emoji_text,cleaned_text
0,1265465820995411973,"This was me, and I want to make one thing clea...",257467,['https://t.co/349TZijtD8'],[],[],[],[],"This was me, and I want to make one thing clea..."
1,1266553959973445639,Mike Pence caught on hot mic delivering empty ...,135818,['https://t.co/IduvGhiPwj'],[],[],[],[],Mike Pence caught on hot mic delivering empty ...
2,1258750892448387074,THE PANDEMIC IS STILL HAPPENING. THE PANDEMIC ...,88667,[],[],[],[],[],THE PANDEMIC IS STILL HAPPENING. THE PANDEMIC ...
3,1263579286201446400,"This just happened on live tv. Wow, what a dou...",82495,['https://t.co/dQKheEcCvb'],[],[],[],[],"This just happened on live tv. Wow, what a dou..."
4,1266546753182056453,Mask on,66604,[],[],[],[],[],Mask on


## A. Tokenization

The set of all the unique terms in our data is called the vocabulary. Each element in this set is called a type. Each occurrence of a type in the data is called a token. 

Let's practice! Our sentence is

>“Today is a great day to learn NLP, such a powerful tool!”

Thi sentece has 14 tokens but only 13 types (namely, 'Today', 'is', 'a', 'great', 'day', 'to', 'learn', 'NLP', ',', 'such', 'a', 'powerful', 'tool', '!'). Note that types can also include punctuation marks and multiword expressions.

In other words, the words of a text document/file separated by spaces and punctuation are called as tokens.

#### What is a Tokenization?
The process of extracting tokens from a text file/document is referred as tokenization. Let's see an example below of a tokenization process using spaCy:

In [2]:
import spacy

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Print the original and tokenized text
print('Original text:', text)
print('\nTokens in the text:',)

for token in doc:
    print('\t', token.text)

print('\nTotal tokens:', len(doc))


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt. 

Tokens in the text:
	 Mike
	 Pence
	 caught
	 on
	 hot
	 mic
	 delivering
	 empty
	 boxes
	 of
	 PPE
	 for
	 a
	 PR
	 stunt
	 .

Total tokens: 16


We can also push further our analysis and extract the vocabulary from the corpus of tweets from the previous dataset. Since the vocabulary of a text corpus is the collection of unique tokens present in that corpus, we will just need to tokenize each single tweet and keep unique occurence of each token:

In [3]:
from collections import Counter

# Process all tweets with spaCy and extract all tokens
tokens = []

# iterate through all tweets and extract all tokens
###

# Count the occurrences of each token and create a vocabulary of unique tokens
vocabulary = Counter(tokens)

# Print the extracted vocabulary
print("Size of extracted vocabulary: {0}".format(len(vocabulary)))
print("Total number of tokens: {0}".format(len(tokens)))

print(vocabulary)

Size of extracted vocabulary: 0
Total number of tokens: 0
Counter()


### Lemmatization
When we look up a word in a dictionary, we usually just look for the base form. This dictionary base form is called the **lemma**.
For instance, we might see forms like “go”, “goes”, “went”, “gone”, or “going” and we look up dictionary in a lemmatized form, such as "go" (Hovy, 2020). These words have clearly different meaning, in some contexts it is not fundamental to distinguish them. On the contrary, it is much more convenient to trace them back to their lemma. Indeed, this may simplify some analysis and allow easier extraction of relevant information from the text. Let's see an example of lemmatization applied to the corpus of tweets using spaCy:

In [4]:
# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy and perform lemmatization
doc = nlp(text)

# Print words and extractes lemmas
for token in doc:
    print("{0} -> {1}".format(token.text, token.lemma_))

# Finally we can recover the text of the tweet after lemmatization
print('\n\nOriginal text:', text)
lemmatized_text = " ".join([token.lemma_ for token in doc])
print('Lemmatized text:', lemmatized_text)

Mike -> Mike
Pence -> Pence
caught -> catch
on -> on
hot -> hot
mic -> mic
delivering -> deliver
empty -> empty
boxes -> box
of -> of
PPE -> PPE
for -> for
a -> a
PR -> pr
stunt -> stunt
. -> .


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt. 
Lemmatized text: Mike Pence catch on hot mic deliver empty box of PPE for a pr stunt .


### Stemming 

Another strategy to reduce different forms of a word to a common base or root form is stemming. Stemming involves removing the suffixes of words to create a simplified form of the word. For example, the stem of the words "running," "runner," and "run" is "run." This can be achieved using several algorithms like the one developed by Porter (1980). This algorithm defines a number of suffixes and the order in which they should be removed or replaced. These actions are then applied iteratively untill a word is reduced to its stem.

Note how, although similar, stemming and lemmatization are different and give different results. Generally speaking, lemmatization tends to produce more accurate and meaningful results with respect to stemming. Nonethelss, stemming is often faster and simpler to implement, which makes it useful for tasks that require real-time processing or have limited computational resources.

An implementation of the Porter stemmer is available in the Python library NLTK. Let's see an example:

In [5]:
# run this to install NLTK
# !pip install nltk

In [6]:
# download popular NLTK data
!python -m nltk.downloader popular

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/nicolo/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /Users/nicolo/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /Users/nicolo/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /Users/nicolo/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /Users/nicolo/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /Users/nicolo/nltk_data...
[nltk_data]    |   Package movie_reviews is already

In [7]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# This performs tokenization on the text (NLTK equivalalent of what we did with spaCy)
tokens = word_tokenize(text)

# Create a PorterStemmer object
stemmer = PorterStemmer()

# Apply stemming to each word in the text
stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Let's see results 
for token, stem in zip(tokens, stemmed_tokens):
    print("{0} -> {1}".format(token, stem))

# Finally we can recover the text of the tweet after lemmatization
print('\n\nOriginal text:', text)
stemmed_text = " ".join(stemmed_tokens)
print('Stemmed text:', stemmed_text)

Mike -> mike
Pence -> penc
caught -> caught
on -> on
hot -> hot
mic -> mic
delivering -> deliv
empty -> empti
boxes -> box
of -> of
PPE -> ppe
for -> for
a -> a
PR -> pr
stunt -> stunt
. -> .


Original text: Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt. 
Stemmed text: mike penc caught on hot mic deliv empti box of ppe for a pr stunt .


### N-grams

In natural language processing (NLP), **N-grams** are contiguous sequences of n elements from a given text sample, where an element can be a word, a character, or part of speech. In most cases, n-grams are created from a text by dragging a window of size n over the text and extracting the sequences of n elements that fall within that window.

N-grams are used in a variety of NLP tasks such as language modeling, machine translation, and text classification. By extracting n-grams from a text, it is possible to capture the local context of a word or word sequence, which can help improve the accuracy of many NLP tasks.

For example, a bigram (n=2) is "natural language", a trigram (n=3) is "natural language processing", and a 4-gram (n=4) is "natural language processing task". By examining the frequency of different n-grams in a text or corpus, it is possible to gain insight into the distribution of words and their relationships.

N-grams can also be used to generate new texts through techniques such as n-gram language modeling. In this approach, the probabilities of different N-grams in a text are used to generate a new text that is similar in style and content to the original text.

However, it should be noted that n-grams can be constrained by the sparsity problem, especially for larger values of n. That is, as the value of n increases, the number of unique n-grams in a text can increase rapidly, making it difficult to capture meaningful patterns or relationships. Therefore, choosing an appropriate value of n is an important consideration in many NLP tasks.

Let's see an example of  N-grams extraction applied to the corpus of tweets using spaCy:

In [8]:
import spacy

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Define the function to extract n-grams
def extract_ngrams(doc, n):
    ngrams = []
    for i in range(len(doc) - n + 1):
        ngram = " ".join([doc[j].text for j in range(i, i + n)])
        ngrams.append(ngram)
    return ngrams

# Extract unigrams, bigrams, and trigrams from the text
unigrams = extract_ngrams(doc, 1)
bigrams = extract_ngrams(doc, 2)
trigrams = extract_ngrams(doc, 3)

# Print the extracted n-grams
print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)


Unigrams: ['Mike', 'Pence', 'caught', 'on', 'hot', 'mic', 'delivering', 'empty', 'boxes', 'of', 'PPE', 'for', 'a', 'PR', 'stunt', '.']
Bigrams: ['Mike Pence', 'Pence caught', 'caught on', 'on hot', 'hot mic', 'mic delivering', 'delivering empty', 'empty boxes', 'boxes of', 'of PPE', 'PPE for', 'for a', 'a PR', 'PR stunt', 'stunt .']
Trigrams: ['Mike Pence caught', 'Pence caught on', 'caught on hot', 'on hot mic', 'hot mic delivering', 'mic delivering empty', 'delivering empty boxes', 'empty boxes of', 'boxes of PPE', 'of PPE for', 'PPE for a', 'for a PR', 'a PR stunt', 'PR stunt .']


Alternatively, we can use Gensim, another popular library for NLP, to automatically extract the most common n-grams:

In [9]:
import gensim

# gensim expect as input tokenized texts
texts = []
for text in tweets_df.cleaned_text.values:
    texts.append(word_tokenize(text))

# extract bigrams
bigrams = gensim.models.Phrases(texts, min_count=5, threshold=100)
texts_bigrams = [bigrams[text] for text in texts]

# visualize the extracted bigrams
extracted_bigrams = []
for text in texts_bigrams:
    for el in text:
        if "_" in el:
            extracted_bigrams.append(el)

extracted_bigrams = set(extracted_bigrams)
print(extracted_bigrams)


{'tear_gas', 'United_States', 'BREAKING_:', 'White_House', '_Candy', 'George_Floyd', 'IS_STILL', '_Truth_Or_Dare', 'Dr._Fauci', '큥이_에리_기가막힌_케미스트리', 'social_distancing', '&_amp', 'THE_PANDEMIC'}


## 6.2.2. Stopwords

In natural language processing (NLP), stop words refer to words that are frequently used in a language but usually do not have much meaning or semantic value when used in context. Examples of stop words in English are "the", "a", "an", "and", "in", "on", "is", "are", "for", "with", and so on.

Stop words are usually removed from text during preprocessing in NLP tasks such as text classification, sentiment analysis, and information retrieval. The reason is that they do not contribute much to the overall meaning or topic of a text and can potentially degrade algorithm performance by adding noise to the data. Removing stop words can also help reduce the size of vocabulary and improve the efficiency of text processing algorithms.

However, there are certain cases where the inclusion of stop words in the analysis may be useful or even necessary. For example, stopwords can be useful in tasks such as authorship attribution, to identify common themes, or writing styles. In such cases, it is important to carefully consider the use of stop words and their potential impact on the analysis

We will now see a simple example on how to remove Stop Words from a text using spaCy:

In [10]:
from spacy.lang.en.stop_words import STOP_WORDS

# Load the small English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# Define the list of stop words
stop_words = list(STOP_WORDS)

# Remove stop words from the text
filtered_text = [token.text for token in doc if token.text not in stop_words]
stop_words_removed = [token.text for token in doc if token.text in stop_words]

# Print the original and filtered text, and the stop words removed
print("Original tokens: ", [token.text for token in doc])
print("Filtered tokens:", filtered_text)
print("\nStop words removed: ", stop_words_removed)


Original tokens:  ['Mike', 'Pence', 'caught', 'on', 'hot', 'mic', 'delivering', 'empty', 'boxes', 'of', 'PPE', 'for', 'a', 'PR', 'stunt', '.']
Filtered tokens: ['Mike', 'Pence', 'caught', 'hot', 'mic', 'delivering', 'boxes', 'PPE', 'PR', 'stunt', '.']

Stop words removed:  ['on', 'empty', 'of', 'for', 'a']


In [7]:
# exercise: find the most common stop word

# Define a dictionary to store the count of each stop word
stop_words_count = {}

# iterate over tweets

    # find stop words in this text
    
    # iterate over stop words found and update counts
    

# Find the stop word with the highest count    
most_frequent_stop_word = max(stop_words_count, key=stop_words_count.get)
print("The most common stop word is:", most_frequent_stop_word)

The most common stop word is: the


Depending on the task, one can also add custom stop words. This can be easily done by appending additional words to the stop words list:

In [8]:
print(len(stop_words))
stop_words.append("place")
print(len(stop_words))

326
327


## 6.2.3. Parts of Speech

**Part of speech tagging** (POS) is the process of assigning a part of speech to each word in a sentence, such as noun, verb, adjective, or adverb. POS Tagging is an important step in many NLP applications, such as named entity recognition, sentiment analysis, and machine translation.

The goal of POS tagging is to identify the grammatical structure of a sentence by labelling each word with its corresponding part of speech. This information can then be used to extract meaning and context from the text. For example, knowing whether a word is a noun or a verb can help determine the subject and predicate of a sentence.

POS tagging is typically performed using machine learning algorithms, such as hidden Markov models, conditional random fields, or neural networks. These algorithms are trained on annotated text corpora in which each word is labelled with the corresponding word type. After training, the algorithm can then predict the word type for a new unseen text.

POS tagging is not always an easy task, as some words may have multiple possible word types depending on the context. For example, "run" can be a verb ("I run every morning") or a noun ("I went for a run"). In these cases, the algorithm must use contextual clues to determine the most likely part of speech for the word.

Overall, POS tagging is an important technique in NLP that helps extract meaning and context from texts by identifying the grammatical structure of sentences.

English has 9 main categories:

- verb — Expresses an action or a state of being. E.g. jump, is, write, become
- noun — identifies a person, a place or a thing or names of particular of one of these (pronoun). E.g. man, house, happiness
- pronoun — can replace a noun or noun phrase. E.g. she, we, they, it
- determiner — Is placed in front of a noun to express a quantity or clarify what the noun refers to — briefly a noun introducer. E.g. my, that, the, many
- adjective — modifies a noun or a pronoun. E.g. pretty, old, blue, smart
- adverb — modifies a verb, an adjective, or another adverb. E.g. gently, extremely, carefully, well
- preposition — Connect a noun/pronoun to other parts of the sentence. E.g. by, with, about, until
- conjunction — glue words, clauses, and sentences together. E.g. and, but, or, while, because
- interjection — Expresses emotion in a sudden or exclamatory way. E.g. oh!, wow!, oops!


<img src='../data/POS.png' style='height: 200px; float: center'>




We will now see a simple example on how to perform POS on a text using spaCy:

In [9]:
import spacy

# load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# As a text example we will use a tweet from the previous dataset
text = tweets_df.cleaned_text[1]

# Process the text with spaCy
doc = nlp(text)

# iterate over each token in the doc and print its text and POS tag
for token in doc:
    print(token.text, token.pos_)


Mike PROPN
Pence PROPN
caught VERB
on ADP
hot ADJ
mic NOUN
delivering VERB
empty ADJ
boxes NOUN
of ADP
PPE PROPN
for ADP
a DET
PR NOUN
stunt NOUN
. PUNCT


If the meaning of a POS tag is not clear to us, we ask spaCy to explain it:

In [10]:
spacy.explain("PROPN")

'proper noun'

Finally, let's see how spaCy POS tagging works on more tricky examples:

In [11]:
# here the word run is used as verb
text1 = "I run every morning"

# here the word run is used as a noun
text2 = "I went for a run"

# POS tagging of sentence 1
for token in nlp(text1):
    print(token.text, token.pos_)

print("\n")
# POS tagging of sentence 2
for token in nlp(text2):
    print(token.text, token.pos_)


I PRON
run VERB
every DET
morning NOUN


I PRON
went VERB
for ADP
a DET
run NOUN


We can see that spaCy correctly tag the word "run" differently in these two examples. 