## Introduction to Computational Social Science methods with Python

### Natural Language Processing - Text Representation

<div class='alert alert-block alert-success'>
<b>In this Python notebook</b>, 

we will explore how to represent text data using bag of words and TF-IDF techniques. Text representation is a critical task in natural language processing (NLP), which involves converting raw text data into a numerical format that can be processed by machine learning algorithms. In this notebook, we will focus on two specific techniques for text representation: bag of words and TF-IDF.

We will use a dataset of news articles obtained from a publicly available source, which consist of a collection of articles from different topics. We will preprocess the text data by cleaning the text, and then use the bag of words and TF-IDF techniques to represent the text data in a numerical format.

By the end of this notebook, you will have a basic understanding of how to represent text data using bag of words and TF-IDF techniques. Let's get started!

</div>

## A. Bag-of-words

So far we have introduced what are tokenization, stemming, lemmatization, stop words, n-grams, and part of speech tagging. As we have seen, these are all preprocessing techniques that aims at cleaning, removing unnecessary information, and extracting structure from the text. As a last step of this section, we will introduce two techniques, the Bag-of-Words (BoW) approach and its natural extension, a technique called TF-IDF (Term Frequency-Inverse Document Frequency). These approaches combines the techniques described so far, and finally prepares the texts for the actual analysis, such as topic modeling, text classification, and sentiment analysis.


In natural language processing (NLP), a **"bag of words"** is a representation of a text document that describes the occurrence of words in it. It is a simple and commonly used approach to convert text data into a numerical format that can be used for analysis and machine learning.
The bag-of-words model ignores the order and structure of the text and only considers the frequency of occurrence of each word in the document. The resulting representation is a "bag" of words in which each word is represented as a separate feature, and the value of each feature is the count of the corresponding word in the document.

The bag-of-words representation of a corpus is usually stored in a matrix called the **document-term matrix**, where each row represents a document and each column represents a term (i.e., a word). The value in each cell is the number of occurrences of the corresponding term in the corresponding document. The document-term matrix is generally a sparse matrix, meaning that it contains a large number of zero elements and few non-zero ones. Indeed, most documents only contain a small fraction of the possible words in a language, and most words occur in only a subset of the documents. This means that the vast majority of the entries in the document-term matrix are zero.

We will now see how to derive the document-term matric using spaCy and Gensim. For this task, we will use another dataset that contains news articles. Let's import the data:



In [2]:
import pandas as pd
import numpy as np 

# import the data
news = pd.read_csv("../data/news_subset.csv")
news.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/eleven-american...,11 American Troops Injured In Iran Attack On I...,WORLD NEWS,The United States military originally said no ...,"Eric Beech, Reuters",2020-01-17
1,https://www.huffingtonpost.com/entry/gus-kenwo...,Olympian Gus Kenworthy Burns Ivanka Trump: 'TF...,SPORTS,The first daughter led the U.S. delegation dur...,Alana Horowitz Satlin,2018-02-25
2,https://www.huffingtonpost.com/entry/watch-ins...,WATCH: Inspiring Woman Living with Spinal Musc...,WELLNESS,"When Alyssa was just 5 months old, she was dia...","HooplaHa - Only Good News, Contributor\nHoopla...",2013-05-20
3,https://www.huffingtonpost.com/entry/dad-deliv...,"Brent Farrell, Dad, Knocked Down Locked Door T...",PARENTING,"A week before Henry's quick delivery, Katherin...",Jessica Samakow,2012-04-09
4,https://www.huffingtonpost.com/entry/how-polit...,How Politically Correct Culture Influences My ...,PARENTING,"I may not abandon my child in the wilderness, ...","Toni Nagy, Contributor\nwriter, podcaster, ton...",2014-01-24


Before deriving the document-term matrix, we preprocess the articles using the technique described before: tokenization, lemmatization (or stemming), stop words and punctuation removal: 

In [3]:
import gensim
import spacy
import re
from nltk.tokenize import word_tokenize
from spacy.lang.en.stop_words import STOP_WORDS
from nltk.stem import PorterStemmer
nlp = spacy.load("en_core_web_sm")

def clean_text(text):

    # remove punctuation and special characters
    pattern = r"[^\w\s]"
    text_clean = re.sub(pattern, "", text)

    # remove numbers
    pattern = r"\d+"
    text_clean = re.sub(pattern, "", text_clean)

    # remove all non-ASCII characters
    pattern = r"[^\x00-\x7F]+"
    text_clean = re.sub(pattern, "", text_clean)

    # remove new line characters
    text_clean.replace("\n", "")

    # remove empty spaces left by regex
    text_clean = ' '.join(text_clean.split())
    
    return text_clean


def tokenization(texts):
    return [word_tokenize(text) for text in texts]


def remove_stop_words(texts, stop_words=[]):
    if stop_words == []:
        stop_words = list(STOP_WORDS)
    return [[word for word in doc if word.lower() not in stop_words] for doc in texts]


def add_bigrams(texts):
    bigrams = gensim.models.Phrases(texts, min_count=5, threshold=100)
    return [bigrams[text] for text in texts]


def stemming(texts):
    stemmer = PorterStemmer()
    return [[stemmer.stem(word) for word in doc] for doc in texts]


def lemmatization(texts):
    texts_lemma = []
    for text in texts:
        doc = nlp(" ".join(text)) 
        texts_lemma.append([token.lemma_ for token in doc])
    return texts_lemma


def pipeline(corpus):
    print("Cleaning text...")
    corpus = [clean_text(text) for text in corpus]

    print("Tokenization...")
    corpus = tokenization(corpus)

    print("Lowercasing...")
    corpus = [[el.lower() for el in text] for text in corpus]

    print("Stop Words removal...")
    corpus = remove_stop_words(corpus)

    print("Extract bigrams...")
    corpus = add_bigrams(corpus)

    print("Stemming...")
    corpus = stemming(corpus)

    print("Stop Words removal after stemming...")
    corpus = remove_stop_words(corpus)

    print("Removing tokens that are too short...")
    corpus = [[c for c in text if len(c) > 2] for text in corpus]

    return corpus

Run the preprocessing pipeline:

In [4]:
# our corpus is title + description of the article
corpus = []
for index, row in news.iterrows():
    corpus.append(row.headline + ". " + row.short_description)
corpus = np.array(corpus)

# run the preprocessing pipeline
corpus = pipeline(corpus)


Cleaning text...
Tokenization...
Lowercasing...
Stop Words removal...
Extract bigrams...
Stemming...
Stop Words removal after stemming...
Removing tokens that are too short...


We are now ready to extract the document-term matrix from the preprocessed corpus. We will use and implementation of the BoW approach in Gensim:

In [11]:
from gensim.corpora import Dictionary

# we create a dictionary
dictionary = Dictionary(corpus)

# we filter very common and very rare words
#dictionary.filter_extremes(no_below=10, no_above=0.5)

# covert the corpus to bag of words format 
document_term_matrix = [dictionary.doc2bow(text) for text in corpus]

In the previous cell we have created a dictionary (i.e., a collection of all the words appearing in the corpus) and the document_term matrix. Let's see these outputs:

In [13]:
print("Number of words in the dictionary: {0}".format(len(dictionary)))
print("Dictionary first 5 elements (id, token):", list(dictionary.items())[:5])

print("\nFirst document in bag-of-words format (raw):", document_term_matrix[0])
print("First document in bag-of-words format (word, frequency):", [[dictionary[id], freq] for id, freq in document_term_matrix[0]])

Number of words in the dictionary: 40358
Dictionary first 5 elements (id, token): [(0, 'alasad'), (1, 'american'), (2, 'attack'), (3, 'base'), (4, 'erbil')]

First document in bag-of-words format (raw): [(0, 1), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 3), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
First document in bag-of-words format (word, frequency): [['alasad', 1], ['american', 1], ['attack', 2], ['base', 2], ['erbil', 1], ['hurt', 1], ['injur', 1], ['iran', 2], ['iraq', 1], ['iraqi', 1], ['jan', 1], ['militari', 3], ['missil', 1], ['origin', 1], ['said', 1], ['service_memb', 1], ['troop', 1], ['united_st', 1]]


We also save the output for future analysis:

In [14]:
import pickle as pkl

with open("./output/dict_gensim.pkl", "wb") as file:
    pkl.dump(dictionary, file)

with open("./output/corpus.pkl", "wb") as file:
    pkl.dump(corpus, file)

with open("./output/document_term_matrix.pkl", "wb") as file:
    pkl.dump(document_term_matrix, file)

## B. TF-IDF

Despite its simplicity, the BoW approach is a powerful technique to turn collections of text into a numerical format that can then be inputed to a variety of models for several applications, including topic modeling. Nonetheless, it has some limitations, such as:

- Importance of rare words: The bag of words model assigns equal weight to all words in a document, regardless of their importance or rarity. 
- Discrimination of common words: The bag of words model assigns high weights to common words, which are not very informative and may not be discriminative for distinguishing between different documents. 

These limitations, can be corrected using the **Term Frequency-Inverse Document Frequency** (TF-IDF) matrix instead of the simple document-matrix. The idea behind TF-IDF is to assign a weight to each word in a document based on how frequently it occurs in the document and how important it is in the overall corpus. This weight is calculated by multiplying two factors:

- **Term Frequency (TF)**: this is a measure of how often a word occurs in a document. It is calculated by dividing the number of occurrences of a word in a document by the total number of words in the document. The TF value for a word is high if it occurs very often in a document, and low if it occurs only a few times. In mathematical terms, the TF value for word $t$ in document $d$ is:

$
\begin{align}
tf(t, d) = \frac{f_{t,d}}{\sum_{t' \in d}f_{t', d}}
\end{align}
$

- **Inverse Document Frequency (IDF)**: This is a measure of how important a word is in a corpus. It is calculated by dividing the total number of documents in the corpus by the number of documents containing the word. The IDF value for a word is high if it occurs in a few documents and low if it occurs in many documents. In general, it is used the logarithm of the IDF factor. Indeed, if a word appears in only a very small number of documents, the resulting IDF value can be very large. This can lead to a situation where the TF-IDF weight of a word is dominated by its IDF value, even if its term frequency (TF) is relatively low. Taking the logarithmof the IDF value has the effect of compressing the range of possible IDF values and reducing the impact of very high IDF values. In mathematical terms, the TF value for word $t$ in a corpus $D$ of $N$ document is:
$
\begin{align}
idf(t, D) = log \frac{N}{|\{d \in D : t \in d\}|}
\end{align}
$


The TF-IDF weighting $w_{t, d, D}$ for a word $t$ is then calculated by multiplying the TF value and the IDF value for that word:
$
\begin{align}
w_{t, d, D} = tf(t, d) \times idf(t, D)
\end{align}
$

The higher the TF-IDF weighting, the more important the word is in the document or corpus. We can simply derive the TF-IDF matrix using Gensim starting from the bag-of-words representation of the preprocessed corpus previously obtained:




In [15]:
from gensim.models import TfidfModel

# fit TF-IDF model
model = TfidfModel(document_term_matrix)
tf_idf = model[document_term_matrix]

Let's see the output:

In [16]:
print("\nFirst document in TF-IDF format (raw):", tf_idf[0])
print("First document in TF-IDF format (word, frequency):", [[dictionary[id], freq] for id, freq in tf_idf[0]])


First document in TF-IDF format (raw): [(0, 0.3048256577720754), (1, 0.10267055678042152), (2, 0.249209864967932), (3, 0.29483215702461396), (4, 0.3048256577720754), (5, 0.15325778767051162), (6, 0.17868562942168534), (7, 0.33670071042346267), (8, 0.175367327477358), (9, 0.2023439280457968), (10, 0.18208408221404682), (11, 0.44882891302624855), (12, 0.20161212056132805), (13, 0.16219714724153844), (14, 0.0900998951919583), (15, 0.22500552666505164), (16, 0.17774119448204392), (17, 0.14904502951897228)]
First document in TF-IDF format (word, frequency): [['alasad', 0.3048256577720754], ['american', 0.10267055678042152], ['attack', 0.249209864967932], ['base', 0.29483215702461396], ['erbil', 0.3048256577720754], ['hurt', 0.15325778767051162], ['injur', 0.17868562942168534], ['iran', 0.33670071042346267], ['iraq', 0.175367327477358], ['iraqi', 0.2023439280457968], ['jan', 0.18208408221404682], ['militari', 0.44882891302624855], ['missil', 0.20161212056132805], ['origin', 0.16219714724153

Finally, we save the TF-IDF object for further analysis:

In [17]:
with open("./output/tf_idf_gensim.pkl", "wb") as file:
    pkl.dump(tf_idf, file)