# INFO 4271 - Exercise 2 - Text Representation

Issued: April 23, 2024

Due: April 29, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Bag-of-Words Models
In class we discussed BOW vectorization models under which documents are represented via term frequency counts.

a) Construct term frequency BOW representations for the following sentences:

- "The government is open."
- "The government is closed."
- "Long live Mickey Mouse, emperor of all!"
- "Darn! This will break."

In [1]:
import re
from collections import Counter
import numpy as np

def bag_of_words(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize the text
    return text.lower().split()

def vocabulary(corpus):
    vocab = set()
    for doc in corpus:
        # Tokenize each document
        words = bag_of_words(doc[0])
        vocab.update(words)
    return sorted(list(vocab))

text = "Hello, world! This is a test. Hello again."
print(bag_of_words(text))
print(vocabulary([[text], [text], ["Hi!, this is fun :)"]]))

['hello', 'world', 'this', 'is', 'a', 'test', 'hello', 'again']
['a', 'again', 'fun', 'hello', 'hi', 'is', 'test', 'this', 'world']


In [2]:
corpus = [['The the government is open.'], ['The government is closed.'], ['Long live Mickey Mouse, emperor of all!'], ['Darn! This will break.']]

#Turn a corpus of arbitrary texts into term-frequency weighted BOW vectors.
def TF(corpus):
    vecs = []
    
    entire_vocab = vocabulary(corpus)
    for doc in corpus:
        words = bag_of_words(doc[0])
        counts = Counter(words)
        vec = [counts[word] for word in entire_vocab]
        vecs.append(vec)    
    
    return np.array(vecs)

TF(corpus)

array([[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]])

b) Extend the term frequency model by an inverse document frequency (IDF) component. Estimate IDFs based on the Reuters 21578 collection.

In [21]:
import math
import nltk
from nltk.corpus import reuters

#Download the documents
nltk.download("reuters")
documents = reuters.fileids()

docs = list(filter(lambda doc: doc.startswith("train"),documents));
print(str(len(docs)) + " total train documents");

#Estimate inverse document frequencies based on a corpus of documents.
def IDF_dict(corpus):
    idfs = {}
    N = len(corpus)
    
    # corpus is a list of counts of documents
    for doc in corpus:
        for word, doc_count in doc.items():
            total_count = idfs.get(word, 0) # default to 0 if not found yet
            idfs[word] = total_count + doc_count
                
    for word, count in idfs.items():
        idfs[word] = math.log(N/float(count))
    
    return idfs

def IDF_reuters(corpus):
    idfs_reuters = {}
    corpus_reuters = [Counter(reuters.words(doc)) for doc in docs]
    
    # corpus is a list of counts of documents
    for doc in corpus_reuters:
        for word, doc_count in doc.items():
            total_count = idfs_reuters.get(word, 0) # default to 0 if not found yet
            idfs_reuters[word] = total_count + doc_count
            
    vecs = []
    entire_vocab = vocabulary(corpus)
    N = len(corpus_reuters)
    
    for doc in corpus:
        vec = [math.log(N/float(idfs_reuters[word])) for word in entire_vocab]
        vecs.append(vec) 
        
    return vecs   

def IDF(corpus):
    counts = {}
    corpus_counts = [Counter(bag_of_words(doc[0])) for doc in corpus]
    
    # corpus is a list of counts of documents
    for doc in corpus_counts:
        for word, doc_count in doc.items():
            total_count = counts.get(word, 0) # default to 0 if not found yet
            counts[word] = total_count + doc_count
            
    entire_vocab = vocabulary(corpus)
    N = len(corpus)
    idfs = np.zeros( (N, len(entire_vocab)) )
    
    for i, doc in enumerate(corpus):
        words = bag_of_words(doc[0])
        for j, word in enumerate(entire_vocab):
            if word not in words:
                continue
            idfs[i, j] = math.log(N/float(counts[word]))
        
    return idfs

#Turn a corpus of arbitrary texts into TF-IDF weighted BOW vectors.
def TFIDF(corpus):
    return TF(corpus) * IDF(corpus)

7769 total train documents


[nltk_data] Downloading package reuters to
[nltk_data]     /home/tluebbing/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


In [22]:
np.set_printoptions(precision=3)

corpus = [['The government is open because it is the goverment.'], ['The government is closed.'], ['Long live Mickey Mouse, emperor of all!'], ['Darn! This will break.']]
corpus_reuters = [Counter(reuters.words(doc)) for doc in docs]


idf_dict = IDF_dict(corpus_reuters)
print(idf_dict, end="\n\n\n")

# Das ergibt keinen Sinn, weil die vocabs nicht gleich sind... emperor ist nicht teil von reuters
# ? Wofür dann überhaupt reuters corpus?
# idf_reuters = IDF_reuters(corpus)
# print(idf_reuters)

idf = IDF(corpus)
print(idf, end="\n\n\n")

# TFIDF equal to IDF since TF in the given corpus is always <= 1 for each term in a document
# Since the first document is changed a bit, the TFIDF values are different for the more frequent words
tfidf = TFIDF(corpus) 
print(tfidf)



[[0.    1.386 0.    0.    0.    0.    1.386 0.693 0.288 1.386 0.    0.
  0.    0.    0.    1.386 0.288 0.    0.   ]
 [0.    0.    0.    1.386 0.    0.    0.    0.693 0.288 0.    0.    0.
  0.    0.    0.    0.    0.288 0.    0.   ]
 [1.386 0.    0.    0.    0.    1.386 0.    0.    0.    0.    1.386 1.386
  1.386 1.386 1.386 0.    0.    0.    0.   ]
 [0.    0.    1.386 0.    1.386 0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.    0.    0.    1.386 1.386]]


[[0.    1.386 0.    0.    0.    0.    1.386 0.693 0.575 1.386 0.    0.
  0.    0.    0.    1.386 0.575 0.    0.   ]
 [0.    0.    0.    1.386 0.    0.    0.    0.693 0.288 0.    0.    0.
  0.    0.    0.    0.    0.288 0.    0.   ]
 [1.386 0.    0.    0.    0.    1.386 0.    0.    0.    0.    1.386 1.386
  1.386 1.386 1.386 0.    0.    0.    0.   ]
 [0.    0.    1.386 0.    1.386 0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.    0.    0.    1.386 1.386]]


c) Bag-of-words models are order invariant. They do not retain the ordering in which terms occur in the document. Is there any way to include term order information in these models? Justify your answer below.

- No, you can't incorporate term order info in a simple bag of words model. All words in the bag are assigned values using the same method, like term frequency and inverse document frequency, without considering the sequence order in the document.

- However, there are techniques such as word embeddings (like word2vec) that take into account the surrounding words in the document. Also, Positional encodings in transformer models use word order and context info to achieve much better results compared bag of words models.

# 2. Topic Models
Topic models represent textual documents in terms of their distribution of latent topics. Imagine you have trained a 10-topic LDA model. Each topic is a frequency distribution over thousands of terms. Is there a good way of illustrating the meaning of the learned topics to a human? Discuss the advantages and disadvantages of some of the possible options below.

1. Listing Top 10 or 20 most important terms for each topic. A simple approach to identify the most important topics the algorithm has learned and eliminate problematic topics. No information on word frequencies and importance or relationships between words.
2. Word clouds can be used to visualize the most important terms in each of the 10 topics. The size of each word in the cloud can represent its frequency or importance in the topic. Unfortunatly, word clouds do not provide information about the relationships between words and can sometimes also be misleading due to the lack of context.