# Exercises

In these exercises you will first be solving tasks related to the new concepts.
# Stemming and Lemmatization

#### 1. Given the list of pluralized words below, define your own simple word stemmer function or class,  limited to only simple rules and regex. No libraries! It should strip basic endings.

In [1]:
import re

plurals = [
    "flies",
    "denied",
    "itemization",
    "sensational",
    "reference",
    "colonizer",
]

class MyStemmer:
    def __init__(self):
        pass

    def stem_word(self, word):
        if re.match(r".*(s|es|ed)$", word):
            return re.sub(r"(s|es|ed)$", "", word)
        return word    

    def stem(self, words):
        return [self.stem_word(word) for word in words]

stemmer = MyStemmer()
stemmer.stem(plurals)

['fli', 'deni', 'itemization', 'sensational', 'reference', 'colonizer']

#### 2. After your initial implementation, run it on the following words:

In [2]:
new_words = [
    "friendly",
    "puzzling",
    "helpful",
]
stemmer.stem(new_words)

['friendly', 'puzzling', 'helpful']

#### 3. Realizing that fixing future words manually can be problematic, use a desired NLTK stemmer and run it on all the words:

In [3]:
import nltk

all_words = plurals + new_words

def stem_wordlist(stemmer, wordlist):
    assert isinstance(stemmer, nltk.stem.StemmerI)
    return [stemmer.stem(word) for word in wordlist]

stemmer = nltk.stem.PorterStemmer()
stem_wordlist(stemmer, all_words)

['fli',
 'deni',
 'item',
 'sensat',
 'refer',
 'colon',
 'friendli',
 'puzzl',
 'help']

#### 4. There are likely a few words in the outputs above that would cause issues in real-world applications. Pick some examples, and show how they are solved with a lemmatizer. Use either spaCy or nltk.

Your answer here! Code below.

In [4]:
nltk.download('wordnet')
lemmatizer = nltk.stem.WordNetLemmatizer()
print(lemmatizer.lemmatize("flies"))
print(lemmatizer.lemmatize("colonizer"))

[nltk_data] Downloading package wordnet to /Users/tollef/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


fly
colonizer


# Stemming/Lemmatization - Practical Example
Using the news corpus (subset/category of the Brown corpus), perform common text normalization techniques such as stopword filtering and stemming/lemmatization. Compare the top 10 most common **words** before and after these normalization techniques.

In [51]:
# import nltk; nltk.download('brown')  # ensure we have the data
from nltk.corpus import brown
news = brown.words(categories='news')

news_words = [word.lower() for word in news if word.isalpha()]
top10 = nltk.FreqDist(news_words).most_common(10)
top10 = [word for word, _ in top10]
top10

['the', 'of', 'and', 'to', 'a', 'in', 'for', 'that', 'is', 'was']

In [8]:
stopwords = nltk.corpus.stopwords.words('english')
words = [w for w in news_words if w not in stopwords]
words = [lemmatizer.lemmatize(w) for w in words]

top10 = nltk.FreqDist(words).most_common(10)
top10 = [word for word, _ in top10]
top10

['said',
 'would',
 'year',
 'new',
 'one',
 'state',
 'last',
 'two',
 'first',
 'president']

# TF-IDF
TF-IDF (term frequency-inverse document frequency) is a way to measure the importance of a word in a document.

$$
\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)
$$

Where:
- $t$ is the term (word)
- $d$ is the document
- $D$ is the corpus



#### 1. Implement TF-IDF using NLTKs FreqDist (no use of e.g. scikit-learn and other high-level libraries).

In [None]:
import nltk
from math import log10


def tf(document, term):
    freq = nltk.FreqDist(document)
    return freq[term]


def idf(documents, term):
    num_docs_with_term = 0
    for d in documents:
        if term in d:
            num_docs_with_term += 1

    if num_docs_with_term == 0:
        return 0

    return log10(len(documents) / num_docs_with_term)

def tf_idf(all_documents, document, term):
    _tf = tf(document, term)
    _idf = idf(all_documents, term)
    # print(f"TF: {_tf}, IDF: {_idf}")
    return _tf * _idf

#### 2. With your TF-IDF function in place, calculate the TF-IDF for the following words in the first document of the news articles found in the Brown corpus: 

- *the*
- *nevertheless*
- *highway*
- *election*

Perform any preprocessing steps you deem necessary. Comment on your findings.

In [10]:
import re

# get fileids from news:
fileids = brown.fileids(categories='news')
# we make these lists as they are an NLTK corpus object
first_doc = list(brown.words(fileids[0]))
all_docs = [list(brown.words(fileid)) for fileid in fileids]

def preprocess(document):
    # TODO: implement some preprocessing steps.
    pass

# LF
def preprocess(document):
    # TODO: implement some preprocessing steps.
    words = [re.sub(r'[^a-z]', '', w.lower()) for w in document]
    # require length > 1
    words = [w for w in words if len(w) > 1]
    return words

first_doc = preprocess(first_doc)
all_docs = [preprocess(doc) for doc in all_docs]

terms = ["the", "nevertheless", "highway", "election"]

for term in terms:
    score = tf_idf(all_docs, first_doc, term)
    print(f"TF-IDF of '{term}': {score:.2f}")

TF-IDF of 'the': 0.00
TF-IDF of 'nevertheless': 0.94
TF-IDF of 'highway': 7.29
TF-IDF of 'election': 8.43


#### 3. While TF-IDF is primarily used for information retrieval and text mining, reflect on how TF-IDF could be used in a language modeling context.

Expect an answer that includes the importance of weighing words based on their frequency and inverse frequency, as we don't want to overrepresent common words when predicting the next word. If we only considered frequency, for example, we would be left with "a", "the", "in", etc. as the most "important", and thus most predicted words.

So similarly to the n-gram question in lab2, we could utilize these weights as a kind of filter on the next word predictions.

#### 4. You were previously introduced to word representations. TF-IDF can be considered one. What are some differences between the TF-IDF output and one that is computed once from a vocabulary (e.g. one-hot encoding)?

TF-IDF is a mapping between a word and a number (as opposed to a vector based on its position in the vocabulary). This number, instead of being a binary "in the document" or "not in the document", is a real number that represents the importance of the word in the given context, based on the entire collection. 

Put simply, TF-IDF is a measure of importance, where one-hot encoding is a measure of presence.

# TF-IDF - Practical Example
You will again be looking at specific words for a document, but this time weighted by their TF-IDF scores. Ideally, the scoring should be able to retrieve representative words for this document in context of its document collection or category.

You will do the following:
- Select a category from the Reuters (news) corpus
- Perform preprocessing
- Calculate TF-IDF scores
- Find the top 5 words for a subset of documents in your collection (e.g. 5, 10, ..)
- Inspect whether these words make sense for a given document, and comment on your findings.

In [56]:
# import nltk; nltk.download("reuters")
from nltk.corpus import reuters

# TODO: select a category of the Reuters corpus
# (hint: check with reuters.categories())
#
# Apply preprocessing whenever it suits you :-)
files = reuters.fileids(categories=["ship"])
docs = [reuters.words(fileid) for fileid in files]
docs = [preprocess(doc) for doc in docs]

# TODO: calculate the TF-IDF scores for the selected documents
TOP_N = 5
SUBSET_SIZE = 5
for doc_id, doc in list(zip(files, docs))[:SUBSET_SIZE]:
    print(" ".join(doc))
    scores = {term: tf_idf(docs, doc, term) for term in doc}
    # sorted
    scores = {
        k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)
    }
    for term, score in list(scores.items())[:TOP_N]:
        print(f"{term}: {score:.2f}")
    print()

australian foreign ship ban ends but nsw ports hit tug crews in new south wales nsw victoria and western australia yesterday lifted their ban on foreign flag ships carrying containers but nsw ports are still being disrupted by separate dispute shipping sources said the ban imposed week ago over pay claim had prevented the movement in or out of port of nearly vessels they said the pay dispute went before hearing of the arbitration commission today meanwhile disruption began today to cargo handling in the ports of sydney newcastle and port kembla they said the industrial action at the nsw ports is part of the week of action called by the nsw trades and labour council to protest changes to the state workers compensation laws the shipping sources said the various port unions appear to be taking it in turn to work for short time at the start of each shift and then to walk off cargo handling in the ports has been disrupted with container movements most affected but has not stopped altogether

# Part-of-speech tagging

#### 1. Briefly describe your understanding of POS tagging and its possible use-cases in context of text generation applications/language modeling.

Much alike TF-IDF as a weighing scheme, including POS tagging for language modeling could be used to ensure proper sentence structure (such as Subject-Verb-Object, depending on language, of course) and in this way filter out unlikely word sequences.

#### 2. Train a UnigramTagger (NLTK) using the Brown corpus. 
Hint: the taggers in nltk require a list of sentences containing tagged words.

In [46]:
import nltk

brown = nltk.corpus.brown
corpus = brown.tagged_sents()

tagger = nltk.UnigramTagger(train=corpus)

#### 3. Use this tagger to tag the text given below. Print out the POS tags for all variants of "justify"

In [47]:
text = """
Imagine a situation where you have to explain why you did something – that's when you justify your actions. So, let's say you made a decision; you, as the justifier, need to give good reasons (justifications) for your choice. You might use justifying words to make your point clear and reasonable. Justifying can be a bit like saying, "Here's why I did what I did." When you justify things, you're basically providing the why behind your actions. So, being a good justifier involves carefully explaining, giving reasons, and making sure others understand your choices
"""

words = nltk.word_tokenize(text)
tagged = tagger.tag(words)
for word, tag in tagged:
    if "just" in word:
        print(f"{word} ({tag})")

justify (VB)
justifier (None)
justifications (NNS)
justifying (VBG)
justify (VB)
justifier (None)


#### 4. Your results may be disappointing. Repeat the same task as above using both the default NLTK pos-tagger and with spaCy. Compare the results

In [48]:
import nltk

tagged = nltk.pos_tag(nltk.word_tokenize(text))
for word, tag in tagged:
    if "just" in word:
        print(f"{word} ({tag})")

justify (VBP)
justifier (NN)
justifications (NNS)
justifying (VBG)
justify (VBP)
justifier (NN)


In [40]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
for token in doc:
    if "just" in token.text:
        print(f"{token.text} ({token.pos_})")

justify (VERB)
justifier (NOUN)
justifications (NOUN)
justifying (VERB)
justify (VERB)
justifier (NOUN)


#### 5. Finally, explore more features of the what the spaCy *document* includes.

Expect the student to explore attributes such as lemmatization, perhaps named entity recognition, and other easily accessible features of the document object.