# Tokenization

**1. Kochmar mentions several steps required in a typical NLP pipeline, one of them being *Split into words*. Why is this step necessary? Why can we not just feed the text as it is into a model?**

- A full text will provide little to no meaning without properly discretizing the units of information
    - e.g., words, multi-word expressions, ...
- Predictability in modeling schemes: we cannot beforehand know the structure (such as the length) of a raw input text
    - By tokenizing, we can impose rules such as "maximum length 512 tokens", and everything below this could for example get "padded" to ensure we always have the same input lengths
- More advanced approaches like the attention mechanism relies on identfiying relationships between tokens

The student is expected to at least reflect on some of these factors.

**2. Simply splitting on "words" (i.e. whitespace) is rarely enough. Consider the sentence below ("That U.S.A. poster-print costs $12.40...") and name some problems that arise from splitting on whitespace.**

The student is expected to know about certain issues on tokenization with symbols, abbreviations, capitalization, hyphenation, ...

In [2]:
# If you wish, experiment with implementing different rules for tokenization. You will see that the "ruleset" quickly grows if you want to account for all types of edge cases...
sentence = "That U.S.A. poster-print costs $12.40..."

def your_rulebased_tokenizer(sentence):
    tokens = []
    return tokens

your_rulebased_tokenizer(sentence)

# if implemented, just make sure that the output is a list of tokens.

[]

NLTK has several tokenizers implemented, such as a specific one for Twitter data. Below, indicated by the `TODO`-tag, you should find and import various tokenizers and add them to the list of tokenizers:

`tokenizers = [tokenizer1, tokenizer2, ..., tokenizerN]`

Tokenize the sentence with at least three different tokenizers supplied by NLTK and comment on your findings. You will find the documentation for NLTK's tokenizers [here](https://www.nltk.org/_modules/nltk/tokenize.html) useful.

In [3]:
from typing import List

from nltk.tokenize.api import TokenizerI
from nltk.tokenize import WhitespaceTokenizer, TreebankWordTokenizer, WordPunctTokenizer, TweetTokenizer

tokenizers = [
    WhitespaceTokenizer(),
    TreebankWordTokenizer(),  # tokenize according to the Penn Treebank
    WordPunctTokenizer(),
    TweetTokenizer()
]
# ************************************************************

# Leave this function as-is
def tokenize(tokenizers: List[TokenizerI], sentence: str) -> None:
    for tokenizer in tokenizers:
        assert isinstance(tokenizer, TokenizerI)
        tokenized = tokenizer.tokenize(sentence)
        print(f"{tokenizer.__class__.__name__} ({len(tokenized)} tokens)\n{tokenized}\n")


tokenize(tokenizers, sentence)

WhitespaceTokenizer (5 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40...']

TreebankWordTokenizer (7 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']

WordPunctTokenizer (16 tokens)
['That', 'U', '.', 'S', '.', 'A', '.', 'poster', '-', 'print', 'costs', '$', '12', '.', '40', '...']

TweetTokenizer (12 tokens)
['That', 'U', '.', 'S', '.', 'A', '.', 'poster-print', 'costs', '$', '12.40', '...']



Your comments on the outputs above here!

# 2. Language modeling
We have now studied the bigger models like BERT and GPT-based language models. A simpler language model, however, can implemented using n-grams.

**1. What is an n-gram?**

A sequence of n tokens (words, characters, ...) that occur within a text (in a sort of sliding window fashion)

**2. Use NLTK to print out bigrams and trigrams for the given sentence below. Your function should support any number of N.**

In [4]:
sentence = "That U.S.A. poster-print costs $12.40... I'd pay $5.00 for it."

# ************************************
# TODO: your implementation of n-grams
# ************************************

import nltk
def get_ngram(sentence: str, n: int):
    tokens = TreebankWordTokenizer().tokenize(sentence)
    return list(nltk.ngrams(tokens, n))

**3. Based on your intuition for language modeling, how can n-grams be used for word predictions?**

A varying N could be used to create a simplistic language model that adheres specifically to the data it is trained on. If you train it on your own chats, for example, it would probably be alright. Using combinations of e.g. 2-grams, 3-grams and 4-grams could yield better results, if we weigh the probability of the next word based on the different contexts. 

**4. NLTK includes the `FreqDist` class, which produces the frequency distribution of words in a sentence. Use it to print out the two most common words in the text below.**

In [5]:
from nltk import FreqDist
from nltk.tokenize import word_tokenize

text = "That that is is that that is not. Is that it? It is. You sure? Surely it is!"
tokens = word_tokenize(text)
FreqDist(tokens).most_common(2)

[('is', 5), ('that', 4)]

**5. Use your n-gram function from question 2.2 to print out the most common trigram of the text in question 2.4**

In [6]:
text = text.lower()
ngram = get_ngram(text, 3)
ngram_freqs = FreqDist(ngram)
ngram_freqs.most_common(1)

[(('that', 'that', 'is'), 2)]

**6. You may have discovered that you would need to implement some form of preprocessing to get the correct answer to the previous tasks. Preprocessing/cleaning/normalization is often necessary for the desired results. If you were to process the text of a news site or blog post, can you think of some preprocessing steps that would be useful?**

The student should answer something about:
- issues with lowercasing (removing meaning from things like proper nouns, entities and more). Same with stopwords.
- hyphenation (often occurring in dialogues and quotes)
- symbols in general
- emails, dates, URLs

# 3. Word Representations
For more information on word representations, consult the lab description file and course material.

**1. Describe the main differences between bag-of-words and one-hot encoding through examples.**

Bag-of-words is a mapping/dictionary between the collection of words and their occurrences. 

Given an initial sentence like "the cat and the dog"

The bag-of-words representation would be: 
```
{
    "the": 2,
    "cat": 1,
    "and": 1,
    "dog": 1
}
```
And thus, we can represent another sentence like "the cat and the cat" as a vector: [2, 2, 1, 0] and "the cat" as [1, 1, 0, 0].


One-hot encoding is a binary representation of words in a vocabulary.
With the same data as above, "the cat and the dog", the vocabulary would be represented as:
```
[
    [1, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 1]
]
```
The complete sentence is then `[1, 1, 1, 1]` and "the cat" would be `[1, 1, 0, 0]`. "The cat and the cat" would be `[1, 1, 1, 0]`.

The main differences is that the one-hot encoding results in a binary (either/or) whether a word is present. Bag-of-words considers the frequency of the words within the document, whereas one-hot will not.

**2. What are the limitations of the above representations?**

Positional data is not well represented with either BoW or one-hot encoding, and thus we lack semantic meaning compared to other alternatives like word embeddings.

**3. Example of word embedding techniques, such as Word2Vec and GloVe are considered *dense* representations. How do dense word embeddings relate to the *distributional hypothesis*?**

The way dense representations are developed, such as skip-gram or CBoW, is by considering context when passed through the model. This directly relates to the distributional hypothesis, which states that words that occur in similar contexts are related in meaning. If a model is able to infer the correct word given its context (or vice versa), it has learned something about the distributional semantics of the involved words