# Week 2 ― main topics

+ representing the duality between words and meanings
+ language modeling

# Agenda

# Models of natural language (NL)

**What** is an NL model?

**How** do we build an NL model?

**Why** should we care about NL models?

... let's focus on the **why** aspect first. 

## Why do we care about NL models?

Let's consider **tokenization**, a core task to any natural language processing
analysis.

Now, let's apply different tokenizers to the below displayed sentence.

In [1]:
s = """Back in the golden age of hip-hop
       (the late '80s, youngsters), Rakim took
       lyricism to unfathomable heights,
       helping to usher in the wave of lethal
       MCs like Big Daddy Kane and Kool G Rap,
       who would go on to become icons. Two
       decades later, some of Ra's rhymes from
       '86 are still over people's heads: His
       wordplay remains a hip-hop measuring
       stick."""

### Different tokenizers in action: Naive solution

Both the module `re` and string methods can be used to implement a naive
tokenizer.

In [2]:
# pure Python
tokens_s = s.split()

# regex
import re
pattern = r'[-\s.,;!?]+'
tokens_re = re.split(pattern, s)

# print results
from pprint import pprint as pp
print("""
Original string:
================
{}

Pure Python:
============
{}

With regex:
===========
{}
""".format(s, tokens_s, tokens_re), flush=True)


Original string:
Back in the golden age of hip-hop
       (the late '80s, youngsters), Rakim took
       lyricism to unfathomable heights,
       helping to usher in the wave of lethal
       MCs like Big Daddy Kane and Kool G Rap,
       who would go on to become icons. Two
       decades later, some of Ra's rhymes from
       '86 are still over people's heads: His
       wordplay remains a hip-hop measuring
       stick.

Pure Python:
['Back', 'in', 'the', 'golden', 'age', 'of', 'hip-hop', '(the', 'late', "'80s,", 'youngsters),', 'Rakim', 'took', 'lyricism', 'to', 'unfathomable', 'heights,', 'helping', 'to', 'usher', 'in', 'the', 'wave', 'of', 'lethal', 'MCs', 'like', 'Big', 'Daddy', 'Kane', 'and', 'Kool', 'G', 'Rap,', 'who', 'would', 'go', 'on', 'to', 'become', 'icons.', 'Two', 'decades', 'later,', 'some', 'of', "Ra's", 'rhymes', 'from', "'86", 'are', 'still', 'over', "people's", 'heads:', 'His', 'wordplay', 'remains', 'a', 'hip-hop', 'measuring', 'stick.']

With regex:
['Back', 'i

### Different tokenizers in action: NLTK's tokenizers

NLTK has different tokenizers (which, in 2020, seem gross)

In [3]:
# import NLTK's tokenizer 'Regexp'
from nltk.tokenize import RegexpTokenizer

# the tokenizer
pattern = r'\w+|$[0-9.]+|\S+'
tokenizer = RegexpTokenizer(pattern)

# tokenize
tokens_nltk_r = tokenizer.tokenize(s)

# print results
from pprint import pprint as pp
print("""
Original string:
================
{}

Pure Python:
============
{}

With regex:
===========
{}

With NLTK's Regex tokenizer:
============================
{}
""".format(s, tokens_s, tokens_re, tokens_nltk_r), flush=True)


Original string:
Back in the golden age of hip-hop
       (the late '80s, youngsters), Rakim took
       lyricism to unfathomable heights,
       helping to usher in the wave of lethal
       MCs like Big Daddy Kane and Kool G Rap,
       who would go on to become icons. Two
       decades later, some of Ra's rhymes from
       '86 are still over people's heads: His
       wordplay remains a hip-hop measuring
       stick.

Pure Python:
['Back', 'in', 'the', 'golden', 'age', 'of', 'hip-hop', '(the', 'late', "'80s,", 'youngsters),', 'Rakim', 'took', 'lyricism', 'to', 'unfathomable', 'heights,', 'helping', 'to', 'usher', 'in', 'the', 'wave', 'of', 'lethal', 'MCs', 'like', 'Big', 'Daddy', 'Kane', 'and', 'Kool', 'G', 'Rap,', 'who', 'would', 'go', 'on', 'to', 'become', 'icons.', 'Two', 'decades', 'later,', 'some', 'of', "Ra's", 'rhymes', 'from', "'86", 'are', 'still', 'over', "people's", 'heads:', 'His', 'wordplay', 'remains', 'a', 'hip-hop', 'measuring', 'stick.']

With regex:
['Back', 'i

In [4]:
# import NLTK's Treebank tokenizer
from nltk.tokenize import TreebankWordTokenizer

# the tokenizer
tokenizer = TreebankWordTokenizer()

# tokenize
tokens_nltk_t = tokenizer.tokenize(s)

# print results
from pprint import pprint as pp
print("""
Original string:
================
{}

With NLTK's Regex tokenizer:
============================
{}

With NLTK's Treebank tokenizer:
===============================
{}
""".format(s, tokens_nltk_r, tokens_nltk_t), flush=True)


Original string:
Back in the golden age of hip-hop
       (the late '80s, youngsters), Rakim took
       lyricism to unfathomable heights,
       helping to usher in the wave of lethal
       MCs like Big Daddy Kane and Kool G Rap,
       who would go on to become icons. Two
       decades later, some of Ra's rhymes from
       '86 are still over people's heads: His
       wordplay remains a hip-hop measuring
       stick.

With NLTK's Regex tokenizer:
['Back', 'in', 'the', 'golden', 'age', 'of', 'hip', '-hop', '(the', 'late', "'80s,", 'youngsters', '),', 'Rakim', 'took', 'lyricism', 'to', 'unfathomable', 'heights', ',', 'helping', 'to', 'usher', 'in', 'the', 'wave', 'of', 'lethal', 'MCs', 'like', 'Big', 'Daddy', 'Kane', 'and', 'Kool', 'G', 'Rap', ',', 'who', 'would', 'go', 'on', 'to', 'become', 'icons', '.', 'Two', 'decades', 'later', ',', 'some', 'of', 'Ra', "'s", 'rhymes', 'from', "'86", 'are', 'still', 'over', 'people', "'s", 'heads', ':', 'His', 'wordplay', 'remains', 'a', 'hip',

### Different tokenizers in action: spaCy's tokenizer

spaCy operates a tokenizer that is informed by regular expressions & **DL**

In [5]:
# import spaCy
import spacy

# load a model of natural language (https://spacy.io/models)
'''
to install one of spaCy's models:

conda install -c conda-forge spacy-model-en_core_web_sm
'''
import en_core_web_sm

In [38]:
# load the nlp pipeline
nlp = en_core_web_sm.load()
'''
Mac users may want to go for:

spacy.load("the_model")
'''

# pass the sentence through the pipeline
doc = nlp(s)

# store tokens in a list
tokens_spacy = [token.text for token in doc if '\n' not in token.text]

# print
print("""
With NLTK's Regex tokenizer:
============================
{}

With NLTK's Treebank tokenizer:
===============================
{}

With spaCy:
==========
{}

""".format(tokens_nltk_r, tokens_nltk_t, ' ^^ '.join(tokens_spacy)))


With NLTK's Regex tokenizer:
['Back', 'in', 'the', 'golden', 'age', 'of', 'hip', '-hop', '(the', 'late', "'80s,", 'youngsters', '),', 'Rakim', 'took', 'lyricism', 'to', 'unfathomable', 'heights', ',', 'helping', 'to', 'usher', 'in', 'the', 'wave', 'of', 'lethal', 'MCs', 'like', 'Big', 'Daddy', 'Kane', 'and', 'Kool', 'G', 'Rap', ',', 'who', 'would', 'go', 'on', 'to', 'become', 'icons', '.', 'Two', 'decades', 'later', ',', 'some', 'of', 'Ra', "'s", 'rhymes', 'from', "'86", 'are', 'still', 'over', 'people', "'s", 'heads', ':', 'His', 'wordplay', 'remains', 'a', 'hip', '-hop', 'measuring', 'stick', '.']

With NLTK's Treebank tokenizer:
['Back', 'in', 'the', 'golden', 'age', 'of', 'hip-hop', '(', 'the', 'late', "'80s", ',', 'youngsters', ')', ',', 'Rakim', 'took', 'lyricism', 'to', 'unfathomable', 'heights', ',', 'helping', 'to', 'usher', 'in', 'the', 'wave', 'of', 'lethal', 'MCs', 'like', 'Big', 'Daddy', 'Kane', 'and', 'Kool', 'G', 'Rap', ',', 'who', 'would', 'go', 'on', 'to', 'become', '

**Pseudocode behind spaCy's tokenizer**

![](images/_16.png)

**spaCy's tokenizer in context**

![](images/_17.svg)

In [44]:
# store tokens in a list
lemmas_spacy = [token.lemma_ for token in doc if not '\n' in token.text]

# print
print("""
With NLTK's Regex tokenizer:
============================
{}

With NLTK's Treebank tokenizer:
===============================
{}

With spaCy -- tokens as text:
=============================
{}

With spaCy -- tokens as lemmas:
===============================
{}

""".format(tokens_nltk_r, tokens_nltk_t, ' ^^ '.join(tokens_spacy),
           lemmas_spacy))


With NLTK's Regex tokenizer:
['Back', 'in', 'the', 'golden', 'age', 'of', 'hip', '-hop', '(the', 'late', "'80s,", 'youngsters', '),', 'Rakim', 'took', 'lyricism', 'to', 'unfathomable', 'heights', ',', 'helping', 'to', 'usher', 'in', 'the', 'wave', 'of', 'lethal', 'MCs', 'like', 'Big', 'Daddy', 'Kane', 'and', 'Kool', 'G', 'Rap', ',', 'who', 'would', 'go', 'on', 'to', 'become', 'icons', '.', 'Two', 'decades', 'later', ',', 'some', 'of', 'Ra', "'s", 'rhymes', 'from', "'86", 'are', 'still', 'over', 'people', "'s", 'heads', ':', 'His', 'wordplay', 'remains', 'a', 'hip', '-hop', 'measuring', 'stick', '.']

With NLTK's Treebank tokenizer:
['Back', 'in', 'the', 'golden', 'age', 'of', 'hip-hop', '(', 'the', 'late', "'80s", ',', 'youngsters', ')', ',', 'Rakim', 'took', 'lyricism', 'to', 'unfathomable', 'heights', ',', 'helping', 'to', 'usher', 'in', 'the', 'wave', 'of', 'lethal', 'MCs', 'like', 'Big', 'Daddy', 'Kane', 'and', 'Kool', 'G', 'Rap', ',', 'who', 'would', 'go', 'on', 'to', 'become', '

## What is a model of NL?

In [45]:
# example from spaCy's website
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False



## How do we build a model of NL?

**Bibliography**

+ Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. Eliyahu Kiperwasser, Yoav Goldberg. (2016)

+ A Dynamic Oracle for Arc-Eager Dependency Parsing. Yoav Goldberg, Joakim Nivre (2012)

+ Parsing English in 500 Lines of Python. Matthew Honnibal (2013)

+ Stack-propagation: Improved Representation Learning for Syntax. Yuan Zhang, David Weiss (2016)

+ Deep multi-task learning with low level tasks supervised at lower layers. Anders Søgaard, Yoav Goldberg (2016)

+ An Improved Non-monotonic Transition System for Dependency Parsing. Matthew Honnibal, Mark Johnson (2015)

+ A Fast and Accurate Dependency Parser using Neural Networks. Danqi Cheng, Christopher D. Manning (2014)

+ Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. Stefan Riezler et al. (2002)

**Tools**

+ Old-school but still good:
  + $\texttt{word2vec} $
+ Extensions of $\texttt{\word2vec}$:
  + GloVe
  + Fasttext
+ Context-aware models:
  + BERT
  + ELMO