Week 2
======

― Representing words and meanings
------------------------------------------------------

― Language modeling
--------------------------------

<img src="images/_99.jpg" width="30%">

Meanings are central to natural language 
=================================

Key **emerging properties** of natural language:

+ language is socially-oriented
+ words reflect symbols and social categories (that is, culture)
+ language convey (ambiguous) meanings

The **concept of meaning** is the place to start for any natural language processing analyses.

Let's have a closer look at:

+ how 'meanings' are represented in computational linguistics
+ how machines look at 'meanings'.

<img src="images/_0.jpg" width="75%">

How to represent the meaning of a word?
=====================================

A (computational) linguist's perspective
----------------------------------

Two pillars that reflect how linguists' think about meanings:

+ denotational semantics
+ distributional hypothesis

The intuition behind denotational semantics
---------------------------------------------------------------

Semantics, as the study of meanings, concerns the relationship between signifiers ― like words, phrases, signs, and symbols ― and what they stand for in reality, their denotation.

Denotations comprise both the salient features associated with an entity (being a concrete instance or a category) and the cognitive and behavioral effects of using a signifier that invokes an entity.  

Example: the lexeme 'hip-hop' conveys meanings about what constitute a 'hip-hop' song as well as the values, norms, and beliefs that orient the behavior of 'hip-hop people.'

<img src="images/_2.jpg" width="100%">

The distributional hypothesis (DH)
============================

*''Difference of meaning correlates with difference of distribution''*

―Harris, 1954

*''Semantic similarity is a function of the contexts in which words are used.''*

―Miller & Charles, 1951

*''DS is not only a method for lexical analysis but also a theoretical framework to build computational models of semantic memory''*

―Lenci, 2018

DH lies at the hearth of vector space models.

<img src="images/_1.png" width="100%">

Fig. 1 ― Distributional vectors of the lexemes car, cat, dog, and van. Notes: source is 'Lenci 2018 ― ARL'

Distributional representations
========================

The distributional representation of a lexical item is typically a distributional vector representing its co-occurrences with linguistic contexts ― hence the name vector space semantics.

The kind of co-occurrence relation between target and context lexemes determines different
types of collocates and distributional representations.

Context types (Firth (1957): (You shall know a word) by the company it keeps!)

| Context types                                | Co-occurrences             |
| -------------------------------------------- | -------------------------- |
| Undirected window-based collocate            | $word$                     |
| Directed window-based collocate              | $\langle R, word \rangle$  |
| [Dependency-filtered syntactic collocate][1] | word                       |
| Dependency-typed syntactic collocate         | $\langle obj, word \rangle$|
| Text region                                  | Firth (1957)               |

Notes: source is 'Lenci 2018 ― ARL'

    [1]: https://spacy.io/displacy-3504502e1d5463ede765f0a789717424.svg

Building distributional representations (1/3)
====================================

The basic method of building distributional vectors consists of the following procedure:

+ co-occurrences between lexical items and linguistic contexts are extracted from a corpus and counted
+ the distribution of lexical items is represented with a co-occurrence matrix, whose rows correspond to target lexical items, columns to contexts, and the entries to their co-occurrence frequency
+ raw frequencies are then usually transformed into significance weights to reflect the importance of the contexts
+ the semantic similarity between lexemes is measured with the similarity between their row vectors in the co-occurrence matrix

Suppose we have extracted and counted the co-occurrences of the targets $T =\{bike, car, dog, lion\}$ with the context lexemes $C =\{bite, buy, drive, eat, get, live, park, ride, tell\}$ in a corpus. Their distribution is represented with the following co-occurrence matrix $MT x C$,in which mt,c is the co-occurrence frequency of t with $c$:

<img src="images/_4.png" width="70%">

Notes: source is 'Lenci 2018 ― ARL'

Building distributional representations (2/3)
====================================

The most common weighting function in DS is positive pointwise mutual information (PPMI) (Bullinaria & Levy 2007).

PPMI measures how much the probability of a target–context pair estimated in the training corpus is higher than the probability we should expect if the target and the context occurred independently of one another.

Matrix 3 contains the PPMI weights computed from the raw co- occurrence frequencies in matrix 1

\begin{equation}
PPMI(t,c) = max \bigg( 0, log_{2} \frac{p(t,c)}{p(t)p(c)} \bigg )
\end{equation}

<img src="images/_5.png" width="70%">

Notes: source is 'Lenci 2018 ― ARL'

Building distributional representations (3/3)
====================================

The distributional similarity between two lexemes u and v is measured with the similarity
between their distributional vectors u and v.

Once we have computed the pairwise distributional similarity between the targets, we can identify the k nearest neighbors of each target t, that is, the k lexical items with the highest similarity score with t. The cosine is the most popular measure of vector similarity in DS:

\begin{equation}
cos(u,v) = \frac{u \cdot v}{\Vert u \Vert \Vert v \Vert} 
\end{equation}

The cosine ranges from 1 for identical vectors to −1 (0, if the vectors do not contain negative values).matrix reports the cosines between the row vectors in matrix 3:

<img src="images/_6.png" width="35%">

Notes: source is 'Lenci 2018 ― ARL'

Distributional semantics and NLP frameworks/tools
==========================================

<img src="images/_7.png" width="90%">

Notes: source is 'Lenci 2018 ― ARL'

How to represent the meaning of a word?
=====================================

A machine's perspective
----------------------------------

Human beings are entrenched in the symbols and social categories of natural language, while machines are not. Hence, machines are not able to associate meanings with lexemes.

This explains why analyzing massive datasets of natural language has been traditionally taxing/impossible. In fact, human beings are really good at making sense of language but they are bad at computing (so, hand-curated work-flows are not scalable). On the contrary, machines are really good at computing but they're just dull (so, computation capacity looks for a work-flow to scale-up).

Mainly, there are two strategies through which machines can handle meanings:

+ human beings can provide machines with 'pattern-matching' rules that induce meaningful responses vis a' vis natural language inputs
+ with the aid of statistical frameworks (e.g. Distributional Representations), machines can discover/learn the  

Pattern-matching route
===================

Two prominent natural language tools tools that draw on pattern matching:

+ regular expressions
+ WordNet (an example of annotated dataset)

Regular expressions
=================

Regular expressions use a special kind (class) of formal language grammar called a regular grammar.

Regular grammars have predictable, provable behavior, and yet are flexible enough to power some of the most sophisticated dialog engines and chatbots on the market. Amazon Alexa and Google Now are mostly pattern-based engines that rely on regular grammars.

Deep, complex regular grammar rules can often be expressed in a single line of code called a regular expression. There are successful chatbot frameworks in Python, like Will, that rely exclusively on this kind of language to produce some useful and interesting behavior.

<img src="images/_9.png" width="100%">

Examples of home assistant products.

Regular expressions: A minimal chatbot (1/3)
=================================

In [32]:
'''
Credits to Lane, Howard & Hapke (2019)
'''

# load re module
import re

# greeting matcher
r = "(hi|hello|hey)[ ]*([a-z]*)"

# matcher in action
m0 = re.match(r, 'Hello Rosa', flags=re.IGNORECASE)
m1 = re.match(r, "hi ho, hi ho, it's off to work ...", flags=re.IGNORECASE)
m2 = re.match(r, "hey, what's up", flags=re.IGNORECASE)

print("""
m0 : {}

m1 : {}

m2 : {}
""".format(m0, m1, m2))


m0 : <re.Match object; span=(0, 10), match='Hello Rosa'>

m1 : <re.Match object; span=(0, 5), match='hi ho'>

m2 : <re.Match object; span=(0, 3), match='hey'>



Regular expressions: A minimal chatbot (2/3)
=================================

In [33]:
# let's expand the greeting matcher
r = r"[^a-z]*([y]o|[h']?ello|ok|hey|(good[ ])?(morn[gin']{0,3}|"\
    r"afternoon|even[gin']{0,3}))[\s,;:]{1,3}([a-z]{1,20})"

# ... and ignore the case of text
re_greeting = re.compile(r, flags=re.IGNORECASE)

# matcher in action (uncomment the below to run)
# re_greeting.match('Hello Rosa')
# re_greeting.match('Hello Rosa').groups()
# re_greeting.match("Good morning Rosa")
# re_greeting.match("Good Manning Rosa")
# re_greeting.match('Good evening Rosa Parks').groups() 
# re_greeting.match("Good Morn'n Rosa")
# re_greeting.match("yo Rosa")

Regular expressions: A minimal chatbot (3/3)
=================================

In [None]:
# set of name for the bot
my_names = set(['rosa', 'rose', 'chatty', 'chatbot', 'bot', 'chatterbot'])

# possible curt names to use in the conversation
curt_names = set(['hal', 'you', 'u'])

# name of the conversant (we pretend to know her/him)
greeter_name = 'Simone'

# let's recycle the matcher
match = re_greeting.match(input())

# conditional statment that initiates the conversation (run and populate)
if match:
    at_name = match.groups()[-1]
    if at_name in curt_names:
        print("Good one.")
    elif at_name.lower() in my_names:
        print("Hi {}, How are you?".format(greeter_name))

Wordnet
=======

[WordNet®](https://wordnet.princeton.edu/) is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, Wordnet presents some key nuances (see graph on the right).

<img src="images/_11.png" width="100%">

[Fragment of WordNet Concept Hierarchy](http://nltk.sourceforge.net/doc/en/ch01.html)

The anatomy of WordNet
=====================

**Structure**

The main relation among words in WordNet is synonymy, as between the words shut and close or car and automobile. Synonyms ― words that denote the same concept and are interchangeable in many contexts ― are grouped into unordered sets (synsets).

Each of WordNet’s 117 000 synsets is linked to other synsets by means of a small number of 'conceptual relations.' 

Additionally, a synset contains a brief definition ('gloss') and, in most cases, one or more short sentences illustrating the use of the synset members.

Word forms with several distinct meanings are represented in as many distinct synsets. Thus, each form-meaning pair in WordNet is unique.

**Relations**

The most frequently encoded relation among synsets is the super-subordinate relation (also called hyperonymy, hyponymy or ISA relation).

It links more general synsets like $\texttt{furniture}$ or $\texttt{piece_of_furniture}$ to increasingly specific ones like $\texttt{bed}$ and $\texttt{bunkbed}$.

Thus, WordNet states that the category furniture includes bed, which in turn includes bunkbed; conversely, concepts like bed and bunkbed make up the category furniture.

WordNet in action: synonims
===============

In [None]:
'''
let's import wordnet from nltk.corpus

this assumes that you've downloaded the 'wordnet' corpus before

if not, you can do that with:

nltk.download('wordnet')
'''
from nltk.corpus import wordnet as wn

'''
# the tags adopted by nltk are not very informative
# let's make them self-explanatory
'''
poses = { 'n':'noun', 'v':'verb', 's':'adj (s)', 'a':'adj', 'r':'adv'}

for synset in wn.synsets("fine"):
    print("{}: {}".format(poses[synset.pos()], 
                          ", ".join([l.name() for l in synset.lemmas()])))

WordNet in action: hyperonimy relationship
===============

In [None]:
from nltk.corpus import wordnet as wn
panda = wn.synset("python.n.1")
hyper = lambda s: s.hypernyms()
list(panda.closure(hyper))

So, is WordNet enough to do NLP?
============================

**PROS**

+ it bring usable meanings to machines (e.g., it can inform our chat-bot)
+ great as a resource for research & teaching 

**CONS**

+ bottleneck: WordNet is an annotated dataset
+ it requires human labor to adapt
  - perhaps, it's impossible to keep-up-to date
  - at least, it misses new meanings of words
+ it is subjective
+ missing nuance (Manning, 2019): e.g. 'proficient' is listed as a synonym for 'good'.
+ it doesn't offer a continuous measure of word similarity

### Is there a statistical or machine learning approach that might work in place of the pattern-matching approach?

### Provided we have access to a reasonably large (and diverse) corpus of text, how can we represent the duality between words and meanings?

Statistical framework route
======================

+ Traditional NLP
    - Bag-of-Words
    - One-hot encodings

+ Modern NLP
    - Embeddings (e.g., a vector space generated via $\texttt{word2vec}$) 

<img src="images/_8.jpg" width="100%">

Bag-of-Words (BoW)
===========

Given some text corpora $D$, a BoW work-flow implies the following steps:

1. $\forall d \in D$ (e.g, documents, state- ments, sentences, and even single words), get the the token $\Phi(d)$
2. $\forall s \in S$ (i.e., unique lexical items) and $\forall d \in D$, get the cardinality $\vert s \vert$

We call the possible vectors a machine might create this way a vector space.

Such a vector space allows us to use linear algebra (and libraries such as NumPy, Scipy, or Numba) to manipulate lexical items and compute things like distances and statistics involving natural language data.

<img src="images/_12.png" width="100%">

Token sorting tray (source is Lane, Howard & Hapke 2019)

What can we do with BoW data?
==========================

For example, we can address search queries such as: *''What is the combination of words most likely to follow a particular bag of words?''* Or, if a user enters a sequence of words, *''What is the closest bag of words in our database to a bag-of-words vector provided by the user?''*

General take-home on BoW:

+ a BoW approach can generate meaningful responses to answers 
+ in a BoW approach, humans do not pass any rules to machines (you remember pattern-matching?)
+ BoW leverages distributional data to appreciate the semantic similarity of lexical items

Warning: a BoW approach doesn't say anything about the specific meanings of lexical items.

BoW in action
============

In [None]:
# let's import Counter, a special kind of
# dictionary that computes the cardinality of
# elements
from collections import Counter

# sample sentence
sentence = """
Success is not final; failure is not fatal: It is the courage to continue that counts. -Winston S. Churchill
"""

# tokenization
tokens = sentence.split()

# BoW
bow = Counter(tokens)

# print
from pprint import pprint
pprint(bow)

One-hot vectors 
=============

One of the main limitations of the BoW approach is the proliferation of unique vectors to compare and contrast.

One-hot vectors (a form of discrete representation of lexical items) mitigate the curse of dimensionality by considering whether a word is or is not present in a piece of text.

In [None]:
# let's import numpy to manipulat the text
import numpy as np

# sample sentence
sentence = """
Success is not final; failure is not fatal: It is the courage to continue that counts. -Winston S. Churchill
"""

# tokenization
tokens = str.split(sentence)

# vocabulary (unique words)
vocab = sorted(set(tokens))
', '.join(vocab)

# count of tokens
num_tokens = len(tokens)

# size of vocabulary
vocab_size = len(vocab)

# one-hot vector representation
# -- empty np array
onehot_vectors = np.zeros((num_tokens, vocab_size), int)
# -- fill-in values
for i, word in enumerate(tokens):
    onehot_vectors[i, vocab.index(word)] = 1
    
# print np array
pprint(onehot_vectors)

In [None]:
# some embellishments
import pandas as pd
df = pd.DataFrame(onehot_vectors, columns=vocab)
df[df ==0] = ''
pprint(df)

Limitations of one-hot encodings
===========================

There is no natural notion of similarity for one-hot vectors! (Mannings, 2019)

**Example 1:** the vectors associated with 'good' and 'fine' are orthogonal:

```
good = [0, 0, 1, 0, 0, 0]

fine = [0, 0, 0, 0, 1, 0]
```

**Example 2:** 'greasy spoon' and 'British cafe' express the same category of eatery but the intersection of their one-hot vectors is empty.

```
greasy spoon = [[0, 1, 0,  0], [0, 0, 1, 0]]

British cafe = [[1, 0, 0, 0], [0, 0, 0, 1]]
```

Shall we try to use WordNet’s list of synonyms to get similarity? Likely as not, a bad idea...WordNet has severe limitations.

Modern NLP: Distributional Hypothesis + DL
====================================

From DH to word vectors
=====================

According to the Distributional Hypothesis, a focal word’s $\omega$ meaning is a function of the linguistic context ― i.e., the lexical items in the neighborhood of the focal word.

Then, considering (all) the many contexts of $\omega$ (e.g., regulation) helps to create an accurate vector representation of $\omega$.

Sample sentences containing the word 'regulation':

```
... to encourage and implement the adoption of common REGULATIONs for all forms of motor sports and series across the ...

countries should adhere to the cost-benefit paradigm of REGULATION, forcing bureaucrats to outline all the benefits of ...

Agencies create REGULATIONs (also known as "rules") under the authority of Congress to help ...

```

Word vectors as dense, real valued vectors
===================================

Ultimately, by observing and analyzing a same word in multiple context, we aim at building a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

Below is a portion of the [vector](https://spacy.io/usage/vectors-similarity) associated with the word 'banana'.

```
array([2.02280000e-01,  -7.66180009e-02,   3.70319992e-01,
       3.28450017e-02,  -4.19569999e-01,   7.20689967e-02,
      -3.74760002e-01,   5.74599989e-02,  -1.24009997e-02,
       5.29489994e-01,  -5.23800015e-01,  -1.97710007e-01,
      -3.41470003e-01,   5.33169985e-01,  -2.53309999e-02,
       1.73800007e-01,   1.67720005e-01,   8.39839995e-01,
       5.51070012e-02,   1.05470002e-01,   3.78719985e-01,
       2.42750004e-01,   1.47449998e-02,   5.59509993e-01,
       1.25210002e-01,  -6.75960004e-01,   3.58420014e-01,
       # ... and so on ...
       3.66849989e-01,   2.52470002e-03,  -6.40089989e-01,
      -2.97650009e-01,   7.89430022e-01,   3.31680000e-01,
      -1.19659996e+00,  -4.71559986e-02,   5.31750023e-01], dtype=float32)
```

Overview of the $\texttt{word2vec}$ algorithm
================================

$\texttt{word2vec}$ (Mikolov et al. 2013) is a framework for learning word vectors

Idea ― given a corpus of text $D$:

+ each word $d$ is associated with a vector 
+ go through each position $k$ in the text, which has a center word $\omega$ and context words $\eta$
+ use the similarity of the word vectors for $\omega$ and $\eta$ to calculate the probability of $\eta$ given $\omega$ (or vice versa)
+ keep adjusting the word vectors to maximize this probability

Source is Manning 2019.

# Next week, we'll focus on $\texttt{word2vec}$ and word vectors only. 