Week 3
======

## ― Vector semantics and embeddings

<img src="images/_0.jpg" width="50%">

Week 2: Key tenets
================

+ modern NLP ~ Distributional Hypothesis + DL
+ words can be represented as vectors (Lenci 2018)
+ desiderata: by observing and analyzing a same word in multiple context, we aim at:
  - building a *dense, real-value* vector for each word
  - ... chosen so that it is similar to vectors of words that appear in similar contexts

Word vectors as dense, real valued vectors
===================================

Ultimately, by observing and analyzing a same word in multiple context, we aim at building a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

Below is a portion of the [vector](https://spacy.io/usage/vectors-similarity) associated with the word 'banana'.

```
array([2.02280000e-01,  -7.66180009e-02,   3.70319992e-01,
       3.28450017e-02,  -4.19569999e-01,   7.20689967e-02,
      -3.74760002e-01,   5.74599989e-02,  -1.24009997e-02,
       5.29489994e-01,  -5.23800015e-01,  -1.97710007e-01,
      -3.41470003e-01,   5.33169985e-01,  -2.53309999e-02,
       1.73800007e-01,   1.67720005e-01,   8.39839995e-01,
       5.51070012e-02,   1.05470002e-01,   3.78719985e-01,
       2.42750004e-01,   1.47449998e-02,   5.59509993e-01,
       1.25210002e-01,  -6.75960004e-01,   3.58420014e-01,
       # ... and so on ...
       3.66849989e-01,   2.52470002e-03,  -6.40089989e-01,
      -2.97650009e-01,   7.89430022e-01,   3.31680000e-01,
      -1.19659996e+00,  -4.71559986e-02,   5.31750023e-01], dtype=float32)
```

Why dense, real-valued vectors rather than sparse vectors?
=================================================

+ convenience: in ML, you don't want to deal with thousands of features 
+ they may do better at capturing synonymy/semantic similarity:
  - 'car' and 'automobile' are synonyms; but are distinct dimensions
  - a word with 'car' as a neighbor and a word with 'automobile' as a neighbor should be similar, but aren't
+ in practice, they work better

Notes. ― The materials included in the following slides are based on Jurafsky and Martin's book 'Speech and Language Processing and Speech Recognition' (chapter 6). 

Examples of word-vectors and embeddings
====================================

[$\texttt{word2vec}$](https://code.google.com/archive/p/word2vec/) ― the pioneer

<img src="images/_1.png" width="15%">

[**Fasttext**](http://www.fasttext.cc/) ― it takes word parts into account

<img src="images/_2.png" width="6%">

[**GloVe**](http://nlp.stanford.edu/projects/glove/) ― it emphasizes co-occurrences of words in the whole text corpus

<img src="images/_3.png" width="22%">

Let's focus on $\texttt{word2vec}$: Background
==================================

<img src="images/_4.png" width="50%">

Let's focus on $\texttt{word2vec}$: Background
==================================

<img src="images/_5.png" width="50%">

$\texttt{word2vec}$: the basic
===================


**Key features**

+ in 2013, it drew a line between old-school and modern NLP
+ it doesn't require hand-labeled supervision
+ easy and quite fast to train (you can do that with Gensim)
+ it's OSS  

**Philosophy**

+ it trains a classifier on a binary prediction task:
  - "is word $\omega$ likely to show up near word $\eta$"?
+ the classification task is 'instrumental' in nature:
  - the point is not predicting the 'next' word
  - the goal is adjusting word vectors 

$\texttt{word2vec}$ algorithm: Skip-Gram flavor
==================================

**!!!Boundary condition¡¡¡**

+ there are various flavors of the $\texttt{word2vec}$
+ here, we focus on the Skip-Gram (SG) flavor

Skip-Gram algorithm
=================

1. treat the target word and a neighboring context word as positive examples
2. randomly sample other words in the lexicon to get negative samples
3. use logistic regression to train a classifier to distinguish those two cases
4. use the weights as the embeddings

SG training data
==============

Given the sentence:

| not in context | c1         | c2 | t       | c3  | c4 | not in context | 
| -------------- |------------|----|---------|-----|----|----------------|
| ... lemon, a   | tablespoon | of | apricot | jam | a  | pinch of ...   |

let's assume context words are those in +/- 2 word window.

SG goal
=======
 
Given a tuple $(t,c)  = target, context$

+ $\texttt{(apricot, jam)}$
+ $\texttt{(apricot, aardvark)}$

Return probability that $c$ is a real context word:

$P(+|t,c)$

$P(-|t,c)$

How to compute p(+|t,c)?
=====================

Intuition:

+ words are likely to appear near similar words
+ model similarity with dot-product!
+ similarity(t,c)  $∝ t ∙ c$

Problem:
+ dot product is not a probability!
+ (neither is cosine)

Turning dot product into a probability
===============================

\begin{equation}
P(+|t,c) = \frac{1}{1+e^{-t ∙ c}}
\end{equation}

\begin{equation}
P(-|t,c) = 1 - P(+|t,c)
= \frac{e^{-t ∙ c}}{1+e^{-t ∙ c}}
\end{equation}

For all context words
=================

\begin{equation}
P(+|t,c_{1:k}) = \prod_{i=1}^{k} \frac{1}{1+e^{-t ∙ c_{i}}}
\end{equation}

SG training data
=============
Given the sentence:

| not in context | c1         | c2 | t       | c3  | c4 | not in context | 
| -------------- |------------|----|---------|-----|----|----------------|
| ... lemon, a   | tablespoon | of | apricot | jam | a  | pinch of ...   |

let's assume context words are those in +/- 2 word window.


SG training (1/2)
============== 

Given the sentence:

| not in context | c1         | c2 | t       | c3  | c4 | not in context | 
| -------------- |------------|----|---------|-----|----|----------------|
| ... lemon, a   | tablespoon | of | apricot | jam | a  | pinch of ...   |

we look for positive examples $+$:

| t       | c          |
|---------| -----------|
| apricot | tablespoon |
| apricot | of         |
| apricot | jam        |
| apricot | preserve   |
| apricot | ...        |

SG training (2/2)
============== 

Given the list of + $c$:

+ each positive $c$ is matched with a negative $c$
+ negatives are 'noise words' that do not belong to any linguistic contexts of $t$

Setup
=====

Let's represent words as vectors of some length (say 300), randomly initialized. 

+ we start with 300 * V random parameters
+ over the entire training set, we’d like to adjust those word vectors such that we
+ maximize the similarity of the target word, context word pairs (t,c) drawn from the positive data
+ minimize the similarity of the (t,c) pairs drawn from the negative data. 

Learning $\texttt{word2vec}$ embeddings
=============================

<img src="images/_6.png" width="80%">

Source is Jurafsky and Martin (2019).

Interpreting embeddings: Caveat
===========================

Semantic similarities depend on window size $C$:

$C = ±2 \rightarrow$ the nearest words to Hogwarts:
+ sunnydale
+ evernight

$C = ±5 \rightarrow$  the nearest words to Hogwarts:
+ Dumbledore
+ Malfoy
+ halfblood

Embedding capture relational meanings
=================================

```
vector(‘king’) - vector(‘man’) + vector(‘woman’) = vector(‘queen’)
```

```
vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) = vector(‘Rome’)
```

<img src="images/_7.png" width="80%">

Visual inspection of embeddings (1/2)
===============================

<img src="images/_8.png" width="60%">

Visual inspection of embeddings (2/2)
===============================

<img src="images/_9.png" width="60%">

Embeddings: Fields of application
============================

+ business and economic analysis (examples are scant)
+ cultural analysis

Studying changes in meanings with Google Books data
==============================================

<img src="images/_10.png" width="60%">

Source: see various works by Dr. W. Hamilton (McGill U.)

<img src="images/_11.png" width="60%">

Using embeddings as a historical tool to study bias
==========================================

<img src="images/_12.png" width="60%">

Using embeddings as a historical tool to study bias
==========================================

The paper in a nutshell:

+ embeddings for competence adjectives are biased toward men
  - smart, wise, brilliant, intelligent, resourceful, thoughtful, logical, etc.
+ this bias is slowly decreasing over time