# Part 1: Usupervised learning

© Anatolii Stehnii, 2018

In [63]:
%env LC_ALL=en_US.UTF-8
%env LANG=en_US.UTF-8

import numpy as np
from scipy import spatial
import nltk
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/anatolii.stehnii/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Numeric representation for words

Numerical data naturally have a meaning for computational machine. Numbers can be **compared**, a **distance** between vectors can be measured. 

In [64]:
a = [1, 5, 6]
b = [-2, -6, 1]
distance = spatial.distance.cosine(a, b)
distance

1.5156862774427124

But this can not be easy done for **words**:

In [65]:
a = ['weather', 'is', 'good']
b = ['sun', 'is', 'shining']
distance = None # ?

Words are **discreete symbols** and can be encoded as **one-hot-vector**, sentences can be encoded as a **bag of words**:

In [66]:
vectors = {
    'weather': [1, 0, 0, 0, 0],
    'is': [0, 1, 0, 0, 0],
    'good': [0, 0, 1, 0, 0],
    'sun': [0, 0, 0, 1, 0],
    'shining': [0, 0, 0, 0, 1],
}
a = np.sum(list(map(lambda w: vectors[w], ['weather', 'is', 'good'])), axis=0)
print('BoW vector for `weather is good`: {}'.format(a))
b = np.sum(list(map(lambda w: vectors[w], ['sun', 'is', 'shining'])), axis=0)
print('BoW vector for `sun is shining`: {}'.format(b))

distance = spatial.distance.cosine(a, b)
distance

BoW vector for `weather is good`: [1 1 1 0 0]
BoW vector for `sun is shining`: [0 1 0 1 1]


0.66666666666666663

But for the whole vocabulary, we will have 500,000-dimensional vectors. Also, this approach totally ignores words **meanings, relations, and similarity**.

## Meaning
Definition of **meaning**:
 1. the logical connotation of a word or phrase;
 2. what is intended to be, or actually is, expressed or indicated; signification;
 3. the thing that is conveyed especially by language.
 
Meaning of a **natural language** is a complex problem. The same words can have a different meaning in different contexts (**polysemy**). Different words also can have a similar meaning (**synonymy**). Words with broader meaning can include meaning for more specific categories (**hypernymy and hyponymy**).

The simple solution for meaning representation is to manually mark up a graph of relations between words (**WordNet**).

In [67]:
from nltk.corpus import wordnet as wn
def synset_to_str(synset):
    return '({}) {}'.format(synset.pos(), ', '.join(map(str, synset.lemma_names())))

In [68]:
# synonyms
for synset in wn.synsets('evil'):
    print(synset_to_str(synset))

(n) evil, immorality, wickedness, iniquity
(n) evil
(n) evil, evilness
(a) evil
(s) evil, vicious
(s) malefic, malevolent, malign, evil


In [69]:
# hypernyms
hypernyms = lambda s: s.hypernyms()
cat = wn.synset('cat.n.01')
for synset in list(cat.closure(hypernyms)):
    print(synset_to_str(synset))

(n) feline, felid
(n) carnivore
(n) placental, placental_mammal, eutherian, eutherian_mammal
(n) mammal, mammalian
(n) vertebrate, craniate
(n) chordate
(n) animal, animate_being, beast, brute, creature, fauna
(n) organism, being
(n) living_thing, animate_thing
(n) whole, unit
(n) object, physical_object
(n) physical_entity
(n) entity


In [70]:
hyponyms = lambda s: s.hyponyms()
for synset in list(cat.closure(hyponyms)):
    print(synset_to_str(synset))

(n) domestic_cat, house_cat, Felis_domesticus, Felis_catus
(n) wildcat
(n) Abyssinian, Abyssinian_cat
(n) alley_cat
(n) Angora, Angora_cat
(n) Burmese_cat
(n) Egyptian_cat
(n) kitty, kitty-cat, puss, pussy, pussycat
(n) Maltese, Maltese_cat
(n) Manx, Manx_cat
(n) mouser
(n) Persian_cat
(n) Siamese_cat, Siamese
(n) tabby, tabby_cat
(n) tabby, queen
(n) tiger_cat
(n) tom, tomcat
(n) tortoiseshell, tortoiseshell-cat, calico_cat
(n) cougar, puma, catamount, mountain_lion, painter, panther, Felis_concolor
(n) European_wildcat, catamountain, Felis_silvestris
(n) jaguarundi, jaguarundi_cat, jaguarondi, eyra, Felis_yagouaroundi
(n) jungle_cat, Felis_chaus
(n) kaffir_cat, caffer_cat, Felis_ocreata
(n) leopard_cat, Felis_bengalensis
(n) lynx, catamount
(n) manul, Pallas's_cat, Felis_manul
(n) margay, margay_cat, Felis_wiedi
(n) ocelot, panther_cat, Felis_pardalis
(n) sand_cat
(n) serval, Felis_serval
(n) tiger_cat, Felis_tigrina
(n) blue_point_Siamese
(n) gib
(n) bobcat, bay_lynx, Lynx_rufus
(n)

In [71]:
# distance - shortest path in hyponyms/hypernims graph between two words
dog = wn.synset('dog.n.01')
dog.shortest_path_distance(cat)

4

Other WordNet features: http://www.nltk.org/howto/wordnet.html

WordNet is interesting as a word catalogue, but it has some **major issues**:
1. whole **language complexity** cannot be inferred only from synonymy, hyponymy, and hypernymy relations;
2. WordNet is missing new words, it requires **manual labor** to adapt, and it is prone to **human errors**;
3. word similarity is **not accurate**.

### Solution 

Map words into n-dimensional space, where relations between words will be encoded into **spatial words positions**.

### Core idea

A word’s meaning is given by the words that frequently appear close-by.

*"You shall know a word by the company it keeps" (John Rupert Firth. 1957:11)*

**Distributional hypothesis**: linguistic items with similar distributions have similar meanings. https://en.wikipedia.org/wiki/Distributional_semantics

To find a representation of a word $w$ we need to use it's **context**: a set of words, which occured nearby is some window (for example, paragraph).

## Latent semantic analysis

You can form term-document matrix (TF-IDF, BoW) to describe a words context; it is a sparse matrix $\textbf{X}$ where terms are rows and documents are columns:

$$
\begin{matrix} 
 & \textbf{d}_j \\
 & \downarrow \\
\textbf{t}_i^T \rightarrow &
\begin{bmatrix} 
x_{1,1} & \dots & x_{1,j} & \dots & x_{1,n} \\
\vdots & \ddots & \vdots & \ddots & \vdots \\
x_{i,1} & \dots & x_{i,j} &  \dots & x_{i,n} \\
\vdots & \ddots & \vdots & \ddots & \vdots \\
x_{m,1} & \dots & x_{m,j} & \dots & x_{m,n} \\
\end{bmatrix}
\end{matrix}
$$

After the matrix formed, any **rank lowering method** (PCA, SVD) can be used to reduce dimensionality and extract main components for terms (or for documents, if you task is to project documents into the linear space):

$$
\textbf{X} = \textbf{U} \textbf{S} \textbf{V}^T
$$

$$
\begin{matrix} 
 & X & & & U & & \textbf{S} & & V^T \\
 & (\textbf{d}_j) & & & & & & & (\hat{\textbf{d}}_j) \\
 & \downarrow & & & & & & & \downarrow \\
(\textbf{t}_i^T) \rightarrow 
&
\begin{bmatrix}
x_{1,1} & \dots & x_{1,j} & \dots & x_{1,n} \\
\vdots & \ddots & \vdots & \ddots & \vdots \\
x_{i,1} & \dots & x_{i,j} &  \dots & x_{i,n} \\
\vdots & \ddots & \vdots & \ddots & \vdots \\
x_{m,1} & \dots & x_{m,j} & \dots & x_{m,n} \\
\end{bmatrix}
&
=
&
(\hat{\textbf{t}}_i^T) \rightarrow
&
\begin{bmatrix} 
\begin{bmatrix} \, \\ \, \\ \textbf{u}_1 \\ \, \\ \,\end{bmatrix} 
\dots
\begin{bmatrix} \, \\ \, \\ \textbf{u}_l \\ \, \\ \, \end{bmatrix}
\end{bmatrix}
&
\cdot
&
\begin{bmatrix} 
s_1 & \dots & 0 \\
\vdots & \ddots & \vdots \\
0 & \dots & s_l \\
\end{bmatrix}
&
\cdot
&
\begin{bmatrix} 
\begin{bmatrix} & & \textbf{v}_1 & & \end{bmatrix} \\
\vdots \\
\begin{bmatrix} & & \textbf{v}_l & & \end{bmatrix}
\end{bmatrix}
\end{matrix}
$$

Components of this decomposition are sorted in order of data variance, explained by them; selecting first $l$ components we can obtain dense representation of each word-vector. You can calculate part of variance explained:

$$
R^2 = \frac{\sum_{i=0}^{l} s_i^2}{\sum_{j=0}^{n} s_j^2}
$$

### Problems

1. SVD is a linear transformation; therefore, it can not capture non-linear relations between words. 
2. Not so efficient as state-of-the-art methods (neural networks for example).