# Lexical Semantics

Representing the __meaning of a word__

* Ex cat is similar to dog

## Words and Vectors
### Word and Vector Similarity 
Most words dont have synonyms, but many similar/related words
* Glad, happy, joyous, exstatic

Idea: Project word vector onto some space where similar words have a smaller distance between them than unrelated words

Dot product: 
* > 0: The higher the dot product, the smaller the angle between them and the more similar they are.
* = 0: Orthogonal vectors
* < 0: Angle is larger than 90 degrees, words are likely very opposite.

Ex cosine similarity

#### Term-Document Matrix
* Columns: Documents, ex books, articles
* Rows: Words
* $M_{ij}$: Word count for $w_i$ in document $d_j$

This allows us to represent documents in a vector space, and the similarity between documents is then the word counts. 

#### Co-occurence
You shall know a word by the company it keeps

Look at co-occurence of word in shared contexts

Often you can tell what an unknown word means, just by looking at the contexts it appears in.

````
A bottle of X is on the table 
Everybody likes X
X makes you drunk
We make X ̈ino out of corn
```

We can infer that X is some kind of alcoholic drink, similar to beer.


#### Association by Co-occurence
Syntagmatic assocation: Words which occur in the same context
* Wrote ~ {author, book, down, pen...}
* First order co-occurence, since the words in fact appear together

Paradigmatic association: Words which can be substituted for eachother
* Monday ~ {Tuesday, Wednesday...}
* Man ~ {Guy, Dude, Boy}
* Second order co-occurence, since they mean almost the same thing and therefore often dont appear together in a context, but the contexts they occur in are similar!

#### Co-occurence Matrix
$N \in I^{m\times m}$ 
* Counts, many entries 0 (sparse)
* $n_{ij}$ is the #contexts of $w_i$ for which $w_j$ occurs.

Raw counts are not a good measure of assocation
* Stop words co-occur with all other words

Alternative: PMI

_How often two events $x$ and $y$ occur compared to what we expect if they were independent._

$$PMI(x,y) = log_2 \ \frac{P(x,y)}{P(x)P(y)}$$

### Word2Vec
Represent word with a fixed length, non-sparse vector, condense the information


#### Skip-Gram Model with Negative Sampling
How do we come up with a representation? 

Instead of counting how often each word $w$ occurs near $w'$ instead train a classifier on the following binary prediction task:
* Is word $w$ likely to show up near $w'$? 

Concretely:
* Treat target word and neighbouring context as positive examples (y=1)
* Randomly sample other words in vocab to get negative samples (y=0)
* Use logistic regression to train classifier - learn to distinguish between positive and negative
* Use the learned weights as word embeddings

So, rather than using the logistic regression model itself, we use the learned weights of the classifier - these are the word embeddings.

Context window example: the two previous words and the next 2 words, ie.
* $w_1 \ w_2 \ \mathbf{w_3} \ w_4 \ w_5$
* Context is $w_1 \ w_2$ and $w_4 \ w_5$


### GloVe
Alternative to negative sampling

Idea: Weighted least squares fit for log-counts

Introduce dampening function $f$:
$$f(n) = min \ (1, (\frac{n}{n_{max}})^\alpha) \quad \alpha \in [0,1]$$

Often $\alpha = 3/4$

This function limits the influence of very frequent words, and clips at a frequency $n_{max}$



$$H = \sum_{i,j} f(n_{ij}) (log \ n_{ij} - log \tilde{p}_\theta (w_j | w_i))^2$$

* Target: $log \ n_{ij}$
* Model: $log \tilde{p}_\theta (w_j | w_i)$
* Weighting function: $f(n_{ij})$

Is equivalent to matrix decomposition problem