# Background

## Text Semantics

### Terminology

**Semantics**: concerned with meaning of texts
1. **Lexical Semantics**:  focuses on the meaning of individual words
2. **Compositional Semantics**: meaning depends on the words, and on how they are combined.

**Syntax**: 

**Paradigmatic Similarity**:

### Approaches
**Propositional or formal semantics**: Block of text is converted into a formula in logical language
- "dog bites man" $\rightarrow$ bites(dog, man)
- bites(",") is a binary relation
- Probaility can be attached to a relation

#### Pros:
- Allows for logical inferences
- Good for structured language domains (where causal structure is prevalent)

**Vector Representation**: Texts are embedded into a high-dimensional space
- vec("dog bites man") = $(0.2, -0.3, 1.5, \cdots ) \in \mathbb{R}^n$

#### Pros:

1. Allows to measure word similarity

#### Cons
1. Hard to compose semantic logic.


## Vector Embedding of Words

### Latent Semantic Analysis

Given a set of documents, referred to as **corpus**, we shall construct a matrix $T \in \mathbb{Z}^{N \times M}$ where

$T_{ij}$ is the count of word $j$ in document $i.$

1. We shall first compute L2 Approximation of $T$.
- $T \approx U^T V$ 

![](lsa1.png)
or
- $T \approx U^T S ^V$
![](lsa2.png)

#### Questions:
1. How do one interpret $V, S, U$?
2. What is the difference between $U^T V$ and $U^TSV$

$v = Vt$ is an embedding of the document in the latent space.

$t' = V^Tv = V^TVt$ is the decoding of the document from its embedding.

3. Why is V^T used to reconstruct from the embedded space?

An SVD (Singular Value Decomposition) factorization gives the best possible reconstructionsof the documents 푡′from their embeddings.

### Autoencoder in LSA Perspective

![autoencode](autoencode.png)

![](lsadeepnetwork.png)

### t-SNE

## Word2Vec


### Skipgrams
![](skip-gram.png)

### Continuous Bag of Words
![](cbow.png)


### Example

![](w2v-ex.png)

**In this case**: context window $= 1$

### CBOW Input - $(\cdots \text{("the", "quick"), ("brown", "quick")} \cdots)$
### Skipgram Input - $(\cdots \text{("quick", "the"), ("quick", "brown")} \cdots)$

### Apply Word2Vec in a Network

![](w2v-network.png)

To apply **CBOW** and **Skipgram** to the neural network, we can represent $x$ and $y$ as one hot vector.  If $x$ and $y$ are one hot vector, then using softmax makes sense because it will output the probability predicted value will be $y$.  

Other times, we may need to create an embedding for $x$ and $y$.  Such the case, there are two embedding matrices:  $V$ and $U$.  

$u$ will be output embedding vector
$v$ will be input embedding vector.

Softmax function for this network (and skipgram) will be defined as:

$$p(j \ | \ i) = \frac{\exp(u_j^Tv_i)}{\sum_{k=1}^V \exp(u_k^Tv_i)}$$

$j$ is the output word
$i$ is the input word
$k$ allows us to range through the word context.

What would the softmax function for (cbow) be?


### Composition

“Linguistic Regularities in Continuous Space Word Representations"

![](composition.png)

### Pros of Word2Vec

Local context allows us to capture more information about relations.

![](relation.png)

### Cons of Word2Vec
- Cross entropy loss may put emphasis on small combination of word/contexts (Why?)
- Very expensive to normalize the softmax over all words
- Uses heuristic down-weighting of frequent words

## Glove

### Co-occurrence Matrices

To combine global context and local context information, **Glove** uses **Co-occurrence Matrices**. 

$C_{ij}$ counts the number of documents containing both $i$ and $j$ given a **window size**.

Glove minimizes:

$$J(\theta) = \sum_{i,j=1}^V f(C_{ij})(u_i^Tv_j + b_i + \tilde{b_j} - \log C_{ij})^2$$

$f$ is a function that satisfies:
1. $f(0) = 0$
2. $f(x)$ is non-decreasing
3. $f(x)$ "saturates" -- not too large for large $x$

**Example** of $f$ is:

$$f(x) = \begin{cases}
          (x/x_{\max})^{\alpha} & \text{if} x < x_{\max} \\
          1                     & \text{otherwise}
         \end{cases}$$

## Skip-Thought Vector

**Skip-thought** vectors utilizes RNNs to predict the next and previous sentences.

So, **RNN** is applied to each word, and the output state vector of the boundary layer (usually end of a phrase that is considered as a unit) will be embedding.

After we train to create the embedding, then the recurrent network can be used as a encoder.

**How do we encode larger unit of texts?**

**Why don't we need backpropogation to train RNN?**

**Why can we represent hte encoder as a (truly) recurrent network?**

## TODO: READ Skip Thought paper

## Evaluation

1. SICK semantic relatedness scores

## Siamese Network for Semantic Relatedness


![](siamese.png)

Given a pair of sentences, $a$ and $b$ with similarity label $y$ and parameter shared between two networks, we shall train the network, while expanding the data by substituting random words with wordnet.