## Motivation

* Goal: Understanding the mappings between constituent and compound embeddings
* Why does a linear projection work so well?
  * Linear projection: 80K parameters, Test acc: .39
  * Best DL model: 7.2M parameters, Test acc: .51
* What has the linear projection captured in this task?
* How could we model the different meanings of the constituents? (e.g. 土 tǔ shí "land, clay" in 土石 tǔ shí "earth and stoes" vs. 土狗 tǔ gǒu "native dogs")

### 2-Char (conv-based/linear/fully-connected layers)    

|             | Train  |  Test
|------------|--------:|--------:|
| conv-based | .78     |   .46   |
| linear     | .40     |   .39   |
| emb-tuning (init with tencent)  | .83     |   .51   |



### Underlying function
The x- and y- axis could be considered as a one-dimensinoal constituent embedding. For example, "土", "著", "石" would be at three different locations along the x axis. The warped grid is like the word embeddings. In this example, the relationship between compound ($x$, $y$ space) and constituent ($u$ and $v$ coordinates) embedding is:

$$
x = u + 0.2 \cdot sin(v*PI) \\
y = v + 0.2 \cdot cos(u*PI)
$$

![non-linear](img/non-linear.png)

## A Linear projection

The linear projection model estimates a transformation matrix, which, in this case, is a shear matrix.

The intuition here is the same character (in the same $u$ location) would have different "effect" on the word embeddings (indicated by the black arrows), but they will all be the same in the linear projection. The local "effect" of the character is the same with the global linear approximation.

If we can model the underlying distribution, and extract the "effect" from the model, would that "effect" somehow _reflect_ the constituents' meaning?

$$
M = (U^\top U)^{-1}U^{\top} X \\
\begin{bmatrix}
u & v
\end{bmatrix} \begin{bmatrix}
1.00 & 0.01 \\ 0.19 & 1.00
\end{bmatrix} = \begin{bmatrix}
u+0.19v & 0.01u+v
\end{bmatrix}
$$

![linear](img/linear.png)

## Approximating with the deep learning model

A deep learning model better approximate the underlying non-linear relations. Further more, we could similary derive the "effect" of the constituent in that specific context with Jacobian matrix of the DL model. 

As long as the model approximates the underlying function, the "effect" of the constituent is similar.

![dl.png](img/dl.png)

## Empirical study

* Show the model predict real words well
* Show the model predict pseudowords well: help account for the non-word behavioral data (MELD-SCH)
* Show that the character Jacobian (Jacobian matrix of that constituents) reflect the character's meanings in the words.

### Real word embedding predictions
<img src="img/morphert-acc.png" alt="drawing" style="width:400px;"/>

### Pseudowords: MELD-SCH data

<img src="80.11-nw-paper-figure.png" alt="80.11-nw-paper-figure" style="width:800px;"/>

### Character meaning clustering based on Character Jacobian

<img src="30.22-affix-series.png" alt="30.22-affix-series" style="width:800px;"/>

<img src="30.22-affix-pval-distr.png" alt="30.22-affix-pval-distr" style="width:800px;"/>