In [1]:
# imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})

# for custom notebook formatting.
from IPython.core.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
HTML(open('../custom.css').read())


<br><br><br>

## Natural Language Processing
### :::: Machine Translation ::::


<br><br><br><br><br><br><br><br><br><br><br><br>


<br><br>

## What makes translation difficult?


<br><br><br><br><br>


Different languages behave differently

- Different syntax:
  - SVO (Subject-Verb-Object) languages: English, French, Mandarin
    - Verbs come in the middle of the sentence
  - SOV (Subject-Object-Verb) languages: Hindi, Japanese
    - Verbs come at the end of the sentence
  - VSO (Verb-Subject-Object) languages: Biblical Hebrew, Classical Arabic
    - Verbs come at the beginning of the sentence
    
<br><br>

- Different semantics:
  - E.g., in English, we would say "The bottle floated <u>out</u>"
  - But in Spanish, we say "La botella <u>salió</u> flotando" (The bottle exited floating)
    - The verb indicates the direction.
    
<br><br>

- Lexical differences:
  - A word in one language may not map directly to a word in another language
    - E.g., "aput" in Eskimo means "snow on the ground"
    - "qana" in Eskimo means "falling snow"
  - Mandarin does not distinguish between gender in third-person singular pronouns (e.g., "he" vs "she")
  - English does not distinguish between gender in third-person plural pronouns ("they"), whereas French does ("ils" vs "elles")

<br><br>  

- Cultural differences:
  - Understanding a metaphor may require knowing the history of a culture

<br><br>

<u>Sapir-Whorf hypothesis</u>:
- The language you use constrains the thoughts you can have
  - E.g., do Eskimos have more complex thoughts about snow than English-speakers?
- Controversial topic in linguistics



<br><br>

## Four General Approaches to Machine Translation

1. Transfer
2. Interlingua
3. Direct Translation
4. Machine Learning

## 1. Transfer

![transfer](figs/transfer.png)

> Use prior knowledge about source and target languages to directly transform the syntactic and lexical patterns of the source into an appropriate sentence in the target.

<br><br>

### Syntactic transformations

A function that maps a parse tree in the source language to a parse tree in the target language. 

E.g., in English, adjectives come before nouns, but in French it is the reverse.

![adj](figs/adj.png)

Since we have many transformations, we require a list of rules and the order in which they should be applied.

![rewrite](figs/rewrite.png)

<br><br>

### Lexical transformations

- E.g., to convert "aput" into "snow on the ground"
- Need a cross-language dictionary
- To resolve ambiguity, use the parse tree

<br>

E.g., if unsure whether it is "book" the noun or verb, use the part-of-speech.

<br>

Can sometimes require complex discourse parsing.

- E.g., translating "they" into French requires figuring out the gender of "they"

<br><br><br><br>

<br><br>

## 2. Interlingua

The transfer approach doesn't scale well for many languages
- Need a different set of transfer rules for each <u>pair</u> of languages

<br><br>
Interlingua approach: 
> Represent the meaning of a sentence in a language-independent way.

![inter](figs/inter.png)

<u>Example interlingua</u>:

Unification style representation of "There was an old man gardening."

![unif](figs/unif.png)

Tradeoffs of Interlingua approach:
- Avoids the need to design pairwise transfer rules
- But, it is difficult to design a truly language-independent representation
  - And to accurately parse a sentence into that representation!

<br><br><br>


<br><br>

## 3. Direct Translation

Avoids complex intermediate representations (parse trees, semantics).

Instead, just run set of rules in parallel for each word token.

![direct](figs/direct.png)

![direct2](figs/direct2.png)

Iteratively refine the translation based on a set of rules.

Tradeoffs:
- Less complex to implement
- Will produce some output for every input
- More likely to have syntactic errors
<br><br><br>



## Vauquois Triangle


![vauquois](figs/vauquois.png)

<br><br><br>

## 4. Machine Learning

Consider translation as an instance of a <u>noisy channel model</u>:

Source $\rightarrow$ Noise $\rightarrow$ Receiver

For a source sentence $S$, find the best target sentence $T$ by:

$$T^* \leftarrow \mathrm{argmax}_{\: T} \: P(T) P(S \mid T)$$

- $P(T)$: "fluency". How likely is this sentence $T$ overall?
- $P(S \mid T):$  "faithfulness". How closely does the meaning of $T$ match that of $S$?


<br>

#### Fluency

We've already seen many models of fluency:
- e.g. n-gram language models

$p(w_1 \ldots w_m) = p(w_n|w_{n-1} \ldots w_1) * p(w_{n+1}|w_{n} \ldots w_2) * \ldots * p(w_m|w_{m-1}\ldots w_{m-n})$

In future lectures, we'll consider parse trees, which provides another way to estimate fluency.

<br>

#### Faithfulness

Estimate probabilities of translations from <u>parallel texts</u>.
- E.g., the same news article in English and French

Given training data $D = \{(e_1, f_1) \ldots (e_n, f_n)\}$ of $n$ parallel sentences, how can we compute:

$$
p(f_i \mid e_i) = \prod_{j=1}^n p(f_{ij} \mid e_{i1} \ldots e_{im})
$$
- English sentence has tokens $e_{i1} \ldots e_{im}$
- French sentence has tokens $f_{i1} \ldots f_{in}$

<br>
We'd like to just count the occurrences in the training data, but we need to align sentences.

An alignment vector $a$ maps one or more words in English to one or more words in French. 

![figs/alignment.png](figs/alignment.png)

Then:

$$p(f_i \mid e_i) = \sum_a p(f_i, a_i \mid e_i) $$
- alignment is an unobserved variable
- "sum over" it to compute the probability of a translation
- Use "Expectation Maximization" algorithm to fit this model on parallel text
  - Recall EM for HMMs



<br><br>

## sequence-to-sequence networks for translation

Recall the different types of recurrent neural networks:

![rnn_styles](../sequence/figs/rnn_styles.png)

Which one is most appropriate for translation?

<br><br><br>

many-to-many, unsynchronized

Called **sequence-to-sequence** or **encoder-decoder** networks.

![encoder](figs/encoder.png)

<br><br><br>

Where are we on the Vauquois Triangle?


![vauquois](figs/vauquois.png)


<br><br><br>

in more detail...

![encoder](figs/encoder2.png)

1. **encoder**: converts input tokens to contextualized representations
  - $\{x_1 \ldots x_n\} \mapsto \{h^e_1 \ldots h^e_n\}~~~~$, $~~~~h^e_i \in \mathbb{R}^{d}$
  - LSTMs, GRUs, Transformers, etc.
  
<br>

2. **context vector** $c$: Summarizes $\{h^e_1 \ldots h^e_n\}$
  - used by the decoder to produce the output
  

<br>

3. **decoder**: converts $c$ into contextualized output representations
  - $c \mapsto (\{h^d_1 \ldots h^d_m\}, \{y_1 \ldots y_m\})$


<br><br>

A simple encoder just uses an RNN

> $h^e_t = g(h^e_{t-1}, x_t)$

but $g$ could be an LSTM, GRU, or Transformer

<br><br>

We can then let the context be the final hidden state of the encoder.

> $h^d_1 = c =  h^e_n  $

<br> 

Then, we can generate output using 

> $$\begin{align}h^d_t &=& g(h^d_{t-1})\\ z_t &=& f(h^d_t)\\ y_t &=& \mathrm{softmax}(z_t)\end{align}$$

<br><br>

But, there are a few problems relying on this simple recurrence $h^d_t = g(h^d_{t-1})$ ...

<br>
Recall what our first RNN looked like:

![char](../sequence/figs/char.png)

<br><br><br><br><br>

- **problem 1**: Need to account for the previous words we've generated $\{y_1 \ldots y_{t-1}\}$ (like in HMMs)

<br>

Let our prediction at time $t$ be

$$\hat{y}_t = \mathrm{argmax}_{w \in V} y_t(w)$$

where $y_t(w)$ is the value of $\mathrm{softmax}(z_t)$ for word $w$.

Then we can modify the update equation to be:

$h^d_t = g(h^d_{t-1}, \hat{y}_{t-1})$


<br><br><br>

- **problem 2**: For long sentences, the network will "forget" information from the context vector related to the beginning of sentence
  - similar to vanishing gradient problem
  
<br>

We can just make the context vector $c$ accessible at each decoding time step, since it isn't changing anymore.

$h^d_t = g(h^d_{t-1}, \hat{y}_{t-1}, c)$


## Training sequence-to-sequence models

Assume we have "parallel texts" of source-target languages.

We can concatenate sentence pairs, separated by a special `<s>` token, and train to maximize per-word probability.

![figs/encoder3.png](figs/encoder3.png)

Similar to how we trained a vanilla RNN:

```python
    for epoch in range(epochs):
        optimizer.zero_grad()
        for inputs, outputs in data:
            # inputs: source language tokens
            # outputs: target language tokens
            predicted_outputs, hiddens = rnn.forward_unrolled(inputs)
            loss = criterion(predicted_outputs, outputs)
        loss.backward()      # computes all the gradients
        optimizer.step()     # update parameters
        loss_val.append(loss.item())
```

<br>

One problem: especially at the start of training, the output will be really bad.

But, above we generate each token conditioned on our (noisy) prior prediction:

> $$h^d_t = g(h^d_{t-1}, \hat{y}_{t-1}, c)$$


<br>

One trick to speed up training is called **teacher forcing**. 

Use the true output $y_{t-1}$ when predicting $y_t$

> $$h^d_t = g(h^d_{t-1}, y_{t-1}, c)$$






<br><br><br>

## Beam Search

To do prediction, we are currently performing a **greedy search** to generate each $\hat{y}_t$ conditioned on the most probable prior prediction $\hat{y}_{t-1}$.


In HMMs, we used Viterbi. Why can't we use it here?

<br><br><br><br><br><br>

**long-distance dependencies**

- Since $h^d_t$ can depend on all the prior time steps $t-k$, we can't formulate an efficient dynamic program. 
- Instead, approximate with beam search

![figs/beam.png](figs/beam.png)

- at each time step, generate $k$ best hypothesis
- prune to top $k$ 

$$ \mathrm{score}(y) = \sum_{i=1}^t p(y_i \mid y_1 \ldots y_{i-1}, x) $$

<br><br><br>

![figs/beam2.png](figs/beam2.png)

<br><br>

## Attention models for translation

Recall our trick for "remembering" the context longer:

$h^d_t = g(h^d_{t-1}, \hat{y}_{t-1}, c)$

![encoder](figs/encoder2.png)

But, some parts of $c$ are more relevant to outputs $t$ vs $t+1$.

e.g., the system should care a lot more about the word "green" when producing "verde" than it cares about the word "witch"

We've seen this problem before:

![../sequence/figs/lstmclf4.png](../sequence/figs/lstmclf4.png)

<br><br>

We need a different context vector for each output element:

$h^d_t = g(h^d_{t-1}, \hat{y}_{t-1}, c) \rightarrow h^d_t = g(h^d_{t-1}, \hat{y}_{t-1}, c_t)$

![figs/att1.png](figs/att1.png)


<br>

**How to compute** $c_i$?

Simple(ish) approach: use dot product attention

$$c_i = \sum_j \alpha_{ij} h_j^e$$
- context vector for output $i$ is weighted average of all hidden states of the encoder
- $\alpha_{ij}$ proportional to similarity of encoder state $h_j^e$ and decoder state $h_{i-1}^d$

$$
\alpha_{ij} = \frac{\mathrm{exp}(h_{i-1}^d \cdot h_j^e)}{\sum_k\mathrm{exp}(h_{i-1}^d \cdot h_k^e)}
$$


![figs/att2.png](figs/att2.png)

Interestingly, the above approach doesn't actually require any new parameters.

Instead of dot product, can use a BiLinear layer

$$ \mathrm{score}(h_{i-1}^d, h_j^e) = h_{i-1}^d W_s h_j^e$$
- learn **which** aspects of similarity are important 
- greatly helps for long sentences


![figs/att4.png](figs/att4.png)

<br><br>

Attention allows for some interpretability
- What was the model "attending to" when it produced word $i$?

![figs/att3.png](figs/att3.png)

Can extract the alignment between the sentences!

<br><br><br>

## neural machine translation revolution 

![figs/mtquality.png](figs/mtquality.png)

- [source](https://syncedreview.com/2020/05/20/neural-network-ai-is-the-future-of-the-translation-industry/)
- [NYT article](https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html) on Google's translation efforts

<br><br><br>

## is translation solved?


![figs/maori.png](figs/maori.png)

- [source](https://www.skynettoday.com/briefs/google-nmt-prophecies)
- More at <https://www.reddit.com/r/TranslateGate/>


<br>

Less obvious errors arise due to:

- low-resource language pairs
- out-of-vocabulary words
- common sense
  - [paper jam](https://translate.google.com/?sl=en&tl=es&text=%20paper%20jam%0A&op=translate)? 
- training set biases
  - [they are doctors](https://translate.google.com/?sl=en&tl=es&text=They%20are%20doctors%0A.&op=translate) / [they are nurses](https://translate.google.com/?sl=en&tl=es&text=They%20are%20nurses%0A.&op=translate)


<br><br>

## many other uses of sequence-to-sequence models

- summarization
  - input: long text
  - output: short text
  
  
- dialogue
  - input: human input
  - output: machine output
  
  
- code generation
  - input: human specification of program
  - output: Python (!)
  
  
- music recommendation
  - input: first half of a playlist
  - output: second half

<br><br>

####  image sources

- https://www.cs.colorado.edu/~martin/SLP/
- https://people.cs.umass.edu/~mccallum/courses/inlp2007/lect17-mt.key.pdf
- https://web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture07-nmt.pdf