# Model Notes

__Contents__ :
1. Architecture overview
    - 1.1 [RNN](#RNN)
    - 1.2 [LSTM](#LSTM)
    - 1.3 [Attention](#Attention)
2. General models
    - 2.1 [VAE](#VAE)
    - 2.2 [Seq2Seq](#Seq2Seq)
3. Paper implementations (non-parallel text)
    - 3.1 [Style Transfer from Non-Parallel Text by Cross-Alignment](#Paper_3.1) - one of the first big papers on text style transfer (TST); by MIT CSAIL group
    - 3.2 [Neural Style Transfer for Non-Parallel Text](#Paper_3.2) - summary of CSAIL paper
    - 3.3 [Disentangled Representation Learning for Non-Parallel Text Style Transfer](#Paper_3.3) - the most recent/comprehensive source on TST
    - 3.4 [Style Transfer in Text: Exploration and Evaluation](#Paper_3.4) - autoencoder & style embedding versions of seq2seq
4. Evaluation
    - 4.1 [Massive Exploration of Neural Machine Translation Architectures](#Eval_4.1) - methodology and parameter suggestions

## 1. Architecture Overview (Coursera & NLP Notes)

***

<a name="RNN"></a>
### 1.1 RNN: Recurrent Neural Network
- _Sources_ : 
    - $[1]$ [RNN Wiki](https://en.wikipedia.org/wiki/Recurrent_neural_network)
    - $[2]$ [Coursera Slides](https://cs230.stanford.edu/files/C5M1.pdf)
    - $[3]$ [Massive Exploration of Neural Machine Translation Architectures](https://arxiv.org/pdf/1703.03906.pdf) (Google Brain)

#### 1.1.1 Overview $[1]$
- _Problems with standard NN_ :
    - Inputs, outputs can be different lengths
    - Doesn't share features learned across different positions of text
- Connections between nodes form directed graph along temporal sequence; allows temporal dynamic behavior
- Two classes of RNNs:
    1. _Finite impulse_ : response to input is of finite duration (settles to zero in finite time)
        - Is DAG, can be "unrolled" and replaced with strictly feedforward NN
    2. _Infinite impulse_ : may have internal feedback, and may continue to respond indefinitely (although usually decay)
        - Directed cyclic graph, cannot be unrolled
- Can have additional stored states, potentially controlled by NN $\to$ incorporating time delays/feedback loops leads to LSTM gated memory principles

#### 1.1.2 RNN types $[2]$

<img src = "img/coursera_rnn_types.png" style = "width: 500px"/>

#### 1.1.3 Issue of vanishing gradients, illustrated $[2]$

<img src = "img/coursera_vanishing_grads.png" style = "width: 500px"/>
    

#### 1.1.4 GRU: gated recurrent unit
- Gating mechanism in RNN: like LSTM with forget gate, but has fewer parameters (lacks output gate)
- Also addresses the vanishing gradients problem
- Can have better performance than LSTM on small datasets; but LSTM strictly stronger $[3]$ 

<a name="LSTM"></a>
### 1.2 LSTM: Long Short-Term Memory

- _Sources_ : 
    - $[1]$ [Long Short-Term Memory](https://www.bioinf.jku.at/publications/older/2604.pdf) 
        - Original paper on LSTM, cited in Coursera
    - $[2]$ [LSTM Wiki](https://en.wikipedia.org/wiki/Long_short-term_memory)
    - $[3]$ [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf)

#### 1.2.1 Motivation $[1]$
- DNNs can only be applied to poblems where inputs, targets encoded with vectors of fixed dimensionality; not good for speech
- Conventional DNNs have issue of compounding error in backprop
- _LSTM soln_ : when error values back-propagated, error remains in LSTM cell $\implies$ "error carousel" feeds error back into each of LSTM gates, until they learn to cut off value


#### 1.2.2 What is an LSTM? $[1, \, 2]$
- Has feedback connections, unlike standard _feedforward_ NNs; in feedforward, node connections don't form a cycle
- _Idea_ :
    - Cell remembers values, and three gates regulate flow of into into/out of cell
    - Useful for data with lags
    - Each gate has activation: marked increase in number of parameters
    - Robustness to range of parameters due to error control
    
#### 1.2.3 LSTM units $[1]$
##### 1.2.3 (a) -  LSTM unit
- __Cell__ : keeps track of dependencies between elements in input sequence
- __Input gate__ : controls net flow into cell
- __Forget gate__ : controls how much value remains in cell
- __Output gate__ : controls extent to which cell value used to compute activation
    
##### 1.2.3 (b) - Why gate units?
- Avoids weight conflicts
- Input gate learns when to release errors by appropriate scaling
    
<img src = "img/coursera_lstm.png" style = "width: 500px"/>    
    
#### 1.2.4 Application to NLP models $[3]$
- _Property of LSTM_ : learns to map input sentence of variable length onto fixed-dimensional vector representation
    - Translation objective encourages LSTM to find sentence representations that capture meaning
    - Paper claim: model aware of word order and fairly invariant to active/passive voice
    
<a name="Attention"></a>
### 1.3 Model with Attention
- _Sources_ :
    - $[1]$ [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)

***
***

## 2. Generalized Models

***

<a name="VAE"></a>
### 2.1 Variational Autoencoder (VAE)

- _Sources_ : 
    - $[1]$ [Google's Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf)
    - $[2]$ [VAE tutorial](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/)
    - $[3]$ [Tutorial on Variational Autoencoders](https://arxiv.org/pdf/1606.05908.pdf)
    - $[4]$ [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf) - one of the original papers
    - $[5]$ [Stochastic Backprop in DGMs](https://arxiv.org/pdf/1401.4082.pdf) - the other original paper
    - $[6]$ [Kullback-Leibler divergence Wiki](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)
    - $[7]$ [Intuitively understanding VAEs](https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf)

#### 2.1.1 Encoder-decoder model $[3]$
##### 2.1.1 (a) - Encoder
- Reads in source data $x$, produces feature representation in latent space $z$
    * RNN; input sequence of words $\to$ fixed vector that is $\approx$ meaning
    * CNN; input img $\to$ volume that contains higher-level features of img
- Aka the __inference network__ : parameterizes appx posterior of latent variables $z$ & outputs parameters to dist $q(z \mid x)$
- Bottleneck: encoder must learn efficient compression of data into hidden representation, $z$, for input $x$
- _Note_ : encoder $q_{\theta} (z \mid x)$ for parameters $\theta$ is stochastic; usually $\sim$ Gaussian

<img src = "img/vae_encoding_visual.png" style = "width: 500px"/>
    
##### 2.1.1 (b) -  Decoder
- Input: representation $z$, output: reconstruction of $x$; aka __generative network__
- Notation: $p_{\phi}(x \mid z)$ for parameters $\phi$ (aka __likelihood distribution__)
- Decoder gets latent representation of input as $z$, so necessarily loses some information
    - $\implies$ lost information represented by reconstruction log-likelihood $\log p_{\phi}(x \mid z)$
    - Measures how effectively decoder can reconstruct input $x$ given latent $z$
    
##### 2.1.1 (c) Loss function
- Log-likelihood loss w/regularization; since each $z_i$ corresponds to an $x_i$, no global latent vars $\implies$ can decompose loss fn into only terms that depend on single datapt $\ell_i$:
$$\ell_i(\theta, \phi) = -\mathbb{E}_{z \sim q_{\theta}(z \mid x_i)} \log p_{\phi}(x_i \mid z) + R(x_i)$$
- _Notes on loss_ :
    - Expectation taken wrt encoder distribution over representations $\implies$ poor reconstruction yields large loss
    - Regularizer $R(x)$: use __Kullback-Leibler divergence__ $[6]$ between distributions
        - KL definition for $p, q$ on same probability space $\mathcal{X}$:
        $$ \mathbb{KL}(q \mid\mid p) = \sum_{x \in \mathcal{X}} q(x) \log \Big(\frac{q}{p} \Big)$$
        - _Idea_ : expectation of log-$\Delta$ between $p, q$
        - _Interpretation_ : how much information lost when using $q$ instead of $p$
        - Defined iff $\forall \, x$, $p = 0 \implies q = 0$ (absolute continuity)
    
##### 2.1.1 (d) Example encoder-decoder model (with attention) $[2]$
    
<img src = "img/google_brain_encoder_decoder.png" style = "width: 500px"/>

#### 2.1.2 Variational autoencoder $[3]$
##### 2.1.2 (a) - VAE vs. traditional autoencoder $[7]$

<img src = "img/vae_vs_standard.png" style = "width: 500px"/>

##### 2.1.2 (b) - Definition
- _Formal def_ : approximate inference in latent Gaussian model where approximate posterior and model likelihood parameterized by NNs (inference, and generative networks)
- Assumes distribution on $p$: generally $p \sim$ unit normal; allows direct sampling
    - $\implies$ if encoder outputs representations $z$ different from those normally distributed, then will receive penalty in loss
    - _Note_ : could model word distribution for better encoding as in 161/200?
- Train VAE using gradient descent, or other optimizer

##### 2.1.2 (c) - VAE backprop
- Want small $\sigma$ in $p$ to ensure that model generates something like $x$ input

##### 2.1.2 (d) - KL divergence for two gaussians $[7]$
$$ \mathbb{KL} = \sum_i \sigma_i^2 + \mu_i^2 - \log(\sigma_i) - 1 $$

***

<a name="Seq2Seq"></a>
### 2.2 Seq2Seq

- _Sources_ : 
    - $[1]$ [Seq2Seq documentation](https://google.github.io/seq2seq/)
    - $[2]$ [Seq2Seq Wiki](https://en.wikipedia.org/wiki/Seq2seq)
    - $[3]$ [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf); _authors_ : Sutskever et al., 2014
    - $[4]$ [Style Transfer in Text: Exploration and Evaluation](https://arxiv.org/pdf/1711.06861.pdf)

    
- _Code_ :
    - _List of encoders_ : [seq2seq encoders](https://google.github.io/seq2seq/encoders/)
    - _List of decoders_ : [seq2seq decoders](https://google.github.io/seq2seq/decoders/)
    - _List of models_ : [seq2seq models](https://google.github.io/seq2seq/models/)

***
***

## 3. Paper Implementations (Non-Parallel Text)

***

<a name="Paper_3.1"></a>
### 3.1 [Style Transfer from Non-Parallel Text by Cross-Alignment](https://arxiv.org/pdf/1705.09655.pdf)
- _Authors_ : Shen et al., 2017

- _Relevant outside sources_ :
    - $[1]$ [Toward Controlled Generation of Text](https://arxiv.org/pdf/1703.00955.pdf) - latent spaces with generated text; _authors_ : Hu et. al, 2017
    - $[2]$ [InfoGAN: Representation Learning](https://arxiv.org/pdf/1606.03657.pdf) - more on latent spaces
    - $[3]$ [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf) - original VAE paper
    - $[4]$ [CNN for Sentence Classification](https://arxiv.org/pdf/1408.5882.pdf) - model-based evaluation metric
    - $[5]$ [word2vec](https://code.google.com/archive/p/word2vec/) - more on model-based evaluation
    - $[6]$ [Professor Forcing](https://arxiv.org/pdf/1610.09038.pdf) - RNN training algorithm
- _Code implementation_ : [authors' repo](https://github.com/shentianxiao/language-style-transfer)

#### 3.1.1 Introduction
- Learn encoder $\to$ take style as input $\to$ map to style-independent content
- Do not use VAE ( _need to research_ ) - need to preserve latent content

<img src = "img/non_parallel_model.png" style = "width: 500px"/>  

- Task evaluation:
    1. Sentiment modification
    2. Decipherment of word substitution ciphers
    3. Recovery of word order
    
#### 3.1.2 Related work
- _Vision_ : cannot employ many similar vision methods due to discreteness of NLP
    - Generating sentences with controllable attributes by learning disentangled latent representations $[1]$
    - More reserach available on latent spaces $[2]$
    
#### 3.1.3 Formulation
- _Motivation_ : want to recover joint distributions of style in order to facilitate style transfer
    - __Claim__ : datasets from different styles need to be distinct enough for this paper's methods to work
- _Conclusions_ : latent content should have most complexity from input, and latent style variable should have simple effects

##### 3.1.3 (a) - Data Generation
1. Latent style variable $y$ generated from $p(y)$
2. Latent content variable generated from $p(z)$
3. Datapoint $x$ generated from conditional dist $p(x \mid y, z)$

##### 3.1.3 (b) - Setup
- Two data sets $X_1$ (drawn from $p(x_1, y_1)$), $X_2$ with same content, but different style (drawn from $p(x_2, y_2)$): $y_1, y_2$, both unknown
- Want to estimate style transfer functions: $p(x_1 \mid x_2 ; y_1, y_2)$ and $p(x_2 \mid x_1; y_1, y_2)$

#### 3.1.4 Model
##### 3.1.4 (a) - Overview
- Use of auto-encoder model:
    1. Encoding step to infer content of target
    2. Decoding step to generate transferred counterpart
- Need content space of source and target to coincide; could employ VAE $[3]$
    
##### 3.1.4 (b) - Model specifics and loss
- Encoder E: infers content $z$ from sentence $x$ and style $y$
- Generator G: generates sentence $x$ from from style $y$ and content $z$
- E, G form autoencoder when applied to same style

<img src = "img/csail_autoencoder_loss.png"/>  

- Align posteriors to have same distribution; use adversarial discriminator to distinguish between the dists\
- _Goal_ : minimize encoder, generator loss; maximize discriminator distinction


<img src = "img/csail_loss_function.png"/>  
    
##### 3.1.4 (c) -  Proposed styles

1. _Aligned auto-encoder_ :

    - Implement encoder & decoder using single-layer RNNs with GRU
    - Need to align distributions: use discriminator D as feed-forward NN with single hidden layer, sigmoid output
    - Complicated loss function (see above)
    
2. _Cross-aligned auto-encoder_ :

    - Use two discriminators: $D_1$, $D_2$
        - $D_1$ : distinguish between real $x_1$ and transferred $x_2$
        - $D_2$ : distinguish between real $x_2$ and transferred $x_1$
        
##### 3.1.4 (d) - Implementation

1. Use softmax distribution over words
    - During generating process of transferred $x_2$ from $G(y_1, z_2)$, feed peaked $\text{softmax}(a^{[\ell]} / \gamma)$ into next input
2. Use _Professor-Forcing_ $[6]$ to "match sequence of hidden states instead of output words"
    - Hidden states: information about outputs, and smoothly distributed
    - Input to $D_1$ : sequence of hidden states, either from:
        1. Teacher forcing by real example $x_1$; $G(y_1, z_1)$
        2. Self-fed by previous softmaxes; $G(y_1, z_2)$
    
        
#### 3.1.5 Evaluation
- Used model-based evaluation metric for sentiment modification: measure how often a transferred sentence has correct sentiment according to pre-trained sentiment classifier (see $[4, 5]$)

<img src = "img/csail_professor_forcing.png"/> 

#### 3.1.6 Code re-implementations
- _Vinitra CSAIL summary paper version_ : [repo](https://github.com/vinitra/neural-text-style-transfer) - (see below, $3.2$)
- _J. Park Shakespeare non-parallel transfer_ : [repo](https://github.com/jpark96/language-style-transfer)
    
***

<a name="Paper_3.2"></a>
### 3.2 [Neural Style Transfer for Non-Parallel Text](https://github.com/vinitra/neural-text-style-transfer/blob/master/reports/Neural_Style_Transfer_for_Non_Parallel_Text.pdf)

- _Note_ : essentially a plain-English summary of the CSAIL paper $(3.1)$
- [_Full repo_](https://github.com/vinitra/neural-text-style-transfer)

#### 3.2.1  Model
1. One-layer RNN GRU to find latent representation of content variables ($z$)
2. Simple one-layer feedforward NN as discriminator; implicit distribution on $z$
3. Feed latent $z$ into decoder RNN (same architecture as encoder), joint optimization
4. Optimize decoder after latent representation learned
    - Use latent code $z$ and one-hot-encoded style representation
    - Train decoder RNN to reconstruct original sentence
    
<img src = "img/csail_summary_paper_model.png" style = "width: 500px"/> 

#### 3.2.2 Evaluation
    
##### 3.2.2 (a) - Content preservation
- Use model-based evaluation like in zebra/horse paper from class; transfer output back to original, then compare via BLEU score

<img src = "img/csail_summary_paper_evaluation.png"/> 

##### 3.2.2 (b) - Style evaluation
- CSAIL paper equates _style_ and _sentiment_ : problem this paper talks about that we also need to consider
- Used authorship attribution as proxy for style: author classification as loss metric (could be adapted to our work)
- Could employ technique from $3.2.2 (a)$, again citing the horse paper; could try style transfer in reverse, then evaluate how well the styles match

***

<a name="Paper_3.3"></a>
### 3.3 [Disentangled Representation Learning for Non-Parallel Text Style Transfer](https://arxiv.org/pdf/1808.04339.pdf)

- _Relevant outside sources_ :
    - $[1]$ [Toward Controlled Generation of Text](https://arxiv.org/pdf/1703.00955.pdf) - latent spaces with generated text; _authors_ : Hu et al., 2017
    - $[2]$ [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf); _authors_ : Sutskever et al., 2014
    - $[3.1]$ [Style Transfer from Non-Parallel Text by Cross-Alignment](https://arxiv.org/pdf/1705.09655.pdf); _authors_ : Shen et al., 2017
- _Code_ :
    - _Tensorflow_ : [authors' repo](https://github.com/vineetjohn/linguistic-style-transfer) - updated as of 2019
    - _Pytorch_ : [repo](https://github.com/h3lio5/linguistic-style-transfer-pytorch), also has [good paper & model summary](https://github.com/h3lio5/linguistic-style-transfer-pytorch/blob/master/summary.md)
    
#### 3.3.1 Introduction
- _Latent feature_ : no explicit meaning/interpretation to intermediate non-linear transformations ("black box" criticism of NNs)
- Autoencoder paradigm: sentence latent space as vector representation $\to$ disentangle to _style_ , _content_

##### 3.3.1 (a) - General approach
- Conflating sentiment with style -- need to address in our implementation
- Multi-task loss operates on latent space
- Adversarial loss "minimizes predictability of information that shouldn't be contained in that space"

##### 3.3.1 (b) - Content approximation
- Use __bag of words__ (BoW) features: style-neutral, non-stopwords $\implies$ disentangling the spaces
- Then, employ the models from the CSAIL paper $[3.1]$ and Hu et. al, 2017 $[1]$

##### 3.3.1 (c) - Non-parallel text style transfer
- Train model on non-parallel but _style-labeled_ corpora
    1. Train autoencoder with disentangled latent spaces
    2. For inference: autoencoder encodes content, but ignores encoded style
    3. Empirically infer style from training data
    4. Concatenate content from (2) and style from (3), feed into decoder
    
#### 3.3.2 Importance of related works
- $[1]$ _Hu et al., 2017_ :
    - Control sentiment (style proxy) by "using discriminators to reconstruct sentiment, content from generated sentences"
    - Criticism: no evidence that this disentangles the latent space
- $[3.1]$ _Shen et al., 2017 (CSAIL)_ :
    - Pair of adversarial discriminators to align recurrent hidden decoder states of original & style-transferred sentences
- _Summary_ : most authors up to 2019 basically ignored the content space
    - This paper encoes input from the start, which allows for easier analysis
    
#### 3.3.3 Model overview
- Builds heavily on Sutskever et al., 2014, Seq2Seq model $[2]$

***

<a name="Paper_3.4"></a>
### 3.4 [Style Transfer in Text: Exploration and Evaluation](https://arxiv.org/pdf/1711.06861.pdf)

#### 3.4.1 Models
##### 3.4.1 (a) - Summary
<img src = "img/style_transfer_in_text_models.png" style = "width: 500px"/> 



***
***

## 4. Evaluation

<a name="Eval_4.1"></a>
### 4.1 [Massive Exploration of Neural Machine Translation Architectures](https://arxiv.org/pdf/1703.03906.pdf)

- _Source_ : results from Google Brain experiments in 2017 to compare NMT models

#### 4.1.1 Results
- Large embeddings (they tried $2048$ dimensions) got best results, but only by small margin
    - Small embeddings can generally get most semantic information ($\geq 128$)
- LSTMs consistently outperformed GRU in BLEU scores
- Need residual connections for deep-layered decoders
- Overall best results when using parameterized additive attention
- Importance of well-tuned beam search: recommend beam width $\in [5, 10]$ and length penalty of $\approx 1.0$

***

### 4.2