# Model Notes

## 1. Architecture Overview

***

### 1.1 Encoder-Decoder Model

- _Source_ : [Seq2Seq documentation](https://google.github.io/seq2seq/)

#### 1.1.1 Encoder-decoder overview

- __Encoder__ :
    * Reads in source data, produces feature representation in cts space
        * RNN; input sequence of words $\to$ fixed vector that is $\approx$ meaning
        * CNN; input img $\to$ volume that contains higher-level features of img
    * _Idea_ : representation from encoder used by decoder to generate new data
    * [_List of encoders_](https://google.github.io/seq2seq/encoders/)
    
- __Decoder__ :
    * Generative model, conditioned on representation from encoder
        * RNN decoder: learns to generate translation for encoded sequence in another language
    * [_List of decoders_](https://google.github.io/seq2seq/decoders/)
    
#### 1.1.2 Encoder-decoder model
    
- __The model__ :
    * Defines how encoder/decoder put together & how to calculate/minimize loss fn
    * [_List of models_](https://google.github.io/seq2seq/models/)
    
    <img src = "img/google_brain_encoder_decoder.png" style = "width: 500px"/>

### 1.2 RNN: Recurrent Neural Network
- _Sources_ : 
    - $[1]$ [RNN Wiki](https://en.wikipedia.org/wiki/Recurrent_neural_network)
    - $[2]$ [Coursera Slides](https://cs230.stanford.edu/files/C5M1.pdf)
    - $[3]$ [Massive Exploration of Neural Machine Translation Architectures](https://arxiv.org/pdf/1703.03906.pdf) (Google Brain)

#### 1.2.1 Overview $[1]$
- _Problems with standard NN_ :
    - Inputs, outputs can be different lengths
    - Doesn't share features learned across different positions of text
- Connections between nodes form directed graph along temporal sequence; allows temporal dynamic behavior
- Two classes of RNNs:
    1. _Finite impulse_ : response to input is of finite duration (settles to zero in finite time)
        - Is DAG, can be "unrolled" and replaced with strictly feedforward NN
    2. _Infinite impulse_ : may have internal feedback, and may continue to respond indefinitely (although usually decay)
        - Directed cyclic graph, cannot be unrolled
- Can have additional stored states, potentially controlled by NN $\to$ incorporating time delays/feedback loops leads to LSTM gated memory principles

#### 1.2.2 RNN types $[2]$

<img src = "img/coursera_rnn_types.png" style = "width: 500px"/>

#### 1.2.3 Issue of vanishing gradients, illustrated $[2]$

<img src = "img/coursera_vanishing_grads.png" style = "width: 500px"/>
    

#### 1.2.4 GRU: gated recurrent unit
- Gating mechanism in RNN: like LSTM with forget gate, but has fewer parameters (lacks output gate)
- Also addresses the vanishing gradients problem
- Can have better performance than LSTM on small datasets; but LSTM strictly stronger $[3]$ 

### 1.3 LSTM: Long Short-Term Memory

- _Sources_ : 
    - $[1]$ [Long Short-Term Memory](https://www.bioinf.jku.at/publications/older/2604.pdf) 
        - Original paper on LSTM, cited in Coursera
    - $[2]$ [LSTM Wiki](https://en.wikipedia.org/wiki/Long_short-term_memory)
    - $[3]$ [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf)

#### 1.3.1 Motivation $[1]$
- DNNs can only be applied to poblems where inputs, targets encoded with vectors of fixed dimensionality; not good for speech
- Conventional DNNs have issue of compounding error in backprop
- _LSTM soln_ : when error values back-propagated, error remains in LSTM cell $\implies$ "error carousel" feeds error back into each of LSTM gates, until they learn to cut off value


#### 1.3.2 What is an LSTM? $[1, \, 2]$
- Has feedback connections, unlike standard _feedforward_ NNs; in feedforward, node connections don't form a cycle
- _Idea_ :
    - Cell remembers values, and three gates regulate flow of into into/out of cell
    - Useful for data with lags
    - Each gate has activation: marked increase in number of parameters
    - Robustness to range of parameters due to error control
    
#### 1.3.3 LSTM units $[1]$
- _LSTM unit_ :  
    - __Cell__ : keeps track of dependencies between elements in input sequence
    - __Input gate__ : controls net flow into cell
    - __Forget gate__ : controls how much value remains in cell
    - __Output gate__ : controls extent to which cell value used to compute activation
- _Why gate units?_ :
    - Avoids weight conflicts
    - Input gate learns when to release errors by appropriate scaling
    
<img src = "img/coursera_lstm.png" style = "width: 500px"/>    
    
#### 1.3.4 Application to NLP models $[3]$
- _Property of LSTM_ : learns to map input sentence of variable length onto fixed-dimensional vector representation
    - Translation objective encourages LSTM to find sentence representations that capture meaning
    - Paper claim: model aware of word order and fairly invariant to active/passive voice

***
***

## 2. Generalized Models

***

### 2.1 Seq2Seq

- _Sources_ : 
    - [Seq2Seq documentation](https://google.github.io/seq2seq/)
    - [Wiki](https://en.wikipedia.org/wiki/Seq2seq)

#### 2.1.1 Model Overview
- Turns one sequence into another; uses RNN, LSTM, or GRU

***
***

## 3. Paper Implementations: Non-Parallel Corpora

***

### 3.1 [Style Transfer from Non-Parallel Text by Cross-Alignment](https://arxiv.org/pdf/1705.09655.pdf)

#### 3.1.1 Introduction
- Learn encoder $\to$ take style as input $\to$ map to style-independent content
- Do not use VAE ( _need to research_ ) - need to preserve latent content

<img src = "img/non_parallel_model.png" style = "width: 500px"/>  

- Task evaluation:
    1. Sentiment modification
    2. Decipherment of word substitution ciphers
    3. Recovery of word order
    
#### 3.1.2 Related work
- _Vision_ : cannot employ many similar vision methods due to discreteness of NLP
- [Toward Controlled Generation of Text](https://arxiv.org/pdf/1703.00955.pdf):
    - Generating sentences with controllable attributes by learning disentangled latent representations
- [InfoGAN: Representation Learning](https://arxiv.org/pdf/1606.03657.pdf)
    - More on latent spaes
    
#### 3.1.3 Formulation
- _Motivation_ : want to recover joint distributions of style in order to facilitate style transfer
    - __Claim__ : datasets from different styles need to be distinct enough for this paper's methods to work
- _Conclusions_ : latent content should have most complexity from input, and latent style variable should have simple effects

#### 3.1.4 Model
- Use of auto-encoder model:
    1. Encoding step to infer content of target
    2. Decoding step to generate transferred counterpart
- Need content space of source and target to coincide; could employ VAE
    - Probabilistic VAE: [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf)
- _Proposed models_ :
    1. __Aligned auto-encoder__ :
        - Implement encoder & decoder using single-layer RNNs with GRU
        - E: takes input x, outputs content z
        - G: generates sentence x conditioned on latent state
        - Adversarial discriminator D
        - Need to align distributions: use discriminator D as feed-forward NN with single hidden layer, sigmoid output
        - Complicated loss function!
    2. __Cross-aligned auto-encoder__ :
        - Aligns transferred samples from one style with true samples from other
        
#### 3.1.5 Implementation
- Used model-based evaluation metric for sentiment modification: measure how often a transferred sentence has correct sentiment according to pre-trained sentiment classifier
    - See [CNN for Sentence Classification](https://arxiv.org/pdf/1408.5882.pdf), [word2vec](https://code.google.com/archive/p/word2vec/)
    
***

### 3.2 [Disentangled Representation Learning for Non-Parallel Text Style Transfer](https://arxiv.org/pdf/1808.04339.pdf)

***

### 3.3 [Neural Style Transfer for Non-Parallel Text](https://github.com/vinitra/neural-text-style-transfer/blob/master/reports/Neural_Style_Transfer_for_Non_Parallel_Text.pdf)