# Data Science on Textual Information
> Papers on NLP, transformer, NER etc

## Named Entity Recognition (NER)

#### [Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning](https://arxiv.org/abs/1801.09851)
> Xuan et al. 2018-10-09

> A paper focused on BioNER, break **14 SOTA** out of 15 NER tasks, official [source code here](https://github.com/yuzhimanhua/lm-lstm-crf)

> #### Techniques:
* **Character level** tokenization + char-level embedding + char-level LSTM (1st LSTM layer)
* **Word level** tokenization + word-level embedding +  word-level LSTM (2nd LSTM layer)
* Missing word tokens (**OOV**: out of vocabulary) are solved by: **the hidden state from char-level LSTM by the position of word boundaries**, concatenate with the word embedding input, then into word-level LSTM. So when word-level vocab is ```[UNK]```, the model can further deduce by the hidden state from char-level model
* CRF **Cross Random Field** is deployed for classification, see official [pytorch implementation](https://github.com/yuzhimanhua/LM-LSTM-CRF/blob/master/model/lstm_crf.py) and```Lampel et al```
* The multi-task problem: 
    * each task, dataset too few
    * each task has different class labels, simple comibination will lead to many false negative prediction
    * This paper try talcking above by **sharing weights**:
        * The LSTM model weights are divided to $\theta_{w},\theta_{c},\theta_{o}$
        * as weights from **w**ord level, **c**har level, **o**utput layer (CRF)
        * Experiments went through: 
            * only share $\theta_{w}$, model MTM-W
            * only share $\theta_{c}$, model MTM-C
            * share both $\theta_{c}$ and $\theta_{w}$, model MTM-CW (awesome one)
            
> #### Other mention
* Explained the BioNER datasets from [MTL datasets](https://github.com/cambridgeltl/MTL-Bioinformatics-2016)
* AS they all follow IOBES scheme:
    * I: entity **i**n the middle
    * O: n**o**t an entity
    * B: entity by the **b**egin
    * E: entity by the **e**nd
    * S: **s**ingle token entity  
* Other NER (solid old school) system
    * CHEMDNER(Lu et al. , 2019), CRF + Brown clustering of words
    * TaggerOne (Leaman and Lu, 2016), semi-Markov model for joint entity recognition and normalization
    * JNLPBA, (Zhou and Su, 2004), using HMM

## Pretrain & Transfer Learning

#### [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146)
> This paper from Jeremy Howard et al has compared and discussed pretraining model in length. 
The paper practice pretrain and fine tuning by the following 3 steps
* **step 1** LM for general textual material, or in related domain
* **step 2** LM on task text
* **step 3** Fine-tune on task text vs task labels

>The paper experimented through many useful techniques like 
* Discriminated fine-tuning (step 2): different LR for different parameters, after some experiment, $\eta^{-1} = \eta / 2.6$ is usually good
* Slanted triangular learning rates (step 1): which first **linearly increases** the learning rate and then **linearly decays** it.
* Concat pooling (step 3): for hidden states $H = \{h_{1},h_{2},...,h_{t}\}$:
$h_{c} = [t_{T},maxpool(H),meanpool(H)]$, where $[]$ is the concatenation
* Gradual unfreezing (step 3), from top layer, **unfreeze 1 layer an epoch**, train all the unfreezed layers at the epoch.(use this under discriminated fine-tuning)
* **Back propagation through time** for text classification (BPT3C) (step 3): use the output state of then end of the last sentence as the initial state of the next sentence

## Transformers

### Basics

#### Transformer paper: [Attention is all you need](https://arxiv.org/abs/1706.03762)

> Background: do transduction task entirely with self-attention without using RNN
* This paper experiments on a transduction task: machine translation
* Learn the vector representation $z = (z_{1},...,z_{n})$ from symbol representation$x = (x_{1},...,x_{n})$, to generate output sequence $(y_{1},...,y_{m})$ with better **parallelized computation**, better than CNN and RNN, especially when $k > n$
* The paper spent many paragraphs discuss model structure
    * **Encoder, Decoder** stacks
        * Encoder & Decoder consists of 6 identical layers:
        * a layer consist of 2 **sublayer** structure, by sublayer, it's $LayerNorm(x+Sublayer(x))$, residual connection
        * 1 sublayer is: **multi-head self-attention**, the other is **Position-wise feed forward**
    * About the multi-head self-**Attention**: 
        * query, key & value structure, 
            * Encoder-decoder attention: Q-decoder, K,V - encoder, mimics seq2seq
            * encoder, self-attention with Q,K,V
            * decoder, self-attention with Q,K,V, masking out illegal connection
        * $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_{k}}})V$, Afther the dot product, we'll have very scaled up value if we have large ${d_k}$, which will make softmax has extremely small gradients, hence divide by $\sqrt{d_{k}}$
        * Why & How we practice multi-heading,eg. $d_{model} = 512$, if $h=8$, $d_{k}=d_{v}=d_{model}/h=64$
    * Position-wise **feed forward**
    * Positional encoding, since no RNN, no CNN, no tell of absolute and relative postion
        * **fix** & **learning** based positional encode added to embedding output, the learning-based **works** as fine as the fix-based, but fix-based can better extrapolate
* Self-attention generated from this model, when visualized, can show text syntax structure

### GPT based

#### GPT paper [Improve Language Understanding by Generative Pre-Training, Radford et al.](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

> Use Transformer-like(only decoder) model structure with **2 stages**
* **Un-supervised pretraining** Text generation as pretraining
* **Supervised finetuning** Discriminative finetune on specific task (auxiliary objective)

> Framework
* Unsupervised training: language modeling, guessing the next-word, trying to maximize the following
    * $L_{1}(u) = \sum_{i} log P(u_{i}|u_{i-k},...,u_{i-1};\Theta)$
* Supervised, with extra parameter $W_{y}$ to predict $y$, $P(y|x^{1},...,x^{m}) = softmax(h_{l}^{m}W_{y})$, trying to maximize, $h_{l}^{m}$ is the last token's corresponding vector at last layer's outputs: 
    * $L_{2}(C) = \sum_{x,y} log P(y|x^{1},...,x^{m})$
    * Train with auxiliary objective: $L_{3}(C) = L_{2}(C)+\lambda*L_{1}(C)$
    * at this stage, only the $W_{y}$ and embedding for delimiter are trainable
    
> Data input
* In case of fine-tuning qa and other structured input data, make the structured  data into special token separated long string of sentence
* The paper pretrained on BooksCorpus dataset

> Model
* **Decoder** only, decoder is the transformer layer with **masked** self-attention heads, 12 layers of decoder transformer
* Positional embeddings are **learnt** not fixed
* BPE tokenized
* GELU instead of relu
* python ftfy([fix text for you](https://ftfy.readthedocs.io/en/latest/)) clean the text
* spaCy tokenized the text
* $\lambda = 0.5$ 


### BERT based

#### BERT Paper, [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al.](https://arxiv.org/abs/1810.04805)

> For GPT is undirectional, the "most important new contribution" of BERT is it makes the pretrain task **Bidirectional**, for the purpose of comparison with GPT, BERT share many hyper-parameters with GPT (to be specific, $BERT_{base}$), comparing the structure, GPT masked out the right-context input for each layer.

> Use **encoder** instead of decoder, the bidirectional Transformer is often referred to as a "Transformer encoder", while the left-context-only version is referred to as "Transformer decoder"
* only encoder
* also learned positional encoding, but has additional segment embeddings (one emb for A of the pair, one for B of the pair)
* The start of the sentence, we use ```[CLS]``` token, the corresponding token in last output of transformer will be used for fine-tune classification
* Pretrain is slightly slower, but perform much better on GLUE benchmark

> Model pretrained on 2 tasks
* Masked Language Modeling (**MLM**), 15% random pick of word-piece masked
* Next Sentence Prediction (is /is not?: next sentence), 50% random

#### ALBERT paper, [ALBERT: A Lite BERT for Self-Supervised Learning of Language Representation Lan et al.](https://arxiv.org/abs/1909.11942v6) 

> Is having better NLP models as easy as having larger models?
A configuration similar to BERT-large can has 18x fewer parameters, and trains 1.7x faster

> Model Structure:
* Seperate the **hidden layer size** vs **embedding size**, easier to grow hidden layers from the size of vocabulary embedding
* Cross layer parameter **sharing**, more layers, not much param size growth

> Method
* SOP(sentence-order prediction) instead of NSP (next sentenc prediction)

#### ELECTRA paper, [ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generator, Clark et al](https://arxiv.org/abs/2003.10555)

> ELECTRA short for **Efficently Learning an Encoder that Classifies Token Rplacements Accurately**,The paper proposed a training scheme, which can outperform GPT, BERT, RoBERTa, with much less computation (one GPU 4 days), a bit like GAN but not doing any adversarial learning
* The reason this works better than BERT's traing method: MLM learning has the model to learn 15% of the positions in the sentence, this doesn't seem an efficient way of learning

> Method:
* Use 2 encoder transformer, 
    * Generator, training on MLM(masked language modeling) task
    * Discriminator,training on the generated text with guessed tokens from Generator, trying to do classification(original or replaced) on **every token** (that's how ELECTRA learns faster than usual MLM)
* If Generator guessed the write token, it will be treated as original (different from GAN)
* We try to minimized the combined loss $\mathcal{L}_{MLM}(x,\theta_{G}) +\lambda \mathcal{L}_{Disc(x,\theta_{D})}$
* Don't backpropagate the discriminator loss through the generator
* Train the 2 model altogether from the start, then only D, never only G at first, G would be too difficult for D

> Model structure
* same as BERT-Base
* Generators are **smaller** (decreasing layer size,1/4 to 1/2 of discriminator works the best), so can only share embedding, if same structure, all the weights can be tied
* **Small**, designed to be trained on a single GPU
    * Sequence length:512 to 128
    * token embedding: 768 to 128
    * hidden dimension size 768 to 256
    * batch size: 256 to 128
* **Large** , same as BART large