# Data Science on Textual Information
> Papers on NLP

## Named Entity Recognition (NER)
#### [Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning](https://arxiv.org/abs/1801.09851)
> Xuan et al. 2018-10-09

> A paper focused on BioNER, break **14 SOTA** out of 15 NER tasks, official [source code here](https://github.com/yuzhimanhua/lm-lstm-crf)

> #### Techniques:
* **Character level** tokenization + char-level embedding + char-level LSTM (1st LSTM layer)
* **Word level** tokenization + word-level embedding +  word-level LSTM (2nd LSTM layer)
* Missing word tokens (**OOV**: out of vocabulary) are solved by: **the hidden state from char-level LSTM by the position of word boundaries**, concatenate with the word embedding input, then into word-level LSTM. So when word-level vocab is ```[UNK]```, the model can further deduce by the hidden state from char-level model
* CRF **Cross Random Field** is deployed for classification, see official [pytorch implementation](https://github.com/yuzhimanhua/LM-LSTM-CRF/blob/master/model/lstm_crf.py) and```Lampel et al```
* The multi-task problem: 
    * each task, dataset too few
    * each task has different class labels, simple comibination will lead to many false negative prediction
    * This paper try talcking above by **sharing weights**:
        * The LSTM model weights are divided to $\theta_{w},\theta_{c},\theta_{o}$
        * as weights from **w**ord level, **c**har level, **o**utput layer (CRF)
        * Experiments went through: 
            * only share $\theta_{w}$, model MTM-W
            * only share $\theta_{c}$, model MTM-C
            * share both $\theta_{c}$ and $\theta_{w}$, model MTM-CW (awesome one)
            
> #### Other mention
* Explained the BioNER datasets from [MTL datasets](https://github.com/cambridgeltl/MTL-Bioinformatics-2016)
* AS they all follow IOBES scheme:
    * I: entity **i**n the middle
    * O: n**o**t an entity
    * B: entity by the **b**egin
    * E: entity by the **e**nd
    * S: **s**ingle token entity  
* Other NER (solid old school) system
    * CHEMDNER(Lu et al. , 2019), CRF + Brown clustering of words
    * TaggerOne (Leaman and Lu, 2016), semi-Markov model for joint entity recognition and normalization
    * JNLPBA, (Zhou and Su, 2004), using HMM

## Pretrain & Transfer Learning
#### [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146)
> This paper from Jeremy Howard et al has compared and discussed pretraining model in length. 
The paper practice pretrain and fine tuning by the following 3 steps
* **step 1** LM for general textual material, or in related domain
* **step 2** LM on task text
* **step 3** Fine-tune on task text vs task labels

>The paper experimented through many useful techniques like 
* Discriminated fine-tuning (step 2): different LR for different parameters, after some experiment, $\eta^{-1} = \eta / 2.6$ is usually good
* Slanted triangular learning rates (step 1): which first **linearly increases** the learning rate and then **linearly decays** it.
* Concat pooling (step 3): for hidden states $H = \{h_{1},h_{2},...,h_{t}\}$:
$h_{c} = [t_{T},maxpool(H),meanpool(H)]$, where $[]$ is the concatenation
* Gradual unfreezing (step 3), from top layer, **unfreeze 1 layer an epoch**, train all the unfreezed layers at the epoch.(use this under discriminated fine-tuning)
* **Back propagation through time** for text classification (BPT3C) (step 3): use the output state of then end of the last sentence as the initial state of the next sentence

## Transformers