# Homework: Multi-task Learning & Domain Adaptation
##  Named Entity Recognition

Today we're gonna solve the problem of named entity recognition. Here's what it does in one picture:
![img](https://commons.bmstu.wiki/images/0/00/NER1.png)
[image source](https://bit.ly/2Pmg7L2)


For each word, in a sentence, your model should predict a named entity class:


In [None]:
# in colab, uncomment this:
# !pip install tensorflow==2.0.0

import numpy as np
import tensorflow as tf
keras, L = tf.keras, tf.keras.layers

## Data

### Train set

Our model will train on a [Groningen Meaning Bank corpus](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus).

Each word of every sentence is labelled with named entity class and a part-of-speech tag.

### Source domain testset

Our train set consists from texts from different news sources. Therefore as source-domain testset we will use data from [CoNLL-2003 Shared Task](https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/data/conll2003/en). More information about the task can be found [here](https://www.clips.uantwerpen.be/conll2003/ner/).

### Target domain (in-domain) data

As target-domain data we will use data from [WNUT17 Emerging and Rare entity recognition task](http://noisy-text.github.io/2017/emerging-rare-entities.html). This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. The data were mined from mined from Twitter, Reddit,
YouTube and StackExchange. Results of different competitors of the task were published [here](https://noisy-text.github.io/2017/pdf/WNUT18.pdf).

### Named entity classes

* PER - _person_: names of people (e.g. Alexander S. Pushkin)
* ORG - _organization_: names of corporations (e.g. Yandex), names of non-profit organizations (e.g. UNICEF)
Google).
* LOC - _location_ : e.g. Russia
* MISC - _miscellaneous_ : other named entities including names of products (e.g. iPhone) and creative works (e.g. Bohemian Rhapsody)

### Evaluation metrics

As evaluation metrics we will F1 measure on exact matched NEs. It means that partially overlapped enitities of same class are considered as mismatch.
For example, LOC entities below is partially overlapped. And it is a mismatch:

__O, B-LOC, I-LOC, O__

__O, B-LOC, I-LOC, I-LOC__

Details can be found in the code of _conlleval.py_

### Data format

The format of all dataset follows popular [IOB format](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging)). The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them. An O tag indicates that a token belongs to no chunk.

The named entity labels include:
* __B-LOC__ - location - first token
* __I-LOC__ - location - subsequent tokens
* __B-ORG__ - organization - first token
* __O__ - not a named entity

Take a look for yourself:

In [None]:
# Train:
!wget https://www.dropbox.com/s/xobyz6jgovvz3dm/kaggle-train.conll?dl=1 -O kaggle-train.conll

# Source domain testset:
!wget https://www.dropbox.com/s/1l8b9iy78cglrw3/source-domain-test.conll?dl=1 -O source-domain-test.conll

# Target domain testset:
!wget https://www.dropbox.com/s/oxfkdy23ux5hfz5/target-domain-test.conll?dl=1 -O target-domain-test.conll
    
# Target domain monolingual data:
!wget https://www.dropbox.com/s/ysdrotjdfljydbr/target-domain-monolingual.conll?dl=1 -O target-domain-monolingual.conll


In [None]:
from conlleval import evaluate
from utils import read_conll
data = read_conll('./kaggle-train.conll', lower_words=True)

data[333]

In [None]:
test_outdomain_data = read_conll('./source-domain-test.conll', lower_words=True)
test_indomain_data = read_conll('./target-domain-test.conll', column_names=['word', 'ne'], lower_words=True)
monolingual_indomain_data = read_conll('./target-domain-monolingual.conll', column_names=['word'], lower_words=True)

In [None]:
from sklearn.model_selection import train_test_split
train_data, dev_data = train_test_split(data, test_size=0.25, random_state=42)
print("train: {}, dev: {}".format(len(train_data), len(dev_data)))

In [None]:
from utils import Vocab
vocabs = {
    key: Vocab.from_lines([row[key] for row in train_data])
    for key in ['word', 'pos', 'ne']
}

def prepare_batch(data):
    keys = data[0].keys()
    return {
        key: vocabs[key].to_matrix(row[key] for row in data)
        for key in keys
    }

In [None]:
dummy_rows = sorted(data, key=lambda row: len(row['word']))[100:102]
print(dummy_rows[0])
print(dummy_rows[1])
prepare_batch(dummy_rows)

## Baseline: single-task model (1 point)

![img](https://github.com/yandexdataschool/nlp_course/raw/master/resources/gorynich_ne.png)

Let's start with a straightforward model that does named entity recognition.


The image will make sense later :)

In [None]:

class SimpleModel(L.Layer):
    def __init__(self, emb_size=128, hid_size=128):
        """ 
        A model that predicts named entity class for each word
        We recommend the following model:
        * Embedding
        * Bi-directional LSTM
        * Linear layer to predict logits
        """
        super().__init__() # initialize base class to track sub-layers, trainable variables, etc.
        
        # define layers
        self.emb = L.Embedding(len(vocabs['word']), emb_size)
        
        <YOUR CODE HERE>
    
    def __call__(self, input_ix):
        """
        Compute logits for named entity recognition
        :param input_ix: a matrix of token indices, int32[batch_size, seq_length]
        """
        <YOUR CODE>
        return {'ne': ner_logits}


In [None]:
model = SimpleModel()

dummy_ix = tf.convert_to_tensor(prepare_batch(train_data[:3])['word'])
dummy_logits = model(dummy_ix)['ne'].numpy()


assert dummy_logits.shape == (3, dummy_ix.shape[1], len(vocabs['ne']))
assert dummy_logits.min() < 0 and dummy_logits.max() > 0, "you ~may~ have added nonlinearity after logits."\
                                                          "Make sure they're just a linear layer"

In [None]:
from utils import infer_mask
optimizer = keras.optimizers.Adam()

def train_step(model, **batch):
    """ A bunch of tensorflow operations used for model training """
    with tf.GradientTape() as tape:
        outputs = model(batch['word'])
        mask = infer_mask(batch['word'])

        loss = -tf.nn.log_softmax(outputs['ne'], -1) * tf.one_hot(batch['ne'], len(vocabs['ne']))
        loss = tf.reduce_sum(loss * mask[:, :, None]) / tf.reduce_sum(mask)
    
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

### Training loop

Nothin' special: sample random batches and perform SGD steps

In [None]:
def iterate_minibatches(data, batch_size=128, shuffle=True, cycle=False, max_batches=None):
    indices = np.arange(len(data))
    total_batches = 0
    while True:
        if shuffle: indices = np.random.permutation(indices)
        for start_i in range(0, len(data), batch_size):
            batch_ix = indices[start_i: start_i + batch_size]
            yield prepare_batch(data[batch_ix])
            total_batches += 1
            if max_batches and total_batches >= max_batches:
                return
        if not cycle: break
            

def compute_error_rate(model, data, batch_size=128, key='ne'):
    numerator = denominator = 0.0
    for batch in iterate_minibatches(data, batch_size, shuffle=False, cycle=False):
        batch_ne_logits = model(batch['word'])[key].numpy()
        batch_mask = infer_mask(batch['word']).numpy()
        
        numerator += np.sum((batch[key] == batch_ne_logits.argmax(-1)) * batch_mask)
        denominator += batch_mask.sum()
    return (1.0 - numerator / denominator) * 100

def decode_greedy(model, data, vocabs, batch_size=128, key='ne'):
    result = []
    for batch in iterate_minibatches(data, batch_size, shuffle=False, cycle=False):
        batch_logits = model(batch['word'])[key].numpy()
        result.extend(vocabs[key].to_lines(batch_logits.argmax(-1)))
    return result

def compute_stats(model, data, vocabs, batch_size=128, key='ne', verbose=False):
    pred_seqs = decode_greedy(model, data, vocabs, batch_size, key)
    true_seqs = [r[key] for r in data]
    precision, recall, f1 = evaluate(true_seqs, pred_seqs, verbose)
    return precision, recall, f1
    

In [None]:
class StatsHistory:
    def __init__(self):
        self.precision = []
        self.recall = []
        self.f1 = []

In [None]:
from tqdm import tqdm
from IPython.display import clear_output
import matplotlib.pyplot as plt
%matplotlib inline
eval_every = 100

loss_history = []
dev_stats_history = StatsHistory()
indomain_stats_history = StatsHistory()
outdomain_stats_history = StatsHistory()

In [None]:
for batch in tqdm(iterate_minibatches(train_data, cycle=True, max_batches=2500), total=2500):
    loss_t = train_step(model, **batch)
    loss_history.append(loss_t)
    
    if len(loss_history) % eval_every == 0:
        clear_output(True)
        precision, recall, f1 = compute_stats(model, dev_data, vocabs, verbose=True)
        dev_stats_history.precision.append(precision)
        dev_stats_history.recall.append(recall)
        dev_stats_history.f1.append(f1)
        
        _, _, f1 = compute_stats(model, test_outdomain_data, vocabs, verbose=True)
        outdomain_stats_history.f1.append(f1)
        
        _, _, f1 = compute_stats(model, test_indomain_data, vocabs, verbose=True)
        indomain_stats_history.f1.append(f1)

        plt.figure(figsize=[12, 6])
        plt.subplot(1, 2, 1)
        plt.plot(loss_history)
        plt.title('train loss'), plt.grid()
        plt.subplot(1, 2, 2)
        
        plt.plot(np.arange(1, len(dev_stats_history.f1) + 1) * eval_every, dev_stats_history.f1, label="dev f1")
        plt.plot(np.arange(1, len(outdomain_stats_history.f1) + 1) * eval_every, outdomain_stats_history.f1, label="outdomain f1")
        plt.plot(np.arange(1, len(indomain_stats_history.f1) + 1) * eval_every, indomain_stats_history.f1, label="indomain f1")

        plt.legend()
        plt.title('dev stats %'), plt.grid()
        plt.show()

In [None]:
print("Best dev f1 = %.3f%%" % max(dev_stats_history.f1),
      "\nBest in-domain f1 = %.3f%%" % max(indomain_stats_history.f1),
      "\nBest out-of-domain f1 = %.3f%%" % max(outdomain_stats_history.f1))
assert max(dev_stats_history.f1) > 75, "you can do better"

# Multitask model: NER + POS (2 points)

Our data contains not only named entity labels, but also part-of-speech tags. Those problems are similar in nature, making it a good candidate for multi-tasking. With any luck, ouyr model will become better at named entity recognition by learning for POS-tagging.

![model2](https://github.com/yandexdataschool/nlp_course/raw/master/resources/gorynich_2.png)

In [None]:
class TwoTaskModel(L.Layer):
    def __init__(self, emb_size=128, hid_size=128):
        """ 
        Equivalent to the SimpleModel above, but with two 
        linear "heads": one for "ne" logits and another for "pos".
        Both heads should grow from the same intermediate "body" layer
        """
        super().__init__()
        <YOUR CODE>

    
    def __call__(self, input_ix):
        """
        Compute logits for named entity recognition and part-of-speech tagging
        """
        <YOUR CODE>
        ner_logits = <YOUR_CODE>
        pos_logits = <YOUR_CODE>
        return {'ne': ner_logits, 'pos': pos_logits}

In [None]:
model = TwoTaskModel()

dummy_ix = tf.convert_to_tensor(prepare_batch(train_data[:3])['word'])
dummy_out = model(dummy_ix)
assert 'ne' in dummy_out and 'pos' in dummy_out

In [None]:
from utils import infer_mask

def train_step_two_task(model, tasks=('ne', 'pos'), **batch):
    """ Naive train step for two-task loss: ne and pos """
    with tf.GradientTape() as tape:
        # model predictions
        <YOUR CODE>

        # losses for each task
        <YOUR CODE>
    
    # compute and apply gradients
    <YOUR CODE>
    
    return loss
    

In [None]:
from tqdm import tqdm
from IPython.display import clear_output
import matplotlib.pyplot as plt
%matplotlib inline

loss_history = []
dev_stats_history = StatsHistory()
indomain_stats_history = StatsHistory()
outdomain_stats_history = StatsHistory()

In [None]:
for batch in tqdm(iterate_minibatches(train_data, cycle=True, max_batches=2500), total=2500):
    loss_t = train_step_two_task(model, **batch)
    loss_history.append(loss_t)
        
    if len(loss_history) % 100 == 0:
        precision, recall, f1 = compute_stats(model, dev_data, vocabs, verbose=True)
        dev_stats_history.precision.append(precision)
        dev_stats_history.recall.append(recall)
        dev_stats_history.f1.append(f1)
        
        clear_output(True)
        
        _, _, f1 = compute_stats(model, test_outdomain_data, vocabs, verbose=True)
        outdomain_stats_history.f1.append(f1)
        
        _, _, f1 = compute_stats(model, test_indomain_data, vocabs, verbose=True)
        indomain_stats_history.f1.append(f1)
        
        plt.figure(figsize=[12, 6])
        plt.subplot(1, 2, 1)
        plt.plot(loss_history)
        plt.title('train loss'), plt.grid()
        plt.subplot(1, 2, 2)
        
        plt.plot(np.arange(1, len(dev_stats_history.f1) + 1) * eval_every, dev_stats_history.f1, label="dev f1")
        plt.plot(np.arange(1, len(outdomain_stats_history.f1) + 1) * eval_every, outdomain_stats_history.f1, label="outdomain f1")
        plt.plot(np.arange(1, len(indomain_stats_history.f1) + 1) * eval_every, indomain_stats_history.f1, label="indomain f1")
        
        plt.title('dev stats %'), plt.grid()
        plt.show()

In [None]:
print("Best dev f1 = %.3f%%" % max(dev_stats_history.f1),
      "\nBest indomain f1 = %.3f%%" % max(indomain_stats_history.f1),
      "\nBest outdomain f1 = %.3f%%" % max(outdomain_stats_history.f1))
assert max(dev_stats_history.f1) > 75, "you can do better"

# Multitask model: NER + POS + LM (3 points)

Two heads are great, but three's even better! Let's add language modeling to the task.

With language models, however, there are a few complications:
* Our data is too small for LM training. Let's use [1 billion word benchmark](http://www.statmt.org/lm-benchmark/) instead. It *may* even be a good idea to preserve cases.
* Language models have some issues with being bidirectional. We recommend training forward and backward models separately and fusing them together. Or use the same approach as [ELMO](https://tfhub.dev/google/elmo/2).
* The simplest scheme is to pre-train as a language model and fine-tune for NER and POS. We recommend starting from that.

__IMPORTANT!__ NER/POS dataset comes pre-tokenized.  Make sure you apply {almost} the same tokenization when training language model. Alternatively, you can re-tokenize ner/pos data.


![model3](https://github.com/yandexdataschool/nlp_course/raw/master/resources/gorynich_small.png)

In [None]:
class MultitaskModel(L.Layer):
    def __init__(self, name, tasks=('ne', 'pos', 'lm'), emb_size=128, hid_size=128):
        """ 
        Equivalent to the SimpleModel above, but with three
        linear "heads": one for "ne" logits, second for "pos" and last for language modelling ('lm').
        Both heads should grow from the same intermediate "body" layer
        """
        super().__init__()
        self.tasks = tasks
        # define layers:
        <YOUR CODE>

    def __call__(self, input_ix):
        """
        Compute logits for all tasks
        """
        <YOUR CODE>

        return <YOUR CODE>

In [None]:
model = MultitaskModel('mod1')

dummy_ix = tf.convert_to_tensor(prepare_batch(train_data[:3])['word'])
dummy_out = model(dummy_ix)
assert 'ne' in dummy_out and 'pos' in dummy_out and 'lm' in dummy_out

Different schemes for multitask learning with Language Model component.

You can try at least two of them:
* Pretrain network using monolingual data as Language Model and then train model as NER and POS tagger.
* Train network alternately: one step on NER and POS tasks, one step on LM tasks. 


In [None]:
def train_step_multitask(model, tasks=('ne', 'pos', 'lm'), **batch):
    """ Naive train step for two-task loss: ne and pos """
    with tf.GradientTape() as tape:
        # model predictions
        <YOUR CODE>

        # losses for each task
        <YOUR CODE>
    
    # compute and apply gradients
    <YOUR CODE>
    
    return loss

In [None]:
from tqdm import tqdm
from IPython.display import clear_output
import matplotlib.pyplot as plt
%matplotlib inline

loss_history = []
dev_stats_history = StatsHistory()
indomain_stats_history = StatsHistory()
outdomain_stats_history = StatsHistory()

In [None]:
for batch in tqdm(iterate_minibatches(train_data, cycle=True, max_batches=2500), total=2500):
    # YOUR CODE HERE
    loss_history.append(loss_t)
    
    if len(loss_history) % 100 == 0:
        # YOUR CODE HERE
        
        clear_output(True)
        plt.figure(figsize=[12, 6])
        plt.subplot(1, 2, 1)
        plt.plot(loss_history)
        plt.title('train loss'), plt.grid()
        plt.subplot(1, 2, 2)
        
        plt.plot(np.arange(1, len(dev_stats_history.f1) + 1) * eval_every, dev_stats_history.f1, label="dev f1")
        plt.plot(np.arange(1, len(outdomain_stats_history.f1) + 1) * eval_every, outdomain_stats_history.f1, label="outdomain f1")
        plt.plot(np.arange(1, len(indomain_stats_history.f1) + 1) * eval_every, indomain_stats_history.f1, label="indomain f1")
        
        plt.title('dev stats %'), plt.grid()
        plt.show()

# Final task: transfer learning from BERT (4++ points)

By now you have mined pretty much everything from the CONLL dataset, so now it's time to go beyond.

You can __fine-tune a pre-trained BERT__ to solve all three CONLL-based tasks.

In case you forgot, BERT is a huge transformer-based model that learns to solve several tasks (e.g. missing word imputation) on a huge dataset.
* How bert works: [arxiv](https://arxiv.org/abs/1810.04805)
* Bert in TF2.0: [colab notebook](https://colab.research.google.com/drive/1EJuMPW7TDVDGB1wDCIayx22jutcwLQlE)

You can also try BERT's friends for fun, swag and bonus points: [RoBERTa](https://arxiv.org/abs/1907.11692), [ALBERT](https://arxiv.org/abs/1909.11942), [XLNet](https://arxiv.org/abs/1906.08237), and even [T5](https://arxiv.org/abs/1910.10683).

In [None]:
# A whole lot of your code here

# (Bonus) Structured prediction for NEs (3 points)

![lstm_crf_ner](https://github.com/yandexdataschool/nlp_course/raw/master/resources/lstm_crf_ner.png)

[_Picture from  Lample et al._](https://arxiv.org/abs/1603.01360)

A setup with seq2seq and cross-entropy loss for tagging is not so good. Because in the case tagging there is an important constraint: input and output sequence have same length, and the number of output token types is much less than the number of input token types.

To use this constraint effectively it is good idea to try structured [CRF loss](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf) for NER (and POS)


A good example of TensorFlow implementation of NER NN with CRF loss can be found in the [blogpost](https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html)

## (Bonus) Dealing with letter case and rare entities (2 points)

![lstm_crf_ner](https://github.com/yandexdataschool/nlp_course/raw/master/resources/word_and_char_embedding_concat.png)

[_Picture from  Lample et al._](https://arxiv.org/abs/1603.01360)

First, in European languages (both English and Russian) personal names, companies, geographical names are capitalized traditionally. Thus the letter case carries a powerful signal for named entities recognition. So it is good to utilize it.

Second, most of named entities are rare words. In testsets (both from source and target distributions) they are replaced by _UNK_. To deal with OOV words you can try different approaches (feel free!). For example:

* You can use additional character-level recurrent layers to obtain character-aware word embeddings (see scheme on the picture above)
* You can use pretrained embeddings with character ngram information ([FastText](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md))
* You can split words by subword units using [BPE](https://arxiv.org/abs/1508.07909)

## (bonus) Domain adaptation via proxy-labels (3 points)
As you can see above the quality of NER on the target domain (internet comments) is much worse than on the source domain (news). This is not surprising.

To overcome the problem we offer you to implement any kind of proxy-label method. A good overview on this kind of methods can be found [here](https://arxiv.org/abs/1804.09530)

__ATTENTION!!!__ For proxy-labeling use monolingual target-domain dataset (not testset!).

Good luck!