In [None]:
!pip3 -qq install torch==0.4.1
!pip -qq install torchtext==0.3.1
!git clone https://github.com/MiuLab/SlotGated-SLU.git
!wget -qq https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/week08_multitask/conlleval.py

In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


if torch.cuda.is_available():
    from torch.cuda import FloatTensor, LongTensor
    DEVICE = torch.device('cuda')
else:
    from torch import FloatTensor, LongTensor
    DEVICE = torch.device('cpu')

np.random.seed(42)

# Interactive systems


Interactive systems are divided into two types - * goal-orientied * and * general conversation *.

** General conversation ** is a chat talk on a free topic:
<center>
<img src="https://i.ibb.co/bFwwGpc/alice.jpg" width="10%">
</center>
    
Today we will speak not about them, but about ** goal-orientied ** systems:
<center>
<img src="https://hsto.org/webt/gj/3y/xl/gj3yxlqbr7ujuqr9r2akacxmkee.jpeg" width="20%">  
</center>
*From [Как устроена Алиса](https://habr.com/company/yandex/blog/349372/)*

The user says something, it recognizes something. By recognized it is determined - what, where and when he wanted. Then the dialog engine decides whether the user really knows what he wanted to ask. There is a trip to the sources - to find out the information that (it seems) requested by the user. Based on all this, some response is generated:
<center>
<img src="https://i.ibb.co/8XcdpJ7/goal-orientied.png" width="20%">
</center>
<center>
<img src="https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/task_oriented_dialog_systems.gif" width="20%">
</center>
We will study the part that is in the middle - the classifier and the tagger. All the rest is usually heuristic and zahardkozhennye answers.

## Data loading

There is a conventionally standard dataset - atis, which is indecently small, in fact.

To him you can take more dataset snips - it is bigger and more diverse.

We will take both datasets from the repository of the article [Slot-Gated Modeling for Joint Slot Filling and Intent Prediction] (http://aclweb.org/anthology/N18-2118).

Let's start with atis.

In [None]:
import os 

def read_dataset(path):
    with open(os.path.join(path, 'seq.in')) as f_words, \
            open(os.path.join(path, 'seq.out')) as f_tags, \
            open(os.path.join(path, 'label')) as f_intents:
        
        return [
            (words.strip().split(), tags.strip().split(), intent.strip()) 
            for words, tags, intent in zip(f_words, f_tags, f_intents)
        ]

In [None]:
train_data = read_dataset('SlotGated-SLU/data/atis/train/')
val_data = read_dataset('SlotGated-SLU/data/atis/valid/')
test_data = read_dataset('SlotGated-SLU/data/atis/test/')

In [None]:
intent_to_example = {example[2]: example for example in train_data}
for example in intent_to_example.values():
    print('Intent:\t', example[2])
    print('Text:\t', '\t'.join(example[0]))
    print('Tags:\t', '\t'.join(example[1]))
    print()

In [None]:
from torchtext.data import Field, LabelField, Example, Dataset, BucketIterator

tokens_field = Field()
tags_field = Field(unk_token=None)
intent_field = LabelField()

fields = [('tokens', tokens_field), ('tags', tags_field), ('intent', intent_field)]

train_dataset = Dataset([Example.fromlist(example, fields) for example in train_data], fields)
val_dataset = Dataset([Example.fromlist(example, fields) for example in val_data], fields)
test_dataset = Dataset([Example.fromlist(example, fields) for example in test_data], fields)

tokens_field.build_vocab(train_dataset)
tags_field.build_vocab(train_dataset)
intent_field.build_vocab(train_dataset)

print('Vocab size =', len(tokens_field.vocab))
print('Tags count =', len(tags_field.vocab))
print('Intents count =', len(intent_field.vocab))

train_iter, val_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset, val_dataset, test_dataset), batch_sizes=(32, 128, 128), 
    shuffle=True, device=DEVICE, sort=False
)

## Classifier intents

Let's start with a classifier: to which intent this request belongs.

** Assignment ** Nothing clever - take rnn'ku and learn how to predict mark-intents.

In [None]:
class IntentClassifierModel(nn.Module):
    def __init__(self, vocab_size, intents_count, emb_dim=64, lstm_hidden_dim=128, num_layers=1):
        super().__init__()

        <init layers>

    def forward(self, inputs):
        <apply layers>

**Задание** `ModelTrainer` для подсчета лосса и accuracy.

In [None]:
class ModelTrainer():
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer
        
    def on_epoch_begin(self, is_train, name, batches_count):
        """
        Initializes metrics
        """
        self.epoch_loss = 0
        self.correct_count, self.total_count = 0, 0
        self.is_train = is_train
        self.name = name
        self.batches_count = batches_count
        
        self.model.train(is_train)
        
    def on_epoch_end(self):
        """
        Outputs final metrics
        """
        return '{:>5s} Loss = {:.5f}, Accuracy = {:.2%}'.format(
            self.name, self.epoch_loss / self.batches_count, self.correct_count / self.total_count
        )
        
    def on_batch(self, batch):
        """
        Performs forward and (if is_train) backward pass with optimization, updates metrics
        """
        <As usual: perform the forward pass, then call backward and apply optimizer>

In [None]:
import math
from tqdm import tqdm
tqdm.get_lock().locks = []


def do_epoch(trainer, data_iter, is_train, name=None):
    trainer.on_epoch_begin(is_train, name, batches_count=len(data_iter))
    
    with torch.autograd.set_grad_enabled(is_train):
        with tqdm(total=trainer.batches_count) as progress_bar:
            for i, batch in enumerate(data_iter):
                batch_progress = trainer.on_batch(batch)

                progress_bar.update()
                progress_bar.set_description(batch_progress)
                
            epoch_progress = trainer.on_epoch_end()
            progress_bar.set_description(epoch_progress)
            progress_bar.refresh()

            
def fit(trainer, train_iter, epochs_count=1, val_iter=None):
    best_val_loss = None
    for epoch in range(epochs_count):
        name_prefix = '[{} / {}] '.format(epoch + 1, epochs_count)
        do_epoch(trainer, train_iter, is_train=True, name=name_prefix + 'Train:')
        
        if not val_iter is None:
            do_epoch(trainer, val_iter, is_train=False, name=name_prefix + '  Val:')

In [None]:
model = IntentClassifierModel(vocab_size=len(tokens_field.vocab), intents_count=len(intent_field.vocab)).to(DEVICE)

criterion = nn.CrossEntropyLoss().to(DEVICE)
optimizer = optim.Adam(model.parameters())

trainer = ModelTrainer(model, criterion, optimizer)

fit(trainer, train_iter, epochs_count=30, val_iter=val_iter)

**Задание** Подсчитайте итоговое качество на тесте.

## Tegger

<center>
<img src="https://commons.bmstu.wiki/images/0/00/NER1.png" width="20%">
</center>
    
*From [NER](https://ru.bmstu.wiki/NER_(Named-Entity_Recognition)*

** Assignment ** Still nothing clever - simple tagger like POS, only NER.

In [None]:
class TokenTaggerModel(nn.Module):
    def __init__(self, vocab_size, tags_count, emb_dim=64, lstm_hidden_dim=128, num_layers=1):
        super().__init__()

        <init layers again>

    def forward(self, inputs):
        <apply 'em>

** Task ** Update `ModelTrainer`: you need to consider all the same loss and accuracy, only now it is a little different.

In [None]:
<fit the model>

NERs are usually rated for F1 guessing slots. For this, everyone is dragging the conlleval script from each other :)

** Task ** Write a function to evaluate tegger.

In [None]:
from conlleval import evaluate

def eval_tagger(model, test_iter):
    true_seqs, pred_seqs = [], []

    model.eval()
    with torch.no_grad():
        for batch in test_iter:
            <calc true_seqs and pred_seqs for the batch>
    print('Precision = {:.2f}%, Recall = {:.2f}%, F1 = {:.2f}%'.format(*evaluate(true_seqs, pred_seqs, verbose=False)))

## Multi-task learning

We have already discussed that multi-task learning is cool, fashionable and youthful. Let's ~~ let's like it ~~ we implement a model that can immediately predict tags and intents. The idea is that there is general information in all of this, which should help both one and the other: knowing the intent, you can understand which slots can be, and knowing the slots, you can guess the intent.

** Task ** Implement the combined model.

In [None]:
class SharedModel(nn.Module):
    def __init__(self, vocab_size, intents_count, tags_count, emb_dim=64, lstm_hidden_dim=128, num_layers=1):
        super().__init__()

        <init layers>

    def forward(self, inputs):
        <apply layers>

In [None]:
<update ModelTrainer>

In [None]:
<fit the model>

In [None]:
<calc intent accuracy>

In [None]:
<calc tags F1-score>

 ## Asynchronous learning
 
 In general, everything was started precisely because of this - asynchronous learning multi-task model.
 
The idea is described in [A Bi-model-based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling] (http://aclweb.org/anthology/N18-2050)

Let's start with this model:
<center>
<img src="https://i.ibb.co/N2T1X2f/2018-11-27-2-11-01.png" width="20%">
</center>

The main difference from what has already been implemented is in what order everything is optimized. Instead of combined learning of all layers, the networks for the tagger and for the classifier are trained separately.

At each learning step, sequences of hidden states $ h ^ 1 $ and $ h ^ 2 $ are generated - for the classifier and for the tagger.

Next, losses from the prediction of the intensity are considered first and the optimizer step is taken, and then the losses from the prediction of the tags - and again the optimizer step.

** Assignment ** Implement it.

In [None]:
class AsyncSharedModel(nn.Module):
    def __init__(self, vocab_size, intents_count, tags_count, emb_dim=64, lstm_hidden_dim=128, num_layers=1):
        super().__init__()

        <init layers>
        
    <do smth>

In [None]:
<update ModelTrainer somehow>

You need to create separate optimizers for each part of the model.

Separate parameters can be obtained as follows:

In [None]:
model = AsyncSharedModel(
    vocab_size=len(tokens_field.vocab),
    intents_count=len(intent_field.vocab),
    tags_count=len(tags_field.vocab)
).to(DEVICE)

tags_parameters = [param for name, param in model.named_parameters() if not name.startswith('_intent')]
intent_parameters = [param for name, param in model.named_parameters() if not name.startswith('_tags')]

Then they need to be transferred to separate optimizers and taught separately.

* Also, perhaps the reward_graph parameter of the backward () method is useful.

In [None]:
<fit the model>

In [None]:
<calc intent accuracy and tags F1-score>

## Improvements

** Task ** Look at the parameters in the article and try to achieve a similar quality.

** Task ** Try replacing the case you are working with.

### Encoder-decoder

A good idea is to use not just independent tag predictions, but a decoder above them:

<center>
<img src = "https://i.ibb.co/qrgVSqF/2018-11-27-2-11-17.png" width = "20%">
</center>

In fact, there is just another RNN layer added here, this time unidirectional. In this case, in the case of tag prediction, its input is the previous tag, the previous hidden state, and the hidden states from the tag and integer encoders. For intent - simple RNN.

** Task ** Implement such a model.

# Async Multi-task Learning for POS Tagging

These were toy datasets and not very good articles (albeit from NAACL-2018).

I prefer this one: [Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings] (https://arxiv.org/pdf/1805.08237.pdf). Much more.

The architecture there is this:

<center>
<img src="https://i.ibb.co/0nSX6CC/2018-11-27-9-26-15.png" width="20%">
</center>

A multi-task task is to train individual classifiers of a lower level (above characters and words) to predict tags by individual optimizers.

** Task ** Try to implement what is written in the article.

# Referrence
A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling, 2018 [[pdf]](http://aclweb.org/anthology/N18-2050)  
Slot-Gated Modeling for Joint Slot Filling and Intent Prediction, 2018 [[pdf]](http://aclweb.org/anthology/N18-2118)  
Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings, 2018 [[arxiv]](https://arxiv.org/pdf/1805.08237.pdf)

[Как устроена Алиса](https://habr.com/company/yandex/blog/349372/)  