<a href="https://colab.research.google.com/github/unpackAI/unpackai/blob/main/examples/nlp_seq2seq_en_to_zh_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence to sequence model

## Installation

In [None]:
# !pip install -q unpackai==0.1.8.9
# !pip install -q transformers
# !pip install -Uqq fastai
# !pip install -q datasets

## Imports

In [None]:
from datasets import load_dataset
from fastai.text.all import *
from unpackai.nlp import *

## Data

In [None]:
# also en-ru, en-fr, en-it, etc
dataset = load_dataset("ted_iwlst2013", 'en-zh') 

Reusing dataset ted_iwlst2013 (/root/.cache/huggingface/datasets/ted_iwlst2013/en-zh/1.1.0/769086006155211ed7233545de12bce6fe41e1c71f509a3f062e294cb3c00e99)


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
dataset
entire = dataset['train']

### Split data to train/valid

The dataset we donwloaded doesn't contain validation, let's split to make one

In [None]:
splited = entire.train_test_split(test_size = .1,)
splited

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/ted_iwlst2013/en-zh/1.1.0/769086006155211ed7233545de12bce6fe41e1c71f509a3f062e294cb3c00e99/cache-64b6ac3983d87e67.arrow and /root/.cache/huggingface/datasets/ted_iwlst2013/en-zh/1.1.0/769086006155211ed7233545de12bce6fe41e1c71f509a3f062e294cb3c00e99/cache-4167f5cdef840354.arrow


DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 139121
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 15458
    })
})

In [None]:
train = splited['train']
valid = splited['test']

In [None]:
train[30]

{'translation': {'en': 'And now my mission to control and predict had turned up the answer that the way to live is with vulnerability and to stop controlling and predicting.',
  'zh': '而我现在的使命 即控制并预测 却给出了这样一个结果：要想与脆弱共存 就得停止控制，停止预测'},
 'id': '86514'}

## Tokenizer and pretrained model
> Tokenizer will be a part the data pipeline, so let's download the pretrained tokenizer and pretrained model before we create a data block

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModel,
    EncoderDecoderModel
    )

In [None]:
# we find a English parsing encoder, as a pretrained bert is good at understanding english
# BERT is short for Bidirectional **Encoder** Representations from Transformers, which consists fully of encoder blocks
ENCODER_PRETRAINED = "bert-base-uncased"
# we find a Chinese writing model for decoder, as decoder is the part of the model that can write stuff
DECODER_PRETRAINED = "uer/gpt2-chinese-poem"

### Load pretrained models

In [None]:
# encoder = AutoModel.from_pretrained(ENCODER_PRETRAINED, proxies={"http":"bifrost:3128"})
# decoder = AutoModelForCausalLM.from_pretrained(DECODER_PRETRAINED, add_cross_attention=True,
#                                                proxies={"http":"bifrost:3128"})

### Load pretrained tokenizers

In [None]:
encoder_tokenizer = AutoTokenizer.from_pretrained(ENCODER_PRETRAINED)
decoder_tokenizer = AutoTokenizer.from_pretrained(DECODER_PRETRAINED)

In [None]:
# ENCODER_MAX_LEN = encoder.config.max_position_embeddings
# DECODER_MAX_LEN = decoder.config.max_position_embeddings

# ENCODER_MAX_LEN, DECODER_MAX_LEN

In [None]:
tokenizer_configuration = dict(
    return_tensors='pt',
    max_length=128,
    padding="max_length",
    truncation=True,
)

### Datablock

You can try to change get_x and get_y to fetch the opposite language, you can have model that will train in other direction

In [None]:
dblock = DataBlock(
    get_x = lambda x:x['translation']['en'],
    get_y = lambda x:x['translation']['zh'],
                   )

### Datasets

In [None]:
dsets = dblock.datasets(train, valid)

Collecting items from Dataset({
    features: ['id', 'translation'],
    num_rows: 139121
})
Found 139121 items
2 datasets of sizes 111297,27824
Setting up Pipeline: <lambda>
Setting up Pipeline: <lambda>


Preview a row of dataset, which returns an English sentence vs Chinese sentence pair

In [None]:
dsets.train[6]

('But the information was closer to me.', '但这些知识却离我更近了。')

### Dataloaders

* Dataset deals data on **row** level, eg. a pair of sentence
* Dataloader deals data on **batch** level, eg. a batch of tensor, consists of $n$ rows of data, where $n$ is the batch size

We usually call this process of: rows of raw data => pytorch tensor: collate

Here we build a collate function that will transform rows of 2 sentences into tokenized/numericalize tensors using given tokenizers

In [None]:
def batch_tokenize_collate(data):
    input_seq, target_seq = list(zip(*data))
    # tokenizing for encoder
    x = encoder_tokenizer(list(input_seq),**tokenizer_configuration)
    input_ids = x.input_ids
    attention_mask = x.attention_mask

    # tokenizing for decoder
    y = decoder_tokenizer(list(target_seq),**tokenizer_configuration)
    decoder_input_ids = y.input_ids
    decoder_attention_mask = y.attention_mask
    
    # return the output in format of (x, y), y
    # As the model forward pipeline will need both x, and y for training
    # and will output loss directly, but fastai learner require x,y formality in datablock
    return (input_ids, attention_mask, decoder_input_ids, decoder_attention_mask),\
        ( decoder_input_ids, decoder_attention_mask)

In [None]:
dls = dsets.dataloaders(bs=64,
                        create_batch=batch_tokenize_collate)

In [None]:
(input_ids, attention_mask, decoder_input_ids, decoder_attention_mask),(
    decoder_input_ids, decoder_attention_mask)=dls.one_batch()

### Reconstruct a batch of data

In [None]:
for e,c in zip(encoder_tokenizer.batch_decode(input_ids, skip_special_tokens=True),
decoder_tokenizer.batch_decode(decoder_input_ids, skip_special_tokens=True)):
    print(e)
    print(c)
    print("="*10)

well, there are lots of reasons.
这 有 很 多 原 因
even if you're logged out, one engineer told me, there are 57 signals that google looks at - - everything from what kind of computer you're on to what kind of browser you're using to where you're located - - that it uses to personally tailor your query results.
一 位 工 程 师 告 诉 我 ， 即 使 你 退 出 帐 号 ， 还 会 有 57 种 信 号 可 供 谷 歌 参 考 - - 几 乎 所 有 的 信 息 ： 从 你 使 用 的 电 脑 型 号 到 你 用 的 浏 览 器 到 你 所 在 的 位 置 - - 谷 歌 利 用 这 些 为 你 定 制 出 个 性 化 的 查 询 结 果 。
but it has to be done together.
这 必 须 由 你 们 共 同 参 与 完 成 。
the whole business is run on sustainable energy.
整 间 餐 厅 都 是 使 用 可 再 生 能 源 ，
he said, " you know what, one of the items on the checklist is lack of remorse, but another item on the checklist is cunning, manipulative.
他 说 ， 你 知 道 吗 ， 检 核 表 上 有 一 项 是 缺 乏 懊 悔 但 另 一 项 却 是 狡 猾 ， 且 控 制 欲 强
and one of the most common faces on something faced with beauty, something stupefyingly delicious, is what i call the omg.
面 对 美 的 最 常 见 的 表 情 之 一 那 种 面 对 难 以 置 信 的 美 味 时 的 表 情 就 

## Model

We create a seq2seq model by using pretrained encoder + pretrained decoder

In [None]:
# loading pretrained model
encoder_decoder = EncoderDecoderModel.from_encoder_decoder_pretrained(
    encoder_pretrained_model_name_or_path=ENCODER_PRETRAINED,
    decoder_pretrained_model_name_or_path=DECODER_PRETRAINED,
)

class Seq2SeqTrain(nn.Module):
    def __init__(self, encoder_decoder):
        super().__init__()
        self.encoder_decoder = encoder_decoder
        
    def forward(self, batch):
        input_ids, attention_mask, decoder_input_ids, decoder_attention_mask = batch

        return self.encoder_decoder(
                input_ids = input_ids,
                attention_mask = attention_mask,
                labels = decoder_input_ids,
                decoder_input_ids=decoder_input_ids,
            )

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at uer/gpt2-chinese-poem and are newly initialized: ['transformer.h.0.crossa

In [None]:
model = Seq2SeqTrain(encoder_decoder)

## Training

In [None]:
learn = Learner(dls, model, loss_func=lambda output, target:output.loss,)

In [None]:
learn.fit(10, lr=5e-5)

epoch,train_loss,valid_loss,time
0,0.961011,0.935475,38:16
1,0.847539,0.836173,38:16
2,0.815385,0.786844,38:17
3,0.766871,0.755473,38:18
4,0.727195,0.73294,38:18
5,0.701056,0.716719,38:19
6,0.669514,0.70428,38:19
7,0.651389,0.694713,38:19
8,0.625053,0.6909,38:19
9,0.609521,0.687882,38:18


## Inference

In [None]:
model = model.cpu()
model = model.eval()

In [None]:
def inference(text, starter=''):
    tk_kwargs = dict(truncation=True, max_length=128, padding="max_length",
                     return_tensors='pt')
    inputs = encoder_tokenizer([text,],**tk_kwargs)
    with torch.no_grad():
        return decoder_tokenizer.batch_decode(
            model.encoder_decoder.generate(
            inputs.input_ids,
            attention_mask=inputs.attention_mask,
            num_beams=3,
            bos_token_id=101,
        ),
                                              skip_special_tokens=False)

In [None]:
inference(
    'And now my mission to control and predict had turned up the answer that the way to live is with vulnerability and to stop controlling and predicting.')

['[CLS] 我 的 目 的 就 是 预 测 来 控 制 这 个 预 测 ， 并 且 预']

In [None]:
inference("I'm going to enjoy this movie")

['[CLS] 我 很 喜 欢 这 部 电 影 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']

In [None]:
inference("Why does this matter")

['[CLS] 为 什 么 这 很 重 要 ？ [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']