<a href="https://colab.research.google.com/github/unpackAI/unpackai/blob/main/examples/nlp_seq2seq_en_to_zh_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence to sequence model

## Installation

In [1]:
!pip install -q unpackai==0.1.8.9
!pip install -q transformers
!pip install -Uqq fastai

In [2]:
!pip install -q datasets

## Imports

In [3]:
from datasets import load_dataset
from fastai.text.all import *
from unpackai.nlp import *

## Data

In [5]:
# also en-ru, en-fr, en-it, etc
dataset = load_dataset("ted_iwlst2013", 'en-zh',) 

Reusing dataset ted_iwlst2013 (/root/.cache/huggingface/datasets/ted_iwlst2013/en-zh/1.1.0/769086006155211ed7233545de12bce6fe41e1c71f509a3f062e294cb3c00e99)


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
dataset
entire = dataset['train']

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 154579
    })
})

### Split data to train/valid

The dataset we donwloaded doesn't contain validation, let's split to make one

In [8]:
splited = entire.train_test_split(test_size = .1,)
splited

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/ted_iwlst2013/en-zh/1.1.0/769086006155211ed7233545de12bce6fe41e1c71f509a3f062e294cb3c00e99/cache-64b6ac3983d87e67.arrow and /root/.cache/huggingface/datasets/ted_iwlst2013/en-zh/1.1.0/769086006155211ed7233545de12bce6fe41e1c71f509a3f062e294cb3c00e99/cache-4167f5cdef840354.arrow


DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 139121
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 15458
    })
})

In [9]:
train = splited['train']
valid = splited['test']

In [10]:
train[30]

{'id': '86514',
 'translation': {'en': 'And now my mission to control and predict had turned up the answer that the way to live is with vulnerability and to stop controlling and predicting.',
  'zh': '而我现在的使命 即控制并预测 却给出了这样一个结果：要想与脆弱共存 就得停止控制，停止预测'}}

## Tokenizer and pretrained model
> Tokenizer will be a part the data pipeline, so let's download the pretrained tokenizer and pretrained model before we create a data block

In [11]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModel,
    EncoderDecoderModel
    )

In [12]:
# we find a English parsing encoder, as a pretrained bert is good at understanding english
# BERT is short for Bidirectional **Encoder** Representations from Transformers, which consists fully of encoder blocks
ENCODER_PRETRAINED = "bert-base-uncased"
# we find a Chinese writing model for decoder, as decoder is the part of the model that can write stuff
DECODER_PRETRAINED = "uer/gpt2-chinese-poem"

### Load pretrained models

In [13]:
encoder = AutoModel.from_pretrained(ENCODER_PRETRAINED)
decoder = AutoModelForCausalLM.from_pretrained(DECODER_PRETRAINED, add_cross_attention=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at uer/gpt2-chinese-poem and are newly initialized: ['transformer.h.2.ln_cro

### Load pretrained tokenizers

In [14]:
encoder_tokenizer = AutoTokenizer.from_pretrained(ENCODER_PRETRAINED)
decoder_tokenizer = AutoTokenizer.from_pretrained(DECODER_PRETRAINED)

In [15]:
ENCODER_MAX_LEN = encoder.config.max_position_embeddings
DECODER_MAX_LEN = decoder.config.max_position_embeddings

ENCODER_MAX_LEN, DECODER_MAX_LEN

(512, 1024)

In [16]:
tokenizer_configuration = dict(
    return_tensors='pt',
    max_length=128,
    padding="max_length",
    truncation=True,
)

### Datablock

You can try to change get_x and get_y to fetch the opposite language, you can have model that will train in other direction

In [17]:
dblock = DataBlock(
    get_x = lambda x:x['translation']['en'],
    get_y = lambda x:x['translation']['zh'],
                   )

### Datasets

In [18]:
dsets = dblock.datasets(train, valid)

Collecting items from Dataset({
    features: ['id', 'translation'],
    num_rows: 139121
})
Found 139121 items
2 datasets of sizes 111297,27824
Setting up Pipeline: <lambda>
Setting up Pipeline: <lambda>


Preview a row of dataset, which returns an English sentence vs Chinese sentence pair

In [19]:
dsets.train[6]

('Now you prepare for the inevitable.', '现在 你要为不可避免的事情做准备')

### Dataloaders

* Dataset deals data on **row** level, eg. a pair of sentence
* Dataloader deals data on **batch** level, eg. a batch of tensor, consists of $n$ rows of data, where $n$ is the batch size

We usually call this process of: rows of raw data => pytorch tensor: collate

Here we build a collate function that will transform rows of 2 sentences into tokenized/numericalize tensors using given tokenizers

In [20]:
def batch_tokenize_collate(data):
    input_seq, target_seq = list(zip(*data))
    # tokenizing for encoder
    x = encoder_tokenizer(list(input_seq),**tokenizer_configuration)
    input_ids = x.input_ids
    attention_mask = x.attention_mask

    # tokenizing for decoder
    y = decoder_tokenizer(list(target_seq),**tokenizer_configuration)
    decoder_input_ids = y.input_ids
    decoder_attention_mask = y.attention_mask
    
    # return the output in format of (x, y), y
    # As the model forward pipeline will need both x, and y for training
    # and will output loss directly, but fastai learner require x,y formality in datablock
    return (input_ids, attention_mask, decoder_input_ids, decoder_attention_mask),\
        ( decoder_input_ids, decoder_attention_mask)

In [21]:
dls = dsets.dataloaders(bs=8,
                        create_batch=batch_tokenize_collate)

In [22]:
dls.one_batch()

((tensor([[ 101, 2057, 2202,  ...,    0,    0,    0],
          [ 101, 2074, 2061,  ...,    0,    0,    0],
          [ 101, 4067, 2017,  ...,    0,    0,    0],
          ...,
          [ 101, 2061, 2057,  ...,    0,    0,    0],
          [ 101, 4312, 1010,  ...,    0,    0,    0],
          [ 101, 5674, 2065,  ...,    0,    0,    0]], device='cuda:0'),
  tensor([[1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0],
          ...,
          [1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0'),
  tensor([[ 101, 2769,  812,  ...,    0,    0,    0],
          [ 101, 2769, 2682,  ...,    0,    0,    0],
          [ 101, 6468, 6468,  ...,    0,    0,    0],
          ...,
          [ 101, 8020,  830,  ...,    0,    0,    0],
          [ 101, 8020, 5010,  ...,    0,    0,    0],
          [ 101, 2682, 6496,  ...,    0,    0,    0]], device='cuda:0'),
  tensor([[1, 1, 1,  ..., 0, 0, 0]

## Model

We create a seq2seq model by using pretrained encoder + pretrained decoder

In [25]:
encoder_decoder = EncoderDecoderModel(encoder=encoder, decoder=decoder)

class Seq2SeqTrain(nn.Module):
    def __init__(self, encoder_decoder):
        super().__init__()
        self.encoder_decoder = encoder_decoder
        
    def forward(self, batch):
        input_ids, attention_mask, decoder_input_ids, decoder_attention_mask = batch
        return self.encoder_decoder(
            input_ids = input_ids,
            attention_mask = attention_mask,
            decoder_input_ids = decoder_input_ids,
            decoder_attention_mask = decoder_attention_mask,
            labels = decoder_input_ids,
        )

In [26]:
model = Seq2SeqTrain(encoder_decoder)

## Training

In [27]:
learn = Learner(dls,model,loss_func=lambda output, target:output.loss,)

In [None]:
learn.fit(1)

epoch,train_loss,valid_loss,time
