<a href="https://colab.research.google.com/github/whitebearOuO/Bootstrap-starburst/blob/main/%E3%80%8CHW2_demo_ipynb%E3%80%8D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DEMO2 : 聊天機器人

資料集: [Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization](https://arxiv.org/abs/1808.08745)

程式碼參考自: [huggingface](https://huggingface.co/)

> **資料集說明**

We introduce extreme summarization, a new single-document summarization task which does not favor extractive strategies and calls for an abstractive modeling approach. The idea is to create a short, one-sentence news summary answering the question "What is the article about?". We collect a real-world, large-scale dataset for this task by harvesting online articles from the British Broadcasting Corporation (BBC).

**訓練一個Encode-Decoder模型，輸入是一句話，輸出為一句話。**

程式碼參考自：
1. [Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models](https://arxiv.org/pdf/1907.12461.pdf)
2. [huggingface](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)

In [None]:
!pip install datasets transformers torchmetrics



In [None]:
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, EncoderDecoderModel
import torch
import transformers
import matplotlib.pyplot as plt
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore') # setting ignore as a parameter

## 一些模型會用到的小函數

In [None]:
# transform logits to token ids
def get_pred(logits):
  '''
  Parameter
  ---------
  logits: torch.tensor, model outputs (batch_size, max_length, vocab_size)
  ---------
  '''
  return logits.argmax(dim=-1)

# transform token id to word
def transform(ids):
    tokenizer = AutoTokenizer.from_pretrained(parameters['tokenizer'])
    return tokenizer.batch_decode(ids, skip_special_tokens=True)[0]

ROUGE（Recall-Oriented Understudy for Gisting Evaluation）為一種自動摘要評價方法，是評估自動文摘以及機器翻譯的一組指標。

ROUGE 將系統生成的自動摘要與人工生成的標準摘要相對比，通過統計二者之間重疊的 n-gram 數目，來評價摘要的質量，是一種基於 n-gram 召回率的評價方法。

程式碼參考：[ROUGE score](https://torchmetrics.readthedocs.io/en/stable/text/rouge_score.html)

In [None]:
# calculate rouge comfusion metrics
from torchmetrics.functional.text.rouge import rouge_score

def cal_metrics(pred, ans, method):
    '''
    Parameter
    ---------
    pred: [list], predict sentences
    ans: [list], true sentence
    method: 'rouge1', 'rouge2', 'rougeL'. 'rougeLsum'.
    ---------
    '''
    score = rouge_score(transform(pred), transform(ans))
    f1 = score[method+'_fmeasure']
    rec = score[method+'_recall']
    prec = score[method+'_precision']

    return f1, rec, prec

In [None]:
# save model to path
def save_checkpoint(save_path, model):
    if save_path == None:
        return
    torch.save(model.state_dict(), save_path)
    print(f'Model saved to ==> {save_path}')

# load model from path
def load_checkpoint(load_path, model, device):
    if load_path==None:
        return
    state_dict = torch.load(load_path, map_location=device)
    print(f'Model loaded from <== {load_path}')

    model.load_state_dict(state_dict)
    return model

## 載入資料

### 資料集下載

- 資料集說明 :
  - document: Input news article.
  - summary: One sentence summary of the article.
  - id: BBC ID of the article.

You can see https://huggingface.co/datasets/xsum to get more information.

In [None]:
from datasets import load_dataset

dataset = load_dataset("xsum")

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

看一下資料格式長怎樣

In [None]:
dataset['train'][0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

In [None]:
import pandas as pd
train_df = pd.DataFrame(dataset['train'])
val_df = pd.DataFrame(dataset['validation'])
test_df = pd.DataFrame(dataset['test'])

print(type(train_df['document'][0]))

print('# of train_df:', len(train_df))
print('# of dev_df:', len(val_df))
print('# of test_df data:', len(test_df))

# save data
train_df.to_csv('./train.tsv', sep='\t', index=False)
val_df.to_csv('./val.tsv', sep='\t', index=False)
test_df.to_csv('./test.tsv', sep='\t', index=False)

<class 'str'>
# of train_df: 204045
# of dev_df: 11332
# of test_df data: 11334


### 自定義 Dataset，將tokenzie的步驟放進去

我們會將資料用 Dataset + Dataloader 封裝

In [None]:
from torch.utils.data import Dataset
import torch
from transformers import AutoTokenizer

class CustomDataset(Dataset):
    def __init__(self, df, args):
        '''
        Parameters
        ----------
        df: (DataFrame) input data
        specify: (str) decide which column of df will use
        args: (dict) parameters
        '''
        self.df = df
        self.encoder_max_length = args['encoder_max_length']
        self.decoder_max_length = args['decoder_max_length']
        self.tokenizer = AutoTokenizer.from_pretrained(args['tokenizer'])

    def __len__(self):
        return len(self.df)

    def tokenize(self, row):
        # tokenize the inputs and labels
        data = {}
        inputs = self.tokenizer.encode_plus(
                row["document"],
                None,
                add_special_tokens=True,
                max_length=self.encoder_max_length,
                padding="max_length",
                truncation=True,
                return_token_type_ids=True
              )
        outputs = self.tokenizer.encode_plus(
                row["summary"],
                None,
                add_special_tokens=True,
                max_length=self.decoder_max_length,
                padding="max_length",
                truncation=True,
                return_token_type_ids=True
              )


        data["document"] = row["document"]
        data["summary"] = row["summary"]
        data["input_ids"] = torch.tensor(inputs.input_ids, dtype=torch.long)
        data["attention_mask"] = torch.tensor(inputs.attention_mask, dtype=torch.long)
        data["decoder_attention_mask"] = torch.tensor(outputs.attention_mask, dtype=torch.long)
        data["labels"] = torch.tensor(outputs.input_ids, dtype=torch.long)

        return data

    def __getitem__(self, index):

        vectors = self.tokenize(self.df.loc[index])

        return vectors

## 定義你的 Hyperparameters

* 如果電腦的記憶體不夠可以試著減少 batch_size
* 因為我們採用現有的模型去fine-tune，所以一般不需要設太多 epochs
* config 就是我們所使用的現有模型，可以自己找適合的做替換
* 如果你的模型 overfit 了，可以試著把 dropout 調高
* 可以試著調高或調低 learning_rate，這會影響他的學習速度（跨步的大小）
* 你應該先檢閱你的資料再來決定 max_len

In [None]:
# Hyperparameters
parameters = {
    "tokenizer": 'bert-base-uncased', #這三個可以不一樣
    "config1": 'bert-base-uncased', # encoder
    "config2": 'bert-base-uncased', # decoder,
    "decoder_max_length": 512,
    "encoder_max_length": 512,
    "learning_rate": 1e-2,
    "epochs": 3,
    "batch_size": 2, #記憶體不夠就調小
    "dropout": 0.1, #overfit就調高
}

## 開始訓練

* 為了方便演示，這次的資料都沒有全部丟下去，而是sample部分資料，減少所需花費時長
* 如想達到更高效能，建議增加資料量以及epochs數量

In [None]:
import transformers
import pandas as pd

# load training data
train_df = pd.read_csv('./train.tsv', sep = '\t').dropna().sample(1000).reset_index(drop=True) #數字可調
train_dataset = CustomDataset(train_df, parameters)
train_loader = DataLoader(train_dataset, batch_size=parameters['batch_size'], shuffle=True)

# load validation data
val_df = pd.read_csv('./val.tsv', sep = '\t').dropna().sample(250).reset_index(drop=True)
val_dataset = CustomDataset(val_df, parameters)
val_loader = DataLoader(val_dataset, batch_size=parameters['batch_size'], shuffle=True)

print('load training data : %d'%(len(train_dataset)))
print('load validation data : %d'%(len(val_dataset)))

load training data : 1000
load validation data : 250


*   載入模型（這邊會使用已經訓練過的模型，Fine-tune我們的資料集）
*   定義Optimization
  *   通常用Adam就可以了，你也可以換SGD之類的試看看
  *   可以自己看需不需要加scheduler（可以自己寫一個function，也可以直接套用現有的function）
  
  ［請記得pytorch中是以step去計算，想要用epoch去訂定需自行換算］

In [None]:
from transformers import AutoTokenizer, EncoderDecoderModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

transformers.logging.set_verbosity_error() # close the warning message
tokenizer = AutoTokenizer.from_pretrained(parameters['tokenizer'])

model = EncoderDecoderModel.from_encoder_decoder_pretrained(parameters['config1'], parameters['config2']).to(device)
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.eos_token_id = tokenizer.sep_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.max_length = parameters['encoder_max_length']
model.config.vocab_size = model.config.decoder.vocab_size

In [None]:
model

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

In [None]:
## You can custom your optimizer ##
# we use Adam here
optimizer = torch.optim.Adam(model.parameters(), lr=parameters['learning_rate'], betas=(0.9, 0.98), eps=1e-9)

In [None]:
# evaluate dataloader
def evaluate(model, data_loader, args, device):
    val_loss, val_f1, val_rec, val_prec = 0.0, 0.0, 0.0, 0.0
    step_count = 0
    model.eval()
    with torch.no_grad():
        for data in data_loader:
            input_ids = data["input_ids"].to(device)
            attention_mask = data["attention_mask"].to(device)
            decoder_attention_mask = data["decoder_attention_mask"].to(device)
            labels = data["labels"].to(device)

            outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    decoder_attention_mask=decoder_attention_mask,
                    labels=labels,
                    output_attentions=True
                    )

            loss, logits = outputs.loss, outputs.logits
            f1, rec, prec = cal_metrics(get_pred(logits), labels, 'rouge1')

            val_loss += loss.item()
            val_f1 += f1
            val_rec += rec
            val_prec += prec
            step_count += 1

        val_loss = val_loss / step_count
        val_f1 = val_f1 / step_count
        val_rec = val_rec / step_count
        val_prec = val_prec / step_count

    return val_loss, val_f1, val_rec, val_prec

In [None]:
print(device)

cpu


In [None]:
# Start training
import time
metrics = ['loss', 'acc', 'f1', 'rec', 'prec']
mode = ['train_', 'val_']
record = {s+m :[] for s in mode for m in metrics}

for epoch in range(parameters["epochs"]):

    st_time = time.time()
    train_loss, train_f1, train_rec, train_prec = 0.0, 0.0, 0.0, 0.0
    step_count = 0

    model.train()
    for data in train_loader:

        input_ids = data["input_ids"].to(device)
        attention_mask = data["attention_mask"].to(device)
        decoder_attention_mask = data["decoder_attention_mask"].to(device)
        labels = data["labels"].to(device)

        optimizer.zero_grad()

        outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                decoder_attention_mask=decoder_attention_mask,
                labels=labels,
                output_attentions=True
                )
        loss, logits = outputs.loss, outputs.logits
        f1, rec, prec = cal_metrics(get_pred(logits), labels, 'rouge1')

        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_f1 += f1
        train_rec += rec
        train_prec += prec
        step_count += 1

    val_loss, val_f1, val_rec, val_prec = evaluate(model, val_loader, parameters, device)

    train_loss = train_loss / step_count
    train_f1 = train_f1 / step_count
    train_rec = train_rec / step_count
    train_prec = train_prec / step_count

    print('[epoch %d] cost time: %.4f s'%(epoch + 1, time.time() - st_time))
    print('         loss     f1      rec    prec')
    print('train | %.4f, %.4f, %.4f, %.4f'%(train_loss, train_f1, train_rec, train_prec))
    print('val   | %.4f, %.4f, %.4f, %.4f,\n'%(val_loss, val_f1, val_rec, val_prec))

    # record training metrics of each training epoch
    record['train_loss'].append(train_loss)
    record['train_f1'].append(train_f1)
    record['train_rec'].append(train_rec)
    record['train_prec'].append(train_prec)

    record['val_loss'].append(val_loss)
    record['val_f1'].append(val_f1)
    record['val_rec'].append(val_rec)
    record['val_prec'].append(val_prec)

In [None]:
# save model
save_checkpoint('./bert.pt', model)

In [None]:
# draw learning curve
import matplotlib.pyplot as plt
def draw_pics(record, name, img_save=False, show=False):
    x_ticks = range(1, parameters['epochs']+1)

    plt.figure(figsize=(6, 3))

    plt.plot(x_ticks, record['train_'+name], '-o', color='lightskyblue',
             markeredgecolor="teal", markersize=3, markeredgewidth=1, label = 'Train')
    plt.plot(x_ticks, record['val_'+name], '-o', color='pink',
             markeredgecolor="salmon", markersize=3, markeredgewidth=1, label = 'Val')
    plt.grid(color='lightgray', linestyle='--', linewidth=1)

    plt.title('Model', fontsize=14)
    plt.ylabel(name, fontsize=12)
    plt.xlabel('Epoch', fontsize=12)
    plt.xticks(x_ticks, fontsize=12)
    plt.yticks(fontsize=12)
    plt.legend(loc='lower right' if not name.lower().endswith('loss') else 'upper right')

    if img_save:
        plt.savefig(name+'.png', transparent=False, dpi=300)
    if show:
        plt.show()

    plt.close()

In [None]:
draw_pics(record, 'loss', img_save=False, show=True)
draw_pics(record, 'f1', img_save=False, show=True)

## 預測結果

### generate function

huggingface中有繼承PretrainedModel的模型都有generate function，是一個已經封裝好的函數，可以直接使用

我們看一下直接使用去預測的結果為何

In [None]:
from transformers import AutoTokenizer, EncoderDecoderModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(parameters['tokenizer'])
finetune_model = EncoderDecoderModel.from_encoder_decoder_pretrained(parameters['config1'], parameters['config2']).to(device)
finetune_model.config.decoder_start_token_id = tokenizer.cls_token_id
finetune_model.config.eos_token_id = tokenizer.sep_token_id
finetune_model.config.pad_token_id = tokenizer.pad_token_id
finetune_model.config.max_length = parameters['encoder_max_length']
finetune_model.config.vocab_size = finetune_model.config.decoder.vocab_size

# let's perform inference on a long piece of text
inputs = (
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)
input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(device)

# autoregressively generate summary (uses greedy decoding by default)
generated_ids = finetune_model.generate(input_ids)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

NameError: name 'torch' is not defined

### 預測單筆句子
* 因為沒有訓練很久 & 資料量不多，所以結果沒有很好

In [None]:
# predict single sentence
def predict_one(query, model, tokenizer, device):
    model.eval()
    with torch.no_grad():
        inputs = tokenizer.encode_plus(
                query,
                None,
                add_special_tokens=True,
                max_length = parameters["encoder_max_length"],
                truncation = True,
                padding = 'max_length',
                return_token_type_ids=True
            )
        query_ids = torch.tensor([inputs.input_ids], dtype=torch.long).to(device)
        attention_mask = torch.tensor([inputs.attention_mask], dtype=torch.long).to(device)
        outputs = model(input_ids=query_ids,
                    attention_mask=attention_mask,
                    decoder_attention_mask=attention_mask,
                    labels=query_ids
                    )
        pred_ids = get_pred(outputs.logits)
        pred_sentence = transform(pred_ids)
        return pred_sentence

In [None]:
inputs = "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
tokenizer = AutoTokenizer.from_pretrained(parameters['tokenizer'])
pred = predict_one(inputs, model, tokenizer, device)
print(pred)

* 先初始化一個相同架構模型，再讀入已訓練好的模型參數

In [None]:
# load model from training result
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

init_model = EncoderDecoderModel.from_encoder_decoder_pretrained(parameters['config1'], parameters['config2'])
init_model.config.decoder_start_token_id = tokenizer.cls_token_id
init_model.config.eos_token_id = tokenizer.sep_token_id
init_model.config.pad_token_id = tokenizer.pad_token_id
init_model.config.max_length = parameters['encoder_max_length']
init_model.config.vocab_size = init_model.config.decoder.vocab_size

final_model = load_checkpoint('./bert.pt', init_model, device).to(device)

In [None]:
pred = predict_one(inputs, final_model, tokenizer, device)
print(pred)

In [None]:
# evaluate testing data
test_df = pd.read_csv('./test.tsv', sep = '\t').dropna().sample(500).reset_index(drop=True)
test_dataset = CustomDataset(test_df, parameters)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)

test_loss, test_f1, test_rec, test_prec = evaluate(final_model, test_loader, parameters, device)
print('test_f1: %.4f'%(test_f1))