# Build Your Own GPT-2

## 2. Fine Tuning the Language Model

在[上一篇]()裡我們用 HuggingFace `transformers` 套件裡預先訓練好的標記化工具（tokenizer）跟語言模型（language model），來自動生成新的文本，接下來要嘗試進一步用自己準備的語料來調整預先訓練的語言模型。

### 參考資料：
- [Natural Language Generation Part 2: GPT2 and Huggingface](https://towardsdatascience.com/natural-language-generation-part-2-gpt-2-and-huggingface-f3acb35bc86a)

- [(transformers-document) Training and fine-tuning](https://huggingface.co/transformers/training.html)

In [1]:
# setup imports to use the model
from transformers import TFGPT2LMHeadModel
from transformers import GPT2Tokenizer, BertTokenizer

model = TFGPT2LMHeadModel.from_pretrained('ckiplab/gpt2-base-chinese', from_pt=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

input_ids = tokenizer.encode("今天天氣很好", return_tensors='tf')
print(input_ids)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.6.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'lm_head.weight']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassifica

tf.Tensor([[ 101  791 1921 1921 3706 2523 1962  102]], shape=(1, 8), dtype=int32)


我們可以把預訓練好的模型先存起來，作為後續 fine-tune 的基礎。

> model.save_pretrained('../data/mygpt2/')


## 把語料轉換成訓練用的格式

接下來，我們要把訓練用的語料，轉換成 `TFGPT2LMHeadModel()` 使用的格式。

### 1. 製作語料文本的檔案清單

In [2]:
# Get all files in the corpus directory
def list_corpus_files(corpus_path, prefix='wiki_'):
    import os
    # 
    flist = []
    for dirPath, dirNames, fileNames in os.walk(corpus_path):
        for f in fileNames:
            if f.startswith(prefix):
                flist.append(os.path.join(dirPath, f))
    return(flist)

# Test function with the wiki_zh corpus
wikifiles = list_corpus_files('D:\data\corpus\wiki_zh')

print(len(wikifiles))
print(wikifiles[:5])

1274
['D:\\data\\corpus\\wiki_zh\\AA\\wiki_00', 'D:\\data\\corpus\\wiki_zh\\AA\\wiki_01', 'D:\\data\\corpus\\wiki_zh\\AA\\wiki_02', 'D:\\data\\corpus\\wiki_zh\\AA\\wiki_03', 'D:\\data\\corpus\\wiki_zh\\AA\\wiki_04']


### 2. 讀取單一語料檔案的工具

In [3]:
# Read specified json file
def read_json_corpus(corpus_file, to_zhtw=True):
    import json
    import opencc
    converter = opencc.OpenCC('s2tw.json')          # To Taiwan Chinese
    data = []
    with open(corpus_file, 'r', encoding='utf8') as f:
        line = f.readline()
        while line:
            if to_zhtw:
                line = converter.convert(line)
            data.append(json.loads(line))
            line = f.readline()
    return(data)

# Test 
data = read_json_corpus(wikifiles[1])
print(len(data))
print(data[5]['text'][:100])

59
年表列表

這是年表列表，年表或歷史年表（Timeline），又稱時間表、時間軸，是將有關歷史的資料依時間或年份先後排列而成的列表，即將一件發生在某年某月某日的歷史事件，以最簡單的形式，重點地記錄下來


### 3. 將語料轉換成標記化資料 (tokenized data)

我們建好可以讀取大量文本的工具函數，在我們參考 [`gpt2-chinese`](https://github.com/Morizeyao/GPT2-Chinese/) 的 [`train.py`](https://github.com/Morizeyao/GPT2-Chinese/blob/master/train.py) 當中，有一個將大量文本轉換成「標記化資料 (tokenized data)」的函數，進行的工作如下：

1. 將「分行符號」（`\n`）轉換成 BertTokenizer 的段落符號（`[SEP]`）   
2. 重組以「文件」為單位的語料：
    1. 指定最後要分割成的檔案數量 `num_pieces`，將文件數量平均分配到每個 piece 裡
    2. 將長度超過 `min_length` 的文件透過指定的 tokenizer (here, we use BertTokenizer) 轉換成向量，並在文件前後分別加上`[MASK]`和`[CLS]`的標記。
    3. 以 piece 為單位儲存。

基本上這個函數是將文件語料轉換成 token vector 的工具，不見得一定要這麼使用。


In [4]:
def build_files(lines, tokenized_data_path, num_pieces, full_tokenizer, min_length):
    import os, tqdm
    # Process raw strings 
    print('reading lines')
    lines = [line.replace('\n', ' [SEP] ') for line in lines]  # 用[SEP]表示换行, 段落之间使用SEP表示段落结束
    all_len = len(lines)
    # Prepare output
    if not os.path.exists(tokenized_data_path):
        print('creating path for tokenized data: '+tokenized_data_path)
        os.mkdir(tokenized_data_path)
    for i in tqdm.tqdm(range(num_pieces)):
        sublines = lines[(all_len // num_pieces * i):(all_len // num_pieces * (i + 1))]
        if i == num_pieces - 1:
            sublines.extend(lines[all_len // num_pieces * (i + 1):])  # 把尾部例子添加到最后一个 piece
        sublines = [full_tokenizer.tokenize(line) for line in sublines if
                    len(line) > min_length]  # 只考虑长度超过 min_length 的句子
        sublines = [full_tokenizer.convert_tokens_to_ids(line) for line in sublines]
        full_line = []
        for subline in sublines:
            full_line.append(full_tokenizer.convert_tokens_to_ids('[MASK]'))  # 文章开头添加MASK表示文章开始
            full_line.extend(subline)
            full_line.append(full_tokenizer.convert_tokens_to_ids('[CLS]'))  # 文章之间添加CLS表示文章结束
        with open(tokenized_data_path + 'tokenized_train_{}.txt'.format(i), 'w') as f:
            for id in full_line:
                f.write(str(id) + ' ')
    print('finish')

**製作以「文件」為單位的資料集**

我們利用前面讀取語料檔案的工具，來建立以「文件」為單位的資料集。讓我們先讀取10個檔案作測試，我們不需要語料當中的其他欄位，只需要'text'即可。

In [5]:
## Fetch content from all wikifiles
import tqdm
data = []
for i in tqdm.tqdm(range(len(wikifiles[:10]))):
    data+=read_json_corpus(wikifiles[i])
    
print(len(data))
corpus_text = []
for article in data:
    corpus_text.append(article['text'])
    
print(len(corpus_text))

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00,  1.94it/s]

885
885





**分段測試函數功能**

讀取的10個檔案中總共有885篇文件，由於上面的函數比較長，我們用分段測試來理解它的功能。

In [6]:
print(corpus_text[10][:100])
lines = [line.replace('\n', ' [SEP] ') for line in corpus_text]  # 用[SEP]表示换行, 段落之间使用SEP表示段落结束
print(lines[0][:100])
all_len = len(lines)
print(all_len)

政治學

政治學是一門以研究政治行為、政治體制以及政治相關領域為主的社會科學學科。在西方，政治學在學術領域裡的研究也被稱為政治研究、政治科學、或只有政治兩字。政治學意味著在學術上的研究領域，政治研究則
數學 [SEP]  [SEP] 數學是利用符號語言研究數量、結構、變化以及空間等概念的一門學科，從某種角度看屬於形式科學的一種。數學透過抽象化和邏輯推理的使用，由計數、計算、量度和對物體形狀及運動的觀
885


接下來這段是主迴圈，基本上就是把 article-based 單位換成 piece-based 的單位。

In [7]:
num_pieces = 10
min_length = 30
for i in range(num_pieces):
    idx1 = all_len // num_pieces * i
    idx2 = all_len // num_pieces * (i + 1)
    print(idx1, idx2)
    sublines = lines[(all_len // num_pieces * i):(all_len // num_pieces * (i + 1))]
    if i == num_pieces - 1:
        sublines.extend(lines[all_len // num_pieces * (i + 1):])  # 把尾部例子添加到最后一个 piece
    sublines = [tokenizer.tokenize(line) for line in sublines if len(line) > min_length]  # 只考虑长度超过 min_length 的句子
    sublines = [tokenizer.convert_tokens_to_ids(line) for line in sublines]
    print(len(sublines))

0 88
88
88 176
87
176 264
87
264 352
87
352 440
88
440 528
86
528 616
86
616 704
88
704 792
32
792 880
92


In [8]:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

build_files(corpus_text, tokenized_data_path='../data/tokenized_data/', num_pieces=100,
                    full_tokenizer=tokenizer, min_length=32)

  0%|                                                                                          | 0/100 [00:00<?, ?it/s]

reading lines


100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:19<00:00,  5.24it/s]

finish





### 4. 建立我們的訓練用標記化語料

In [9]:
# Get all files in the corpus directory
def list_corpus_files(corpus_path, prefix='wiki_'):
    import os
    # 
    flist = []
    for dirPath, dirNames, fileNames in os.walk(corpus_path):
        for f in fileNames:
            if f.startswith(prefix):
                flist.append(os.path.join(dirPath, f))
    return(flist)

# Read specified json file
def read_json_corpus(corpus_file, to_zhtw=True):
    import json
    import opencc
    converter = opencc.OpenCC('s2tw.json')          # To Taiwan Chinese
    data = []
    with open(corpus_file, 'r', encoding='utf8') as f:
        line = f.readline()
        while line:
            if to_zhtw:
                line = converter.convert(line)
            data.append(json.loads(line))
            line = f.readline()
    return(data)

def build_files(lines, tokenized_data_path, num_pieces, full_tokenizer, min_length):
    import os, tqdm
    # Process raw strings 
    print('reading lines')
    lines = [line.replace('\n', ' [SEP] ') for line in lines]  # 用[SEP]表示换行, 段落之间使用SEP表示段落结束
    all_len = len(lines)
    # Prepare output
    if not os.path.exists(tokenized_data_path):
        print('creating path for tokenized data: '+tokenized_data_path)
        os.mkdir(tokenized_data_path)
    for i in tqdm.tqdm(range(num_pieces)):
        sublines = lines[(all_len // num_pieces * i):(all_len // num_pieces * (i + 1))]
        if i == num_pieces - 1:
            sublines.extend(lines[all_len // num_pieces * (i + 1):])  # 把尾部例子添加到最后一个 piece
        sublines = [full_tokenizer.tokenize(line) for line in sublines if
                    len(line) > min_length]  # 只考虑长度超过 min_length 的句子
        sublines = [full_tokenizer.convert_tokens_to_ids(line) for line in sublines]
        full_line = []
        for subline in sublines:
            full_line.append(full_tokenizer.convert_tokens_to_ids('[MASK]'))  # 文章开头添加MASK表示文章开始
            full_line.extend(subline)
            full_line.append(full_tokenizer.convert_tokens_to_ids('[CLS]'))  # 文章之间添加CLS表示文章结束
        with open(tokenized_data_path + 'tokenized_train_{}.txt'.format(i), 'w') as f:
            for id in full_line:
                f.write(str(id) + ' ')
    print('finish')

def build_tokenized_corpus(corpus_path, 
                           tokenizer, 
                           output_path='../data/tokenized_data/', 
                           output_pieces=100, 
                           min_length=10):
    import tqdm
    # List all corpus files
    corpusfiles = list_corpus_files(corpus_path)
    print('Number of files: '+str(len(corpusfiles)))
    # Read and combine corpus
    print('Reading files... ')
    data = []
    for i in tqdm.tqdm(range(len(corpusfiles))):
        data+=read_json_corpus(corpusfiles[i])
    # Convert file-based corpus to article-based
    corpus_text = []
    for article in data:
        corpus_text.append(article['text'])
    #
    build_files(corpus_text, tokenized_data_path='../data/tokenized_data/', num_pieces=output_pieces,
                    full_tokenizer=tokenizer, min_length=min_length)
    
    #
    return(0)
        

In [10]:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
build_tokenized_corpus(corpus_path='D:\data\corpus\wiki_zh', tokenizer=tokenizer, output_pieces=256)

  0%|                                                                                         | 0/1274 [00:00<?, ?it/s]

Number of files: 1274
Reading files... 


100%|██████████████████████████████████████████████████████████████████████████████| 1274/1274 [12:22<00:00,  1.71it/s]


reading lines


  0%|                                                                                          | 0/256 [00:00<?, ?it/s]

creating path for tokenized data: ../data/tokenized_data/


  1%|▉                                                                               | 3/256 [02:37<3:41:56, 52.64s/it]


KeyboardInterrupt: 

## 用 Model 類別直接訓練

我們先讀取預訓練的模式，檢查模式的設定。

In [1]:
from datetime import datetime
import random
import numpy as np
import transformers
import tensorflow as tf

from transformers import TFGPT2LMHeadModel
from transformers import GPT2Tokenizer, BertTokenizer

model = TFGPT2LMHeadModel.from_pretrained('../data/mygpt2')
tokenizer = BertTokenizer.from_pretrained('../data/tokenizer_bert_base_chinese')

num_pieces = 100
tokenized_data_path = '../data/tokenized_data/'
full_tokenizer=tokenizer
output_dir = '../data/gpt2_ft/'

model_config = transformers.GPT2Config.from_json_file('../data/mygpt2/config.json')
print('config:\n' + model_config.to_json_string('../data/mygpt2/config.json'))

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at ../data/mygpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


config:
{
  "_name_or_path": "ckiplab/gpt2-base-chinese",
  "activation_function": "gelu_new",
  "add_cross_attention": false,
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bad_words_ids": null,
  "bos_token_id": 101,
  "chunk_size_feed_forward": 0,
  "decoder_start_token_id": null,
  "diversity_penalty": 0.0,
  "do_sample": false,
  "early_stopping": false,
  "embd_pdrop": 0.1,
  "encoder_no_repeat_ngram_size": 0,
  "eos_token_id": 102,
  "finetuning_task": null,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "min_length": 0,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "no_repeat_ngram_size": 0,
  "num_beam_gr

In [68]:
# Load data
import numpy as np
with open('../data/tokenized_data/tokenized_train_237.txt', 'r') as f:
    tokenized_ids = f.readline().strip().split(' ')

print(len(tokenized_ids))

data = [int(id) for id in tokenized_ids]
#print(data[:20])

dataset = tf.data.Dataset.from_tensor_slices(data).window(model_config.n_ctx, drop_remainder=True)
print(dataset.element_spec)

train_data = []
for window in dataset:
    train_data.append(np.array([elem.numpy() for elem in window]).astype('int32'))

#print(len(train_data))
#print(train_data[1])


882997
DatasetSpec(TensorSpec(shape=(), dtype=tf.int32, name=None), TensorShape([]))


In [76]:
#train_data = np.array(train_data)
print(np.array(train_data).shape)
#print(train_data[1])
print(max(data))
print(min(data))

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.summary()
model.fit(x=train_data, epochs=3, batch_size=16)

(862, 1024)
13722
100
Model: "tfgp_t2lm_head_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
transformer (TFGPT2MainLayer multiple                  102068736 
Total params: 102,068,736
Trainable params: 102,068,736
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3


StagingError: in user code:

    C:\Users\tsyo\AppData\Local\Continuum\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:806 train_function  *
        return step_function(self, iterator)
    C:\Users\tsyo\AppData\Local\Continuum\anaconda3\lib\site-packages\transformers\models\gpt2\modeling_tf_gpt2.py:700 call  *
        inputs = input_processing(
    C:\Users\tsyo\AppData\Local\Continuum\anaconda3\lib\site-packages\transformers\modeling_tf_utils.py:374 input_processing  *
        output[parameter_names[i]] = input

    IndexError: list index out of range


In [None]:
stride=768
batch_size=8
n_ctx = model_config.n_ctx

print('starting training')
now = datetime.now()
print('time: {}'.format(now))

overall_step = 0
running_loss = 0
for epoch in range(epochs):
    x = np.linspace(0, num_pieces - 1, num_pieces, dtype=np.int32)
    random.shuffle(x)
    piece_num = 0
    for i in x:
        with open(tokenized_data_path + 'tokenized_train_{}.txt'.format(i), 'r') as f:
            line = f.read().strip()
        tokens = line.split()
        tokens = [int(token) for token in tokens]
        start_point = 0
        samples = []
        while start_point < len(tokens) - n_ctx:
            samples.append(tokens[start_point: start_point + n_ctx])
            start_point += stride
        if start_point < len(tokens):
            samples.append(tokens[len(tokens)-n_ctx:])
        random.shuffle(samples)
        for step in range(len(samples) // batch_size):  # drop last

            #  prepare data
            batch = samples[step * batch_size: (step + 1) * batch_size]
            batch_inputs = []
            for ids in batch:
                int_ids = [int(x) for x in ids]
                batch_inputs.append(int_ids)
            batch_inputs = tf.tensor(batch_inputs).long().to(device)

            #  forward pass
            outputs = model.forward(input_ids=batch_inputs, labels=batch_inputs)
            loss, logits = outputs[:2]

            #  get loss
            if multi_gpu:
                loss = loss.mean()
            if gradient_accumulation > 1:
                loss = loss / gradient_accumulation

            #  loss backward
            if fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_grad_norm)
            else:
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

            #  optimizer step
            if (overall_step + 1) % gradient_accumulation == 0:
                running_loss += loss.item()
                optimizer.step()
                optimizer.zero_grad()
                scheduler.step()
            if (overall_step + 1) % log_step == 0:
                tb_writer.add_scalar('loss', loss.item() * gradient_accumulation, overall_step)
                print('now time: {}:{}. Step {} of piece {} of epoch {}, loss {}'.format(
                    datetime.now().hour,
                    datetime.now().minute,
                    step + 1,
                    piece_num,
                    epoch + 1,
                    running_loss * gradient_accumulation / (log_step / gradient_accumulation)))
                running_loss = 0
            overall_step += 1
        piece_num += 1

    print('saving model for epoch {}'.format(epoch + 1))
    if not os.path.exists(output_dir + 'model_epoch{}'.format(epoch + 1)):
        os.mkdir(output_dir + 'model_epoch{}'.format(epoch + 1))
    model_to_save = model.module if hasattr(model, 'module') else model
    model_to_save.save_pretrained(output_dir + 'model_epoch{}'.format(epoch + 1))
    print('epoch {} finished'.format(epoch + 1))


print('training finished')
then = datetime.now()
print('time: {}'.format(then))
print('time for training: {}'.format(then - now))


if not os.path.exists(output_dir + 'final_model'):
    os.mkdir(output_dir + 'final_model')
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir + 'final_model')

## 使用 Trainer 類別來調整模型

`HuggingFace` 提供了 [`Trainer` class](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) 作為調整模型參數的工具。在使用 `Trainer` 之前，我們需要先決定模型結構（通常是使用 `model.from_pretrained`），並且據此以 [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments) 來設定訓練用的 hyper-parameters，例如 `learning_rate`, `num_train_epochs`, 和 `per_device_train_batch_size` 等等。


In [1]:
from transformers import BertTokenizer, glue_convert_examples_to_features
import tensorflow as tf
import tensorflow_datasets as tfds

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
data = tfds.load('glue/mrpc')
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

INFO:absl:Generating dataset glue (C:\Users\tsyo\tensorflow_datasets\glue\mrpc\1.0.0)


[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\tsyo\tensorflow_datasets\glue\mrpc\1.0.0...[0m


Dl Completed...: |          | 0/0 [00:00<?, ? url/s]

Dl Size...: |          | 0/0 [00:00<?, ? MiB/s]

INFO:absl:Downloading https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc into C:\Users\tsyo\tensorflow_datasets\downloads\fire.goog.com_v0_b_mtl-sent-repr.apps.com_o_2FjSIMlCiqs1QSmIykr4IRPnEHjPuGwAz5i40v8K9U0Z8.tsvalt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc.tmp.9bb6015152474f53a4c8f753c28f0d16...
INFO:absl:Downloading https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt into C:\Users\tsyo\tensorflow_datasets\downloads\dl.fbaip.com_sente_sente_msr_parap_test0PdekMcyqYR-w4Rx_d7OTryq0J3RlYRn4rAMajy9Mak.txt.tmp.4f16cd9473ce4561af82f9b1529a6243...
INFO:absl:Downloading https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt into C:\Users\tsyo\tensorflow_datasets\downloads\dl.fbaip.com_sente_sente_msr_parap_trainfGxPZuQWGBti4Tbd1YNOwQr-OqxPejJ7gcp0Al6mlSk.txt.tmp.5c005523e04f49439765d9ddf80bb88c...






Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: |          | 0/? [00:00<?, ? examples/s]

Shuffling glue-train.tfrecord...:   0%|          | 0/3668 [00:00<?, ? examples/s]

INFO:absl:Done writing glue-train.tfrecord. Number of examples: 3668 (shards: [3668])


Generating validation examples...: |          | 0/? [00:00<?, ? examples/s]

Shuffling glue-validation.tfrecord...:   0%|          | 0/408 [00:00<?, ? examples/s]

INFO:absl:Done writing glue-validation.tfrecord. Number of examples: 408 (shards: [408])


Generating test examples...: |          | 0/? [00:00<?, ? examples/s]

Shuffling glue-test.tfrecord...:   0%|          | 0/1725 [00:00<?, ? examples/s]

INFO:absl:Done writing glue-test.tfrecord. Number of examples: 1725 (shards: [1725])
INFO:absl:Constructing tf.data.Dataset for split None, from C:\Users\tsyo\tensorflow_datasets\glue\mrpc\1.0.0


[1mDataset glue downloaded and prepared to C:\Users\tsyo\tensorflow_datasets\glue\mrpc\1.0.0. Subsequent calls will reuse this data.[0m


