<a href="https://colab.research.google.com/github/sccmst/NLUModelOnColab/blob/GPT2-ContentExtension/GPT-ContentExtension/Fine_tune_GPT_2_Model_with_Huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install kaggle
!mkdir ~/.kaggle
!touch ~/.kaggle/kaggle.json

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
with open("/root/.kaggle/kaggle.json", "w") as f:
  f.write('{"username":"","key":""}')

!chmod 600 /root/.kaggle/kaggle.json

## **What are we going to do:**

- load the dataset from kaggle
- prepare the dataset and build a ``TextDataset``
- load the pre-trained GPT-2 model and tokenizer
- initialize ``Trainer`` with ``TrainingArguments``
- train and save the model
- test the model

In [1]:
!pip install transformers>=4.2.2

In [None]:
from transformers import set_seed
# Set seed for reproducibility.
set_seed(123)


In [None]:
!nvidia-smi

Tue Dec 13 14:05:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P0    30W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Prepare the dataset and build a ``TextDataset``

The next step is to extract the instructions from all recipes and build a `TextDataset`. The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library. If you want to know more about Dataset in Pytroch you can check out this [youtube video](https://www.youtube.com/watch?v=PXOzkkB5eH0&ab_channel=PythonEngineer).

First, we are going to split the `recipes.json` into a `train` and `test` section and extract `Instructions` from the recipes and write them into a `train_dataset.txt` and `test_dataset.txt`

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
#api.dataset_download_files("sterby/german-recipes-dataset",path="./", unzip=True)
api.dataset_download_files("terrychanorg/chinese-simplified-xlsum-v2", path="./", unzip=True)

In [None]:
import re
import json
from sklearn.model_selection import train_test_split


with open('./chinese_traditional_XLSum_v2.0/chinese_traditional_val.jsonl') as f:
    # row = json.load(f)
    data = []
    for line in f.readlines():
      line.replace("\n","")
      data.append(json.loads(line))

def build_text_files(data_json, dest_path):
    with open(dest_path, 'w') as f:
      data = []
      for texts in data_json:
          title = str(texts['title']).strip()
          text = str(texts['text']).strip()
          summary = str(texts['summary']).strip()
          data.append(f"{summary}BEG;END{text}")
          data.append(f"{title}BEG;END{text}")
      f.write("\n".join(data))



In [None]:
train, test = train_test_split(data,test_size=0.15)
build_text_files(train,'train_dataset.txt')
build_text_files(test,'test_dataset.txt')
print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))


Train dataset length: 3969
Test dataset length: 701


the next step is to download the tokenizer, which we use. We use the tokenizer from the `german-gpt2` model on [huggingface](https://huggingface.co/anonymous-german-nlp/german-gpt2).

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

train_path = 'train_dataset.txt'
test_path = 'test_dataset.txt'

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
from torch.utils.data import Dataset
from typing import Any, Optional
import torch
import os
import logging
from filelock import FileLock
import time
import pickle
logger = logging.getLogger()
class TextDataset(Dataset):
    """
    This will be superseded by a framework-agnostic approach soon.
    """

    def __init__(
        self,
        tokenizer: Any,
        file_path: str,
        block_size: int,
        overwrite_cache=False,
        cache_dir: Optional[str] = None,
    ):

        if os.path.isfile(file_path) is False:
            raise ValueError(f"Input file path {file_path} not found")

        block_size = block_size - tokenizer.num_special_tokens_to_add(pair=False)

        directory, filename = os.path.split(file_path)
        cached_features_file = os.path.join(
            cache_dir if cache_dir is not None else directory,
            f"cached_lm_{tokenizer.__class__.__name__}_{block_size}_{filename}",
        )

        # Make sure only the first process in distributed training processes the dataset,
        # and the others will use the cache.
        lock_path = cached_features_file + ".lock"
        with FileLock(lock_path):

                logger.info(f"Creating features from dataset file at {directory}")

                self.examples = []
                with open(file_path, encoding="utf-8") as f:
                    # text = f.read()
                    texts = f.readlines()

                # tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
                # for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
                    # self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])
                # )

                for text in texts:
                  tokenedtext = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
                  for i in range(0, len(tokenedtext) - block_size + 1, block_size):
                    self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenedtext[i : i + block_size]))
                
                
                # Note that we are losing the last truncated example here for the sake of simplicity (no padding)
                # If your dataset is small, first you should look for a bigger one :-) and second you
                # can change this behavior by adding (model specific) padding.

                start = time.time()
                with open(cached_features_file, "wb") as handle:
                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
                logger.info(
                    f"Saving features into cached file {cached_features_file} [took {time.time() - start:.3f} s]"
                )

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i) -> torch.Tensor:
        return torch.tensor(self.examples[i], dtype=torch.long)
        # return self.examples[i]


In [None]:
from transformers import DataCollatorForLanguageModeling
# from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
# https://github.com/huggingface/transformers/blob/main/src/transformers/data/datasets/language_modeling.py
def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)

Token indices sequence length is longer than the specified maximum sequence length for this model (1639 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
train_dataset[:3]

tensor([[37772,   101, 33768,    98,   171,   120,   234, 17739,   255,   163,
           247,   122,   165,    97,   246, 28938,   235, 46763,   246, 26344,
           102, 12859,   252, 21689, 36181,   252,   164,    95,   104, 28839,
           235, 32368,   108, 21410,   165,   250,   235, 34650,   228, 23877,
           107, 32003,   223,   161,   253,   236, 40792,   164,   100,    96,
         46763,   239, 49035,   118,   160,   122,   228, 16764,    33,  7156,
            26, 10619,   165,   250,   235, 34650,   228, 23877,   107, 21689,
           164,   223, 21253,   249,   228, 28839,   101, 31660,   164,   113,
           115,   163,   255,   231, 36181,   227, 46763,   239,   162,   237,
           112, 31965,   102,   164,   111,   229, 10545,   243,   246, 26344,
           102, 12859,   252,   163,   112,   227, 23877,   108, 17312,   230,
         17312,   225,   164,   110,   254,   164,   110,   105, 21689, 37772,
           232,   164,   101,   112, 33833,   171,  

# login to huggingface

In [None]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from huggingface_hub import notebook_login
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

The [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class provides an API for feature-complete training. It is used in most of the [example scripts](https://huggingface.co/transformers/examples.html) from Huggingface. Before we can instantiate our `Trainer` we need to download our GPT-2 model and create a [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to access all the points of customization during training. In the `TrainingArguments`, we can define the Hyperparameters we are going to use in the training process like our `learning_rate`, `num_train_epochs`, or  `per_device_train_batch_size`. A complete list can you find [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

In [None]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained("gpt2")


training_args = TrainingArguments(
    output_dir="./gpt2-reporter", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=32,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps = 800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    push_to_hub=True,
    hub_model_id="theta/gpt2-reporter"
)


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Cloning https://huggingface.co/theta/gpt2-reporter into local empty directory.


Download file pytorch_model.bin:   0%|          | 3.43k/487M [00:00<?, ?B/s]

Download file runs/Dec13_12-34-11_b897c6e75aee/events.out.tfevents.1670934865.b897c6e75aee.77.0:  44%|####3   …

Download file runs/Dec13_12-34-11_b897c6e75aee/1670934865.3101966/events.out.tfevents.1670934865.b897c6e75aee.…

Download file training_args.bin:  53%|#####2    | 1.77k/3.37k [00:00<?, ?B/s]

Clean file runs/Dec13_12-34-11_b897c6e75aee/events.out.tfevents.1670934865.b897c6e75aee.77.0:  25%|##4       |…

Clean file runs/Dec13_12-34-11_b897c6e75aee/1670934865.3101966/events.out.tfevents.1670934865.b897c6e75aee.77.…

Clean file training_args.bin:  30%|##9       | 1.00k/3.37k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/487M [00:00<?, ?B/s]

# Train and save the model

To train the model we can simply run `Trainer.train()`.

In [None]:
trainer.train()

***** Running training *****
  Num examples = 131053
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 12288
  Number of trainable parameters = 124439808


Step,Training Loss
500,3.0659
1000,2.5997
1500,2.421
2000,2.3264
2500,2.2643
3000,2.2127
3500,2.1787
4000,2.1531
4500,2.1201
5000,2.0937


Saving model checkpoint to ./gpt2-reporter/checkpoint-800
Configuration saved in ./gpt2-reporter/checkpoint-800/config.json
Model weights saved in ./gpt2-reporter/checkpoint-800/pytorch_model.bin
Saving model checkpoint to ./gpt2-reporter/checkpoint-1600
Configuration saved in ./gpt2-reporter/checkpoint-1600/config.json
Model weights saved in ./gpt2-reporter/checkpoint-1600/pytorch_model.bin
Saving model checkpoint to ./gpt2-reporter/checkpoint-2400
Configuration saved in ./gpt2-reporter/checkpoint-2400/config.json
Model weights saved in ./gpt2-reporter/checkpoint-2400/pytorch_model.bin
Saving model checkpoint to ./gpt2-reporter/checkpoint-3200
Configuration saved in ./gpt2-reporter/checkpoint-3200/config.json
Model weights saved in ./gpt2-reporter/checkpoint-3200/pytorch_model.bin
Saving model checkpoint to ./gpt2-reporter/checkpoint-4000
Configuration saved in ./gpt2-reporter/checkpoint-4000/config.json
Model weights saved in ./gpt2-reporter/checkpoint-4000/pytorch_model.bin
Saving m

TrainOutput(global_step=12288, training_loss=2.1476624608039856, metrics={'train_runtime': 14030.1759, 'train_samples_per_second': 28.022, 'train_steps_per_second': 0.876, 'total_flos': 2.5682328502272e+16, 'train_loss': 2.1476624608039856, 'epoch': 3.0})

After training is done you can save the model by calling `save_model()`. This will save the trained model to our `output_dir` from our `TrainingArguments`.

In [None]:
trainer.save_model()

Saving model checkpoint to ./gpt2-reporter
Configuration saved in ./gpt2-reporter/config.json
Model weights saved in ./gpt2-reporter/pytorch_model.bin
Saving model checkpoint to ./gpt2-reporter
Configuration saved in ./gpt2-reporter/config.json
Model weights saved in ./gpt2-reporter/pytorch_model.bin
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.30k/487M [00:00<?, ?B/s]

Upload file runs/Dec13_14-06-43_92d3fa1c5285/events.out.tfevents.1670940576.92d3fa1c5285.74.0:  42%|####1     …

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/theta/gpt2-reporter
   845d542..21ad3c8  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/theta/gpt2-reporter
   845d542..21ad3c8  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
To https://huggingface.co/theta/gpt2-reporter
   21ad3c8..3c6b01c  main -> main

   21ad3c8..3c6b01c  main -> main



In [None]:
trainer.push_to_hub()

Saving model checkpoint to ./gpt2-reporter
Configuration saved in ./gpt2-reporter/config.json
Model weights saved in ./gpt2-reporter/pytorch_model.bin
Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}


# Test the model

To test the model we are going to use another [highlight of the transformers library](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) called `pipeline`. [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) are objects that offer a simple API dedicated to several tasks, among others also `text-generation`

In [8]:
from transformers import pipeline, AutoModelWithLMHead
model = AutoModelWithLMHead.from_pretrained('theta/gpt2-reporter')
reporter = pipeline('text-generation',model=model, tokenizer='gpt2',config={'max_new_tokens':800})



In [9]:
reporter('總統宣布國防預算大漲BEG;END')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': '總統宣布國防預算大漲BEG;END羅魯尼利爾別的馬�'}]

In [11]:
reporter('總統宣布國防預算大漲')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': '總統宣布國防預算大漲：陳水史達四處卡疫情�'}]