## Preparation

### Install packages

Make sure you have installed the newest python packages:

| package |
| -- |
| towhee |
| transformers |
| datasets |
| evaluate |
| scikit-learn |
| torch |


In [1]:
# ! python -m pip install torch towhee transformers datasets evaluate scikit-learn

You can try the following block code to make sure your environment is valid.

In [2]:
# try this line
from torch.distributed import ProcessGroup

If you get a Error like list: cannot import name 'ProcessGroup' from 'torch.distributed', please refer [this issue](https://github.com/pytorch/pytorch/issues/68385#issuecomment-1332607943) and install the newest pytorch version.

## Fine-tune GPT2 on Causal Language Modeling task

[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) is a large transformer-based language model with 1.5 billion parameters, 10x more than the original GPT, trained on a dataset which emphasizes diversity of content, by scraping content from the Internet. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. And it inference in a zero-shot transfer setting without any task-specific fine-tuning. 

### Instantiate operator
We can instantiate a towhee operator containing the smallest version of [`gpt2`](https://huggingface.co/gpt2) model. 

**Note**: By default, we initialize `text_embedding.transformers` to get embedding for every token when inference. If you want to get a sentence embedding rather than embedding for every token, please use `sentence_embedding.transformers` operator.

In [3]:
import towhee

gpt2_op = towhee.ops.text_embedding.transformers(model_name='gpt2').get_op()
# or (if you want to get a sentence embedding rather than embedding for every token, please use sentence_embedding operator)
# gpt2_op = towhee.ops.sentence_embedding.transformers(model_name='gpt2').get_op()

This operator can embed a sentence using the pretrained models's output of the last layer before the model's head. The embedding shape is `([input token num], [model dim])`.

In [4]:
embedding = gpt2_op('Hello world, hello every one.')
embedding, embedding.shape

(array([[-4.8677885e-04, -1.3963915e-01, -2.0877950e-01, ...,
         -1.5345111e-01, -6.7771867e-02, -1.9587985e-01],
        [-1.6629304e-01,  2.1927923e-01,  4.4871259e-02, ...,
         -1.7656286e-01, -1.6588898e-01,  4.3323985e-01],
        [ 2.6872021e-01,  2.9144433e-01,  2.1966067e-01, ...,
         -9.3992241e-02,  1.2293747e-01,  8.8392869e-02],
        ...,
        [ 4.3278667e-01,  6.3713318e-01, -5.7010895e-01, ...,
         -6.7748949e-02,  1.7759663e-01,  3.5576081e-01],
        [-6.8375832e-01,  2.2053759e-01, -1.0839784e+00, ...,
         -2.2822498e-01, -8.7204657e-02,  2.3987369e-01],
        [ 2.2707476e-01, -1.7549282e-01, -3.0932507e-01, ...,
         -3.9989728e-02, -2.0581612e-02,  3.1236490e-02]], dtype=float32),
 (7, 768))

### Start training
We just specify two args dict, and run method `train()` with `task='clm'` to start training.

In [5]:
data_args = {
    'dataset_name': 'wikitext',
    'dataset_config_name': 'wikitext-2-raw-v1',
}
training_args = {
    'num_train_epochs': 3, # you can add epoch number to get a better metric.
    'per_device_train_batch_size': 8,
    'per_device_eval_batch_size': 8,
    'do_train': True,
    'do_eval': True,
    'output_dir': './tmp/test-clm',
    'overwrite_output_dir': True
}

In [6]:
gpt2_op.train(task='clm', data_args=data_args, training_args=training_args)

2022-12-20 08:18:49,249 - 140414087504704 - train_clm_with_hf_trainer.py-train_clm_with_hf_trainer:152 - INFO: Training/evaluation parameters TrainingArguments(
_n_gpu=8,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=Fal

train clm with hugging face transformers trainer
**** DataTrainingArguments ****
- dataset_name 
  default: None 
  metadata_dict: {'help': 'The name of the dataset to use (via the datasets library).'} 

- dataset_config_name 
  default: None 
  metadata_dict: {'help': 'The configuration name of the dataset to use (via the datasets library).'} 

- train_file 
  default: None 
  metadata_dict: {'help': 'The input training data file (a text file).'} 

- validation_file 
  default: None 
  metadata_dict: {'help': 'An optional input evaluation data file to evaluate the perplexity on (a text file).'} 

- max_train_samples 
  default: None 
  metadata_dict: {'help': 'For debugging purposes or quicker training, truncate the number of training examples to this value if set.'} 

- max_eval_samples 
  default: None 
  metadata_dict: {'help': 'For debugging purposes or quicker training, truncate the number of evaluation examples to this value if set.'} 

- block_size 
  default: None 
  metadata_

2022-12-20 08:18:56,055 - 140414087504704 - info.py-info:365 - INFO: Loading Dataset Infos from /home/zhangchen/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126
2022-12-20 08:18:56,080 - 140414087504704 - builder.py-builder:354 - INFO: Overwrite dataset info from restored data version.
2022-12-20 08:18:56,081 - 140414087504704 - info.py-info:285 - INFO: Loading Dataset info from /home/zhangchen/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126
2022-12-20 08:18:56,100 - 140414087504704 - info.py-info:285 - INFO: Loading Dataset info from /home/zhangchen/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126


  0%|          | 0/3 [00:00<?, ?it/s]

[INFO|trainer.py:1634] 2022-12-20 08:18:59,146 >> ***** Running training *****
[INFO|trainer.py:1635] 2022-12-20 08:18:59,147 >>   Num examples = 2318
[INFO|trainer.py:1636] 2022-12-20 08:18:59,147 >>   Num Epochs = 3
[INFO|trainer.py:1637] 2022-12-20 08:18:59,148 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1638] 2022-12-20 08:18:59,149 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1639] 2022-12-20 08:18:59,149 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1640] 2022-12-20 08:18:59,150 >>   Total optimization steps = 111
[INFO|trainer.py:1641] 2022-12-20 08:18:59,152 >>   Number of trainable parameters = 124439808


Step,Training Loss


[INFO|trainer.py:1885] 2022-12-20 08:20:35,102 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:2693] 2022-12-20 08:20:35,105 >> Saving model checkpoint to ./tmp/test-clm
[INFO|configuration_utils.py:447] 2022-12-20 08:20:35,107 >> Configuration saved in ./tmp/test-clm/config.json
[INFO|modeling_utils.py:1637] 2022-12-20 08:20:35,840 >> Model weights saved in ./tmp/test-clm/pytorch_model.bin
[INFO|tokenization_utils_base.py:2157] 2022-12-20 08:20:35,842 >> tokenizer config file saved in ./tmp/test-clm/tokenizer_config.json
[INFO|tokenization_utils_base.py:2164] 2022-12-20 08:20:35,843 >> Special tokens file saved in ./tmp/test-clm/special_tokens_map.json
2022-12-20 08:20:35,932 - 140414087504704 - train_clm_with_hf_trainer.py-train_clm_with_hf_trainer:405 - INFO: *** Evaluate ***
[INFO|trainer.py:2944] 2022-12-20 08:20:35,935 >> ***** Running Evaluation *****
[INFO|trainer.py:2946] 2022-12-20 08:20:35,936 >>   Num examples = 240


***** train metrics *****
  epoch                    =        3.0
  total_flos               =  3384472GF
  train_loss               =      3.253
  train_runtime            = 0:01:35.95
  train_samples            =       2318
  train_samples_per_second =     72.475
  train_steps_per_second   =      1.157


***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.4198
  eval_loss               =     3.0879
  eval_runtime            = 0:00:02.16
  eval_samples            =        240
  eval_samples_per_second =    110.693
  eval_steps_per_second   =      1.845
  perplexity              =    21.9304
done clm.


`data_args` specifies the training data set, you can specify the name directly, it will download the specified dataset through [datasets](https://huggingface.co/docs/datasets/index). For more `data_args` infos, refer to the `**** DataTrainingArguments ****` line in this block's output.  

`training_args` specifies the training config using [transformer TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). For more `training_args` infos, refer to the `**** TrainingArguments ****` line in this block's output.

If you run this training process very slowly, please make sure your device contains advanced GPUs, or, you may reduce the number of epochs in order to simply run through the training. By default, training will be performed in parallel using available GPUs.

If you see the final evaluate result with `eval_accuracy` about 0.41, it means you have successfully trained the operator. And the output result with model's weights has been in your `output_dir`.

### Use your fine-tuned weights
Please note that the model trained in this way is a model with a specific task header, but the original operator does not contain the head. If you need to use your trained weights to extract embedding, you need to convert model weights to adapt model without head and load it. Here is a example.

In [7]:
import torch
from collections import OrderedDict
from transformers import GPT2Model
from transformers.utils import logging

logging.set_verbosity_error()

def convert_gpt2_weights(trained_weights_path, new_weight_path):
    state_dict = torch.load(trained_weights_path, map_location='cpu')
    new_state_dict = OrderedDict()
    for k, v in state_dict.items():
        if k.startswith('transformer.'):
            new_k = k[12:]
            new_state_dict[new_k] = v
    bert_model = GPT2Model.from_pretrained("gpt2")
    bert_model.load_state_dict(new_state_dict, strict=False)
    torch.save(bert_model.state_dict(), new_weight_path)

convert_gpt2_weights('./tmp/test-clm/pytorch_model.bin', './tmp/test-clm/gpt2_without_head_weight.bin')

In [8]:
new_gpt2_op = towhee.ops.text_embedding.transformers(
    model_name='gpt2', 
    checkpoint_path='./tmp/test-clm/gpt2_without_head_weight.bin'
).get_op()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:
embedding = new_gpt2_op('Hello world, hello every one.')
embedding, embedding.shape

(array([[ 0.00633   , -0.03620598, -0.06251045, ..., -0.06700664,
         -0.04973439, -0.11961488],
        [-0.11371685,  0.45823196,  0.07787877, ...,  0.09362672,
         -0.18830143,  0.30425274],
        [ 0.05938547,  0.24975586, -0.21508646, ...,  0.13408856,
          0.11817759,  0.23235339],
        ...,
        [ 0.46084803,  0.66350824, -0.81096333, ...,  0.21739565,
          0.07215898,  0.3428259 ],
        [-0.779129  ,  0.3405829 , -1.2257875 , ...,  0.0303742 ,
          0.01939184,  0.3015506 ],
        [ 0.2265141 , -0.1338258 , -0.34066635, ...,  0.01380733,
         -0.03249043,  0.16766332]], dtype=float32),
 (7, 768))