## Preparation

### Install packages

Make sure you have installed the newest python packages:

| package |
| -- |
| towhee |
| transformers |
| datasets |
| evaluate |
| scikit-learn |
| torch |


In [1]:
# ! python -m pip install torch towhee transformers datasets evaluate scikit-learn

You can try the following block code to make sure your environment is valid.

In [2]:
# try this line
from torch.distributed import ProcessGroup

If you get a Error like list: cannot import name 'ProcessGroup' from 'torch.distributed', please refer [this issue](https://github.com/pytorch/pytorch/issues/68385#issuecomment-1332607943) and install the newest pytorch version.

## Fine-tune BERT on Masked Language Modeling task

[Bidirectional Encoder Representations from Transformers (BERT)](https://arxiv.org/abs/1810.04805)  is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google.BERT was pretrained on two tasks: language modeling and next sentence prediction. As a result of the training process, BERT learns contextual embeddings for words. After pretraining, which is computationally expensive, BERT can be finetuned with fewer resources on smaller datasets to optimize its performance on specific tasks.

### Instantiate operator
We can instantiate a towhee operator containing the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model.  

**Note**: By default, we initialize `text_embedding.transformers` to get embedding for every token when inference. If you want to get a sentence embedding rather than embedding for every token, please use `sentence_embedding.transformers` operator.

In [3]:
import towhee

bert_op = towhee.ops.text_embedding.transformers(model_name='bert-base-uncased').get_op()
# or (if you want to get a sentence embedding rather than embedding for every token, please use sentence_embedding operator)
# bert_op = towhee.ops.sentence_embedding.transformers(model_name='bert-base-uncased').get_op()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


This operator can embed a sentence using the pretrained models's output of the last layer before the model's head. The embedding shape is `([input token num], [model dim])`.

In [4]:
embedding = bert_op('Hello world, hello every one.')
embedding, embedding.shape

(array([[-0.29221347,  0.16306752,  0.20236042, ..., -0.21968302,
         -0.20541406,  0.7011004 ],
        [-0.61254054,  0.55492014,  1.4136117 , ...,  0.20618206,
          0.3001338 ,  1.0520668 ],
        [ 0.62186277,  0.40705413,  1.0701158 , ..., -0.04220748,
          0.25006238,  0.68540764],
        ...,
        [-0.6860294 ,  0.04325684,  1.0766182 , ...,  0.25660843,
         -0.36851028,  0.43962553],
        [-0.5295304 , -0.33464265,  0.10741595, ...,  0.86484456,
          0.02151933, -0.08432791],
        [ 0.79523426,  0.26699936,  0.03003302, ...,  0.24721493,
         -0.55750215, -0.16443911]], dtype=float32),
 (9, 768))

### Start training
We just specify two args dict, and run method `train()` with `task='mlm'` to start training. 

In [5]:
data_args = {
    'dataset_name': 'wikitext',
    'dataset_config_name': 'wikitext-2-raw-v1',
}
training_args = {
    'num_train_epochs': 3, # you can add epoch number to get a better metric.
    'per_device_train_batch_size': 8,
    'per_device_eval_batch_size': 8,
    'do_train': True,
    'do_eval': True,
    'output_dir': './tmp/test-mlm',
    'overwrite_output_dir': True
}

In [6]:
bert_op.train(task='mlm', data_args=data_args, training_args=training_args)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2022-12-20 09:05:26,098 - 140132563211072 - train_mlm_with_hf_trainer.py-train_mlm_with_hf_trainer:164 - INFO: Training/evaluation parameters TrainingArguments(
_n_gpu=8,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_work

train mlm with hugging face transformers trainer
**** DataTrainingArguments ****
- dataset_name 
  default: None 
  metadata_dict: {'help': 'The name of the dataset to use (via the datasets library).'} 

- dataset_config_name 
  default: None 
  metadata_dict: {'help': 'The configuration name of the dataset to use (via the datasets library).'} 

- train_file 
  default: None 
  metadata_dict: {'help': 'The input training data file (a text file).'} 

- validation_file 
  default: None 
  metadata_dict: {'help': 'An optional input evaluation data file to evaluate the perplexity on (a text file).'} 

- overwrite_cache 
  default: False 
  metadata_dict: {'help': 'Overwrite the cached training and evaluation sets'} 

- validation_split_percentage 
  default: 5 
  metadata_dict: {'help': "The percentage of the train set used as validation set in case there's no validation split"} 

- max_seq_length 
  default: None 
  metadata_dict: {'help': 'The maximum total input sequence length after to

2022-12-20 09:05:32,902 - 140132563211072 - info.py-info:365 - INFO: Loading Dataset Infos from /home/zhangchen/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126
2022-12-20 09:05:32,916 - 140132563211072 - builder.py-builder:354 - INFO: Overwrite dataset info from restored data version.
2022-12-20 09:05:32,917 - 140132563211072 - info.py-info:285 - INFO: Loading Dataset info from /home/zhangchen/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126
2022-12-20 09:05:32,933 - 140132563211072 - info.py-info:285 - INFO: Loading Dataset info from /home/zhangchen/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126


  0%|          | 0/3 [00:00<?, ?it/s]

[INFO|trainer.py:703] 2022-12-20 09:05:35,787 >> The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
[INFO|trainer.py:1634] 2022-12-20 09:05:35,798 >> ***** Running training *****
[INFO|trainer.py:1635] 2022-12-20 09:05:35,799 >>   Num examples = 4627
[INFO|trainer.py:1636] 2022-12-20 09:05:35,800 >>   Num Epochs = 3
[INFO|trainer.py:1637] 2022-12-20 09:05:35,800 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1638] 2022-12-20 09:05:35,801 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1639] 2022-12-20 09:05:35,802 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1640] 2022-12-20 09:05:35,802 >>   Total optimization steps = 219
[INFO|trainer.py:1641] 2022-12-20 09:05:35,803 >>   Number of trainable parameters = 109514298

Step,Training Loss


[INFO|trainer.py:1885] 2022-12-20 09:07:25,052 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:2693] 2022-12-20 09:07:25,055 >> Saving model checkpoint to ./tmp/test-mlm
[INFO|configuration_utils.py:447] 2022-12-20 09:07:25,057 >> Configuration saved in ./tmp/test-mlm/config.json
[INFO|modeling_utils.py:1637] 2022-12-20 09:07:25,790 >> Model weights saved in ./tmp/test-mlm/pytorch_model.bin
[INFO|tokenization_utils_base.py:2157] 2022-12-20 09:07:25,792 >> tokenizer config file saved in ./tmp/test-mlm/tokenizer_config.json
[INFO|tokenization_utils_base.py:2164] 2022-12-20 09:07:25,794 >> Special tokens file saved in ./tmp/test-mlm/special_tokens_map.json
2022-12-20 09:07:25,833 - 140132563211072 - train_mlm_with_hf_trainer.py-train_mlm_with_hf_trainer:443 - INFO: *** Evaluate ***
[INFO|trainer.py:703] 2022-12-20 09:07:25,834 >> The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.for

***** train metrics *****
  epoch                    =        3.0
  total_flos               =  3402629GF
  train_loss               =     1.8067
  train_runtime            = 0:01:49.24
  train_samples            =       4627
  train_samples_per_second =    127.058
  train_steps_per_second   =      2.005


***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.6789
  eval_loss               =      1.575
  eval_runtime            = 0:00:01.89
  eval_samples            =        479
  eval_samples_per_second =    252.993
  eval_steps_per_second   =      4.225
  perplexity              =     4.8309
done mlm.


`data_args` specifies the training data set, you can specify the name directly, it will download the specified dataset through [datasets](https://huggingface.co/docs/datasets/index). For more `data_args` infos, refer to the `**** DataTrainingArguments ****` line in this block's output.  

`training_args` specifies the training config using [transformer TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). For more `training_args` infos, refer to the `**** TrainingArguments ****` line in this block's output.

If you run this script for the first time, it will automatically download the corresponding model and data set, which may require at least 1.5G space.

If you run this training process very slowly, please make sure your device contains advanced GPUs, or, you may reduce the number of epochs in order to simply run through the training. By default, training will be performed in parallel using available GPUs.

If you see the final evaluate result with `eval_accuracy` about 0.65, it means you have successfully trained the operator. And the output result with model's weights has been in your `output_dir`.

### Use your fine-tuned weights
Please note that the model trained in this way is a model with a specific task header, but the original operator does not contain the head. If you need to use your trained weights to extract embedding, you need to convert model weights to adapt model without head and load it. Here is a example.

In [7]:
import torch
from collections import OrderedDict
from transformers import BertModel
from transformers.utils import logging

logging.set_verbosity_error()

def convert_bert_weights(trained_weights_path, new_weight_path):
    state_dict = torch.load(trained_weights_path, map_location='cpu')
    new_state_dict = OrderedDict()
    for k, v in state_dict.items():
        if k.startswith('bert.'):
            new_k = k[5:]
            new_state_dict[new_k] = v
    bert_model = BertModel.from_pretrained("bert-base-uncased")
    bert_model.load_state_dict(new_state_dict, strict=False)
    torch.save(bert_model.state_dict(), new_weight_path)

convert_bert_weights('./tmp/test-mlm/pytorch_model.bin', './tmp/test-mlm/bert_without_head_weight.bin')

In [8]:
new_bert_op = towhee.ops.text_embedding.transformers(
    model_name='bert-base-uncased', 
    checkpoint_path='./tmp/test-mlm/bert_without_head_weight.bin'
).get_op()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:
embedding = new_bert_op('Hello world, hello every one.')
embedding, embedding.shape

(array([[-0.5528104 ,  0.15354109,  0.27865022, ..., -0.42180687,
         -0.17302679,  0.84261256],
        [-1.0168282 ,  0.6152043 ,  1.5796373 , ..., -0.066678  ,
          0.01646271,  1.0581956 ],
        [ 0.47564444,  0.3604667 ,  0.87738824, ..., -0.72136706,
          0.0545284 ,  1.0028253 ],
        ...,
        [-0.6695933 ,  0.1092175 ,  1.1711102 , ...,  0.12439835,
         -0.6637869 ,  1.0748045 ],
        [-0.7066035 , -0.12123797,  0.14397985, ...,  0.7948587 ,
          0.06388341,  0.16218108],
        [ 0.88073194,  0.20878866, -0.11732787, ...,  0.22331028,
         -0.5018618 , -0.35616544]], dtype=float32),
 (9, 768))