**Table of contents**<a id='toc0_'></a>    
- 1. [Scientific HuggingFace](#toc1_)    
- 2. [datas](#toc2_)    
- 3. [Tokenizer](#toc3_)    
  - 3.1. [Training a new tokenizer](#toc3_1_)    
  - 3.2. [Using a pre-trained tokenizer](#toc3_2_)    
    - 3.2.1. [直接加载](#toc3_2_1_)    
    - 3.2.2. [在Transformers中使用](#toc3_2_2_)    
      - 3.2.2.1. [封装](#toc3_2_2_1_)    
      - 3.2.2.2. [加载](#toc3_2_2_2_)    
- 4. [BERT](#toc4_)    
  - 4.1. [datas](#toc4_1_)    
  - 4.2. [tokenizer](#toc4_2_)    
    - 4.2.1. [train a new tokenizer](#toc4_2_1_)    
    - 4.2.2. [make tokenizer to be used in transformers with AutoTokenizer](#toc4_2_2_)    
  - 4.3. [trainer](#toc4_3_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# 1. <a id='toc1_'></a>[Scientific HuggingFace](#toc0_)

In [1]:
import os 


os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# 2. <a id='toc2_'></a>[datas](#toc0_)

In [3]:
import datasets 


datas = datasets.load_dataset('dnagpt/dna_promoters')

README.md:   0%|          | 0.00/360 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/11.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11840 [00:00<?, ? examples/s]

# 3. <a id='toc3_'></a>[Tokenizer](#toc0_)

## 3.1. <a id='toc3_1_'></a>[Training a new tokenizer](#toc0_)

In [5]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

from transformers import AutoTokenizer


# 初始化一个BPE模型 
tokenizer = Tokenizer(models.BPE())
# 设置预处理
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=False) #use_regex=False,空格当成一般字符串
# 设置训练器
trainer = trainers.BpeTrainer(vocab_size=30000, special_tokens=["<|endoftext|>"]) #3w words

In [6]:
# 训练
tokenizer.train(["data/huggingface/dna_1g.txt"], trainer=trainer) #all file list, take 10-20 min






In [8]:
# 编码
encoding = tokenizer.encode("TGGCGTGAACCCGGGATCGGG")
print(encoding.tokens)

['TG', 'GCGTGAA', 'CCCGG', 'GATCGG', 'G']


In [9]:
# 解码
decoding = tokenizer.decode(encoding.ids)
print(decoding)

TG GCGTGAA CCCGG GATCGG G


In [11]:
# 保存
tokenizer.save("data/huggingface/tokenizer.json")

## 3.2. <a id='toc3_2_'></a>[Using a pre-trained tokenizer](#toc0_)

### 3.2.1. <a id='toc3_2_1_'></a>[直接加载](#toc0_)

In [15]:
from tokenizers import Tokenizer


# 加载自定义的tokenizer
tokenizer = Tokenizer.from_file("data/huggingface/tokenizer.json")

# 编码
encoding = tokenizer.encode("TGGCGTGAACCCGGGATCGGG")
print(encoding.tokens)

# 解码
decoding = tokenizer.decode(encoding.ids)
print(decoding)

['TG', 'GCGTGAA', 'CCCGG', 'GATCGG', 'G']
TG GCGTGAA CCCGG GATCGG G


### 3.2.2. <a id='toc3_2_2_'></a>[在Transformers中使用](#toc0_)
为了能够从AutoTokenizer中调用。

#### 3.2.2.1. <a id='toc3_2_2_1_'></a>[封装](#toc0_)
要在 🤗 Transformers 中使用这个标记器，我们必须将它包裹在一个 PreTrainedTokenizerFast 类中。

In [18]:
from transformers import GPT2TokenizerFast


dna_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

dna_tokenizer.save_pretrained("data/huggingface/dna_bpe_dict")

# dna_tokenizer.push_to_hub("dna_bpe_dict_1g", organization="dnagpt", use_auth_token="hf_*****") # push to huggingface

('data/huggingface/dna_bpe_dict/tokenizer_config.json',
 'data/huggingface/dna_bpe_dict/special_tokens_map.json',
 'data/huggingface/dna_bpe_dict/vocab.json',
 'data/huggingface/dna_bpe_dict/merges.txt',
 'data/huggingface/dna_bpe_dict/added_tokens.json',
 'data/huggingface/dna_bpe_dict/tokenizer.json')

#### 3.2.2.2. <a id='toc3_2_2_2_'></a>[加载](#toc0_)

In [19]:
from transformers import AutoTokenizer 


# 成功
tokenizer = AutoTokenizer.from_pretrained("data/huggingface/dna_bpe_dict")


# 4. <a id='toc4_'></a>[BERT](#toc0_)

## 4.2. <a id='toc4_2_'></a>[tokenizer](#toc0_)

### 4.2.1. <a id='toc4_2_1_'></a>[train a new tokenizer](#toc0_)

In [4]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast, AutoModelForMaskedLM


# 初始化一个空的 WordPiece 模型
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# 设置训练参数
trainer = WordPieceTrainer(
    vocab_size=30000,        # 词汇表大小
    min_frequency=2,         # 最小词频
    show_progress=True,      # 显示进度
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

In [8]:
help(tokenizer.train)

Help on built-in function train:

train(self, files, trainer=None) method of tokenizers.Tokenizer instance
    Train the Tokenizer using the given files.

    Reads the files line by line, while keeping all the whitespace, even new lines.
    If you want to train from data store in-memory, you can check
    :meth:`~tokenizers.Tokenizer.train_from_iterator`

    Args:
        files (:obj:`List[str]`):
            A list of path to the files that we should use for training

        trainer (:obj:`~tokenizers.trainers.Trainer`, `optional`):
            An optional trainer that should be used to train our Model



In [9]:
# 训练
tokenizer.train(files=["data/huggingface/dna_1g.txt"], trainer=trainer)

# 保存
tokenizer.save("data/huggingface/dna_wordpiece_dict.json")






### 4.2.2. <a id='toc4_2_2_'></a>[make tokenizer to be used in transformers with AutoTokenizer](#toc0_)

In [10]:
new_tokenizer = Tokenizer.from_file("data/huggingface/dna_wordpiece_dict.json")

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=new_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

# 保存
wrapped_tokenizer.save_pretrained("data/huggingface/dna_wordpiece_dict")

('data/huggingface/dna_wordpiece_dict/tokenizer_config.json',
 'data/huggingface/dna_wordpiece_dict/special_tokens_map.json',
 'data/huggingface/dna_wordpiece_dict/tokenizer.json')

In [12]:
from transformers import AutoTokenizer


# 加载
tokenizer = AutoTokenizer.from_pretrained("data/huggingface/dna_wordpiece_dict")
#tokenizer.pad_token = tokenizer.eos_token

In [13]:
# 编码
tokenizer("ATCGGATCG")

{'input_ids': [6, 766, 22, 10], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

In [14]:
tokenizer

PreTrainedTokenizerFast(name_or_path='data/huggingface/dna_wordpiece_dict', vocab_size=30000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

## model

In [15]:
from transformers import BertConfig, BertForMaskedLM 


# 配置
max_len = 1024 

config = BertConfig(
    vocab_size = len(tokenizer),
    max_position_embeddings=max_len, 
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id
) 

# 模型
model = BertForMaskedLM(config=config)

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


## datas

In [17]:
from datasets import load_dataset 


raw_dataset = load_dataset('text', data_files='data/huggingface/dna_1g.txt')
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1079595
    })
})

In [18]:
dataset = raw_dataset["train"].train_test_split(test_size=0.1, shuffle=True)
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 971635
    })
    test: Dataset({
        features: ['text'],
        num_rows: 107960
    })
})

In [19]:
tokenizer._tokenizer.model.max_input_chars_per_word = 10000


def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=max_len)


# 对数据集应用分词函数
tokenized_datasets = dataset.map(tokenize_function, batched=False, remove_columns=['text'], num_proc=50)  # 设置为你的 CPU 核心数或根据需要调整


Map (num_proc=50):   0%|          | 0/971635 [00:00<?, ? examples/s]

Map (num_proc=50):   0%|          | 0/107960 [00:00<?, ? examples/s]

In [20]:
from transformers import DataCollatorForLanguageModeling


# 创建一个数据收集器，用于动态填充和遮蔽,注意mlm=true
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [21]:
dataset["train"][0]

{'text': 'GAATATTTGTCTATTCTTCTTAACTTTCTCCACTGTAAATTAAATTGCTCCTCAGGGTGCTATATGGCATCCCTTGCTATTTTTGGAGCAAATCTTAAATTCTTCAACAATTTTATCAAGACAAACACAACTTTCAGTAAATTCATTGTTTAAATTTGGTGAAAAGTCAGATTTCTTTACACATAGTAAAGCAAATGTAAAATAATATATCAATGTGATTCTTTTAATAAAATACCATTATTGCCAATGGTTTTTAATAGTTCACTGTTTGAAAGAGACCACAAAATTCATGTGCAAAAATCACAAGCATTCTTATACAACAGTGACAGACAAACAGAGAGCCAAATCAGGAATGAACTTCCATTCACAATTGCTTCAAAGAGAATCAAATACCTAGGAATCCAACTTACAAGGGATGTAAAGGACCTCTTCAAGGAGAACTACAAACCACTGCTCAGTGAAATAAAAGAGGACACAAACAAATGGAAGAACATACCATGCTCATGGATAGGAAGAATCAATATCGTGAAAATGGCCATACTGCCCAAGGTAATTTATAGATTCAATGCCATCCCCATCAAGCTACCAATGAGTTTCTTCACAGAATTGGAAAAAACTGTTTTAAAGTTCATATGGAACCAAAAAAGAACCCACATTGCCAAGACAATCCTAAGTCAAATGAACAAAGCTGGAGGGATCATGCTACCTGACTTCAAACTATACTACAAGGCTACAGTAACCAAAATAGCATGGTACTGGTACCAAAACAGAAATATAGACCAATGGAACAGCATAGAGTCCTCAGAAATAATACCACACATCTACATCTTTGATAAATCTGACAAAAACAAGAAATGGGGAAAGGATTCTCTATATAATAAATGGTGCTGGGAAAATTGGCTAGCCATAAGTAGAAAGCTGAAACTGGATCCTTTCCTTACTCTTTATACGAAAATTAATTCAAGATGGAGTAGAGACTTAAATGTTAGA

In [22]:
tokenizer.tokenize(dataset["train"][0]["text"][:100])

['GAA',
 '##TATTTG',
 '##TCTATT',
 '##CTTCTTAA',
 '##CTTTCTCC',
 '##A',
 '##CTGTAAATT',
 '##AAATT',
 '##GCTCC',
 '##TCAGG',
 '##GTGCTA',
 '##TATGGCA',
 '##TCCCTT',
 '##GCTATTTT',
 '##TGGAGCAA',
 '##A',
 '##TCTTAAA',
 '##T']

## 4.3. <a id='toc4_3_'></a>[trainer](#toc0_)

In [29]:
from transformers import TrainingArguments, Trainer


run_path = "cache/bert_run"
train_epoches = 5
batch_size = 2


training_args = TrainingArguments(
        output_dir=run_path,
        overwrite_output_dir=True,
        num_train_epochs=train_epoches,
        per_device_train_batch_size=batch_size,
        save_steps=2000,
        save_total_limit=2,
        prediction_loss_only=True,
        fp16=True, #v100没法用
    )


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [None]:
trainer.train()
trainer.save_model("cache/dna_bert_v0")

# LLaMa

## Tokenizer

In [32]:
import sentencepiece as spm


spm.SentencePieceTrainer.train(
    input="data/huggingface/dna_1g.txt,data/huggingface/protein_1g.txt", 
    model_prefix="dna_llama", 
    vocab_size=60000, 
    model_type="bpe", 
    # max_sentence_length=1000000,
    num_threads=50, 
)

In [None]:
tokenizer = spm.SentencePieceProcessor(model_file="dna_llama.model")

tokenizer.encode("ATCGGATCG")
