### Mask Language Model
- 先找个小BERT来玩玩(蒸馏过的)
- 由于我们今天定位的任务与预训练模型差异较大
- 所以在人家基础上，套咱们的任务继续训练
- 看看不同预训练模型结果的差异到底多大

In [2]:
import warnings
warnings.filterwarnings("ignore")
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

config.json: 100%|██████████| 483/483 [00:00<?, ?B/s] 
model.safetensors: 100%|██████████| 268M/268M [00:16<00:00, 16.6MB/s] 


In [3]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


效果其实差不多的，但是蒸馏过的小了很多

咱们的任务就是去预测MASK到底是个啥

In [4]:
text = "This is a great [MASK]."

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<?, ?B/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 515kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 2.13MB/s]


In [6]:
inputs = tokenizer(text, return_tensors="pt")
inputs

{'input_ids': tensor([[ 101, 2023, 2003, 1037, 2307,  103, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [7]:
tokenizer.mask_token_id

103

In [8]:
model

DistilBertForMaskedLM(
  (activation): GELUActivation()
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.

### 先用预训练模型看看结果啥样
- 模型的训练结果肯定与训练数据高度相关
- 原始的训练数据：https://huggingface.co/datasets/wikipedia

In [9]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
print(token_logits.shape)

torch.Size([1, 8, 30522])


In [10]:
# 找到MASK的位置然后获取其预测值
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# 对MASK所在位置找到他的TOP5预测结果
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


### 让预训练模型换个领域，玩玩影评数据
- 对模型进行微调，用人家的模型训练咱们自己的数据
- 

In [11]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Downloading builder script: 100%|██████████| 4.31k/4.31k [00:00<00:00, 8.56MB/s]
Downloading metadata: 100%|██████████| 2.17k/2.17k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 7.59k/7.59k [00:00<00:00, 7.59MB/s]
Downloading data: 100%|██████████| 84.1M/84.1M [00:06<00:00, 12.4MB/s]
Generating train split: 100%|██████████| 25000/25000 [00:06<00:00, 3878.42 examples/s] 
Generating test split: 100%|██████████| 25000/25000 [00:06<00:00, 4084.92 examples/s] 
Generating unsupervised split: 100%|██████████| 50000/50000 [00:07<00:00, 6959.07 examples/s] 


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

基本就是个情感分类的数据集
- 0 表示negative
- 1表示positive

In [12]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

### 数据集处理方法
- 计算每一个文本的长度（word_ids）
- 指定chunk_size，然后将所有数据按块进行拆分

In [15]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    # if tokenizer.is_fast:
    #     result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [16]:
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]#咱们是完形填空，不需要标签
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors
Map: 100%|██████████| 25000/25000 [00:04<00:00, 5460.98 examples/s]
Map: 100%|██████████| 25000/25000 [00:04<00:00, 5229.77 examples/s]
Map: 100%|██████████| 50000/50000 [00:10<00:00, 4933.51 examples/s]


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

这个是预训练模型能设置的最大值

In [17]:
tokenizer.model_max_length

512

完型填空中尤其我们只需要填空就好了，为了不截断数据和整一堆没用的padding，咱们直接把所有数据首尾拼接，然后指定个固定长度就可以的

In [18]:
chunk_size = 128

随便找几个看看都多长的

In [19]:
# 看看每一个都多长
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


计算总长度，一会要拼起来

In [20]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()#计算拼一起有多少个，
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 800'


按照咱们设置的128来分块

In [21]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]#按照咱们刚才指定的chunk_size来切分
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


咱把这些组成一个方法，一些直接映射

In [22]:
def group_texts(examples):
    # 拼接到一起
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # 计算长度
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # //就相当于咱们不要最后多余的了
    total_length = (total_length // chunk_size) * chunk_size
    # 切分
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # 完型填空会用到标签的，也就是原文是啥
    result["labels"] = result["input_ids"].copy()
    return result

In [23]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map: 100%|██████████| 25000/25000 [00:40<00:00, 617.85 examples/s]
Map: 100%|██████████| 25000/25000 [00:38<00:00, 653.08 examples/s]
Map: 100%|██████████| 50000/50000 [01:19<00:00, 629.34 examples/s]


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 122957
    })
})

这回咱们数据更多了

In [24]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

### 完型填充训练
- 接下来咱们需要随机mask掉一些位置，然后来进行预测，方法huggingface已经提供好了

https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling

In [25]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)#0.15是BERT人家说的，咱们别改了

看看处理后的mask啥样

In [28]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    print(sample)
for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")
    print(len(chunk))

{'input_ids': [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

数据集进行采样，要不太慢了

In [29]:
train_size = 10000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

### 训练过程没啥特别的，跟之前一样

In [30]:
from transformers import TrainingArguments

batch_size = 64
# 每一个epoch打印结果
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",#自己定名字
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=logging_steps,
    num_train_epochs=1,
    save_strategy='epoch',
)

In [31]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
)

In [32]:
import math
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")#困惑都就是交叉熵的指数形式
#这东西有点难理解，用我的话就是你不得在mask那挑啥词合适吗，平均挑了多少个才能答对

100%|██████████| 16/16 [00:03<00:00,  4.46it/s]

>>> Perplexity: 21.94





In [33]:
trainer.train()

100%|██████████| 157/157 [01:45<00:00,  1.86it/s]

{'loss': 2.6986, 'learning_rate': 1.2738853503184715e-07, 'epoch': 0.99}


                                                 
100%|██████████| 157/157 [01:49<00:00,  1.86it/s]

{'eval_loss': 2.522949695587158, 'eval_runtime': 3.8603, 'eval_samples_per_second': 259.051, 'eval_steps_per_second': 4.145, 'epoch': 1.0}


100%|██████████| 157/157 [01:49<00:00,  1.43it/s]

{'train_runtime': 109.7801, 'train_samples_per_second': 91.091, 'train_steps_per_second': 1.43, 'train_loss': 2.69988226738705, 'epoch': 1.0}





TrainOutput(global_step=157, training_loss=2.69988226738705, metrics={'train_runtime': 109.7801, 'train_samples_per_second': 91.091, 'train_steps_per_second': 1.43, 'train_loss': 2.69988226738705, 'epoch': 1.0})

In [34]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

100%|██████████| 16/16 [00:03<00:00,  4.38it/s]

>>> Perplexity: 12.77





用咱们新训练的模型看看效果咋样

In [35]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained("./distilbert-base-uncased-finetuned-imdb/checkpoint-157")

In [36]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [37]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great idea.'
'>>> This is a great deal.'
'>>> This is a great adventure.'
'>>> This is a great film.'
'>>> This is a great one.'
