# 1. 安装 Transformer 库

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m98.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m108.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.1


#2. 使用 Bert

- 随意替换下面的句子为你想要替换的内容。但是确保在扣掉一个文本留下[MASK]，BERT才能预测缺失的单词。

In [2]:
# 引入transformers库中的pipeline函数。transformers库是一个广泛使用的深度学习库，用于处理自然语言处理任务。
from transformers import pipeline 

# 创建一个pipeline。这个pipeline使用预训练的BERT模型（'bert-base-uncased'）来执行'mask-filling'任务。
# 'mask-filling'任务是指在给定的句子中，找到并填补被 '[MASK]' 符号所替代的部分。
unmasker = pipeline('fill-mask', model='bert-base-uncased')

# 使用unmasker预测并填充句子中的 '[MASK]' 部分。这个句子是 "Artificial Intelligence [MASK] take over the world."
# 例如，'[MASK]' 可能会被填充为 'can'，使得完整句子为 "Artificial Intelligence can take over the world."
unmasker("Artificial Intelligence [MASK] take over the world.")


Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[{'score': 0.3182406425476074,
  'token': 2064,
  'token_str': 'can',
  'sequence': 'artificial intelligence can take over the world.'},
 {'score': 0.18299666047096252,
  'token': 2097,
  'token_str': 'will',
  'sequence': 'artificial intelligence will take over the world.'},
 {'score': 0.05600154027342796,
  'token': 2000,
  'token_str': 'to',
  'sequence': 'artificial intelligence to take over the world.'},
 {'score': 0.04519500583410263,
  'token': 2015,
  'token_str': '##s',
  'sequence': 'artificial intelligences take over the world.'},
 {'score': 0.0451531708240509,
  'token': 2052,
  'token_str': 'would',
  'sequence': 'artificial intelligence would take over the world.'}]

#3. Bert 模型自带的偏见 Bias



In [3]:
# 调用前面创建的 unmasker pipeline，预测并填充句子 "The man worked as a [MASK]." 中的 '[MASK]' 部分
# BERT模型将会根据其对语言的理解，预测出 '[MASK]' 最可能代表的词。例如，'[MASK]' 可能被填充为 'doctor'，'teacher'，'lawyer' 等等，这完全取决于模型的预测
unmasker("The man worked as a [MASK].") 

[{'score': 0.09747558832168579,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'the man worked as a carpenter.'},
 {'score': 0.05238332226872444,
  'token': 15610,
  'token_str': 'waiter',
  'sequence': 'the man worked as a waiter.'},
 {'score': 0.049626998603343964,
  'token': 13362,
  'token_str': 'barber',
  'sequence': 'the man worked as a barber.'},
 {'score': 0.037886131554841995,
  'token': 15893,
  'token_str': 'mechanic',
  'sequence': 'the man worked as a mechanic.'},
 {'score': 0.037680815905332565,
  'token': 18968,
  'token_str': 'salesman',
  'sequence': 'the man worked as a salesman.'}]

## 因为预训练文本的缘故,会导致Bert模型自带bias

- 男女都可以作为医生，律师等职业，但是因为互联网数据中男性与这些职业的共现的样本更多，导致了偏见的结果

In [4]:
unmasker("The woman worked as a [MASK].")

[{'score': 0.21981509029865265,
  'token': 6821,
  'token_str': 'nurse',
  'sequence': 'the woman worked as a nurse.'},
 {'score': 0.15974131226539612,
  'token': 13877,
  'token_str': 'waitress',
  'sequence': 'the woman worked as a waitress.'},
 {'score': 0.1154731959104538,
  'token': 10850,
  'token_str': 'maid',
  'sequence': 'the woman worked as a maid.'},
 {'score': 0.03796877712011337,
  'token': 19215,
  'token_str': 'prostitute',
  'sequence': 'the woman worked as a prostitute.'},
 {'score': 0.030423874035477638,
  'token': 5660,
  'token_str': 'cook',
  'sequence': 'the woman worked as a cook.'}]

## 预训练过程中的偏见也会影响到微调任务

- 这是需要注意的缺陷