In [1]:
import os
import json
import sys
import time
import timeit

sys.path.append("../")
import textattack

# 文本增强的使用示例

重点展示下使用示例和生成速度, 例子都会用中文, **因为中文很可能不生效, 需要检查下.**


TextAttack 的组件中，有很多易用的数据增强工具。textattack.Augmenter 类使用 变换 与一系列的 约束 进行数据增强。我们提供了 5 中内置的数据增强策略：

- wordnet 通过基于 WordNet 同义词替换的方式增强文本
- embedding 通过邻近词替换的方式增强文本，使用 counter-fitted 词嵌入空间中的邻近词进行替换，约束二者的 cosine 相似度不低于 0.8
- charswap 通过字符的增删改，以及临近字符交换的方式增强文本
- eda 通过对词的增删改来增强文本
- checklist 通过简写，扩写以及对实体、地点、数字的替换来增强文本
- clare 使用 pre-trained masked language model, 通过对词的增删改来增强文本

In [14]:
# 主要的使用方式是两种
# 第一种是使用已有的文本增强类
from textattack.augmentation import EmbeddingAugmenter
augmenter = EmbeddingAugmenter()
s = 'What I cannot create, I do not understand.'
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不理解的事物
True


In [22]:
# 第二种是自定义转换器和约束, 构成新的文本增强器
from textattack.transformations import WordSwapRandomCharacterDeletion
from textattack.transformations import CompositeTransformation
from textattack.augmentation import Augmenter
transformation = CompositeTransformation([WordSwapRandomCharacterDeletion()])
augmenter = Augmenter(transformation=transformation, transformations_per_example=5)
s = 'What I cannot create, I do not understand.'
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不理解的事1物
True


所有的类都是继承自 Augmenter. Augmenter 有几个实用的初始化参数

1. pct_words_to_swap: 用于指定每个句子中需要增强的词的比例
2. transformations_per_example: 每个句子中生成的增强的样本数量
3. advanced_metrics: 返回高级指标, 包括 perplexity 和 USE Score

# 走吧, 带你扫除一切黑暗, by 妖刀姬

## 回译

原理: 将文本翻译成另一种语言, 再翻译回来, 生成新的文本.

In [None]:
from textattack.augmentation import BackTranslationAugmenter

# 这个预制的类是为英文用的, 所以还得自己重新构建
augmenter = BackTranslationAugmenter()
s = "What I cannot create, I do not understand."
# s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

In [10]:
from textattack.transformations.sentence_transformations import BackTranslation
from textattack.constraints.pre_transformation import RepeatModification, StopwordModification
from textattack.augmentation import Augmenter

transformation = BackTranslation(
    src_lang="zh",
    target_lang="en",
    src_model="Helsinki-NLP/opus-mt-en-zh",
    target_model="Helsinki-NLP/opus-mt-zh-en",
)
constraints = [RepeatModification(), StopwordModification()]
augmenter = Augmenter(transformation = transformation, constraints = constraints, transformations_per_example=1)
s = "我不能创造我不理解的事物"

for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不明白的东西
False


In [2]:
# 试一下批量调用 BackTranslation
from textattack.transformations.sentence_transformations import BackTranslation
from textattack.shared import AttackedText

transformation = BackTranslation(
    src_lang="zh",
    target_lang="en",
    src_model="Helsinki-NLP/opus-mt-en-zh",
    target_model="Helsinki-NLP/opus-mt-zh-en",
)

text_list = ["我不能创造我不理解的事物", "我来自美丽中国"]
attacked_text_list = [AttackedText(text) for text in text_list]
transformation.batch_call(attacked_text_list)

[[<AttackedText "我不能创造我不明白的东西">], [<AttackedText "我来自中国">]]

## CLAREAugmenter

原理: 使用 pre-trained masked language model, 通过对词的增删改来增强文本

In [2]:
from textattack.augmentation import CLAREAugmenter

augmenter = CLAREAugmenter(
    model="bert-base-chinese",
    tokenizer="bert-base-chinese",
    pct_words_to_swap=0.2,
    transformations_per_example=2,
    max_length=64,
)

s = "我不能创造我不理解的事物"

for item in augmenter.augment(s):
    print(item)
    print(item == s)

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2023-04-08 13:08:02,980 SequenceTagger predicts: Dictionary with 19 tags: <unk>, NOUN, VERB, PUNCT, ADP, DET, PROPN, PRON, ADJ, ADV, CCONJ, PART, NUM, AUX, INTJ, SYM, X, <START>, <STOP>
我不能创造出不理智的事物
False
我隻能編造我不理解的事物
False


## 字符交换

对中文不生效, 因为这是对单词里的字符生效的, 我在前面的分词中是将每个中文字符当作一个单词的

- (1) Swap: Swap two adjacent letters in the word.
    - WordSwapNeighboringCharacterSwap(),
- (2) Substitution: Substitute a letter in the word with a random letter.
    - WordSwapRandomCharacterSubstitution(),
- (3) Deletion: Delete a random letter from the word.
    - WordSwapRandomCharacterDeletion(),
- (4) Insertion: Insert a random letter in the word.
    - WordSwapRandomCharacterInsertion(),

In [10]:
from textattack.augmentation import CharSwapAugmenter
augmenter = CharSwapAugmenter()

s = "我不能创造我不理解的事物"
# s = "What I cannot create, I do not understand."
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不理解的事物
True


## CheckListAugmenter

Augments words by using the transformation methods provided by CheckList INV testing, which combines:

- Name Replacement
- Location Replacement
- Number Alteration
- Contraction/Extension

In [11]:
from textattack.augmentation import CheckListAugmenter

augmenter = CheckListAugmenter()
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不理解的事物
True


## 随机删除

In [12]:
from textattack.augmentation import DeletionAugmenter

augmenter = DeletionAugmenter()
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不理解事物
False


## EasyDataAugmenter

这种一般都是来自论文的组合操作, 包含:

- WordNet synonym replacement
    - Randomly replace words with their synonyms.
- Word deletion
    - Randomly remove words from the sentence.
- Word order swaps
    - Randomly swap the position of words in the sentence.
- Random synonym insertion
    -Insert a random synonym of a random word at a random location.

In [14]:
from textattack.augmentation import EasyDataAugmenter

augmenter = EasyDataAugmenter()
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\zhenh\AppData\Roaming\nltk_data...


我不能创造我解理不的事物
False
我不能创造我不理解的事物
True
我不能创造我不理解的物
False


## 嵌入

In [15]:
from textattack.augmentation import EmbeddingAugmenter

# 这个默认的应该是英文的
augmenter = EmbeddingAugmenter()
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不理解的事物
True


In [2]:
import gensim
from gensim.models import KeyedVectors

embedding_path = r"G:\dataset\词向量\merge_sgns_bigram_char300.txt.bz2"
model = KeyedVectors.load_word2vec_format(embedding_path, binary=False, encoding="utf-8", unicode_errors="ignore")

duplicate word '' in word2vec file, ignoring all but first
duplicate word '' in word2vec file, ignoring all but first
duplicate word '' in word2vec file, ignoring all but first


In [4]:
# 好像没什么区别, 这个加载也太慢了
embedding_path = r"G:\dataset\词向量\merge_sgns_bigram_char300.txt"
model = KeyedVectors.load_word2vec_format(embedding_path, binary=False, encoding="utf-8", unicode_errors="ignore")

duplicate word '' in word2vec file, ignoring all but first
duplicate word '' in word2vec file, ignoring all but first
duplicate word '' in word2vec file, ignoring all but first


In [8]:
model.most_similar("甘雨")

[('祈晴祷雨', 0.5951941013336182),
 ('甘霖', 0.5938517451286316),
 ('雨泽', 0.5817687511444092),
 ('逢甘露', 0.56331467628479),
 ('霢', 0.5586028695106506),
 ('承雨露', 0.558268666267395),
 ('逢天', 0.5570719838142395),
 ('晴照', 0.5541519522666931),
 ('农禾', 0.5535107254981995),
 ('花满树', 0.5527628064155579)]

In [11]:
from textattack.shared import GensimWordEmbedding
from textattack.transformations import WordSwapEmbedding
from textattack.constraints.semantics import WordEmbeddingDistance
from textattack.transformations.sentence_transformations import BackTranslation
from textattack.constraints.pre_transformation import RepeatModification, StopwordModification
from textattack.augmentation import Augmenter

embedding = GensimWordEmbedding(model)
transformation = WordSwapEmbedding(max_candidates=50, embedding=embedding)
constraints = [RepeatModification(), StopwordModification()] + [WordEmbeddingDistance(min_cos_sim=0.8)]
augmenter = Augmenter(transformation = transformation, constraints = constraints)
s = "我不能创造我不理解的事物"

for item in augmenter.augment(s):
    print(item)
    print(item == s)

爸妈不能创造我不理解的事物
False


## SwapAugmenter

原理: 随机交换两个单词的顺序

In [13]:
from textattack.augmentation.recipes import SwapAugmenter

augmenter = SwapAugmenter()
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不创能造我不理解的事物
False


## SynonymInsertionAugmenter

原理: 内部用的是 `wordnet.synsets`, 不支持中文

In [14]:
from textattack.augmentation.recipes import SynonymInsertionAugmenter

augmenter = SynonymInsertionAugmenter()
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不理解的事物
True


## WordNetAugmenter

In [18]:
from textattack.augmentation.recipes import WordNetAugmenter

augmenter = WordNetAugmenter()
s = "我不能创造我不理解的事物"
for item in augmenter.augment(s):
    print(item)
    print(item == s)

我不能创造我不理解的事物
True


[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\zhenh\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
