<a href ="https://colab.research.google.com/github/GEM-benchmark/NL-Augmenter/blob/main/notebooks/Write_a_sample_transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.


See the License for the specific language governing permissions and
limitations under the License.

# NL-Augmenter Colab example 

  * Play with an existing **transformation** 
    * Write your own **transformation** 
  * Play with an existing **filter**  
    * Write your own **filter**         

Total running time: ~10 min

## Install NL-Augmenter from GitHub



In [1]:
!git clone https://www.github.com/GEM-benchmark/NL-Augmenter

Cloning into 'NL-Augmenter'...
remote: Enumerating objects: 1726, done.[K
remote: Counting objects: 100% (680/680), done.[K
remote: Compressing objects: 100% (401/401), done.[K
remote: Total 1726 (delta 381), reused 459 (delta 263), pack-reused 1046[K
Receiving objects: 100% (1726/1726), 363.57 KiB | 15.81 MiB/s, done.
Resolving deltas: 100% (1016/1016), done.


In [10]:
cd NL-Augmenter

[Errno 2] No such file or directory: 'NL-Augmenter'
/Users/admintmun/dev/NL-Augmenter/notebooks


In [11]:
cd ..

/Users/admintmun/dev/NL-Augmenter


In [5]:
!pip install -r requirements.txt --quiet

## Load modules

In [2]:
from transformations.butter_fingers_perturbation.transformation import ButterFingersPerturbation
from transformations.change_person_named_entities.transformation import ChangePersonNamedEntities
from transformations.replace_numerical_values.transformation import ReplaceNumericalValues
from interfaces.SentenceOperation import SentenceOperation
from interfaces.QuestionAnswerOperation import QuestionAnswerOperation
from evaluation.evaluation_engine import evaluate, execute_model
from tasks.TaskTypes import TaskType

## Play with some existing transformations

In [4]:
t1 = ButterFingersPerturbation(max_outputs=3)
t1.generate("Jason wants to move back to India by the end of next year.")

['Jasln wants to move back to India by the end od next year.',
 'Jason wants to move back to India by thx end of ntxt year.',
 'Jason wants to move back no India by tht end od next yeac.']

In [38]:
t2 = ChangePersonNamedEntities(max_outputs=2)
t2.generate("Jason wants to move back to India by the end of next year.")

['Austin wants to move back to India by the end of next year.']

In [40]:
t3 = ReplaceNumericalValues(max_outputs=1)
t3.generate("Jason's 3 sisters want to move back to India")

["Jason's 8 sisters want to move back to India"]

# Testing new transformations

In [1]:
from transformations.chinese_butter_fingers_perturbation import ChineseButterFingersPerturbation
import os

In [3]:
os.getcwd()

'/Users/admintmun/dev/NL-Augmenter/notebooks'

In [None]:
cd ..

In [3]:
chinese_perturbation = ChineseButterFingersPerturbation(prob=0.2, max_outputs=1)

In [15]:
chinese_perturbation.generate("谢尔曼将访问中国天津，中国外交部主管中美关系的副部长谢峰将和她会谈.")

From Common Word List
协邪胁斜谐携鞋泄泻屑械卸谢懈蟹写些楔歇蝎
From Full Character List
协邪胁斜谐携鞋泄泻屑械卸谢懈蟹写些楔歇蝎
From Common Word List
儿而贰二尔耳饵
From Full Character List
儿而贰二尔耳饵
From Common Word List
蛮馒瞒曼幔慢漫蔓满
From Full Character List
蛮馒瞒曼幔慢漫蔓满
From Common Word List
匠降酱讲奖桨蒋江姜将浆僵缰疆
From Full Character List
匠降酱讲奖桨蒋江姜将浆僵缰疆
From Common Word List
防妨房肪放仿访纺方坊芳
From Full Character List
防妨房肪放仿访纺方坊芳
From Common Word List
文纹闻蚊问吻紊稳瘟温
From Full Character List
文纹闻蚊问吻紊稳瘟温
From Common Word List
仲众重肿种终盅钟衷中忠
From Full Character List
仲众重肿种终盅钟衷中忠
From Common Word List
国过裹果郭锅
From Full Character List
国过裹果郭锅
From Common Word List
甜填田恬舔天添
From Full Character List
甜填田恬舔天添
From Common Word List
晋浸禁尽劲近进仅紧谨锦筋襟巾今斤金津
From Full Character List
晋浸禁尽劲近进仅紧谨锦筋襟巾今斤金津
From Common Word List

From Full Character List

From Common Word List
仲众重肿种终盅钟衷中忠
From Full Character List
仲众重肿种终盅钟衷中忠
From Common Word List
国过裹果郭锅
From Full Character List
国过裹果郭锅
From Common Word List
外歪
From Full Character List
外歪
From Common Word List
嚼叫轿较教窖酵矫脚搅剿缴角狡绞饺蕉礁浇骄胶椒焦交郊娇
F

['懈尔幔桨肪问众裹舔劲，忠郭歪胶哺主关众眉官牺得复布场谐蜂桨盒他汇瘫.']

In [3]:
chinese_perturbation = ChineseButterFingersPerturbation()

In [4]:
chinese_perturbation.generate("谢尔曼将访问中国天津，中国外交部主管中美关系的副部长谢峰将和她会谈.")
# chinese_perturbation.generate("燮")

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/vd/3vrnjbd50935tzjj4872zb480000gq/T/jieba.cache
Loading model cost 0.594 seconds.
Prefix dict has been built successfully.


[{'perturb_word': '谢尔曼', 'similar_pinyin_words': []}, {'perturb_word': '将', 'similar_pinyin_words': ['匠', '降', '酱', '讲', '奖', '桨', '蒋', '江', '姜', '浆', '僵', '缰', '疆']}, {'perturb_word': '访问', 'similar_pinyin_words': ['方闻', '妨紊', '访闻']}, {'perturb_word': '中国', 'similar_pinyin_words': []}, {'perturb_word': '天津', 'similar_pinyin_words': []}, {'perturb_word': '，', 'similar_pinyin_words': []}, {'perturb_word': '中国外交部', 'similar_pinyin_words': []}, {'perturb_word': '主管', 'similar_pinyin_words': ['朱观']}, {'perturb_word': '中美关系', 'similar_pinyin_words': []}, {'perturb_word': '的', 'similar_pinyin_words': ['得', '德']}, {'perturb_word': '副', 'similar_pinyin_words': ['扶', '芙', '拂', '服', '俘', '浮', '符', '袱', '幅', '伏', '凫', '福', '辐', '蝠', '父', '付', '妇', '负', '附', '咐', '复', '傅', '赴', '富', '覆', '赋', '缚', '腹', '腐', '辅', '抚', '甫', '府', '斧', '俯', '肤', '麸', '孵', '敷', '夫']}, {'perturb_word': '部长', 'similar_pinyin_words': ['不彰', '布帐', '步鄣', '步帐', '步障', '部帐', '簿帐']}, {'perturb_word': '谢峰', 'similar_pinyin_words

['谢尔曼将访问中国天津，中国外交部朱观中美关系的副部长谢峰蒋和她回滩.']

In [5]:
chinese_perturbation.generate("本意是指词的起源义，即词的最初意义。引申义是由词的本意引申出来的并经过推演发展而产生的意义")

[{'perturb_word': '本意', 'similar_pinyin_words': ['奔轶', '奔逸', '犇佚', '犇逸', '本义', '本议', '本谊', '坌溢']}, {'perturb_word': '是', 'similar_pinyin_words': ['匙', '时', '十', '什', '石', '识', '实', '拾', '蚀', '食', '誓', '释', '嗜', '柿', '逝', '适', '式', '士', '氏', '世', '市', '示', '事', '侍', '势', '视', '试', '饰', '室', '恃', '拭', '史', '矢', '使', '始', '驶', '尸', '失', '师', '虱', '诗', '施', '狮', '湿']}, {'perturb_word': '指词', 'similar_pinyin_words': []}, {'perturb_word': '的', 'similar_pinyin_words': ['得', '德']}, {'perturb_word': '起源', 'similar_pinyin_words': ['栖园', '奇冤', '奇缘', '期愿']}, {'perturb_word': '义', 'similar_pinyin_words': ['宜', '仪', '夷', '姨', '胰', '移', '遗', '疑', '疫', '绎', '奕', '译', '邑', '易', '意', '溢', '益', '谊', '逸', '翼', '肄', '毅', '亿', '忆', '艺', '议', '亦', '屹', '异', '役', '抑', '蚁', '倚', '椅', '乙', '已', '以', '伊', '衣', '医', '依', '壹', '揖', '一']}, {'perturb_word': '，', 'similar_pinyin_words': []}, {'perturb_word': '即词', 'similar_pinyin_words': ['赍刺', '即此', '季次', '寄词', '寄辞', '激辞', '急辞', '稷祠', '积次', '祭祠', '击刺', '棘茨', '藉词', '

['本意始指词的起源义，即词的最初一医。引申义驶由词得本谊引申出来的并经过推演发展而缠声得意义']

In [6]:

chinese_perturbation.generate("汉字是语素文字，总数非常庞大。汉字总共有多少字？到目前为止，恐怕没人能够答得上来精确的数字。")

[{'perturb_word': '汉字', 'similar_pinyin_words': ['蚶子', '酣紫', '酣恣', '憨子', '含姿', '含渍', '涵渍', '寒姿', '韩子', '汉子', '汗渍']}, {'perturb_word': '是', 'similar_pinyin_words': ['匙', '时', '十', '什', '石', '识', '实', '拾', '蚀', '食', '誓', '释', '嗜', '柿', '逝', '适', '式', '士', '氏', '世', '市', '示', '事', '侍', '势', '视', '试', '饰', '室', '恃', '拭', '史', '矢', '使', '始', '驶', '尸', '失', '师', '虱', '诗', '施', '狮', '湿']}, {'perturb_word': '语素', 'similar_pinyin_words': []}, {'perturb_word': '文字', 'similar_pinyin_words': []}, {'perturb_word': '，', 'similar_pinyin_words': []}, {'perturb_word': '总数', 'similar_pinyin_words': ['综述']}, {'perturb_word': '非常', 'similar_pinyin_words': ['肥肠', '腓肠', '肥肠', '腓肠', '棐常', '肺肠']}, {'perturb_word': '庞大', 'similar_pinyin_words': []}, {'perturb_word': '。', 'similar_pinyin_words': []}, {'perturb_word': '汉字', 'similar_pinyin_words': ['蚶子', '酣紫', '酣恣', '憨子', '含姿', '含渍', '涵渍', '寒姿', '韩子', '汉子', '汗渍']}, {'perturb_word': '总共', 'similar_pinyin_words': []}, {'perturb_word': '有', 'similar_pinyin_words': 

['酣紫是语素文字，总数非常庞大。汉字总共有多少字？到目前为止，恐怕美人能够大德上来精确的数字。']

In [7]:
chinese_perturbation.generate("恰当的运用反义词，可以形成鲜明的对比，把事物的特点表达得更充分，给人留下深刻难忘的印象。")

[{'perturb_word': '恰当', 'similar_pinyin_words': []}, {'perturb_word': '的', 'similar_pinyin_words': ['得', '德']}, {'perturb_word': '运用', 'similar_pinyin_words': []}, {'perturb_word': '反义词', 'similar_pinyin_words': []}, {'perturb_word': '，', 'similar_pinyin_words': []}, {'perturb_word': '可以', 'similar_pinyin_words': ['柯怡']}, {'perturb_word': '形成', 'similar_pinyin_words': ['星城', '行成', '行城', '行程', '行塍', '行秤']}, {'perturb_word': '鲜明', 'similar_pinyin_words': []}, {'perturb_word': '的', 'similar_pinyin_words': ['得', '德']}, {'perturb_word': '对比', 'similar_pinyin_words': ['对笔', '怼笔']}, {'perturb_word': '，', 'similar_pinyin_words': []}, {'perturb_word': '把', 'similar_pinyin_words': ['跋', '拔', '坝', '爸', '罢', '霸', '靶', '八', '巴', '叭', '扒', '吧', '芭', '疤', '捌', '笆']}, {'perturb_word': '事物', 'similar_pinyin_words': ['石屋']}, {'perturb_word': '的', 'similar_pinyin_words': ['得', '德']}, {'perturb_word': '特点', 'similar_pinyin_words': []}, {'perturb_word': '表达', 'similar_pinyin_words': []}, {'perturb_word': '

['恰当的运用反义词，可以形成鲜明的对比，把石屋的特点表达得更充分，给人溜下深刻难忘的印象。']

In [8]:
chinese_perturbation.generate("随着两个遗迹文明的发展，他们终于开始了争斗。遗迹之间的能量冲突是战争的导火索，因为一方出现，另一方的遗迹能量就会相应的颓落。")

[{'perturb_word': '随着', 'similar_pinyin_words': []}, {'perturb_word': '两个', 'similar_pinyin_words': []}, {'perturb_word': '遗迹', 'similar_pinyin_words': ['易基', '义集', '伊籍']}, {'perturb_word': '文明', 'similar_pinyin_words': []}, {'perturb_word': '的', 'similar_pinyin_words': ['得', '德']}, {'perturb_word': '发展', 'similar_pinyin_words': ['发战']}, {'perturb_word': '，', 'similar_pinyin_words': []}, {'perturb_word': '他们', 'similar_pinyin_words': []}, {'perturb_word': '终于', 'similar_pinyin_words': ['钟毓']}, {'perturb_word': '开始', 'similar_pinyin_words': []}, {'perturb_word': '了', 'similar_pinyin_words': ['乐', '勒']}, {'perturb_word': '争斗', 'similar_pinyin_words': ['正斗']}, {'perturb_word': '。', 'similar_pinyin_words': []}, {'perturb_word': '遗迹', 'similar_pinyin_words': ['易基', '义集', '伊籍']}, {'perturb_word': '之间', 'similar_pinyin_words': []}, {'perturb_word': '的', 'similar_pinyin_words': ['得', '德']}, {'perturb_word': '能量', 'similar_pinyin_words': []}, {'perturb_word': '冲突', 'similar_pinyin_words': ['冲途'

['随着两个遗迹文明的发展，他们终于开始了正斗。遗迹之间的能量冲突驶战争德导火索，因为一方出现，另一方的伊籍能量就挥相应德颓落。']

In [5]:
text = "阿鼻"

perturb = ChineseButterFingersPerturbation()
newtext = perturb.generate(text)
chinese_words = perturb.chinese_words_database
print(perturb.chinese_words_database)
output = jieba.lcut(text)
# print(output)

similar_word_pinyin_list = []
for perturb_word in output:
    perturb_word_len = len(perturb_word)
    for i in range(len(chinese_words)):
        word_dict = chinese_words[i]
        word = word_dict['word']
        if(len(word) == perturb_word_len):
            perturb_word_pinyins = pypinyin.pinyin(perturb_word)
            perturb_word_pinyins_flatten = [item for pinyin in perturb_word_pinyins for item in pinyin]
            perturb_word_pinyins_string = ''.join(perturb_word_pinyins_flatten)
            word_pinyins = pypinyin.pinyin(word)
            word_pinyins_flatten = [item for pinyin in word_pinyins for item in pinyin]
            word_pinyins_string = ''.join(word_pinyins_flatten)
            same_pinyin = [perturb_word_pinyin for perturb_word_pinyin, word_pinyin  in zip(perturb_word_pinyins_flatten, word_pinyins_flatten) if perturb_word_pinyin == word_pinyin]

            perturb_word_pinyins_string_no_tone = u.unidecode(perturb_word_pinyins_string)
            word_pinyins_string_no_tone = u.unidecode(word_pinyins_string)

            print(perturb_word_pinyins_string_no_tone)
            print(word_pinyins_string_no_tone)
            if (perturb_word_pinyins_string_no_tone== word_pinyins_string_no_tone):
                similar_word_pinyin_list.append(word)

        print(similar_word_pinyin_list)


FileNotFoundError: [Errno 2] No such file or directory: '/Users/admintmun/dev/NL-Augmenter/transformations/chinese_butter_fingers_perturbation/chinese_word.json'

In [13]:
sentence_add = SentenceAdditions(max_outputs=3, max_length=70)

In [14]:
sentence_add.generate("Historically, the uncertainty principle has been confused with a related effect in physics, called the observer effect, which notes that measurements of certain systems cannot be made without affecting the system, that is, without changing something in a system.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Historically, the uncertainty principle has been confused with a related effect in physics, called the observer effect, which notes that measurements of certain systems cannot be made without affecting the system, that is, without changing something in a system. To the extent that observation affects the system, measurement is not an independent action to be made. But it is a possible',
 'Historically, the uncertainty principle has been confused with a related effect in physics, called the observer effect, which notes that measurements of certain systems cannot be made without affecting the system, that is, without changing something in a system. The difference, in contrast, is that the uncertainty principle says that no amount of uncertainty in the measurement will affect the meaning',
 'Historically, the uncertainty principle has been confused with a related effect in physics, called the observer effect, which notes that measurements of certain systems cannot be made without affect

In [2]:
sentence_add = SentenceAdditions(max_outputs=1, max_length=30)

In [3]:
sentence_add.generate("John was not the person that I had imagined.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['John was not the person that I had imagined. I had thought that he would have strong qualities. I think I knew that I wanted to do something']

In [16]:
sentence_add.generate('Trinity Medical Imaging is one of the foremost providers of private nuclear medicine imaging in London and Surrey. We work with someof the finest nuclear medicine consultants from a wide variety of specialist fields, attracted from London\'s major teaching hospitals.')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Trinity Medical Imaging is one of the foremost providers of private nuclear medicine imaging in London and Surrey. We work with someof the finest nuclear medicine consultants from a wide variety of specialist fields, attracted from London's major teaching hospitals. Trinity Medical Imaging has won awards for excellence from the Royal College of Radiologists, The Royal College of Surgeons,",
 "Trinity Medical Imaging is one of the foremost providers of private nuclear medicine imaging in London and Surrey. We work with someof the finest nuclear medicine consultants from a wide variety of specialist fields, attracted from London's major teaching hospitals. We are also a leading provider for a wide range of diagnostic and clinical imaging tests and procedures, including brain tumour scans",
 "Trinity Medical Imaging is one of the foremost providers of private nuclear medicine imaging in London and Surrey. We work with someof the finest nuclear medicine consultants from a wide variety of

In [None]:
from transformations import 

In [5]:
import torch as torch

In [6]:
device = torch.device("cpu")
execute_model(SentenceAdditions, "TEXT_CLASSIFICATION", percentage_of_examples=1)

Loading <imdb> dataset to evaluate <aychang/roberta-base-imdb> model.


AssertionError: Torch not compiled with CUDA enabled

## Define a simple transformation
Let's define a very basic transformation which just uppercases the sentence. 

This transformation could be used for many [tasks](https://github.com/GEM-benchmark/NL-Augmenter/blob/add_filters_for_contrast_sets/tasks/TaskTypes.py) including text classification and generation. So, we need to populate the `tasks` variable to `[TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION]`. That's it!

In [None]:
class MySimpleTransformation(SentenceOperation):
  tasks = [TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION]
  languages = ["en"]
  
  def generate(self, sentence):
    return [sentence.upper()]

In [None]:
my_transformation = MySimpleTransformation() 

In [None]:
my_transformation.generate("John was n't the person I had n't imagined.")

["JOHN WAS N'T THE PERSON I HAD N'T IMAGINED."]


Obviously this can barely be called a transformation. What could this really achieve? Duh. 
So, let's quickly compare the performance of a trained text classifier on a common test set, and a test set with MySimpleTransformation applied (or also called as a pertubed set) with this one line of code. And you need to hold your breadth for around 5 minutes!  

In [None]:
execute_model(MySimpleTransformation, "TEXT_CLASSIFICATION", percentage_of_examples=1)

Loading <imdb> dataset to evaluate <aychang/roberta-base-imdb> model.


Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/4ea52f2e58a08dbc12c2bd52d0d92b30b88c00230b4522801b3636782f625c5b)


Here is the performance of the model aychang/roberta-base-imdb on the test[:1%] split of the imdb dataset


100%|██████████| 250/250 [00:00<00:00, 226670.13it/s]

The accuracy on this subset = 95.6
Applying transformation:
Here is the performance of the model on the transformed set





The accuracy on this subset = 89.2


{'accuracy': 95.6,
 'dataset_name': 'imdb',
 'model_name': 'aychang/roberta-base-imdb',
 'no_of_examples': 250,
 'pt_accuracy': 89.2,
 'split': 'test[:1%]'}

### 🕺 Voila! The accuracy on the perturbed set has fallen by 6% with this simple transformation!

So what happened internally? --> `execute_model` depending on the transformation type [SentenceOperation](https://github.com/GEM-benchmark/NL-Augmenter/blob/main/interfaces/SentenceOperation.py)) and the task you provided (TEXT_CLASSIFICATION) evaluated a pre-trained model of HuggingFace. In this case, a sentiment analysis model [aychang/roberta-base-imdb](https://huggingface.co/aychang/roberta-base-imdb) was chosen and evaluated on 1% of the [IMDB dataset](https://huggingface.co/datasets/imdb) with and without the transformation to check if the sentiment is predicted correctly. 

If you want to evaluate this on your own model and dataset, you can pass the parameters as shown below in the `execute_model` method. Note that we obviously can't support each and every model type and dataset type and hence some models and datasets might require refactoring in the `evaluation_engine` class from your side and we are happy to help. 😊

In [None]:
# Here are the different parameters which are used as defaults!
# execute_model(MySimpleTransformation, "TEXT_CLASSIFICATION", "en", model_name = "aychang/roberta-base-imdb", dataset="imdb", percentage_of_examples=1)

##  A Model Based Transformation
We don't want to restrict ourselves with just string level changes! We want to do more, don't we? So, let's use a pre-trained paraphrase generator to transform question answering examples. There is an exisiting interface [QuestionAnswerOperation](https://github.com/GEM-benchmark/NL-Augmenter/blob/main/interfaces/QuestionAnswerOperation.py) which takes as input the context, the question and the answer as inputs. Let's use that to augment our training data for question answering! 

In [33]:
import torch
from transformers import T5ForConditionalGeneration, AutoTokenizer

class MySecondTransformation(QuestionAnswerOperation):
  tasks = [TaskType.QUESTION_ANSWERING, TaskType.QUESTION_GENERATION]
  languages = ["en"]

  def __init__(self, max_outputs=5):
    super().__init__()
    model_name="prithivida/parrot_paraphraser_on_T5"
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)  
    self.model = T5ForConditionalGeneration.from_pretrained(model_name)
    self.max_outputs = max_outputs

  def generate(self, context, question, answers): # Note that the choice of inputs for 'generate' is consistent with those in QuestionAnswerOperation
    
    # Let's call the HF model to generate a paraphrase for the question
    paraphrase_input = question
    batch = self.tokenizer([paraphrase_input],truncation=True,padding='longest',max_length=60, return_tensors="pt")
    translated = self.model.generate(**batch,max_length=60,num_beams=10, num_return_sequences=self.max_outputs, temperature=1.5)
    paraphrased_questions = self.tokenizer.batch_decode(translated, skip_special_tokens=True) 

    # context = "Apply your own logic here"
    # answers = "And here too :)"

    # return the list of new question-answering examples
    return [(context, paraphrase, answers) for paraphrase in paraphrased_questions]

In [None]:
t4 = MySecondTransformation()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1373.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1786.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1889.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891737400.0, style=ProgressStyle(descri…




In [None]:
t4.generate(context="Mumbai, Bengaluru, New Delhi are among the many famous places in India.", 
            question="What are the famous places we should not miss in India?", 
            answers=["Mumbai", "Bengaluru", "Delhi", "New Delhi"])

[('Mumbai, Bengaluru, New Delhi are among the many famous places in India.',
  'recommend some of the best places to visit in India?',
  ['Mumbai', 'Bengaluru', 'Delhi', 'New Delhi']),
 ('Mumbai, Bengaluru, New Delhi are among the many famous places in India.',
  'can you list the best places to visit in India?',
  ['Mumbai', 'Bengaluru', 'Delhi', 'New Delhi']),
 ('Mumbai, Bengaluru, New Delhi are among the many famous places in India.',
  'can you list the top 10 places to visit in India?',
  ['Mumbai', 'Bengaluru', 'Delhi', 'New Delhi'])]

Voila! Seems like you have created a new training example now for question-answering and question-generation! 🎉 🎊 🎉 

#Now you are all ready to contribute a transformation to [NL-Augmenter 🦎 → 🐍](https://github.com/GEM-benchmark/NL-Augmenter)! 

## What is this deal with filters?
So, just the way transformations can transform examples of text, filters can identify whether an example follows some pattern of text! The only difference is that while transformations return another example of the same input format, filters return True or False!

sentence --> SentenceOperation.**generate**(sentence) --> List of perturbed sentence

sentence --> SentenceOperation.**filter**(sentence)  --> TRUE/FALSE

#So, let's play with some existing filters! 


In [None]:
from filters.keywords import TextContainsKeywordsFilter
from filters.length import TextLengthFilter, SentenceAndTargetLengthFilter

The `TextLengthFilter` accepts an input sentence if the length of the input sentence is within the initialised range. Let's initialise this filter to accept all sentences with length greater than 10 tokens!

In [None]:
f1 = TextLengthFilter(">", 10)

In [None]:
f1.filter("This sentence is long enough to pass while you think of implementing your own filter!")

True

In [None]:
f1.filter("This one's too short!")

False

Let's say you have a lot of paraphrasing data and you intend to train a paraphrase generator to convert longer sentences to shorter ones! Check how the `SentenceAndTargetLengthFilter` can be used for this!


In [None]:
f2 = SentenceAndTargetLengthFilter([">", "<"], [10,8])

In [None]:
f2.filter("That show is going to take place in front of immensely massive crowds.", 
          "Large crowds would attend the show.")

True

In [None]:
f2.filter("The film was nominated for the Academy Award for Best Art Direction.", 
          "The movie was a nominee for the Academy Award for Best Art Direction.")

False

Okay, now that you've said to yourself that these filters are too basic, let's try to make a simple and interesting one! 

Let's define a filter which selects question-answer pairs which share a low lexical overlap between the question and the context!

In [None]:
import spacy

class LowLexicalOverlapFilter(QuestionAnswerOperation):
  tasks = [TaskType.QUESTION_ANSWERING, TaskType.QUESTION_GENERATION]
  languages = ["en"]
  
  def __init__(self, threshold=3):
    super().__init__()
    self.nlp = spacy.load("en_core_web_sm")
    self.threshold = threshold

  def filter(self, context, question, answers): 
    # Note that the only difference between a filter and a transformation is this method! 
    # The inputs remain the same!
    
    question_tokenized = self.nlp(question, disable=["parser", "tagger", "ner"])
    context_tokenized = self.nlp(context, disable=["parser", "tagger", "ner"])
    
    q_tokens = set([t.text for t in question_tokenized])
    c_tokens = set([t.text for t in context_tokenized])
    
    low_lexical_overlap = len(q_tokens.intersection(c_tokens)) > self.threshold
    return low_lexical_overlap

In [None]:
f3 = LowLexicalOverlapFilter()

In [None]:
f3.filter("New York, is the most populous city in the United States.",
          "Which is the most populous city of the United States?",
          ["New York"])

True

In [None]:
f3.filter("New York, is the most populous city in the United States.",
          "Which city has the largest population in the US?",
          ["New York"])

False

That's it!  So you have created a new filter which can separate the hard examples from the easy one! 🎉 🎊 🎉 

#Now go ahead and contribute a nice filter to [NL-Augmenter 🦎 → 🐍](https://github.com/GEM-benchmark/NL-Augmenter)! 