**English Dataset Augmentation**

In this file, we will increase the number of entries of the english dataset used by CADOCS using various functions provided by the NLPAug and TextAttack libraries.

First, we install the requested libraries.

In [None]:
!pip install -q transformers datasets
!pip install nlpaug
!pip install sacremoses
!pip install textattack

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m60.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nlpaug
  Downloading 

Then, we import the dependencies

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split
from textattack.transformations.sentence_transformations import BackTranslation
from textattack.transformations.word_insertions.word_insertion_masked_lm import WordInsertionMaskedLM
from textattack.transformations.word_insertions.word_insertion_random_synonym import WordInsertionRandomSynonym
from textattack.transformations.word_swaps.word_swap_masked_lm import WordSwapMaskedLM
from textattack.augmentation.recipes import EasyDataAugmenter, BackTranslationAugmenter
from textattack.transformations import WordSwapRandomCharacterDeletion, WordSwapQWERTY, CompositeTransformation
from textattack.constraints.pre_transformation import RepeatModification, StopwordModification
from textattack.augmentation import Augmenter
import pandas as pd
import numpy as np
from sacremoses import MosesTokenizer, MosesDetokenizer

import urllib.request
from tabulate import tabulate
from tqdm import trange
import random
import re
import csv
import io

import warnings
warnings.filterwarnings(action='once')

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from google.colab import drive

from six.moves import urllib

We import the dataset and print it.

In [None]:
root="https://raw.githubusercontent.com/alfcan/CADOCS_NLU_Model/main/dataset.csv"

df = pd.read_csv(root, sep = ';', names=["request", "intent"])
label_mapping = {'get_smells': 0, 'get_smells_date': 1, 'report': 2, 'info': 3}
df['intent'] = df['intent'].map(label_mapping)

df.head()

  and should_run_async(code)


Unnamed: 0,request,intent
0,"Hello CADOCS, can you tell me which community ...",0
1,"Hey CADOCS, tell which community smells are pr...",0
2,CADOCS can you tell which community smells ar...,0
3,I would like to know what are the community sm...,0
4,What are the community smells in the LINK proj...,0


We create lists of words that will not be modified by the augmentation functions

In [None]:
request = df.request.values
intent = df.intent.values

rows = []
stopwords_list=["community", "smells", "smell", "CADOCS","CAD", "LINK", "community smells", "repository"]
stopwordsContextual_list=["community", "smells", "smell", "community smells", "repository"]

  and should_run_async(code)


We create the new dataframe which will contain the new requests, inserting all the original requests in it.

In [None]:
dfAug = pd.DataFrame(columns=["request", "intent"])
dfAug = pd.concat([dfAug, df], ignore_index=True)
print(dfAug)

                                               request intent
0    Hello CADOCS, can you tell me which community ...      0
1    Hey CADOCS, tell which community smells are pr...      0
2     CADOCS can you tell which community smells ar...      0
3    I would like to know what are the community sm...      0
4    What are the community smells in the LINK proj...      0
..                                                 ...    ...
139  Hey CADOCS, what kind of community smells can ...      3
140  I wanna know more about community smells, it s...      3
141  hey CADOCS, can you show me some informations ...      3
142  Could you tell me more about community smells ...      3
143   CADOCS, What community smells have you detected?      3

[144 rows x 2 columns]


A number of augmentation functions were evaluated, but most of them didn't produce approvable results. In particular, various kind of word augmentation functions provided by the NLPAug library were tried, such as nlpaug.augmenter.word.context_word_embs, nlpaug.augmenter.word.random, nlpaug.augmenter.word.synonym and WordInsertionRandomSynonym by TextAttack but the results produced by these functions were not considered acceptable for most of the sentences produced, which would then be discarded during requests selection.

 The functions that generated the best requests are the following.

WordSwapMaskedLM is a function of the TextAttack library which swaps words in sentences using a specified method and model, in this case 'bert-base-uncased'.
After each augmentation function we remove the duplicates to perform an initial cleaning of the dataset.

We chose to set the "min_confidence" parameter to 0.8, in order to obtain only the most realistic requests. For this reason, it was observed that most of the time the function is unable to produce new requests different form the original ones for each sentence in the original dataset, but only for some of them. The other parameters were set to achieve a good balance between quality and quantity of results and execution time.

In [None]:
transformationWordSwap = WordSwapMaskedLM(method='bae', masked_language_model='bert-base-uncased', tokenizer=None, max_length=512, max_candidates=3, min_confidence=0.8, batch_size=3)
constraintsWordSwap = [RepeatModification(), StopwordModification(stopwords=stopwords_list)]
wordSwapAugmenter = Augmenter(
    transformation=transformationWordSwap,
    constraints=constraintsWordSwap,
    transformations_per_example=1
)
wordSwapAugmenter.fast_augment = True
wordSwapAugmenter.high_yield = True

for index, row in df.iterrows():
  print("Original: ", row.request)
  augmentedRequests=wordSwapAugmenter.augment(row.request)
  print("Augmented: ",augmentedRequests)
  for i in augmentedRequests:
    if row.request.lower() != i.lower():
      rows.append({"request":i, "intent":row.intent})

dfAug = pd.concat([dfAug, pd.DataFrame(rows)], ignore_index=True)
dfAug = dfAug.loc[dfAug.astype(str).drop_duplicates().index]
print(dfAug)
rows = []

  and should_run_async(code)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Original:  Hello CADOCS, can you tell me which community smells are present in the repository LINK please? 
Augmented:  ['Hello CADOCS, can you tell me what community smells are present in the repository LINK please? ', 'Hello CADOCS, can you tell me which community smells are present in the repository LINK please? ']
Original:  Hey CADOCS, tell which community smells are present in the repository LINK ?
Augmented:  ['Hey CADOCS, tell which community smells are present in the repository LINK ?']
Original:   CADOCS can you tell which community smells are present in the LINK ?
Augmented:  [' CADOCS can you tell which community smells are present in the LINK ?']
Original:  I would like to know what are the community smells in this LINK
Augmented:  ['I would like to know what are the community smells in this LINK', 'i would like to know what are the community smells in this LINK']
Original:  What are the community smells in the LINK project?"
Augmented:  ['What are the community smells in 

 WordInsertionMaskedLM class of TextAttack library generates new possible phrases inserting new words using a masked language model, which is also in this case "bert-base-uncased". We apply the transformations only on the original requests in order to introduce requests that are more meaningful and realistic. This is also the reason why the "min_confidence" parameter is also set to 0.8 in this case.

 Like function WordSwapMaskedLM, this function also produces different phrases, with this parameter configuration, only for a subset of requests of the original dataset.

In [None]:
transformationWordInsertion = WordInsertionMaskedLM(masked_language_model='bert-base-uncased', tokenizer=None, max_length=512, max_candidates=3, min_confidence=0.8, batch_size=3)
constraintsWordInsertion = [RepeatModification(), StopwordModification(stopwords=stopwords_list)]
wordInsertionAugmenter = Augmenter(
    transformation=transformationWordInsertion,
    constraints=constraintsWordInsertion,
    transformations_per_example=1
)
wordInsertionAugmenter.fast_augment = True
wordInsertionAugmenter.high_yield = True

for index, row in df.iterrows():
  print("Original: ", row.request)
  augmentedRequests=wordInsertionAugmenter.augment(row.request)
  print("Augmented: ",augmentedRequests)
  for i in augmentedRequests:
    if row.request.lower() != i.lower():
      rows.append({"request":i, "intent":row.intent})

dfAug = pd.concat([dfAug, pd.DataFrame(rows)], ignore_index=True)
dfAug = dfAug.loc[dfAug.astype(str).drop_duplicates().index]
print(dfAug)
rows = []

  and should_run_async(code)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Original:  Hello CADOCS, can you tell me which community smells are present in the repository LINK please? 
Augmented:  ['Hello CADOCS, can you please tell me which community smells are present in the repository LINK please? ', 'Hello CADOCS, can you tell me which community smells are present in the repository LINK please? ', 'Hello CADOCS, can you tell me which community smells that are present in the repository LINK please? ']
Original:  Hey CADOCS, tell which community smells are present in the repository LINK ?
Augmented:  ['Hey CADOCS, tell which community smells are present in the repository LINK ?']
Original:   CADOCS can you tell which community smells are present in the LINK ?
Augmented:  [' CADOCS can help you tell which community smells are present in the LINK ?', ' CADOCS can you tell which community smells are present in the LINK ?', ' CADOCS can you tell which community smells that are present in the LINK ?']
Original:  I would like to know what are the community smells i


SpellingAug, of the NLPAug library, creates realistic spelling errors for original requests, in order to make the model to be built more robust. We apply this function also to the new requests created by the WordInsertionMaskedLM function, in order to obtain more results. In the parameters of this function, we decided to set the "aug_min" and "aug_max" values to 1, aiming to limit the words to be modified at just one, in order to keep the requests more realistic.

In [None]:
spellingAugmenter = naw.SpellingAug(dict_path=None, name='Spelling_Aug', aug_min=1, aug_max=1, stopwords=["LINK", "MM/DD/YYYY"], tokenizer=None, reverse_tokenizer=None, include_reverse=True, stopwords_regex=None, verbose=0)

for index, row in dfAug.iterrows():
  print("Original: ", row.request)
  augmentedRequests=spellingAugmenter.augment(row.request)
  print("Augmented: ",augmentedRequests[0])
  rows.append({"request":augmentedRequests[0], "intent":row.intent})

dfAug = pd.concat([dfAug, pd.DataFrame(rows)], ignore_index=True)
dfAug = dfAug.loc[dfAug.astype(str).drop_duplicates().index]

  and should_run_async(code)


Original:  Hello CADOCS, can you tell me which community smells are present in the repository LINK please? 
Augmented:  Hello CADOCS, can you tell me which community smells are present ne the repository LINK please?
Original:  Hey CADOCS, tell which community smells are present in the repository LINK ?
Augmented:  Hey CADOCS, tell which community smells are present in thel repository LINK?
Original:   CADOCS can you tell which community smells are present in the LINK ?
Augmented:  CADOCS can you tell which community smells are présent in the LINK?
Original:  I would like to know what are the community smells in this LINK
Augmented:  I would like to kmow what are the community smells in this LINK
Original:  What are the community smells in the LINK project?"
Augmented:  What is the community smells in the LINK project? "
Original:  Are there any community smell here, LINK?
Augmented:  Are there any community smell [[Hier, LINK?
Original:  Search community smells in the repository LINK
A

In the end, we print the new dataset. These results will then need to be checked and selected to remove the less realistic and thus less meaningful requests.

In [None]:
print(dfAug)

                                               request intent
0    Hello CADOCS, can you tell me which community ...      0
1    Hey CADOCS, tell which community smells are pr...      0
2     CADOCS can you tell which community smells ar...      0
3    I would like to know what are the community sm...      0
4    What are the community smells in the LINK proj...      0
..                                                 ...    ...
575  hey CADOCS, can you please show me somer infor...      3
576  hey CADOCS, can you show me some informations ...      3
577  hey CADOCS, can you show me some informations ...      3
578  Could en please tell me more about community s...      3
579  Could you tell me some more about comunity sme...      3

[580 rows x 2 columns]


  and should_run_async(code)


We remove duplicates from the created augmented dataset.

In [None]:
dfAug = dfAug.loc[dfAug.astype(str).drop_duplicates().index]
print(dfAug)

                                               request intent
0    Hello CADOCS, can you tell me which community ...      0
1    Hey CADOCS, tell which community smells are pr...      0
2     CADOCS can you tell which community smells ar...      0
3    I would like to know what are the community sm...      0
4    What are the community smells in the LINK proj...      0
..                                                 ...    ...
575  hey CADOCS, can you please show me somer infor...      3
576  hey CADOCS, can you show me some informations ...      3
577  hey CADOCS, can you show me some informations ...      3
578  Could en please tell me more about community s...      3
579  Could you tell me some more about comunity sme...      3

[580 rows x 2 columns]


  and should_run_async(code)


We export the dataset in csv format.

In [None]:
dfAug.to_csv("augmented_dataset.csv", index=False)

  and should_run_async(code)


In the end, some requests were removed from the new dataset because considered not realistic, but without changing the requests in the original dataset.

After these steps, it was necessary to reimport the augmented dataset to change its separator without changing the requests, and to perform a data cleaning step, removing or changing some characters (", ’, é, í).

In [15]:
rootAug="https://raw.githubusercontent.com/alfcan/CADOCS_NLU_Model/dev/augmented_dataset.csv"

urlDataset = urllib.request.urlopen(rootAug)
datasetAugmented = urlDataset.read().decode("utf-8")
regex1="\""
regex2="((,)((.)(\\n)))"

datasetAugSemiColon=re.sub(regex1, '', datasetAugmented)
datasetAugSemiColon=re.sub(regex2, r';\3', datasetAugSemiColon)
datasetAugSemiColon=re.sub(r'’', r"'", datasetAugSemiColon)
datasetAugSemiColon=re.sub(r'é', r"", datasetAugSemiColon)
datasetAugSemiColon=re.sub(r'í', r"", datasetAugSemiColon)

buffer=io.StringIO(datasetAugSemiColon)
dfAugSemiColon = pd.read_csv(filepath_or_buffer=buffer, sep=";", dtype=str, names=["request", "intent"]).iloc[1:]
dfAugSemiColon["intent"] = dfAugSemiColon["intent"].map({"0":"get_smells", "1":"get_smells_date", "2":"report", "3":"info"})
print(dfAugSemiColon)
dfAugSemiColon.to_csv("augmented_dataset.csv", sep=";", index=False)

                                               request      intent
1    Hello CADOCS, can you tell me which community ...  get_smells
2    Hey CADOCS, tell which community smells are pr...  get_smells
3     CADOCS can you tell which community smells ar...  get_smells
4    I would like to know what are the community sm...  get_smells
5    What are the community smells in the LINK proj...  get_smells
..                                                 ...         ...
570  hey CADOCS, can you please show me somer infor...        info
571  hey CADOCS, can you show me some informations ...        info
572  hey CADOCS, can you show me some informations ...        info
573  Could en please tell me more about community s...        info
574  Could you tell me some more about comunity sme...        info

[574 rows x 2 columns]
