First, the required libraries are installed

In [104]:
!pip install nlpaug



Then all the needed dependencies are imported.

In [105]:
import nlpaug.augmenter.word as naw
import pandas as pd
import random
from random import sample
import torch
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import warnings
warnings.filterwarnings("ignore")
import logging
logging.getLogger("transformers").setLevel(logging.ERROR)

We import our Italian dataset and extract a random sample of 10% to create the dataset to be used to choose the best configuration of hyper-parameters of the italian model.

In [106]:
root = "https://raw.githubusercontent.com/alfcan/CADOCS_NLU_Model/dev/dataset/italian_dataset.csv"

df = pd.read_csv(root, sep = ';')
label_mapping = {'get_smells': 0, 'get_smells_date': 1, 'report': 2, 'info': 3}
df['intent'] = df['intent'].map(label_mapping)

random.seed(28)
torch.manual_seed(28)

n_samples = int(len(df) * 0.1)
selected_rows = sample(range(len(df)), n_samples)

We use the NLPAug library (https://github.com/makcedward/nlpaug - https://nlpaug.readthedocs.io/en/latest/ - Edward Ma - makcedward - MIT License) to create spelling errors on random requests to simulate possible real requests, and concatenate them with the other dataframe to create two final datasets.

In [107]:
spellingAugmenter = naw.SpellingAug(dict_path=None, name='Spelling_Aug', aug_min=1, aug_max=3, stopwords=["LINK", "MM/DD/YYYY"], tokenizer=None, reverse_tokenizer=None, include_reverse=True, stopwords_regex=None, verbose=0)

We iterate on the sample to create paraphrasis of requests.

In [108]:
rows = []
for index, row in df.iloc[selected_rows].iterrows():
  print("Original: ", row.request)
  paraphrasedRequests=spellingAugmenter.augment(row.request)
  for i in paraphrasedRequests:
    replaced_request=i
    print("Paraphrased: ",replaced_request)
    if row.request.strip().lower() != replaced_request.strip().lower():
      rows.append({"original_request":row.request, "paraphrased_request":replaced_request, "intent":row.intent})

Original:  dimmi quali Community smells sono presenti in questa repository Link dal 10/05/2022
Paraphrased:  dimmi quali Community smels sono presenti tn questa repository Link dal 1 / 05 / 2022
Original:  Che tipo di community smells riesci a rilevare?
Paraphrased:  Che tipo do community smalls riesci g rilevare?
Original:  Da dopo il 10/05/2022 quali community smells sono presenti nella repository LINK?
Paraphrased:  Da dopo in 10pm / 05 / 2022 quali community smells sono presenti nella repository LINK?
Original:  Ciao, quali sono i community smells in data 12/12/20 del progetto LINK? Grazie in anticipo.
Paraphrased:  Ciao, quali sono i comunity smells in date 12th / 12 / 20 del progetto LINK? Grazie in anticipo.
Original:  Mostrami l'ultima esecuzione.
Paraphrased:  Mostrami L ' ultima esecuzione.
Original:  Ciao CADOCS, potresti dirmi quali community smells sono presenti nel repository LINK a partire dal 21/03/2019?
Paraphrased:  Ciao CADOCS, potresti dirmi quali communit smalls so

We create a dataframe from all the created requests and export it.

In [109]:
dfParaphrased = pd.DataFrame(columns=["original_request", "paraphrased_request", "intent"])
dfParaphrased = pd.concat([dfParaphrased, pd.DataFrame(rows)], ignore_index=True)

dfParaphrased["intent"] = dfParaphrased["intent"].map({0:"get_smells", 1:"get_smells_date", 2:"report", 3:"info"})
dfParaphrased.to_csv("paraphrased_italian_dataset.csv", index=False, sep=";")

The entire process is repeated using different a random seed to create a different set of requests, which will then be used for input testing.

In [137]:
random.seed(121)
torch.manual_seed(121)

n_samples = int(len(df) * 0.1)
selected_rows = sample(range(len(df)), n_samples)

In [138]:
rows = []
for index, row in df.iloc[selected_rows].iterrows():
  print("Original: ", row.request)
  paraphrasedRequests=spellingAugmenter.augment(row.request)
  for i in paraphrasedRequests:
    replaced_request=i
    print("Paraphrased: ",replaced_request)
    if row.request.lower() != replaced_request.lower():
      rows.append({"original_request":row.request, "paraphrased_request":replaced_request, "intent":row.intent})

Original:  Quali community smells ci sono in questa repository?
Paraphrased:  Quali comunity smels ci sono ir questa repository?
Original:  Mostrami l'ultimo report
Paraphrased:  Mostrami La ' ultimo rapor
Original:  CADOCS, mostrami i risultati della tua ultima esecuzione.
Paraphrased:  CADOCS, mostrami I’ve risultati della tua ultima esecuzione.
Original:  Salve, richiedo un report relativo al vostro ultimo compito in esecuzione. Grazie in anticipo.
Paraphrased:  Salve, richiedo an report relativo AL vostro ultimo compito in esecuzione. Grazie i anticipo.
Original:  Quali community smells sei in grado di indentificare?
Paraphrased:  Quali communit smell's sei it grado di indentificare?
Original:  Hey CADOCS, quali tipi di community smells riesci a rilevare?
Paraphrased:  Hi CADOCS, quali tipi would community smell's riesci a rilevare?
Original:  Quali tipi di community smells puoi rilevare?
Paraphrased:  Quali tipi did communit smell's puoi rilevare?
Original:  Mostrami i community s

In [139]:
dfParaphrased_input_testing = pd.DataFrame(columns=["original_request", "paraphrased_request", "intent"])
dfParaphrased_input_testing = pd.concat([dfParaphrased_input_testing, pd.DataFrame(rows)], ignore_index=True)

dfParaphrased_input_testing["intent"] = dfParaphrased_input_testing["intent"].map({0:"get_smells", 1:"get_smells_date", 2:"report", 3:"info"})
dfParaphrased_input_testing.to_csv("paraphrased_italian_dataset_input_testing.csv", index=False, sep=";")

At the end of these processes, some rows were removed from the datasets because the paraphrasis were too similar to the original phrases or were unrealistic.