# Data Augmentation

We will try some augmentation method, to artificialy increase the number of data.

Data augmentation can be used, for instance, to increase the number of texts containing counter-claims and rebuttals, since they are not well represented in the original dataset.

This notebook further creates a new csv training file, and raw text files for the augmented data, so that they can be directly used in the following training process.

We will use nlpaug library to perform data augmentation.

In [None]:
!pip install nlpaug

In [None]:
import os
import pandas as pd
import numpy as np
from transformers import *
from tqdm.auto import tqdm
tqdm.pandas()
import nlpaug.augmenter.word as naw

# Load Data

In [None]:
data_df = pd.read_csv('../input/feedback-prize-2021/train.csv')
data_df.head()

In [None]:
# All ID list
all_id = data_df.id.unique()

# Augmentation using Nlpaug

## Synonym Augmentation

The method simply replace some of the words in the original text by their synonym.

For this, I set the percentage of word will be augmented to 10%, with maximum number of word will be augmented is 15.

In [None]:
syn_aug = naw.SynonymAug(aug_src = 'wordnet', aug_max = 15, aug_p = 0.1)

## Contextual Embedding

Contextual embedding use nlp models (here transformers), to understand the context of the input text and replace/add words keeping the context. As a result, the new text may have additional words or slightly different meaning. 

Here for the embedding model, we will use pretrained roberta model The percentage of word will be augmented is 20% and the text will be replaced with maximum 10 words. 

In [None]:
context_aug = naw.ContextualWordEmbsAug(model_path = "roberta-base", action = "substitute",
                                       aug_max = 10, device = "cuda", aug_p = 0.2)

## Back Translation

In this method, we translate the text data to some other language and then translate it back to the original language. This can help to generate textual data with different words while preserving the meaning of the text data. By default English -> German -> English.



In [None]:
back_trans_aug = naw.BackTranslationAug(max_length = 1024, device = 'cuda')

# Creating Augmented dataset

## Create data frame for augmentation

In [None]:
# Create training and validation set
np.random.seed(6)
train_idx = np.random.choice(np.arange(len(all_id)),int(0.4*len(all_id)),replace = False)
left_set = np.setdiff1d(np.arange(len(all_id)),train_idx)
valid_idx = np.random.choice(left_set, int(0.1*len(all_id)), replace = False)
np.random.seed(None)

In [None]:
# Training data frame
train_selected = data_df[data_df["id"].isin(all_id[train_idx])].copy()

In [None]:
# Select essay contains 'Rebuttal' and 'Counterclaim' type for augmentation
augmented_id_list = train_selected[train_selected['discourse_type'].isin(['Rebuttal', 'Counterclaim'])].id.unique()

In [None]:
print('Number of essay in the training set', train_selected.id.nunique())
print('Number of essay to be augmented:', augmented_id_list.shape[0])
print('Pecent of essay to be augmented:', augmented_id_list.shape[0]*100/train_selected.id.nunique())

In [None]:
# The data frame contains text to be augmented
to_aug_df = train_selected[train_selected.id.isin(augmented_id_list)].copy()

## Apply data augmentation

In [None]:
# Selecting the augmentation method, in the list: syn_aug, context_aug, back_trans_aug
augmenter = syn_aug

# Set the following to avoid warning message
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Applying augmentation to all the selected texts
to_aug_df["augmented_text"] = to_aug_df.progress_apply(lambda row: augmenter.augment(row["discourse_text"]), axis = 1)

In [None]:
to_aug_df.head()

In [None]:
# Fix problem with text, especialy when using Contextual Embedding
to_aug_df['augmented_text'] = to_aug_df['augmented_text'].str.replace(" ' ", "'")

## Creating new data frame and text files


In [None]:
def augment_text_file(text_id, data_df, save_folder_name = 'data_augmented'):
    
    """
    Create infomation for agmented text and save to text files
    
    Arguments:
    text_id -- array or list of essay id
    save_folder_name -- folder to save augmented files
    data_df -- the dataframe of data set
    
    Returns:
    augmented_d -- Augmented dataframe
    """
    import os
    os.mkdir(save_folder_name)

    agmented_list = []
    for txt_id in text_id:
        # Get original text
        file_path = f'../input/feedback-prize-2021/train/{txt_id}.txt'
        with open(file_path, 'r') as fr:
            original_text = fr.read()

        # Get corresponding data in the data frame
        text_df = data_df[data_df["id"] == txt_id].copy()
        
        # Init variables for augmented text
        char_pos_original = 0  # trace the character position in the original text
        new_text = ""
        discourse_start_list = []
        discourse_end_list = []
        prediction_string_list = []

        # Loop on the training data discourses
        for row in text_df[["discourse_start", "discourse_end"]].itertuples():
            discourse_start, discourse_end = int(row[1]), int(row[2])

            # Copy the non-discourse text from the orginal
            if char_pos_original < discourse_start:
                new_text += original_text[char_pos_original:discourse_start] 
            else:
                new_text += ' '

            # Evaluate the new discourse starting position/string
            discourse_start_new = len(new_text)  #character position
            discourse_start_list.append(discourse_start_new)
            word_start = len(new_text.split()) #prediction string position

            # Copy the augmented discourse text
            new_text += text_df[text_df["discourse_start"] == discourse_start]["augmented_text"].iloc[0]


            # Evaluate the new discourse end position/string
            discourse_end_list.append(len(new_text))
            word_end = word_start + len(new_text[discourse_start_new:].split()) #prediction string position   
            prediction_string_list.append(" ".join([str(x) for x in range(word_start, word_end)])) # presiction string for that disourse

            char_pos_original = discourse_end
        
        # Write new info to the dataframe
        text_df["discourse_start_augmented"] = discourse_start_list
        text_df["discourse_end_augmented"] = discourse_end_list
        text_df["predictionstring_augmented"] = prediction_string_list

        # Copy the remaining of the original text if there are any
        if char_pos_original < len(original_text) - 1:
            new_text += original_text[char_pos_original:]

        # Save to new text file
        with open(f"./{save_folder_name}/{txt_id}_aug.txt", "w") as file:
            file.write(new_text)
        
        # Save all the augmented dataframe to a list
        agmented_list.append(text_df)
        
    augmented_df = pd.concat(agmented_list)
    
    return augmented_df

In [None]:
# Create augmented text
augmented_df = augment_text_file(text_id = augmented_id_list,
                                 save_folder_name = 'data_augmented',
                                 data_df = to_aug_df)

## Data Checking

The new dataframe now contains:
- `discourse_start_augmented`
- `discourse_end_augmented`
- `augmented_text`
- `predictionstring_augmented`

Let us see how do they look like.

In [None]:
# Sanity check
augmented_df.isnull().sum()

In [None]:
augmented_df.head()

In [None]:
# Specify the index of the dataframe one wants to check
check_idx = 3

# Loading texts
check_id = augmented_df.iloc[check_idx]["id"]
with open(f'../input/feedback-prize-2021/train/{check_id}.txt', "r") as f:
    original_text = f.read()
with open(f"./data_augmented/{check_id}_aug.txt") as f:
    new_text = f.read()

# Checking the original discourse
print("Original")
print(f"----- Discourse text in the dataframe: \n {augmented_df.iloc[check_idx]['discourse_text']}")
print(f"----- Discourse text in the text file: \n {original_text[int(augmented_df.iloc[check_idx]['discourse_start']):int(augmented_df.iloc[check_idx]['discourse_end'])]}")

# Checking the new discourse
print("\n New")
print(f"----- Discourse text in the dataframe: \n {augmented_df.iloc[check_idx]['augmented_text']}")
print(f"----- Discourse text in the text file: \n {new_text[int(augmented_df.iloc[check_idx]['discourse_start_augmented']):int(augmented_df.iloc[check_idx]['discourse_end_augmented'])]}")

In [None]:
# Zip the augmented folder file to download
import shutil
shutil.make_archive('augmented_syn', 'zip', './data_augmented')

# Save to csv file 


In [None]:
# Choose specific columns
aug_df = augmented_df[['id', 'discourse_id', 'discourse_start_augmented', 'discourse_end_augmented',
                    'augmented_text', 'discourse_type', 'discourse_type_num', 'predictionstring_augmented']].copy()

In [None]:
# Change id so that they are difference from original ones
aug_df['discourse_id'] = aug_df['discourse_id']*10
aug_df['id'] = aug_df['id'] + '_aug'

In [None]:
# Rename to match the original data frame
aug_df.rename(columns = {'augmented_text': 'discourse_text',
                         'discourse_start_augmented': 'discourse_start',
                         'discourse_end_augmented': 'discourse_end', 
                        'predictionstring_augmented': 'predictionstring'},
              inplace = True)

In [None]:
aug_df.head()

In [None]:
# Save the data frame to csv file
aug_df.to_csv('train_synonym_augmented.csv', index = False)