### Data Augmentation and Oversampling

The idea is to leverage the metadata and to follow the algorithm proposed by Hashemi et al. (2023) to get pairs of non-consecutive paragraphs (with and
without style changes). Then, classes will be oversampled to obtain a balanced data set.

Description of the algorithm: " incorporate additional non-consecutive pairs of paragraphs
into our sample set and assign them labels based on the inferred relationships. For example, if
there are three consecutive paragraphs without a style change, we can infer that the first and
third paragraphs are written by the same author. Similarly, if there are style changes between
the first and second paragraphs and between the second and third paragraphs, we can deduce
that the authors of the first and third paragraphs are different, given that the number of authors
in the document exceeds the number of style changes by one." (Hashemi et al. 2023: 4).

In [1]:
import os, json
import pandas as pd

In [2]:
BASE_DIR = '../data_pipeline/'

# get data sets 
df_train = pd.read_csv(os.path.join(BASE_DIR, "df_train.csv"))
df_val = pd.read_csv(os.path.join(BASE_DIR, "df_validation.csv"))

# check distribution of labels
changes_train = len(df_train[df_train['label_author'] == 1])
no_changes_train = len(df_train[df_train['label_author'] == 0])

changes_val = len(df_val[df_val['label_author'] == 1])
no_changes_val = len(df_val[df_val['label_author'] == 0])

print(f"Number of rows where label_author == 0 (training data): {no_changes_train}")
print(f"Number of rows where label_author == 1 (training data): {changes_train}")

print(f"Number of rows where label_author == 0 (validation data): {no_changes_val}")
print(f"Number of rows where label_author == 1 (validation data): {changes_val}")

Number of rows where label_author == 0 (training data): 20485
Number of rows where label_author == 1 (training data): 31508
Number of rows where label_author == 0 (validation data): 4489
Number of rows where label_author == 1 (validation data): 6709


In [12]:
def data_augmentation(data):
    augmented_rows = []
    # problem is that fileindexes start from 1 for every label_dataset --> iterate over label_dataset first
    unique_datasets = data['label_dataset'].unique()
    for dataset in unique_datasets:
        dataset_data = data[data['label_dataset'] == dataset] # get subset of data for easy, medium, and hard
        unique_fileindexes = data['fileindex'].unique() # get unique fileindexes
        for file_index in unique_fileindexes:
            file_data = dataset_data[dataset_data['fileindex'] == file_index] # get DataFrame for file
            
            if (file_data['label_author'] == 1).sum() == (file_data["n_authors"].iloc[0] - 1):
                for i in range(len(file_data)-2):
                    row = file_data.iloc[i]
                    j = i + 1 # set next paragraph index

                    while (j < len(file_data)) and (file_data["label_author"].iloc[j-1] == 0):# while same author
                        if j > i + 1:
                            augmented_rows.append({
                        'paragraph1': row['paragraph1'],
                        'paragraph2': file_data['paragraph2'].iloc[j],
                        'label_author': 0, # same author
                        'label_dataset': row['label_dataset'],
                        'n_authors': row['n_authors'],
                        'fileindex': row['fileindex']
                    })   
                        j +=1 # move to next paragraph
                    while j < len(file_data):
                        if j > i + 1:
                            #print("-----",i,j, len(file_data))
                            augmented_rows.append({
                        'paragraph1': row['paragraph1'],
                        'paragraph2': file_data['paragraph2'].iloc[j],
                        'label_author': 1, # style change
                        'label_dataset': row['label_dataset'],
                        'n_authors': row['n_authors'],
                        'fileindex': row['fileindex']
                    })
                        j += 1 # move to next paragraph
            
    # Create a new DataFrame with augmented rows
    augmented_df = pd.DataFrame(augmented_rows)
    #augmented_df = pd.concat([data, augmented_df], ignore_index=True)
    return augmented_df

augmented_df = data_augmentation(df_train)
augmented_df

Unnamed: 0,paragraph1,paragraph2,label_author,label_dataset,n_authors,fileindex
0,"Look, you can believe that there was some broa...",">His rambling opening statement, which lasted ...",1,0,4,16
1,The 1028 120 mm canister may be a round the US...,Modern high explosive fragmentation weapons ex...,1,0,4,19
2,The 1028 120 mm canister may be a round the US...,Shrapnel shells stopped being used because the...,1,0,4,19
3,The 1028 120 mm canister may be a round the US...,Old timey Shrapnel Shells TM worked as basical...,1,0,4,19
4,I just saw some videos from this attack. The m...,Shrapnel shells stopped being used because the...,1,0,4,19
...,...,...,...,...,...,...
27087,I wonder what happens. Quatar has done so many...,In Brazil Fifa could probably sue the govt for...,1,2,4,4200
27088,I wonder what happens. Quatar has done so many...,I didn't say anything about Fifa not being shi...,1,2,4,4200
27089,"This is FIFA. They buy/sell/own the World Cup,...",In Brazil Fifa could probably sue the govt for...,1,2,4,4200
27090,"This is FIFA. They buy/sell/own the World Cup,...",I didn't say anything about Fifa not being shi...,1,2,4,4200


In [None]:
def swap_order(data):
    '''double the training data by swapping paragraph1 and paragraph2'''
    pass