### Data Augmentation and Oversampling

The idea is to leverage the metadata and following the alogrithm proposed by Hashemi et al. (2023) to get pairs of non-consecutive paragraphs (with and
without style changes). Then, classes will be oversampled to obtain a balanced data set.

Description of algorithm: " incorporate additional non-consecutive pairs of paragraphs
into our sample set and assign them labels based on the inferred relationships. For example, if
there are three consecutive paragraphs without a style change, we can infer that the first and
third paragraphs are written by the same author. Similarly, if there are style changes between
the first and second paragraphs and between the second and third paragraphs, we can deduce
that the authors of the first and third paragraphs are different, given that the number of authors
in the document exceeds the number of style changes by one." (Hashemi et al. 2023: 4).

In [1]:
import os, json
import pandas as pd

In [2]:
BASE_DIR = '../data_pipeline/'

# get data sets 
df_train = pd.read_csv(os.path.join(BASE_DIR, "df_train.csv"))
df_val = pd.read_csv(os.path.join(BASE_DIR, "df_validation.csv"))

# check distribution of labels
changes_train = len(df_train[df_train['label_author'] == 1])
no_changes_train = len(df_train[df_train['label_author'] == 0])

changes_val = len(df_val[df_val['label_author'] == 1])
no_changes_val = len(df_val[df_val['label_author'] == 0])

print(f"Number of rows where label_author == 0 (training data): {no_changes_train}")
print(f"Number of rows where label_author == 1 (training data): {changes_train}")

print(f"Number of rows where label_author == 0 (validation data): {no_changes_val}")
print(f"Number of rows where label_author == 1 (validation data): {changes_val}")

Number of rows where label_author == 0 (training data): 20485
Number of rows where label_author == 1 (training data): 31508
Number of rows where label_author == 0 (validation data): 4489
Number of rows where label_author == 1 (validation data): 6709


In [26]:
#for i in range(data[fileindex].loc[-1]): # iterate over fileindexes 
def data_augmentation(data):
    augmented_rows = []
    unique_fileindexes = data['fileindex'].unique()
    
    for file_index in unique_fileindexes:
        file_data = data[data['fileindex'] == file_index] # get DataFrame for file

        print(file_data['label_author'] == 1).sum() # problem here! Because some dont have 1s
        if (file_data['label_author'] == 1).sum() == (file_data["n_authors"].iloc[0] - 1):
            for index, row in file_data.iterrows():
                j = index + 1 # set next paragraph index
                while j < len(file_data) and file_data["label_author"].iloc[j-1] == 0:
                    if j > i + 1:
                        augmented_rows.append({
                    'paragraph1': row['paragraph1'],
                    'paragraph2': file_data.loc[index + 1, 'paragraph1'],
                    'label_author': 0, # same author
                    'label_dataset': row['label_dataset'],
                    'n_authors': row['n_authors'],
                    'fileindex': row['fileindex']
                })
                    j +=1 # move to next paragraph
                while j < len(file_data):
                    if j > i + 1:
                        augmented_rows.append({
                    'paragraph1': row['paragraph1'],
                    'paragraph2': file_data.loc[index + 1, 'paragraph1'],
                    'label_author': 1, # style change
                    'label_dataset': row['label_dataset'],
                    'n_authors': row['n_authors'],
                    'fileindex': row['fileindex']
                })
                    j += 1 # move to next paragraph
            
    # Create a new DataFrame with augmented rows
    augmented_df = pd.DataFrame(augmented_rows)
    #augmented_df = pd.concat([data, augmented_df], ignore_index=True)

    return augmented_df

# Example usage:
augmented_df = data_augmentation(df_train)
augmented_df

0         True
1         True
2         True
3         True
4         True
11065     True
32978     True
32979    False
32980     True
Name: label_author, dtype: bool


AttributeError: 'NoneType' object has no attribute 'sum'

In [4]:
def swap_order(data):
    '''double the training data by swapping paragraph1 and paragraph2'''
    pass