To access the dataset of new manual human labels we generated as part of our study, please see the csv files in the *new_manual_labels* subfolder of the *4_reformulated_datasets* folder. The following notebook is what was used to generate a random subset of the data for manual labelling (400 scenario pairs from the hard test dataset and 400 scenario pairs from the easy test datasets).

### Imports

In [None]:
import numpy as np
import os
import pandas as pd
from sklearn.utils import shuffle

### For Google Colab only

Set path to root folder after change directory command

In [None]:
# # mount google drive
# from google.colab import drive
# drive.mount('/content/drive')

# # change to directory containing relevant files
# %cd 'INSERT_DIRECTORY/2_manual_labelling_preparation'

### Generate manual labelling dataset

The original dataset csv files easy.csv (easy test dataset) and hard.csv (hard test dataset) consist of two columns, with each row a different scenario pair. The first column contains the lower utility sentence, and the second column contains the higher utility scenario (according to the original study labels).

This code will shuffle the order of the scenario pairs, and shuffle the order of the scenarios within each pair, and then output two csv files.  The first will contain a copy of the shuffled dataset with the study labels included (the reference copy). The second will contain the shuffled dataset but not contain the study labels, so that it can then be used to setup a task for labellers.

In [None]:
DATASET = "util_test"  # specify "util_test" for easy test set, or "util_test_hard" for hard test dataset

# load dataframe (csv file with col1: bad_sentence, col2: good_sentence)
df = pd.read_csv(f"../1_original_study_datasets/{DATASET}.csv", header=None, index_col=None)

# shuffle all rows of dataframe
df = shuffle(df)

# loop through rows in dataframe to randomly swap order of sentences
# append shuffled rows to a new dataframe: randomised_df
randomised_df = pd.DataFrame([])
for index, row in enumerate(df.iterrows()):
    # to randomise order or sentences in sentence pair
    order = np.random.choice([1,2])

    if order == 1: # sentence1 is bad, sentence2 is good
        temp_dict = {"original_index": index,
                    "sentence1": row[0],
                    "sentence2": row[1],
                    "study_label": 5} # 5 = sentence2 is good

    elif order == 2: # sentence1 is good, sentence2 is bad
        temp_dict = {"original_index": index,
                    "sentence1": row[1],
                    "sentence2": row[0],
                    "study_label": 1} # 1 = sentence 1 is good

    randomised_df = randomised_df.append(temp_dict, ignore_index=True)

# save randomised_df as csv file with study_labels included
randomised_df.to_csv(f"labelled/{DATASET}_LABELLED_randomised.csv")

# save separate csv files for labellers, with study_labels removed
randomised_df = randomised_df.drop(columns="study_label")
randomised_df.to_csv(f"{name}/{DATASET}_{name}_randomised.csv")