# Preprocessing
This file is used to modify the input data, including cleaning the data, generating multiple choices and stuff

## Load dataset
Loading the dataset from hugging face doesn't work, the github for the dataset (and the paper) is here: 

'''
[https://github.com/aviaefrat/cryptonite](https://github.com/aviaefrat/cryptonite)
'''   

So we will use huggingface's load_dataset to load the downloaded jsonl files.  
*2024.06.06*  
I encountered a problem: in the field "number" of the data, there are a few rows with string format. so what we will do first is to change the string back to int, so that the load_dataset for hugging face can run. 
*2024.06.09* 
I realize that there is a problem, every input should be in string format, so we should actually do the opposite

In [5]:
import json


def reformat_data(file_path):
    modified_lines = []
    # Read the JSONL file
    with open(file_path, 'r') as file:
        for line in file:
            # Parse the JSON object from the line
            data = json.loads(line)
    
            if isinstance(data['number'], str):
                # Cast numerical part to integer so that we can make sure the format is right
                numerical_part = ''.join(filter(str.isdigit, data['number']))
                number_value = int(numerical_part)
                data['number'] = str(number_value)
            else:
                # directly cast to string
                data['number'] = str(data['number'])
            # Append modified line to the list
            modified_lines.append(json.dumps(data) + '\n')
    # Write modified lines back to the original file
    with open(file_path, 'w') as file:
        file.writelines(modified_lines)


train_fp = '../datasets/cryptonite-official-split/cryptonite-train.jsonl'
val_fp = "../datasets/cryptonite-official-split/cryptonite-val.jsonl"
test_fp = '../datasets/cryptonite-official-split/cryptonite-test.jsonl'
for file_path in [train_fp, val_fp, test_fp]:
    reformat_data(file_path)


## Add multiple choice
Beside from the answer, what can be other choices? We know that there are indication of words, for example:  
*the banners supporting strike were featured prominently (3,3,9)*
Indicates that the answer is three words, the first word have 3 letters, the second word have 3 letters, and third 9. So what we can do is at least simulate this. We will pick random three words with (3,3,9) letters. 

First we build a function that can randomly select a word of given length

In [6]:
# !pip install nltk

In [7]:
import nltk
import random

from nltk.corpus import words

# Load the word list
word_list = words.words()

dic_word_by_length = {}
# initialize max_word_len sets: set i have all the words of length i
max_word_len = 35
for length in range(1, max_word_len):
    # make it a list because we want to random select later
    dic_word_by_length[length] = [word for word in word_list if len(word) == length]
# apparently there is an extreme case:
dic_word_by_length[0] = ['']
    
# Function to get a random word of specific length
def get_random_word_of_length(length):
    filtered_words = dic_word_by_length[length]
    if not filtered_words:
        return None
    return random.choice(filtered_words)

def get_random_phrase_of_shape(choice_shape):
    if isinstance(choice_shape, int):
        choice_shape = (choice_shape,)
    # assume choice_shape is a tuple
    words = []
    for length in choice_shape:
        assert length >= 0
        word = get_random_word_of_length(length)
        assert word is not None
        words.append(word)
    phrase = " ".join(words)
    return phrase
    
# Example: Get a random 9-letter word
random_phrase = get_random_phrase_of_shape((3,3,9))
print("Random word of shape (3,3,9):", random_phrase)

Random word of shape (3,3,9): aho nix unplainly


Then we iterate through the data, and give it three random choices that fits the shape in the cue, in addition to the right choice. 

In [8]:
import ast
import json

def add_n_choices(original_file_path, modified_file_path, n=3):
    modified_lines = []
    # Read the JSONL file
    with open(original_file_path, 'r') as file:
        for line in file:
            # Parse the JSON object from the line
            data = json.loads(line)
            choice_shape = ast.literal_eval(data['enumeration'])
            for i in range(1, n + 1):
                data[f"choice{i}"] = get_random_phrase_of_shape(choice_shape)
            # Append modified line to the list
            modified_lines.append(json.dumps(data) + '\n')
    # Write modified lines back to the original file
    with open(modified_file_path, 'w') as file:
        file.writelines(modified_lines)


train_fp = '../datasets/cryptonite-official-split/cryptonite-train.jsonl'
val_fp = "../datasets/cryptonite-official-split/cryptonite-val.jsonl"
test_fp = '../datasets/cryptonite-official-split/cryptonite-test.jsonl'

modified_train_fp = 'processed_dataset/cryptonite-train-choice.jsonl'
modified_val_fp = "processed_dataset/cryptonite-val-choice.jsonl"
modified_test_fp = 'processed_dataset/cryptonite-test-choice.jsonl'
for original_file_path, modified_file_path in [(train_fp, modified_train_fp), (val_fp, modified_val_fp), (test_fp, modified_test_fp) ]:
    add_n_choices(original_file_path, modified_file_path)