# Dataset EDA

In [17]:
import json

with open('../data/paper_review_new_format_train.jsonl') as f:
    dataset = [json.loads(line) for line in f]

In [3]:
print("Total number of papers in the dataset: ", len(dataset))

Total number of papers in the dataset:  1557


Each dataset contains 3 columns: `instruction`, `input`, and `output`. The instruction contains the system prompt for the model, and the input is the paper content. The output is the summarized reviews of the paper.

In [4]:
dataset[0]

{'instruction': 'You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: **importance and novelty**, **potential reasons for acceptance**, **potential reasons for rejection**, and **suggestions for improvement**. The given paper is as follows.',
 'input': '[TITLE]\nImage Quality Assessment Techniques Improve Training and Evaluation of Energy-Based Generative Adversarial Networks\n\n[ABSTRACT]\nWe propose a new, multi-component energy function for energy-based Generative Adversarial Networks (GANs) based on methods from the image quality assessment literature. Our approach expands on the Boundary Equilibrium Generative Adversarial Network (BEGAN) by outlining some of the short-comings of the original energy and loss functions. We address these short-comings by incorporating an l1 score, the Gradient Magnitude Similarity score, and a chrominance score into the new energy function. We then provide a set of systematic experiments that

## Average Number of Words

In [5]:
lengths = []
for data in dataset:
    instruction = data['instruction']
    input = data['input']
    output = data['output']
    
    len_instruction = len(instruction.split())
    len_input = len(input.split())
    len_output = len(output.split())
    
    sum_len = len_instruction + len_input + len_output
    lengths.append(sum_len)
    
print("Average number of words in the dataset: ", sum(lengths)/len(lengths))

Average number of words in the dataset:  5619.705844572896


Following the rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words). We can use this to estimate the average number of tokens.

In [6]:
token2word_ratio = 4/3
print("Average number of tokens in the dataset: ", sum(lengths)/len(lengths) * token2word_ratio)
print("Max number of tokens in the dataset: ", max(lengths) * token2word_ratio)

Average number of tokens in the dataset:  7492.941126097195
Max number of tokens in the dataset:  9821.333333333332


In [None]:
# draw a histogram of the lengths
import matplotlib.pyplot as plt
plt.hist(lengths, bins=20)
plt.xlabel('Length of the paper')
plt.ylabel('Number of papers')
plt.title('Histogram of the lengths of the papers')
plt.show()

# Preprocess Dataset

In [7]:
import json
import numpy as np
from datasets import Dataset

random_seed = 42

In [18]:
dataset = Dataset.from_list(dataset)

In [19]:
def create_conversation(sample):
    return {
        "messages": [
            {"role": "system", "content": sample['instruction']},
            {"role": "user", "content": sample["input"]},
            {"role": "assistant", "content": sample["output"]}
        ]
    }


In [20]:
dataset = dataset.map(create_conversation, remove_columns=dataset.features, batched=False)

Map:   0%|          | 0/1557 [00:00<?, ? examples/s]

In [21]:
print(dataset[0]["messages"])

[{'content': 'You are a professional machine learning conference reviewer who reviews a given paper and considers 4 criteria: **importance and novelty**, **potential reasons for acceptance**, **potential reasons for rejection**, and **suggestions for improvement**. The given paper is as follows.', 'role': 'system'}, {'content': '[TITLE]\nImage Quality Assessment Techniques Improve Training and Evaluation of Energy-Based Generative Adversarial Networks\n\n[ABSTRACT]\nWe propose a new, multi-component energy function for energy-based Generative Adversarial Networks (GANs) based on methods from the image quality assessment literature. Our approach expands on the Boundary Equilibrium Generative Adversarial Network (BEGAN) by outlining some of the short-comings of the original energy and loss functions. We address these short-comings by incorporating an l1 score, the Gradient Magnitude Similarity score, and a chrominance score into the new energy function. We then provide a set of systemati

## Split to Train, Validation, and Test

In [22]:
dataset = dataset.train_test_split(test_size=0.2, seed=random_seed)

In [24]:
dataset

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 1245
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 312
    })
})

In [25]:
# save datasets to disk
dataset["train"].to_json("../data/train_dataset.json", orient="records")
dataset["test"].to_json("../data/test_dataset.json", orient="records")

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

11129949