# Tutorial on Preparing Data for CycleLighting

In [46]:
import json
from datasets import Dataset, DatasetDict
import pandas as pd
from nltk.tokenize import sent_tokenize

## Data Format

Data used for training can be completely unpaired or paired. Unpaired datasets can be of different sizes and domains. Validation and training data must be labelled examples to calculate some metric of error.

## Importing Data & Preparation

Import training and validation data in from whatever data source you have. In this case I am reading raw text for the unpaired examples and a CSV for examples that would have been manually annotated or taken from a ground truth data source. Split your unpaired content in the appropriate way e.g. into sentences.

In [47]:
# import text from file
with open('a_text.txt', 'r') as f:
    a_text = f.read()

with open('b_text.txt', 'r') as f:
    b_text = f.read()

val_data = pd.read_csv('val_data.csv').to_dict('records')

# Extract each sentence
a_sentences = sent_tokenize(a_text)
b_sentences = sent_tokenize(b_text)

print(f'{len(a_sentences)} sentences for dataset A')
print(f'{len(b_sentences)} sentences for dataset B')
print(f'{len(val_data)} sentences for validation')

60 sentences for dataset A
120 sentences for dataset B
4 sentences for validation


Create a dictionary with `text` as the key for the unpaired data and `text` and `label` as the keys for the paired data. The values for each key should be a list of strings.

Validation samples can be reused for each dataset by inverting which half of the pair is text and which half is label.

In [48]:
a_dict = [{'text': a.strip()} for a in a_sentences]
b_dict = [{'text': b.strip()} for b in b_sentences]
a_val = [{'text': example['A'], 'label': example['B']} for example in val_data]
b_val = [{'text': example['B'], 'label': example['A']} for example in val_data]

print(a_dict[0])

{'text': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'}


## Using JSON and JSONL

JSON & JSONL samples cannot be used with a validation set currently. Test samples may be generated for use in `generate.py` so we will keep the full validation set for testing. A and B datasets do not need to be the same. We will save one as JSON and one as JSONL as an example.

In [49]:
# JSON
with open('a.json', 'w+') as f:
    json.dump(a_dict, f, indent=4)

# JSONL
with open('b.jsonl', 'w+') as f:
    for b in b_dict:
        f.write(json.dumps(b))
        f.write('\n')

## DatasetDict

The `DatasetDict` object is a dictionary that holds multiple datasets. It is used to hold the training, validation, and test datasets. One again, each dataset could be different types e.g. one could be JSON and the other DatasetDict. We will create both as DatasetDict as it is the recommended format for the project in general.

In [50]:
a_dataset = Dataset.from_list(a_dict)
b_dataset = Dataset.from_list(b_dict)

a_val_dataset = Dataset.from_list(a_val)
b_val_dataset = Dataset.from_list(b_val)

# Split out validation and test data with 50% for val and 50% for test
a_val_dataset, a_test_dataset = a_val_dataset.train_test_split(test_size=0.5).values()
b_val_dataset, b_test_dataset = b_val_dataset.train_test_split(test_size=0.5).values()

a_dataset_dict = DatasetDict({
    'train': a_dataset,
    'validation': a_val_dataset,
    'test': a_test_dataset
})

b_dataset_dict = DatasetDict({
    'train': b_dataset,
    'validation': b_val_dataset,
    'test': b_test_dataset
})

a_dataset_dict

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 60
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2
    })
})

Save datasets to disk

In [51]:
a_dataset_dict.save_to_disk('tutorial/A')
b_dataset_dict.save_to_disk('tutorial/B')

                                                                                            