# Preprocessing Folk- and Mythology Tales
This notebook assumes you've downloaded and extracted the [cleaned version](https://www.kaggle.com/datasets/cuddlefish/fairy-tales) of [Folk- and Mythology Tales](https://huggingface.co/datasets/merve/folk-mythology-tales) to `./data/merged_clean.txt`.

This is for generating a simple local dataset. For a full tutorial on generating a Hugging Face dataset see [the documentation](https://huggingface.co/docs/datasets/v2.12.0/dataset_script).

There are better ways to do this, but since this dataset is so small, I took a quick and dirty approach.

## Explore Text
In order to preprocess this text file into multiple documents, we need to see what we're dealing with. So let's print the some lines and look to see how the tales are split.

In [1]:
skip = False
with open("data/merged_clean.txt", "r") as f:
    for i in range(200):
        line = f.readline()
        if 0 <= i <= 20 or 150 <= i <=165:
            print(line, end='')
        elif not skip:
            skip = True
            print('\n...\n')


Lovely Ilonka

There was once a king's son who told his father that he wished to marry.

'No, no!' said the king; 'you must not be in such a hurry. Wait till you
have done some great deed. My father did not let me marry till I had won
the golden sword you see me wear.'

The prince was much disappointed, but he never dreamed of disobeying his
father, and he began to think with all his might what he could do. It
was no use staying at home, so one day he wandered out into the world to
try his luck, and as he walked along he came to a little hut in which he
found an old woman crouching over the fire.

'Good evening, mother. I see you have lived long in this world; do you
know anything about the three bulrushes?'

'Yes, indeed, I've lived long and been much about in the world, but I
have never seen or heard anything of what you ask. Still, if you will
wait till to-morrow I may be able to tell you something.'

...


The next day the king was married, with great rejoicings, to the fair
Ilonk

Our documents appear to have a newline after every sentence, and use four or more newlines to split between tales (documents).

## Clean Text
First, we want to remove the newline mid-sentence and the double newline between each sentence, which can easily be done with some regex.

In [2]:
import re

with open("data/merged_clean.txt", "r") as f:
    document = f.read()

# Replace all matches of one single newline but not two or more in a row with a space
document = re.sub(r"(?<!\n)\n(?!\n)", " ", document)
# Replace all matches of two newlines in a row but not three or more in a row with a single newline
document = re.sub(r"(?<!\n)\n\n(?!\n)", "\n", document)

So now we have paragraphs split by (mostly) single newlines and tales (or documents) split by multiple newlines.

In [3]:
lines = document.splitlines()
skip = False
for i, line in enumerate(lines[:40]):
    if 0 <= i <= 10 or 30 <= i:
        print(line)
    elif not skip:
        skip = True
        print('\n...\n')

 Lovely Ilonka
There was once a king's son who told his father that he wished to marry.
'No, no!' said the king; 'you must not be in such a hurry. Wait till you have done some great deed. My father did not let me marry till I had won the golden sword you see me wear.'
The prince was much disappointed, but he never dreamed of disobeying his father, and he began to think with all his might what he could do. It was no use staying at home, so one day he wandered out into the world to try his luck, and as he walked along he came to a little hut in which he found an old woman crouching over the fire.
'Good evening, mother. I see you have lived long in this world; do you know anything about the three bulrushes?'
'Yes, indeed, I've lived long and been much about in the world, but I have never seen or heard anything of what you ask. Still, if you will wait till to-morrow I may be able to tell you something.'
Well, he waited till the morning, and quite early the old woman appeared and took out a

Next, we want to split each tale from the rest. Since we know that each tale is separated by multiple newlines, we can use regex to split into multiple individual tales.

In [4]:
# Split the document wherever there are four or more newlines in a row
tales = re.split("\n{4,}", document)
for tale in tales[:2]:
    print(tale[:750], end='\n\n')

 Lovely Ilonka
There was once a king's son who told his father that he wished to marry.
'No, no!' said the king; 'you must not be in such a hurry. Wait till you have done some great deed. My father did not let me marry till I had won the golden sword you see me wear.'
The prince was much disappointed, but he never dreamed of disobeying his father, and he began to think with all his might what he could do. It was no use staying at home, so one day he wandered out into the world to try his luck, and as he walked along he came to a little hut in which he found an old woman crouching over the fire.
'Good evening, mother. I see you have lived long in this world; do you know anything about the three bulrushes?'
'Yes, indeed, I've lived long and b

Lucky Luck
Once upon a time there was a king who had an only son. When the lad was about eighteen years old his father had to go to fight in a war against a neighbouring country, and the king led his troops in person. He bade his son act as Regent 

In [5]:
tales[1][:250]

'Lucky Luck\nOnce upon a time there was a king who had an only son. When the lad was about eighteen years old his father had to go to fight in a war against a neighbouring country, and the king led his troops in person. He bade his son act as Regent in'

We are left with 1211 tales.

In [6]:
len(tales)

1211

## Final Processing

This method of chunking into sequences is only going to work because the dataset is so small: ~12MB. I'm going to append an end of sequence to all the tales, combine them into one sample, and then chunk them into sequences after tokenization. This is so the model will train on the end of the fairy and mythology tales, and they won't get cut off due to chunking.

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m-deduped")
tokenizer.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

It's important to spot check your dataset, which is how I noticed that the last tale was an empty string. Hence, I removed it. I also know that some tales have multiple newlines in-between paragraphs, so I remove them here.

In [8]:
processed_tales = []
for tale in tales[:-1]:
    if tale.startswith(' '):
        tale = tale[1:]
    tale = re.sub(r"\n{2,}", "\n", tale)
    tale = '# ' + tale + tokenizer.special_tokens_map['eos_token']
    processed_tales.append(tale)
for tale in processed_tales[:2]:
    print(tale[:300], end='\n\n')
    print('...', end='\n\n')
    print(tale[-300:], end='\n\n')

# Lovely Ilonka
There was once a king's son who told his father that he wished to marry.
'No, no!' said the king; 'you must not be in such a hurry. Wait till you have done some great deed. My father did not let me marry till I had won the golden sword you see me wear.'
The prince was much disappoint

...

e had been deceived, he vowed he would be revenged; so he gave orders that the swineherd, his wife and daughter should all be hanged; and so they were.
The next day the king was married, with great rejoicings, to the fair Ilonka; and if they are not yet dead--why, they are still living.<|endoftext|>

# Lucky Luck
Once upon a time there was a king who had an only son. When the lad was about eighteen years old his father had to go to fight in a war against a neighbouring country, and the king led his troops in person. He bade his son act as Regent in his absence, but ordered him on no account to m

...

ul servant alive and well.
When the old king saw this he foamed with rage, stared wi

Finally, I join all the tales into one string since the tokenizer is going to chunk the dataset for us.

In [9]:
processed_tales = ''.join(processed_tales)

## Convert to Hugging Face Dataset

Need to pass in `processed_tales` as a list, or `datasets` will make each string character its own sample.

In [10]:
from datasets import Dataset

tale_dict = {"tales": [processed_tales]}
dataset = Dataset.from_dict(tale_dict)
dataset

Dataset({
    features: ['tales'],
    num_rows: 1
})

## Tokenizing Dataset

This tokenization method is taken from the [Training a causal language model from scratch](https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt#preparing-the-dataset) section of the Hugging Face course.

It will take the single dataset example, tokenize it and chunk it into `max_sequence_length`, only throwing away the leftover overflow of the last tale (instead of all tales if we processed each individually).

In [11]:
from transformers import AutoTokenizer

def tokenize(max_sequence_length, tokenizer="EleutherAI/pythia-160m-deduped", name='tales_pythia'):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer)
    def tokenize_ds(text):
        outputs = tokenizer(
            text["tales"],
            truncation=True,
            max_length=max_sequence_length,
            return_overflowing_tokens=True,
            return_length=True,
        )
        input_batch = []
        for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
            if length == max_sequence_length:
                input_batch.append(input_ids)
        return {"input_ids": input_batch}


    tokenized_dataset = dataset.map(
        tokenize_ds, batched=True, remove_columns=dataset.column_names
    )

    tokenized_dataset.save_to_disk(f"data/{name}_{max_sequence_length}.hf")

In [None]:
for i in [128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280]:
    tokenize(max_sequence_length=i)

There are undoubtably better ways to do this, but since this dataset is so small, quick and dirty is fine.