# Organ Tablature OCR Dataset Creator
---

The whole organ tablature ocr dataset requires almost 300GB of disc space, which is why instead of the whole dataset the generator is distributed.

By running the following code blocks the complete organ tablature ocr dataset will be generated. 
All source images needed to generate the dataset are provided in the `data` folder.

The following packages need to be installed for the generator to work:
* **Pillow**: `pip install Pillow`
* **Augmentor**: `pip install Augmentor`
* **Numpy**: `pip install numpy`


---
## Folder Structure


---

The `src` folder contains all the python code of the data generator and data augmentor program.

The following code sets it as the working directory for the program.

In [None]:
import os

if os.path.basename(os.getcwd()) != 'src':
    os.chdir('src')

---

The `data` directory contains all tablature images and is structured as follows:
* `generatorSources`: Contains source files for the tablature generator
    * `backgrounds`: source images for backgrounds and image borders
    * `duration`: source images for duration tablature characters
    * `note`: source images for note pitch tablature characters
    * `rest`: source images for rest tablature characters
    * `special`: source images for special tablature characters (measure lines, repetition signs, text blocks, ...)
* `realdataSources`: Contains annotated real organ tablature staves from two tablature books by Ammerbach ("Orgel oder Instrument Tabulaturbuch" and "Ein New Künstlich Tabulaturbuch"). These images will be downloaded in the following.
    * `trainSet`: 1000 tablature staves (500 from each book)
    * `validationSet`: 400 tablature staves (200 from each book)
    * `testSet`: 1000 tablature staves (500 from each book)
* `generatorOutput`: Output directory for the data generator
    * `trainSetA`: the images generated for the train set (before augmentation)
    * `validationSetA`: the images generated for the validation set (before augmentation)
* `datasetOutput`: Output directory for the final datasets
    * `trainSetA`: the images of the final train set
    * `validationSetA`: the images of the final validation set
    * `testSetA`: the images of the final test set

---

The following script will download the annotated realData source imges (over 700MB) and extract them to the `realdataSources`subfolder.

In [None]:
from datasetGenerator.datasetGenerator.generatorUtility import download_source_images

data_url = "https://box.uni-marburg.de/index.php/s/MENZtcfuWDeDHi8/download"
data_size = 770034652
data_output_path = "../data"
data_zip_path = "../data3/realdataSources/realdataSources.zip"

download_source_images(data_url, data_size, data_output_path, data_zip_path)


---
## Dataset Generator

The generator is used to generate artificial tablature staves for the train and validation sets

---

In [None]:
from datasetGenerator.datasetGeneratorMain import generate_dataset

In [None]:
generator_source_folder = "../data/generatorSources/"
generator_final_augment = False  # the final augmentation step is omitted because augmentation occurs separately later

### TrainSet (140,000 generated images)

In [None]:
generator_output_folder = "../data/generatorOutput/trainSetA/"
generator_output_index_start = 0
generator_num_of_samples = 140000

generate_dataset(input_folder=generator_source_folder,
                 output_folder=generator_output_folder,
                 generate_num=generator_num_of_samples, 
                 output_index_start=generator_output_index_start,
                 final_augment=generator_final_augment)

### ValidationSet (8,000 generated images)

In [None]:
generator_output_folder = "../data/generatorOutput/validationSetA/"
generator_output_index_start = 0
generator_num_of_samples = 8000

generate_dataset(input_folder=generator_source_folder,
                 output_folder=generator_output_folder,
                 generate_num=generator_num_of_samples, 
                 output_index_start=generator_output_index_start,
                 final_augment=generator_final_augment)

### TestSet (0 generated images)
The TestSet only consists of real images

---
## Dataset Augmentor
The generated dataset is enlarged by using data augmentation.
The generated tablature staves are combined with real staves.

---

In [None]:
from datasetGenerator.datasetAugmentorMain import augment_dataset

### TrainSet (realData: 1,000 images, 100 augmentations per image)

In [None]:
augmentor_input_folder = "../data/realdataSources/trainSet/"
augmentor_input_indices = (0, 1000)

augmentor_output_folder = "../data/datasetOutput/trainSetA/"
augmentor_num_augmentations = 100
augmentor_output_index_start = 0

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

### TrainSet (generatedData: 140,000 images, 5 augmentations per image)

In [None]:
augmentor_input_folder = "../data/generatorOutput/trainSetA/"
augmentor_input_indices = (0, 140000)

augmentor_output_folder = "../data/datasetOutput/trainSetA/"
augmentor_num_augmentations = 5
augmentor_output_index_start = 100000

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

### ValidationSet (realData: 400 images, 25 augmentations per image))

In [None]:
augmentor_input_folder = "../data/realdataSources/validationSet/"
augmentor_input_indices = (0, 400)

augmentor_output_folder = "../data/datasetOutput/validationSetA/"
augmentor_num_augmentations = 25
augmentor_output_index_start = 0

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

### ValidationSet (generatedData: 8,000 images, 5 augmentations per image)

In [None]:
augmentor_input_folder = "../data/generatorOutput/validationSetA/"
augmentor_input_indices = (0, 8000)

augmentor_output_folder = "../data/datasetOutput/validationSetA/"
augmentor_num_augmentations = 5
augmentor_output_index_start = 10000

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

### TestSet (realData: 1,000 images, no augmentations)

In [None]:
augmentor_input_folder = "../data/realdataSources/testSet/"
augmentor_input_indices = (0, 1000)

augmentor_output_folder = "../data/datasetOutput/testSetA/"
augmentor_num_augmentations = 0
augmentor_output_index_start = 0

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

---

The final datasets are found inside the `data/datasetOutput` directory.




