# Organ Tablature OCR Data Set Creator
---

The whole organ tablature ocr data set requires almost 90GB of disc space, which is why instead of the whole data set we distribute the generator.

The following packages need to be installed to run the provided code:
* **Numpy**: `pip install numpy`
* **Pillow**: `pip install Pillow`
* **Augmentor**: `pip install Augmentor`

## Building the data set with this notebook

By running the following code blocks the complete organ tablature ocr data set will be built.
This includes downloading the real tablature staves, running the data generator and running the data augmentor.
The parameters (number of images, number of augmentations, ...) set in this notebook are the same as used in the experiments for the paper.
They can be changed to build a data set of different size.

## Building the data set from command line

The downloader, data generator and data augmentor can also be run from command line by calling the `Main`-modules from the `src` directory.
* Downloader: `python datasetDownloaderMain.py <args>`
* Generator: `python datasetGeneratorMain.py <args>`
* Augmentor: `python datasetAugmentorMain.py <args>`

The required arguments are specified in the documentations of the respective python modules.
To further customize all aspects of the generator and augmentor the variables specified inside the `Main`-modules can be changed.


---
## Program Setup

---

The following code sets the `src` folder as the working directory for the program and imports all required methods.

In [None]:
import os

if os.path.basename(os.getcwd()) != 'src':
    os.chdir('src')

from datasetDownloaderMain import download_dataset
from datasetGeneratorMain import generate_dataset
from datasetAugmentorMain import augment_dataset

---
## RealData Download

---

The annotated realData source images (over 700MB) are downloaded and extraced to `data/realdataSources`.

In [None]:
download_zip_path = "../data/realdataSources/realdataSources.zip"
download_output_path = "../data"
download_delete_zip_file = True

download_dataset(download_zip_path, download_output_path, download_delete_zip_file)

---
## Data Set Generator

The data set generator is used to generate artificial tablature staves for the train and validation sets.

---

In [None]:
generator_source_folder = "../data/generatorSources/"
generator_final_augment = False  # the final augmentation step is omitted because augmentation occurs separately later

### TrainSet (20,000 generated images)

In [None]:
generator_output_folder = "../data/generatorOutput/trainSetA/"
generator_output_index_start = 0
generator_num_of_samples = 20000

generate_dataset(input_folder=generator_source_folder,
                 output_folder=generator_output_folder,
                 generate_num=generator_num_of_samples, 
                 output_index_start=generator_output_index_start,
                 final_augment=generator_final_augment)

### ValidationSet (8,000 generated images)

In [None]:
generator_output_folder = "../data/generatorOutput/validationSetA/"
generator_output_index_start = 0
generator_num_of_samples = 8000

generate_dataset(input_folder=generator_source_folder,
                 output_folder=generator_output_folder,
                 generate_num=generator_num_of_samples, 
                 output_index_start=generator_output_index_start,
                 final_augment=generator_final_augment)

### TestSet (0 generated images)
The TestSet only consists of real images

---
## Dataset Augmentor
The generated data set is enlarged by using data augmentation.
The generated tablature staves are combined with real staves.

---

### TrainSet (realData: 1,000 images, 100 augmentations per image)

In [None]:
augmentor_input_folder = "../data/realdataSources/trainSet/"
augmentor_input_indices = (0, 1000)

augmentor_output_folder = "../data/datasetOutput/trainSetA/"
augmentor_num_augmentations = 100
augmentor_output_index_start = 0

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

### TrainSet (generatedData: 20,000 images, 5 augmentations per image)

In [None]:
augmentor_input_folder = "../data/generatorOutput/trainSetA/"
augmentor_input_indices = (0, 20000)

augmentor_output_folder = "../data/datasetOutput/trainSetA/"
augmentor_num_augmentations = 5
augmentor_output_index_start = 100000

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

### ValidationSet (realData: 400 images, 25 augmentations per image))

In [None]:
augmentor_input_folder = "../data/realdataSources/validationSet/"
augmentor_input_indices = (0, 400)

augmentor_output_folder = "../data/datasetOutput/validationSetA/"
augmentor_num_augmentations = 25
augmentor_output_index_start = 0

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

### ValidationSet (generatedData: 8,000 images, 5 augmentations per image)

In [None]:
augmentor_input_folder = "../data/generatorOutput/validationSetA/"
augmentor_input_indices = (0, 8000)

augmentor_output_folder = "../data/datasetOutput/validationSetA/"
augmentor_num_augmentations = 5
augmentor_output_index_start = 10000

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

### TestSet (realData: 1,000 images, no augmentations)

In [None]:
augmentor_input_folder = "../data/realdataSources/testSet/"
augmentor_input_indices = (0, 1000)

augmentor_output_folder = "../data/datasetOutput/testSetA/"
augmentor_num_augmentations = 0
augmentor_output_index_start = 0

augment_dataset(input_folder=augmentor_input_folder,
                input_indices=augmentor_input_indices,
                output_folder=augmentor_output_folder, 
                augment_num=augmentor_num_augmentations,
                output_index_start=augmentor_output_index_start)

---

The `data` directory now contains all tablature images and is structured as follows:
* `generatorSources`: Contains source files for the tablature generator
    * `backgrounds`: source images for backgrounds and image borders
    * `duration`: source images for duration tablature characters
    * `note`: source images for note pitch tablature characters
    * `rest`: source images for rest tablature characters
    * `special`: source images for special tablature characters (measure lines, repetition signs, text blocks, ...)
* `realdataSources`: Contains annotated real organ tablature staves from two tablature books by Ammerbach ("Orgel oder Instrument Tabulaturbuch" and "Ein New Künstlich Tabulaturbuch"). These images will be downloaded in the following.
    * `trainSet`: 1000 tablature staves (500 from each book)
    * `validationSet`: 400 tablature staves (200 from each book)
    * `testSet`: 1000 tablature staves (500 from each book)
* `generatorOutput`: Output directory for the data generator
    * `trainSetA`: the images generated for the train set (before augmentation)
    * `validationSetA`: the images generated for the validation set (before augmentation)
* `datasetOutput`: Output directory for the final data sets
    * `trainSetA`: the images of the final train set
    * `validationSetA`: the images of the final validation set
    * `testSetA`: the images of the final test set



