# Image generation for text recognition
Author: [Ryan Parker](https://github.com/rparkr)

Updated on: 31-Mar-2022

## Purpose
Create full-page images of text to use as training data for a full-page handwriting recognition model.

I use `trdg` (TextRecognitionDataGenerator) to create lines of text that resemble the text in my handwritten journals.

## Resources
* [`TextRecognitionDataGenerator` GitHub repo](https://github.com/Belval/TextRecognitionDataGenerator)
* [`trdg` docs](https://textrecognitiondatagenerator.readthedocs.io/en/latest/module.html)
* [`trdg` on PyPI](https://pypi.org/project/trdg/)



## `trdg` install and setup

In [3]:
# See: https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/#How-to-use-Pip-from-the-Jupyter-Notebook
# and: https://github.com/microsoft/vscode-jupyter/wiki/Installing-Python-packages-in-Jupyter-Notebooks
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install trdg

# Alternative method, using the %pip magic command
# %pip install trdg

Collecting trdg
  Using cached trdg-1.7.0-py3-none-any.whl (91.2 MB)
Collecting beautifulsoup4>=4.6.0
  Using cached beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
Collecting diffimg==0.2.3
  Using cached diffimg-0.2.3.tar.gz (4.1 kB)
Collecting opencv-python>=4.2.0.32
  Using cached opencv_python-4.5.5.64-cp36-abi3-win_amd64.whl (35.4 MB)
Collecting pillow>=7.0.0
  Downloading Pillow-9.0.1-cp37-cp37m-win_amd64.whl (3.2 MB)
Collecting tqdm>=4.23.0
  Using cached tqdm-4.63.1-py2.py3-none-any.whl (76 kB)
Collecting numpy<1.17,>=1.16.4
  Downloading numpy-1.16.6-cp37-cp37m-win_amd64.whl (11.9 MB)
Collecting requests>=2.20.0
  Using cached requests-2.27.1-py2.py3-none-any.whl (63 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.3.1-py3-none-any.whl (37 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
Collecting charset-normalizer~=2.0.0
  Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Collecting idna<4,>=2.5
  Using ca

In [4]:
%pip install requests
%pip install tqdm

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import requests                     # fetch web content
from PIL import Image, ImageDraw    # manipulate images
import os                           # work with files
import numpy as np                  # array computations
import random                       # sample from distributions
from tqdm.notebook import tqdm      # progress bars

# generate images of text
from trdg.generators import (
    GeneratorFromDict,
    GeneratorFromRandom,
    GeneratorFromStrings,
    GeneratorFromWikipedia,
)

Missing modules for handwritten text generation.


In [2]:
help(GeneratorFromDict)

Help on class GeneratorFromDict in module trdg.generators.from_dict:

class GeneratorFromDict(builtins.object)
 |  GeneratorFromDict(count=-1, length=1, allow_variable=False, fonts=[], language='en', size=32, skewing_angle=0, random_skew=False, blur=0, random_blur=False, background_type=0, distorsion_type=0, distorsion_orientation=0, is_handwritten=False, width=-1, alignment=1, text_color='#282828', orientation=0, space_width=1.0, character_spacing=0, margins=(5, 5, 5, 5), fit=False, output_mask=False, word_split=False, image_dir='C:\\Users\\Ryan\\.conda\\envs\\trdg\\lib\\site-packages\\trdg\\generators\\images', stroke_width=0, stroke_fill='#282828', image_mode='RGB')
 |  
 |  Generator that uses words taken from pre-packaged dictionaries
 |  
 |  Methods defined here:
 |  
 |  __init__(self, count=-1, length=1, allow_variable=False, fonts=[], language='en', size=32, skewing_angle=0, random_skew=False, blur=0, random_blur=False, background_type=0, distorsion_type=0, distorsion_orienta

## Fonts
I searched [Google Fonts](https://fonts.google.com) for handwriting-style fonts that looked similar to my handwriting. I found six fonts which I will use to generate training data:
* [Reenie Beanie](https://fonts.google.com/specimen/Reenie+Beanie), by James Grieshaber
* [Nanum Brush Script](https://fonts.google.com/specimen/Nanum+Brush+Script), by Sandoll
* [Waiting for the Sunrise](https://fonts.google.com/specimen/Waiting+for+the+Sunrise), by Kimberly Geswein
* [Shadows Into Light](https://fonts.google.com/specimen/Shadows+Into+Light), by Kimberly Geswein
* [Architects Daughter](https://fonts.google.com/specimen/Architects+Daughter), by Kimberly Geswein
* [Swanky and Moo Moo](https://fonts.google.com/specimen/Swanky+and+Moo+Moo#standard-styles), by Kimberly Geswein

Each font's `.tff` file can be downloaded from the Google Font links above, or from http://bootes.ethz.ch/fonts/, which was linked to from the [Google Fonts GitHub repo](https://github.com/google/fonts).

## Resources
* [Pillow's `ImageFont` module](https://pillow.readthedocs.io/en/stable/reference/ImageFont.html), which is used by `trdg` to generate images of text
* [StackOverflow question](https://stackoverflow.com/questions/24085996/how-i-can-load-a-font-file-with-pil-imagefont-truetype-without-specifying-the-ab) showing that Pillow's `ImageFont` using paths to `.tff` TrueType font files to place text on images.

In [3]:
fonts = {
    'Reenie Beanie': {'URL': 'https://fonts.google.com/specimen/Reenie+Beanie',
                      'Author': 'James Grieshaber',
                      'Download': 'http://bootes.ethz.ch/fonts/reeniebeanie/ReenieBeanie.ttf',
                      'Other downloads': 'http://bootes.ethz.ch/fonts/?reeniebeanie'},
    'Nanum Brush Script': {'URL': 'https://fonts.google.com/specimen/Nanum+Brush+Script',
                           'Author': 'Sandoll',
                           'Download': 'http://bootes.ethz.ch/fonts/nanumbrushscript/NanumBrushScript-Regular.ttf',
                           'Other downloads': 'http://bootes.ethz.ch/fonts/?nanumbrushscript'},
    'Waiting for the Sunrise': {'URL': 'https://fonts.google.com/specimen/Waiting+for+the+Sunrise',
                                'Author': 'Kimberly Geswein',
                                'Download': 'http://bootes.ethz.ch/fonts/waitingforthesunrise/WaitingfortheSunrise.ttf',
                                'Other downloads': 'http://bootes.ethz.ch/fonts/?waitingforthesunrise'},
    'Shadows Into Light': {'URL': 'https://fonts.google.com/specimen/Shadows+Into+Light',
                           'Author': 'Kimberly Geswein',
                           'Download': 'http://bootes.ethz.ch/fonts/shadowsintolight/ShadowsIntoLight.ttf',
                           'Other downloads': 'http://bootes.ethz.ch/fonts/?shadowsintolight'},
    'Architects Daughter': {'URL': 'https://fonts.google.com/specimen/Architects+Daughter',
                            'Author': 'Kimberly Geswein',
                            'Download': 'http://bootes.ethz.ch/fonts/architectsdaughter/ArchitectsDaughter-Regular.ttf',
                            'Other downloads': 'http://bootes.ethz.ch/fonts/?architectsdaughter'},
    'Swanky and Moo Moo': {'URL': 'https://fonts.google.com/specimen/Swanky+and+Moo+Moo#standard-styles',
                           'Author': 'Kimberly Geswein',
                           'Download': 'http://bootes.ethz.ch/fonts/swankyandmoomoo/SwankyandMooMoo.ttf',
                           'Other downloads': 'http://bootes.ethz.ch/fonts/?swankyandmoomoo'}
}

Download fonts

In [5]:
# Create a folder to store the fonts in
if not os.path.exists('./fonts'):
    os.mkdir('./fonts')

# Loop through fonts and download to folder
for font in fonts:
    fontname = font.replace(' ', '_')
    fontfile = requests.get(fonts[font]['Download']).content
    with open(f"./fonts/{fontname}.tff", mode='wb') as savefile:
        savefile.write(fontfile)

In [6]:
fontfiles = []

# Loop through fonts and download to folder
for font in fonts:
    fontname = font.replace(' ', '_')
    fontfiles.append(f"./fonts/{fontname}.tff")

## Generate lines of text

`trdg` uses four classes of generators to produce images of lines of text: `GeneratorFromDict`, `GeneratorFromRandom`, `GeneratorFromStrings`, and `GeneratorFromWikipedia`. Here's how they work:

All generators:
* Output: a 2-tuple of the form (`img`, `label`), where `img` is a Pillow image of the line of text and `label` is a string of the characters in `img`
* `count`: sets the maximum number of images the generator will create. Default is `-1`, which means "create an image each time you are called until the iteration is exited"
* `length`: sets the number of words in each line (image)
* `size`: sets the height (in pixels) of the generated image.
* `width`: sets the width of the image (in pixels). The default is `-1`, which means "as wide as it needs to be to fit all the words and the margins". Any other setting will truncate the image after that many pixels, even if the `length` number of words hasn't been displayed.
* `font`: a list of filepaths to `.ttf` TrueType font files. The generator uses a single font per line. Each time it generates a line of text, it randomly selects a font from the provided list in `font` (or the default list, if `font` is not specified).
* `language`: 2-letter language code that sets the language (character set, dictionary, or both) to be used for generated text. Default is 'en' for English.
* `margins`: a 4-tuple of the form: `(top, left, bottom, right)`, indicating the number of pixels of margin on each of those sides. This respects the `size` parameter, so if `size` is set to `50` and `margins` is set to `(10, 0, 0, 0)`, the text can take up only 40 pixels of space to allow 10 pixels for the top margin.
* `fit`: Boolean. If `True`, forces image to fit tightly around text, with no margins. However, margins are generated first, so to make the image bound the text exactly, one must set all margins to 0 and then use `fit=True`. Default is `False`.
* `background_type`: sets the background to be used. The default, `0`, uses Gaussian noise. `1` uses a plain white background. `2` uses a diamond-pattern background. `3` uses an image selected from the folder specified in the parameter: `img_dir`.
* `skewing_angle`: the ± angle (in degrees) to skew the text. Has a more drastic effect the larger `length` is because the line becomes longer and the skewing more pronounced. For example, at a `length` of around 15, `skewing_angle` of 2 is significant.
* `random_skew`: if `skewing_angle` is specified, this determines if a skew will be applied or not (at random).
* `output_mask`: if `True`, the output `img` is a 2-tuple of (img, mask), where mask is a black rectangle of the same size as img. Default is `False`.



`GeneratorFromStrings`
* Generates lines of text based on the strings provided to it (as a list in the first argument, `strings`)
* Each list item (a string) is one line of text
* The generator will create images of lines of text in the order in which they are found in the `strings` list

`GeneratorFromRandom`
* Generates lines of text using "words" of characters separated by spaces
* Has settings to enable only certain characters (e.g., letters, numerals, or other characters)
* Uses `language` to determine which character set to use from the provided fonts

`GeneratorFromDict`
* Generates lines of text using words randomly selected from a dictionary in the provided `language`
* Words include compound words like "light-year" and contractions like "didn't", as well as proper nouns, too

`GeneratorFromWikipedia`
* Generates lines of text scraped from random Wikipedia articles
* takes about 80 seconds to instantiate an object (perhaps because it is collecting many articles?)
* `minimum_length` sets the minimum number of words per line

In [46]:
# Test the different text generators

generator_dict = GeneratorFromDict(
    count = -1,
    length = 18,
    fonts = fontfiles,
    size = 62,
    background_type = 1,
    margins = (10,10,0,0)
    )

generator_rnd = GeneratorFromRandom(
    count = -1,
    length = 18, 
    fonts = fontfiles,
    size = 62, 
    background_type = 1,
    margins = (10, 10, 0, 0),
    skewing_angle = 2,
    random_skew = True,
    output_mask = True
    )

generator_text = GeneratorFromStrings(
    strings = ['Line1', 'Line2 Line2', 'Line3 Line3 Line3'],
    fonts = fontfiles,
    size = 62,
    background_type = 1,
    margins = (10, 10, 0, 0))

generator_wiki = GeneratorFromWikipedia(
    count = -1,
    minimum_length = 13,
    fonts = fontfiles,
    language = "es",
    size = 62,
    margins = (10, 10, 0, 0)
)

In [18]:
generator_dict = GeneratorFromDict(
            length = 18,
            fonts = fontfiles,
            size = 62,
            background_type = 1,
            margins = (20,70,0,10)
            )

In [17]:
# Create a folder to store the images in
if not os.path.exists('./temp'):
    os.mkdir('./temp')
# Create a folder to store the image labels in
if not os.path.exists('./labels'):
    os.mkdir('./labels')


# Test image creation
for n, (img, label) in enumerate(generator_dict):
    if n > 4: break
    img.save(f"./temp/img{str(n).zfill(5)}.jpg")
    # print(img.size)
    with open(f"./labels/label{str(n).zfill(5)}.txt", mode='w') as txtlabel:
        txtlabel.write(label)

## Functions for image generation

### Probabilistic number of words per line
Create a function to return the number of words per line, to model the distribution of words per line in my journal.

In [7]:
# First, create a function to return the number of words for each generated line.
# Since the number of words per line in my journal varies, I will sample from
# a probability distribution over the possible number of words per line.

# I generated the below words per line and cumulative probabilities based on
# 10 pages of transcribed text.

words_per_line = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
cumulative_probabilities = [0.04264,
    0.05814,
    0.06589,
    0.0969,
    0.10465,
    0.13566,
    0.1938,
    0.26744,
    0.36822,
    0.51163,
    0.67442,
    0.83333,
    0.92636,
    0.96899,
    0.98837,
    1.0]

# Find the closest index to the one chosen by a random float.
# See: https://www.adamsmith.haus/python/answers/how-to-find-the-numpy-array-element-closest-to-a-given-value-in-python
# and: https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array

def word_count(word_counts, cumulative_probabilities):
    '''
    Uses a randomly-chosen float (0-1) and finds the absolute
    difference between that float and each item in the 
    `cumulative_probabilities` list. Then, finds the minimum 
    difference using `np.argmin()` as an index into the 
    `word_counts` list and returns the item in `word_counts` 
    at the minimum index.

    Note: `word_counts` and `cumulative_probabilities` must 
    have the same number of items.

    References:
    * https://www.adamsmith.haus/python/answers/how-to-find-the-numpy-array-element-closest-to-a-given-value-in-python
    * https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array
    '''
    words = np.array(word_counts)
    probs = np.array(cumulative_probabilities)
    rand_num = random.random()
    # Find the difference between the probability distribution
    # and the random float
    diff = np.abs(probs - rand_num)
    # Find the index of the smallest difference
    idx = diff.argmin()
    # Return the word count of the smallest difference
    return words[idx]

In [9]:
word_count(words_per_line, cumulative_probabilities)

17

### Generate lines of text-images
Next, create a function to generate lines of text using `trdg` that can later be concatenated into full-page images of text.

In [10]:
def create_lines(standard_generator,
                 indent_generator,
                 blank_generator,
                 number_of_lines=30,
                 indented_probability=0.183,
                 blank_line_probability=0.027):
    '''
    Creates images of lines of text for creating full-page images of text.

    Output: (images, labels), where `images` is a list of Pillow Image objects
    and `labels` is a list of strings that represent the text in each image.

    Parameters
    ---
    standard_generator: a `trdg` Generator object with settings (besides line
    length) already set.

    indent_generator: a `trdg` Generator object with settings (besides line
    length) already set. This generator has a large left margin to simulate
    indentation.

    blank_generator: a `trdg` Generator object with settings (besides line
    length) already set. This generator returns a blank line-image to simulate
    blank lines.

    number_of_lines: sets the number of images (lines of text) to be returned.
    Default is 30.

    indented_probability: the likelihood that any given line is indented.
    Default is 0.183, which reflects the fact that, on average, 5.5 lines out
    of every 30 are indented.

    blank_line_probability: the likelihood that any gven line is blank.
    Default is 0.027, which reflects the fact that, on average, 0.8 lines out
    of every 30 are blank.
    '''
    # Create lists to store the returned objects: images and labels
    images = []
    labels = []

    # Loop to create images and labels
    for n in range(number_of_lines):
        # Determine if the line will be indented or blank
        # If it is indented, it cannot also be blank
        is_indented = (True if random.random() < indented_probability else False)
        is_blank = False
        if not is_indented:
            is_blank = (True if random.random() < blank_line_probability else False)
        
        if is_blank:
            img, label = blank_generator.next()
            images.append(img)
            labels.append('')
            # Move on to the next line-image
            continue
        
        line_length = word_count(words_per_line, cumulative_probabilities)
        # Average word length (characters) is much longer for trdg than for
        # my journals, so I'm adjusting the number of words per line by -7
        # so the range is now 3-18 words (rather than 10-25)
        line_length -= 7
        label_prefix = ('    ' if is_indented else '')

        # Create the text generator
        generator = indent_generator if is_indented else standard_generator
        generator.length = line_length

        # Create an image and a label
        img, label = generator.next()
        images.append(img)
        labels.append(label_prefix + label)
        
    return (images, labels)

In [35]:
# # Choose the font to use in all lines for a page
# # Note that it is a single-item list for using with trdg
# font = [random.choice(fontfiles)]

# # Create the generator for blank lines
# blank_generator = GeneratorFromStrings(
#     strings = [' '],
#     fonts = font,
#     margins = (20,10,0,10),
#     size = 62,
#     width = 500)

# # margins = ((20,70,0,10) if is_indented else (20,10,0,10))

# # Create generator for indented lines
# indent_generator = GeneratorFromDict(
#     length = 18,
#     fonts = font,
#     size = 62,
#     background_type = 0,
#     margins = (20,70,0,10))

# # Create generator for standard lines
# standard_generator = GeneratorFromDict(
#     length = 18,
#     fonts = font,
#     size = 62,
#     background_type = 0,
#     margins = (20,10,0,10))

In [59]:
# %%timeit
# images, labels = create_lines(standard_generator, indent_generator, blank_generator)

1 loop, best of 5: 2.96 s per loop


### Generate full-page text images
Next, a function to combine the images and labels into a full-page image.

### References
* nkmk blog article, [Concatenate images with Python Pillow](https://note.nkmk.me/en/python-pillow-concat-images/)
* nkmk blog article, [Draw lines and shapes with Python Pillow](https://note.nkmk.me/en/python-pillow-imagedraw/)

In [11]:
def create_pages(num_pages=10,
                 font_files=fontfiles,
                 image_save_location='./images',
                 label_save_location='./labels'):
    '''
    Concatenates line-images from the `create_lines()` function
    to create full-page images for text recognition.

    Also adds lines to simulate the ruled paper I used for my
    journals.

    Parameters
    ---
    font_files: a list of file paths to .ttf TrueType font files.
    All lines for each page will use a single font.
    '''
    progress_bar = tqdm(total=num_pages)
    # Loop to create one page at a time
    for n in range(num_pages):
        # Choose the font to use in all lines for a page
        # Note that it is a single-item list for using with trdg
        font = [random.choice(font_files)]

        line_len_words = []
        line_len_chars = []

        # margins = ((20,70,0,10) if is_indented else (20,10,0,10))

        # Create the generator for blank lines
        blank_generator = GeneratorFromStrings(
            strings = [' '],
            fonts = font,
            size = 62,
            background_type = 0,
            margins = (20,10,0,10),
            width = 1000)
        
        # Create generator for indented lines
        indent_generator = GeneratorFromDict(
            length = 9,
            fonts = font,
            size = 62,
            background_type = 0,
            margins = (20,70,0,10))
        
        # Create generator for standard lines
        standard_generator = GeneratorFromDict(
            length = 9,
            fonts = font,
            size = 62,
            background_type = 0,
            margins = (20,10,0,10))
        
        # Create images of lines of text, plus labels
        images, labels = create_lines(standard_generator, indent_generator,
                                    blank_generator, number_of_lines=30)

        # Combine images into a single image
        # See: https://note.nkmk.me/en/python-pillow-concat-images/
        max_width = 0
        max_height = 0
        for img in images:
            width, height = img.size
            if width > max_width:
                max_width = width
            if height > max_height:
                max_height = height

        # Create a blank canvas to paste the images on
        full_page = Image.new(
            mode='RGB',
            size=(max_width, max_height * len(images)),
            color=(230,230,230))

        # Paste the images on the blank canvas
        total_height = 0
        for img in images:
            full_page.paste(img, (0, total_height))
            total_height += img.height
        
        # Add lines to the canvas (29 lines per page)
        # See: https://note.nkmk.me/en/python-pillow-imagedraw/
        # Color for lines: gray: (70, 70, 70)
        # Start at 52 pixels, to go just under the previous line of text, which
        # simulates how the letters in my journal drop below the ruling lines
        y_height = 52
        x_1 = 10
        x_2 = full_page.width
        draw = ImageDraw.Draw(full_page)
        for line_num in range(29):
            draw.line(xy=(x_1, y_height, x_2, y_height), fill=(70,70,70), width=1)
            # Add 62 to the y_height to match the line height
            y_height += 62

        # Save the full-page image
        # Default quality is 75 (max is 95).
        # See: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#jpeg
        # and: https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.save
        full_page.save(f"{image_save_location}/{str(n).zfill(5)}.jpg", quality=50)
        
        # Combine all labels into one string
        full_page_label = '<START>'
        for lbl in labels:
            full_page_label += (lbl + '\n')
            line_len_words.append(len(lbl.split()))
            line_len_chars.append(len(lbl))
        # Add end keyword, after removing the final '\n'
        full_page_label = full_page_label[:-1] + '<END>'

        # Save the label
        label_filepath = f"{label_save_location}/{str(n).zfill(5)}.txt"
        with open(label_filepath, mode='wt', encoding='utf-8') as labelfile:
            labelfile.write(full_page_label)
        
        # Update progress bar
        progress_bar.set_description(f"Words/page: {len(full_page_label.split()):,d}"
        + f", line length: {sum(line_len_words)/len(line_len_words):,.1f} words"
        + f", {sum(line_len_chars)/len(line_len_chars):,.1f} chars")
        progress_bar.update(n=1)

In [13]:
# Create a folder to store the images in
if not os.path.exists('./images'):
    os.mkdir('./images')
# Create a folder to store the image labels in
if not os.path.exists('./labels'):
    os.mkdir('./labels')

create_pages(num_pages=10000, font_files=fontfiles, image_save_location='./images', label_save_location='./labels')

  0%|          | 0/10000 [00:00<?, ?it/s]

## Different implementation: varying line lengths
To support varying line lengths, I need to recreate the `trdg` `Generator` object for each line. It's significantly more time consuming (avg of 5.20s vs. 1.75s per 30 lines with an average of 9 words/line), but it enables a greater variety in the generated pages.

In [133]:
def create_lines_v2(font_files=fontfiles,
                 number_of_lines=30,
                 indented_probability=0.183,
                 blank_line_probability=0.027):
    '''
    Creates images of lines of text for creating full-page images of text.

    Output: (images, labels), where `images` is a list of Pillow Image objects
    and `labels` is a list of strings that represent the text in each image.

    Parameters
    ---
    font_files: a list of strings with filepaths to .tff TrueType font files

    number_of_lines: sets the number of images (lines of text) to be returned.
    Default is 30.

    indented_probability: the likelihood that any given line is indented.
    Default is 0.183, which reflects the fact that, on average, 5.5 lines out
    of every 30 are indented.

    blank_line_probability: the likelihood that any gven line is blank.
    Default is 0.027, which reflects the fact that, on average, 0.8 lines out
    of every 30 are blank.
    '''
    # Create lists to store the returned objects: images and labels
    images = []
    labels = []

    # Choose the font to use in all lines for a page
    # Note that it is a single-item list for using with trdg
    font = [random.choice(font_files)]

    # Create the generator for blank lines
    blank_generator = GeneratorFromStrings(
        strings = [' '],
        fonts = font,
        size = 62,
        background_type = 0,
        margins = (20,10,0,10),
        width = 1000)

    # Loop to create images and labels
    for n in range(number_of_lines):
        # Determine if the line will be indented or blank
        # If it is indented, it cannot also be blank
        is_indented = (True if random.random() < indented_probability else False)
        is_blank = False
        if not is_indented:
            is_blank = (True if random.random() < blank_line_probability else False)
        
        if is_blank:
            img, label = blank_generator.next()
            images.append(img)
            labels.append('')
            # Move on to the next line-image
            continue
        
        line_length = word_count(words_per_line, cumulative_probabilities)
        # Average word length (characters) is much longer for trdg than for
        # my journals, so I'm adjusting the number of words per line by -7
        # so the range is now 3-18 words (rather than 10-25)
        line_length -= 7
        # Clip the max line length to 13 words
        line_length = min(line_length, 13)
        label_prefix = ('    ' if is_indented else '')

        # Create the text generator
        margins = ((20,70,0,10) if is_indented else (20,10,0,10))
        generator = GeneratorFromDict(
            length = line_length,
            fonts = font,
            size = 62,
            background_type = 0,
            margins = margins)

        # Create an image and a label
        img, label = generator.next()
        images.append(img)
        labels.append(label_prefix + label)
        
    return (images, labels)

In [107]:
%%timeit
images, labels = create_lines_v2()

1 loop, best of 5: 5.24 s per loop


In [146]:
def create_pages_v2(num_pages=10,
                 image_save_location='./images',
                 label_save_location='./labels'):
    '''
    Concatenates line-images from the `create_lines()` function
    to create full-page images for text recognition.

    Also adds lines to simulate the ruled paper I used for my
    journals.

    Parameters
    ---
    font_files: a list of file paths to .ttf TrueType font files.
    All lines for each page will use a single font.
    '''
    progress_bar = tqdm(total=num_pages)

    # Loop to create one page at a time
    for n in range(num_pages):
        # Create images of lines of text, plus labels
        images, labels = create_lines_v2()

        line_len_words = []
        line_len_chars = []

        # Combine images into a single image
        # See: https://note.nkmk.me/en/python-pillow-concat-images/
        max_width = 0
        max_height = 0
        for img in images:
            width, height = img.size
            if width > max_width:
                max_width = width
            if height > max_height:
                max_height = height

        # Create a blank canvas to paste the images on
        full_page = Image.new(
            mode='RGB',
            size=(max_width, max_height * len(images)),
            color=(230,230,230))

        # Paste the images on the blank canvas
        total_height = 0
        for img in images:
            full_page.paste(img, (0, total_height))
            total_height += img.height
        
        # Add lines to the canvas (29 lines per page)
        # See: https://note.nkmk.me/en/python-pillow-imagedraw/
        # Color for lines: gray: (70, 70, 70)
        # Start at 52 pixels, to go just under the previous line of text, which
        # simulates how the letters in my journal drop below the ruling lines
        y_height = 52
        x_1 = 10
        x_2 = full_page.width
        draw = ImageDraw.Draw(full_page)
        for line_num in range(29):
            draw.line(xy=(x_1, y_height, x_2, y_height), fill=(70,70,70), width=1)
            # Add 62 to the y_height to match the line height
            y_height += 62

        # Save the full-page image
        # Default quality is 75 (max is 95).
        # See: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#jpeg
        # and: https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.save
        full_page.save(f"{image_save_location}/{str(n).zfill(5)}.jpg", quality=75)
        
        # Combine all labels into one string
        full_page_label = '<START>'
        for lbl in labels:
            full_page_label += (lbl + '\n')
            line_len_words.append(len(lbl.split()))
            line_len_chars.append(len(lbl))
        # Add end keyword, after removing the final '\n'
        full_page_label = full_page_label[:-1] + '<END>'

        # Save the label
        label_filepath = f"{label_save_location}/{str(n).zfill(5)}.txt"
        with open(label_filepath, mode='wt', encoding='utf-8') as labelfile:
            labelfile.write(full_page_label)
        
        # Update progress bar
        progress_bar.set_description(f"Words/page: {len(full_page_label.split()):,d}"
        + f", line length: {sum(line_len_words)/len(line_len_words):,.1f} words"
        + f", {sum(line_len_chars)/len(line_len_chars):,.1f} chars")
        progress_bar.update(n=1)

In [147]:
# Create a folder to store the images in
if not os.path.exists('./images'):
    os.mkdir('./images')
# Create a folder to store the image labels in
if not os.path.exists('./labels'):
    os.mkdir('./labels')

create_pages_v2(num_pages=10, image_save_location='./images', label_save_location='./labels')

  0%|          | 0/10 [00:00<?, ?it/s]

## Potential improvements: augmentation
Here are two example packages I could use to augment the images by adjusting brightness, contrast, tilt, noise, etc.:
* [`Augmentor`](https://github.com/mdbloice/Augmentor)
* [`imgaug`](https://github.com/aleju/imgaug)