# Generate Label Text by Sampling iDigBio Data

Images can be generated (both clean and dirty) on the fly but we need to persist the text data so that it can be used in several steps further down the pipeline. This will create persistent data for the label text from the iDigBio database. I'm going to create a separate DB so that it can be easily used by other team members.

The data will be in parts so that we can treat the different parts separately when we generate the labels. For instance so parts may use different fonts and others may have the text underlined etc. Also note that some of the parts may be empty.

Text augmentation is finished here because the OCR step, later on, needs to replicate the text. Text augmentation steps:
- [ ] Underline some text: Use solid, dotted & dashed underlines, also strike-through or lines to the text
- [ ] Bold some text
- [ ] Use different base fonts and within the label change some fonts and sizes
- [ ] Use symbols like: ♀ or ♂ and other ones that may appear on labels
- [ ] Augment taxon names with data from ITIS
- [ ] Augment location data from gazetteer data
- [ ] Generate names dates and numbers
- [ ] Replace some words with abbreviations
- [ ] Add spaces? Lets see if this will work

Note that we are generating *"plausible"* labels not necessarily *realistic* labels. Also note that there are different kinds of labels.
- The main label that describes the sample
- Labels for species determination
- Barcode and QR-Code labels

In [1]:
import sys

sys.path.append('..')

In [2]:
import os
import sqlite3
from pprint import pp
from pathlib import Path

import pandas as pd

from digi_leap.pylib.const import LABEL_DB, RAW_DB

ImportError: cannot import name 'LABEL_DB' from 'digi_leap.pylib.const' (../digi_leap/pylib/const.py)

Constants for building the label table

In [None]:
LABEL_DB = Path('..') / 

LIMIT = 100_000  # This should be enuf for training
TABLE = 'occurrence_raw'

# These column look good for label generation
COLUMNS = """
    scientific_name
    phylum
    class
    order
    family
    genus
    subgenus
    verbatim_scientific_name
    accepted_name_usage
    vernacular_name
    taxon_rank
    verbatim_taxon_rank

    scientific_name_authorship
    name_according_to
    name_published_in
    name_published_in_id
    name_published_in_year
    date_identified
    identified_by
    identification_id
    identification_remarks
    original_name_usage
    previous_identifications

    locality
    location_remarks
    country
    state_province
    municipality
    water_body
    georeference_remarks
    georeferenced_by

    verbatim_coordinate_system
    verbatim_coordinates
    verbatim_depth
    verbatim_elevation
    verbatim_latitude
    verbatim_longitude
    verbatim_srs

    event_date
    event_id
    event_remarks
    dwc_verbatim_event_date

    owner_institution_code
    catalog_number
    collection_code
    dataset_name

    field_notes
    field_number

    habitat
    life_stage
    occurrence_remarks
    organism_remarks
    preparations
    reproductive_condition
    sex
    sampling_protocol
    type_status
    
    record_entered_by
    record_number
    recorded_by
""".split()

In [None]:
columns = {f'`{c}`' for c in COLUMNS}
columns = ', '.join(columns)

sql = f"""
    CREATE TABLE IF NOT EXISTS aux.data AS
        SELECT rowid AS id, {columns}
          FROM {TABLE}
         WHERE rowid IN (
             SELECT rowid
               FROM {TABLE}
              WHERE kingdom like 'plant%'
                AND scientific_name <> ''
           ORDER BY RANDOM()
              LIMIT {LIMIT})
"""

with sqlite3.connect(SQLITE_DB) as cxn:
    cxn.execute(f"ATTACH DATABASE '{LABEL_DB}' AS aux")
    cxn.execute(sql)

We will need this index

In [None]:
with sqlite3.connect(LABEL_DB) as cxn:
    cxn.execute("""
        CREATE UNIQUE INDEX IF NOT EXISTS
            data_id ON data (id)""")

In [None]:
with sqlite3.connect(LABEL_DB) as cxn:
    df = pd.read_sql('select * from data', cxn)

Augment the sex field

In [None]:
UPDATABLE = ('', 'NO DISPONIBLE', 'No Aplica', 'Null', 'U', 'Unspecified', '5')
SEXES = {
    'Male': 10_000,
    'male': 10_000,
    'Female': 10_000,
    'female': 10_000,
    'hermaphrodite': 100,
    'bisexual': 100,
    '♀': 5_000,
    '♂': 5_000,
    '⚥': 10,
}

for value, count in SEXES.items():
    rows = df.sample(n=count)
    df.loc[rows.index, 'sex'] = value

Insert abbreviations into localities

Write the output

In [None]:
with sqlite3.connect(LABEL_DB) as cxn:
    df.to_sql('data', cxn, if_exists='replace', index=False)