<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Create-images-and-errors-tables|" data-toc-modified-id="Create-images-and-errors-tables|-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Create images and errors tables|</a></span></li><li><span><a href="#Get-Files-Already-Read" data-toc-modified-id="Get-Files-Already-Read-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Get Files Already Read</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Functions</a></span><ul class="toc-item"><li><span><a href="#Search-the-image-for-the-sample-ID-(UUID)" data-toc-modified-id="Search-the-image-for-the-sample-ID-(UUID)-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Search the image for the sample ID (UUID)</a></span><ul class="toc-item"><li><span><a href="#Create-Window-Slider" data-toc-modified-id="Create-Window-Slider-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>Create Window Slider</a></span></li><li><span><a href="#Extract-QR-Code-from-Image" data-toc-modified-id="Extract-QR-Code-from-Image-4.1.2"><span class="toc-item-num">4.1.2&nbsp;&nbsp;</span>Extract QR Code from Image</a></span></li><li><span><a href="#Read-and-Process-Image" data-toc-modified-id="Read-and-Process-Image-4.1.3"><span class="toc-item-num">4.1.3&nbsp;&nbsp;</span>Read and Process Image</a></span></li></ul></li><li><span><a href="#Ingest-Image-Batch" data-toc-modified-id="Ingest-Image-Batch-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Ingest Image Batch</a></span></li><li><span><a href="#Resolve-an-error" data-toc-modified-id="Resolve-an-error-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Resolve an error</a></span></li><li><span><a href="#Manually-set-an-image-record" data-toc-modified-id="Manually-set-an-image-record-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Manually set an image record</a></span></li></ul></li><li><span><a href="#Read-image-Files" data-toc-modified-id="Read-image-Files-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Read image Files</a></span><ul class="toc-item"><li><span><a href="#Read-New-York-Botanical-Garden-(1st-trip)" data-toc-modified-id="Read-New-York-Botanical-Garden-(1st-trip)-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Read New York Botanical Garden (1st trip)</a></span></li><li><span><a href="#Read-Harvard-Herbaria" data-toc-modified-id="Read-Harvard-Herbaria-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Read Harvard Herbaria</a></span></li><li><span><a href="#Read-Ohio-State-University-Herbarium" data-toc-modified-id="Read-Ohio-State-University-Herbarium-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Read Ohio State University Herbarium</a></span></li><li><span><a href="#Read-California-Academy-of-Sciences-Herbarium" data-toc-modified-id="Read-California-Academy-of-Sciences-Herbarium-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Read California Academy of Sciences Herbarium</a></span></li><li><span><a href="#Read-Missouri-Botanical-Garden" data-toc-modified-id="Read-Missouri-Botanical-Garden-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Read Missouri Botanical Garden</a></span></li><li><span><a href="#Read-New-York-Botanical-Garden-(2nd-trip)" data-toc-modified-id="Read-New-York-Botanical-Garden-(2nd-trip)-5.6"><span class="toc-item-num">5.6&nbsp;&nbsp;</span>Read New York Botanical Garden (2nd trip)</a></span></li><li><span><a href="#Read-New-York-Botanical-Garden-(3rd-trip)" data-toc-modified-id="Read-New-York-Botanical-Garden-(3rd-trip)-5.7"><span class="toc-item-num">5.7&nbsp;&nbsp;</span>Read New York Botanical Garden (3rd trip)</a></span></li></ul></li><li><span><a href="#Read-Pilot-Data" data-toc-modified-id="Read-Pilot-Data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Read Pilot Data</a></span></li><li><span><a href="#Read-Corrales-Data" data-toc-modified-id="Read-Corrales-Data-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Read Corrales Data</a></span></li><li><span><a href="#Write-Image-Table-to-CSV-File" data-toc-modified-id="Write-Image-Table-to-CSV-File-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Write Image Table to CSV File</a></span></li></ul></div>

# Setup

** Warning: This notebook can take a hours to complete. **

In [1]:
from glob import glob
from pathlib import Path
from collections import namedtuple

import pandas as pd
from PIL import Image, ImageFilter
import exifread
import zbarlight
from tqdm import tqdm

import lib.db as db
import lib.util as util
import lib.google as google
from lib.dict_attr import DictAttrs

In [2]:
Dimensions = namedtuple('Dimensions', 'width height')

CXN = db.connect()
RAW_DATA = Path('..') / 'data' / 'raw'
INTERIM_DATA = Path('..') / 'data' / 'interim'
PROCESSED_DATA = Path('..') / 'data' / 'processed'

# Create images and errors tables|

In [3]:
def create_images_table():
    CXN.execute('DROP TABLE IF EXISTS raw_images')
    CXN.execute("""
        CREATE TABLE raw_images (
            sample_id  TEXT PRIMARY KEY NOT NULL,
            image_file TEXT NOT NULL UNIQUE
        )""")
    CXN.execute("""CREATE INDEX image_idx ON raw_images (sample_id)""")
    CXN.execute("""CREATE INDEX image_file ON raw_images (image_file)""")

In [4]:
def create_errors_table():
    """Create errors table for persisting errors."""
    CXN.execute('DROP TABLE IF EXISTS image_errors')
    CXN.execute("""
        CREATE TABLE errors (
            image_file TEXT NOT NULL,
            msg        TEXT,
            ok         INTEGER,
            resolution TEXT
        )""")
    CXN.execute("""CREATE INDEX error_idx ON image_errors (image_file)""")

In [5]:
# create_images_table()
# create_errors_table()

# Get Files Already Read

In [6]:
images = pd.read_sql('SELECT * FROM raw_images', CXN)
errors = pd.read_sql('SELECT * FROM image_errors', CXN)
skip_images = set(images.image_file) | set(errors.image_file)

# Functions

## Search the image for the sample ID (UUID)

### Create Window Slider

It helps with feature extraction by limiting the search area

In [7]:
def window_slider(image_size, window=None, stride=None):
    window = window if window else Dimensions(400, 400)
    stride = stride if stride else Dimensions(200, 200)

    for top in range(0, image_size.height, stride.height):
        bottom = top + window.height
        bottom = image_size.height if bottom > image_size.height else bottom

        for left in range(0, image_size.width, stride.width):
            right = left + window.width
            right = image_size.width if right > image_size.width else right

            box = (left, top, right, bottom)

            yield box

### Extract QR Code from Image

In [8]:
def get_qr_code(image):
    # Try a direct extraction
    qr_code = zbarlight.scan_codes('qrcode', image)
    if qr_code:
        return qr_code[0].decode('utf-8')

    # Try a slider
    for box in window_slider(image):
        cropped = image.crop(box)
        qr_code = zbarlight.scan_codes('qrcode', cropped)
        if qr_code:
            return qr_code[0].decode('utf-8')

    # Try rotating the image *sigh*
    for degrees in range(5, 85, 5):
        rotated = image.rotate(degrees)
        qr_code = zbarlight.scan_codes('qrcode', rotated)
        if qr_code:
            return qr_code[0].decode('utf-8')

    # Try to sharpen the image
    sharpened = image.filter(ImageFilter.SHARPEN)
    qr_code = zbarlight.scan_codes('qrcode', sharpened)
    if qr_code:
        return qr_code[0].decode('utf-8')

    return None

### Read and Process Image

In [9]:
def get_image_data(image_file):
    with open(image_file, 'rb') as image_file:
        # exif = exifread.process_file(image_file)
        image = Image.open(image_file)
        image.load()

    qr_code = get_qr_code(image)

    return qr_code

## Ingest Image Batch

In [10]:
def ingest_images(dir_name, skip_images):
    pattern = str(dir_name / '*.JPG')

    sample_ids = {}  # Keep track of already used sample_ids

    images = []  # A batch of images to insert
    errors = []  # A batch of errors to insert

    files = sorted(glob(pattern))

    for image_file in tqdm(files):
        if image_file in skip_images:
            continue

        sample_id = get_image_data(image_file)

        # Handle a missing sample ID
        if not sample_id:
            msg = 'MISSING: QR code missing in {}'.format(image_file)
            errors.append((image_file, msg))

        # Handle a duplicate sample ID
        elif sample_ids.get(sample_id):
            msg = ('DUPLICATES: Files {} and {} have the same '
                   'QR code').format(sample_ids[sample_id], image_file)
            errors.append((image_file, msg))

        # The image seems OK
        else:
            sample_ids[sample_id] = image_file
            images.append((sample_id, image_file))

    # Insert the image and error batches
    if images:
        sql = 'INSERT INTO raw_images (sample_id, image_file) VALUES (?, ?)'
        CXN.executemany(sql, images)
        CXN.commit()

    if errors:
        sql = 'INSERT INTO image_errors (image_file, msg) VALUES (?, ?)'
        CXN.executemany(sql, errors)
        CXN.commit()

## Resolve an error

In [11]:
def resolve_error(dir_name, image_file, ok, resolution):
    image_file = str(dir_name / f'{image_file}.JPG')
    sql = """
        UPDATE image_errors
           SET ok = ?, resolution = ?
         WHERE image_file = ?"""
    CXN.execute(sql, (ok, resolution, image_file))
    CXN.commit()

## Manually set an image record

In [12]:
def manual_insert(dir_name, image_file, sample_id, skip_images):
    image_file = str(dir_name / f'{image_file}.JPG')
    if image_file in skip_images:
        return
    sql = 'INSERT INTO raw_images (sample_id, image_file) VALUES (?, ?)'
    CXN.execute(sql, (sample_id, image_file))
    CXN.commit()

# Read image Files

## Read New York Botanical Garden (1st trip)

In [13]:
path = RAW_DATA / 'DOE-nitfix_specimen_photos'

ingest_images(path, skip_images)

resolve_error(path, 'R0000149', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0000151', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0000158', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0000165', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0000674', 1, 'OK: Is a duplicate of R0000473')
resolve_error(path, 'R0000835', 1, 'OK: Is a duplicate of R0000836')
resolve_error(path, 'R0000895', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0000937', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0001055', 1, 'OK: Genuine duplicate')

100%|██████████| 1236/1236 [00:00<00:00, 566141.72it/s]


## Read Harvard Herbaria

In [14]:
path = RAW_DATA / 'HUH_DOE-nitfix_specimen_photos'

ingest_images(path, skip_images)

resolve_error(path, 'R0001262', 1, 'OK: Is a duplicate of R0001263')
resolve_error(path, 'R0001729', 1, 'OK: Is a duplicate of R0001728')

100%|██████████| 483/483 [00:00<00:00, 502941.62it/s]


## Read Ohio State University Herbarium

In [15]:
path = RAW_DATA / 'OS_DOE-nitfix_specimen_photos'

ingest_images(path, skip_images)

resolve_error(path, 'R0000229', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0001835', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0001898', 1, 'OK: Genuine duplicate')

100%|██████████| 688/688 [00:00<00:00, 571988.34it/s]


## Read California Academy of Sciences Herbarium

In [16]:
path = RAW_DATA / 'CAS-DOE-nitfix_specimen_photos'

ingest_images(path, skip_images)

resolve_error(path, 'R0001361', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0002349', 1, 'OK: Genuine duplicate')

100%|██████████| 2596/2596 [00:00<00:00, 599978.69it/s]


## Read Missouri Botanical Garden

In [17]:
path = RAW_DATA / 'MO-DOE-nitfix_specimen_photos'

ingest_images(path, skip_images)

resolve_error(path, 'R0002933', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0003226', 1, 'OK: Genuine duplicate')
resolve_error(path, 'R0003663', 1, 'OK: Manually fixed')
resolve_error(path, 'R0003509', 0, 'ERROR: Blurry image')

manual_insert(
    path, 'R0003663', '2eea159f-3c25-42ef-837d-27ad545a6779', skip_images)

100%|██████████| 1027/1027 [00:00<00:00, 421070.40it/s]


## Read New York Botanical Garden (2nd trip)

In [18]:
path = RAW_DATA / 'NY_visit_2'

ingest_images(path, skip_images)

100%|██████████| 1307/1307 [00:00<00:00, 530837.16it/s]


## Read New York Botanical Garden (3rd trip)

In [19]:
path = RAW_DATA / 'NY_DOE-nitfix_visit3'

ingest_images(path, skip_images)

100%|██████████| 1112/1112 [00:00<00:00, 580396.47it/s]


# Read Pilot Data

In [20]:
csv_name = 'pilot.csv'
csv_path = INTERIM_DATA / csv_name

google.sheet_to_csv('UFBI_identifiers_photos', csv_path)

pilot = pd.read_csv(csv_path)

pilot['image_file'] = pilot['File'].apply(
    lambda x: f'../data/raw/UFBI_sample_photos/{x}')

pilot.drop(['File'], axis=1, inplace=True)
pilot.rename(columns={'Identifier': 'pilot_id'}, inplace=True)
pilot.pilot_id = pilot.pilot_id.str.lower().str.split().str.join(' ')

print(len(pilot))
pilot.head()

456


Unnamed: 0,pilot_id,sample_id,image_file
0,ny: cronquist 11617,2179dce7-dac2-4fc1-84a3-8725acefa8cc,../data/raw/UFBI_sample_photos/20170523_154701...
1,ny: nee 38556,00420ba6-4228-49e8-845c-30a967de4b51,../data/raw/UFBI_sample_photos/20170523_154645...
2,ny: jorengensen 65676,72b64a1e-0dd9-4f44-9f82-afaee163d57b,../data/raw/UFBI_sample_photos/20170523_154638...
3,ny: jaramillo 10160,3364f3bb-c0a1-4af4-8b3b-a780de9f1594,../data/raw/UFBI_sample_photos/20170523_154629...
4,ny: jorgensen 61589,6e76a0be-4b0f-4e01-a6e6-1cd1395d4458,../data/raw/UFBI_sample_photos/20170523_154621...


In [21]:
name = 'raw_pilot'
pilot.to_csv(PROCESSED_DATA / f'{name}.csv', index=False)
pilot.to_sql(name, CXN, if_exists='replace', index=False)

already_in = pilot.sample_id.isin(images.sample_id)
pilot = pilot[~already_in]

pilot.drop('pilot_id', axis=1, inplace=True)
pilot.to_sql('raw_images', CXN, if_exists='append', index=False)

# Read Corrales Data

In [22]:
csv_name = 'pilot.csv'
csv_path = INTERIM_DATA / csv_name

google.sheet_to_csv('corrales_data', csv_path)

corrales = pd.read_csv(csv_path)
corrales.corrales_id = corrales.corrales_id.str.lower()
corrales.head()

Unnamed: 0,corrales_id,sample_id,image_file
0,corrales: corrales 770,eb9dc632-82a9-479f-88d5-f172ee6cc2d7,../data/raw/missing_photos/Corrales_770.jpg
1,corrales: corrales 830,4a31b17c-08f2-4236-a7b1-2261554ea658,../data/raw/missing_photos/Corrales_830.jpg
2,corrales: corrales 792,3ffa0f4c-1180-4268-8ba5-c9cc2e251350,../data/raw/missing_photos/Corrales_792.jpg
3,corrales: corrales 704,b6b6b66e-3b3b-4d5c-8197-a959a8fe715e,../data/raw/missing_photos/Corrales_704.jpg
4,corrales: corrales 754,56013e5d-6b1b-4c1c-8f01-3bb7e57c3a83,../data/raw/missing_photos/Corrales_754.jpg


In [23]:
name = 'raw_corrales'
corrales.to_csv(PROCESSED_DATA / f'{name}.csv', index=False)
corrales.to_sql(name, CXN, if_exists='replace', index=False)

already_in = corrales.sample_id.isin(images.sample_id)
corrales = corrales[~already_in]

corrales.drop('corrales_id', axis=1, inplace=True)
corrales.to_sql('raw_images', CXN, if_exists='append', index=False)

# Write Image Table to CSV File

In [24]:
csv_name = 'images.csv'

df = pd.read_sql('SELECT * FROM raw_images', CXN)

csv_path = PROCESSED_DATA / csv_name
df.to_csv(csv_path, index=False)

df.head()

Unnamed: 0,sample_id,image_file
0,6fcdf583-e9bb-4764-84de-f277cc6ec6b7,../data/raw/DOE-nitfix_specimen_photos/R000000...
1,6fa18219-4958-4d75-8bf3-032fa909315c,../data/raw/DOE-nitfix_specimen_photos/R000000...
2,6f93bea8-43f4-45ad-95f5-ecad63f13037,../data/raw/DOE-nitfix_specimen_photos/R000000...
3,6f66cc88-3583-4e9b-97ea-03b1d681def8,../data/raw/DOE-nitfix_specimen_photos/R000000...
4,6f5bc099-ff55-4740-8a2f-e63466b47892,../data/raw/DOE-nitfix_specimen_photos/R000000...


In [25]:
csv_name = 'errors.csv'

df = pd.read_sql('SELECT * FROM image_errors', CXN)

csv_path = PROCESSED_DATA / csv_name
df.to_csv(csv_path, index=False)

df.head()

Unnamed: 0,image_file,msg,ok,resolution
0,../data/raw/DOE-nitfix_specimen_photos/R000014...,DUPLICATES: Files ../data/raw/DOE-nitfix_speci...,1.0,OK: Genuine duplicate
1,../data/raw/DOE-nitfix_specimen_photos/R000015...,DUPLICATES: Files ../data/raw/DOE-nitfix_speci...,1.0,OK: Genuine duplicate
2,../data/raw/DOE-nitfix_specimen_photos/R000015...,DUPLICATES: Files ../data/raw/DOE-nitfix_speci...,1.0,OK: Genuine duplicate
3,../data/raw/DOE-nitfix_specimen_photos/R000016...,DUPLICATES: Files ../data/raw/DOE-nitfix_speci...,1.0,OK: Genuine duplicate
4,../data/raw/DOE-nitfix_specimen_photos/R000067...,DUPLICATES: Files ../data/raw/DOE-nitfix_speci...,1.0,OK: Is a duplicate of R0000473
