# Experiments for extracting images of Boyd's Bird Journal into computer readable form

(See image below)

The journals are PDFs containing a series of scanned images of observations of birds. The observations are scanned handwritten notes on graph paper. There are bird species labels running down the left side of the page and date information across the top. The charts are organized by month with days of the month being column headings. There are between 2 and three months of information for each image.

Each cell has a mark indicating the presence or absence of a bird species on a given day. So there is, potentially, one mark per bird species per day. The mark on the page is typically a forward slash "/" but it can also be an "x" or a colored in block. Note that the graphs are not clean and contain other notes and stray marks. Also notice that some of the rules are incomplete or missing.

<img  src="Boyd_M_Bird_journal_section1-024.png"/>

In [1]:
%load_ext watermark
%watermark -a 'Raphael LaFrance' -i -u -v -r -g -p numpy,matplotlib,skimage

Raphael LaFrance 
last updated: 2017-10-20T18:20:50-04:00

CPython 3.6.1
IPython 6.2.1

numpy 1.13.3
matplotlib 2.1.0
skimage 0.13.1
Git hash: fd6b74252d6c58ebfc30457b4895e398fbfc6410
Git repo: https://github.com/rafelafrance/boyd-bird-journal.git


## Extract images from PDF files

First we need to extract individual images from the PDFs. This is easily accomplished in Linux with the command `pdfimages`. This is part of either the poppler or xpdf packages. We're using `bash` to make a directory to hold the images and then extracting the PDF images into that directory.

In [2]:
%%bash

RAW_DATA='raw_data'
DIRECTORY='images'

PDF1="$RAW_DATA/Boyd_M_Bird_journal_section1.pdf"
PDF2="$RAW_DATA/Boyd_M_Bird_journal_section2.pdf"

PREFIX1="$DIRECTORY/Boyd_M_Bird_journal_section1"
PREFIX2="$DIRECTORY/Boyd_M_Bird_journal_section2"

if [ ! -d "$DIRECTORY" ]; then
    mkdir $DIRECTORY
    pdfimages -png $PDF1 $PREFIX1
    pdfimages -png $PDF2 $PREFIX2
fi

## Setup

We are using a fairly standard scipy stack: `numpy` & `matplotlib`. The only addition is the use of `scikit-image`.

In [3]:
%matplotlib notebook
# %matplotlib inline

import os
from itertools import product

import numpy as np

import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib import cm

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# import cv2

from skimage import io
from skimage import util
from skimage.filters import sobel
from skimage.transform import hough_line, hough_line_peaks
from skimage.transform import probabilistic_hough_line, rotate

## Brief description of the Hough transform

We're using the Hough Transform to find lines in the image. It's an efficient and old algorithm for finding objects in an image. Efficient because it only scans the image once.

The basic idea of the algorithm is:

1. Set up a table of every possible line in the image. The lines are in polar form (rho, theta).
    1. Lines are limited to a given set of angles.
    1. This table will hold a count of all of the "on" pixels for the line.
1. Scan the image for "on" pixels.
1. When a pixel is "on", add one to every possible line that goes thru the pixel it.
1. After every pixel has been recorded choose all lines with a count that is greater than a given threshold.

See the [Wikipedia Page](https://en.wikipedia.org/wiki/Hough_transform) for a more detailed description.

## Define some convenience objects

#### This is the main grid class. It holds the other classes.

In [4]:
class Grid:

    def __init__(self, *, file_name=None, image=None):
        self.image = io.imread(file_name) if file_name else image

        self.edges = util.invert(self.image)

        self.horiz = Horizontal(self.edges)
        self.vert = Vertical(self.edges)

        self.cells = []

    @property
    def shape(self):
        return self.edges.shape

    @property
    def width(self):
        return self.horiz.size

    @property
    def height(self):
        return self.vert.size

    def get_cells(self):
        self.cells = []
        for row, (n, s) in enumerate(zip(self.horiz.lines[:-1], self.horiz.lines[1:])):
            self.cells.append([])
            for col, (e, w) in enumerate(zip(self.vert.lines[:-1], self.vert.lines[1:])):
                self.cells[row].append(Cell(self.edges, n, s, e, w))

#### This is the base object for working with grid lines. We use if for both vertical and horizontal grid lines.

In [5]:
class GridLines:
    def __init__(self, image):
        self.image = image
        self.thetas = None
        self.angles = []
        self.dists = []
        self.lines = []
        self.threshold = 500
        self.min_distance = 40

    def find_lines(self):
        h_matrix, h_angles, h_dist = hough_line(self.image, self.thetas)

        _, self.angles, self.dists = hough_line_peaks(
            h_matrix,
            h_angles,
            h_dist,
            threshold=self.threshold,
            min_distance=self.min_distance)

    def polar2endpoints(self, angle, dist):
        if np.abs(angle) > np.pi / 4:
            x0 = 0
            x1 = self.image.shape[1]
            y0 = int(np.round(dist / np.sin(angle)))
            y1 = int(np.round((dist - x1 * np.cos(angle)) / np.sin(angle)))
        else:
            y0 = 0
            y1 = self.image.shape[0]
            x0 = int(np.round(dist / np.cos(angle)))
            x1 = int(np.round((dist - y1 * np.sin(angle)) / np.cos(angle)))

        return [x0, y0], [x1, y1]
    
    def add_line(self, point1, point2):
        self.lines.append((point1, point2))
        self.sort_lines()
    
    def sort_lines(self):
        self.lines = sorted(self.lines, key=self.sort_key)

    def find_grid_lines(self):
        self.find_lines()

        self.lines = [self.polar2endpoints(t, r)
                 for (t, r) in zip(self.angles, self.dists)]
        
        self.sort_lines()

#### Given a set of near horizontal angles we can find horizontal grid lines

In [6]:
class Horizontal(GridLines):
    def __init__(self, image):
        super().__init__(image)
        self.size = image.shape[1]

    def find_grid_lines(self):
        super().find_grid_lines()

        # Add image edges as lines
        self.add_line([0, 0], [self.image.shape[1], 0])
        self.add_line([0, self.image.shape[0]], [self.image.shape[1], self.image.shape[0]])

    @staticmethod
    def sort_key(x):
        return x[0][1]

#### Given a set of near vertical angles we can find vertical grid lines

In [7]:
class Vertical(GridLines):
    def __init__(self, image):
        super().__init__(image)
        self.size = image.shape[0]

    def find_grid_lines(self):
        super().find_grid_lines()

        # Add image edges as lines
        self.add_line([0, 0], [0, self.image.shape[0]])
        self.add_line([self.image.shape[1], 0], [self.image.shape[1], self.image.shape[0]])
    
    @staticmethod
    def sort_key(x):
        return x[0][0]

#### Given horizontal and vertical grid lines we can define a grid cell

In [8]:
class Cell:

    def __init__(self, image, north=None, south=None, east=None, west=None):
        self.image = image
        self.east = east
        self.west = west
        self.north = north
        self.south = south
        self.row_label_threshold = 20

    def interior(self):

        # East vertical line
        (ex0, ey0), (ex1, ey1) = self.east

        # West vertical line
        (wx0, wy0), (wx1, wy1) = self.west

        # North horizontal line
        (nx0, ny0), (nx1, ny1) = self.north

        # South horizontal line
        (sx0, sy0), (sx1, sy1) = self.south

        # Get the interior of the cell
        north = max(ny0, ny1)
        south = self.image.shape[0] - min(sy0, sy1)
        east = max(ex0, ex1)
        west = self.image.shape[1] - min(wx0, wx1)
 
        return util.crop(self.image, ((north, south), (east, west)))

    def is_row_label(self):
        return np.mean(self.interior()) > self.row_label_threshold

In [9]:
full_image = Grid(file_name='images/Boyd_M_Bird_journal_section1-024.png')

## Split the image into left-hand and right-hand sides

In [10]:
split = int(full_image.width / 2)

left_side = Grid(image=util.crop(full_image.image, ((0, 0), (0, split)), copy=True))
right_side = Grid(image=util.crop(full_image.image, ((0, 0), (split, 0)), copy=True))

print(full_image.shape)
print(left_side.shape)
print(right_side.shape)

(5100, 3300)
(5100, 1650)
(5100, 1650)


## Get the horizontal and vertical grid lines

As described above, we need to define a line as a threshold on the line count. However, there is a wrinkle, the images are not square with the width being the shorter dimension (3300px width x 5100px height). To accommodate this we will make two passes over the image. One for the horizontal lines and one for the vertical line.

In [11]:
near_horiz_deg = np.linspace(-2.0, 2.0, num=41)
near_vert_deg = np.linspace(88.0, 92.0, num=41)

# I'm not sure why this is required?!
near_horiz_deg, near_vert_deg = near_vert_deg, near_horiz_deg

left_side.horiz.thetas = np.deg2rad(near_horiz_deg)
left_side.vert.thetas = np.deg2rad(near_vert_deg)

print(np.rad2deg(left_side.horiz.thetas))
print(np.rad2deg(left_side.vert.thetas))

[ 88.   88.1  88.2  88.3  88.4  88.5  88.6  88.7  88.8  88.9  89.   89.1
  89.2  89.3  89.4  89.5  89.6  89.7  89.8  89.9  90.   90.1  90.2  90.3
  90.4  90.5  90.6  90.7  90.8  90.9  91.   91.1  91.2  91.3  91.4  91.5
  91.6  91.7  91.8  91.9  92. ]
[-2.  -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.  -0.9 -0.8 -0.7 -0.6
 -0.5 -0.4 -0.3 -0.2 -0.1  0.   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9
  1.   1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2. ]


#### Find the horizontal grid lines for the left half of the image

In [12]:
left_side.horiz.threshold = left_side.width * 0.4

left_side.horiz.find_grid_lines()

print(len(left_side.horiz.angles))
print(np.rad2deg(left_side.horiz.angles))

99
[ 90.9  90.9  90.9  90.9  90.9  90.9  90.9  90.9  90.9  90.8  90.9  90.9
  90.7  90.8  90.8  90.9  90.8  90.8  90.8  90.9  90.8  90.9  90.9  90.8
  90.8  90.9  90.8  90.8  90.8  90.9  90.9  90.8  90.9  90.8  90.8  90.8
  90.8  90.8  90.9  90.8  90.9  90.8  90.9  90.9  90.8  90.9  90.8  90.9
  90.9  90.8  90.8  90.8  90.8  90.9  90.8  90.8  90.8  90.9  90.9  90.8
  90.8  90.9  90.9  90.8  90.8  90.8  90.9  90.9  90.8  90.8  90.8  90.9
  90.9  90.8  90.9  90.8  90.8  90.8  90.8  90.8  90.9  90.9  90.9  90.9
  90.7  91.   90.9  90.8  90.9  90.8  90.9  90.8  90.8  90.7  90.6  90.8
  90.8  90.8  90.1]


#### Find the vertical grid lines for the left half of the image

In [13]:
left_side.vert.threshold = left_side.height * 0.4

left_side.vert.find_grid_lines()

print(len(left_side.vert.angles))
print(np.rad2deg(left_side.vert.angles))

1
[ 0.3]


#### Add a vertical grid line after the first one.

We're expecting two columns of cells on the left side of the image. The cells are rather long and typically have lots of whitespace toward the right end. We expect the 1st cell to have a row number and the 2nd cell to have the bird's species identification. We are going to look at the 2nd cell to see if there is any writing in it. To help boost the signal we are going to chop the 2nd cell at a fixed width and look at that part for writing.

In [14]:
east = left_side.vert.lines[1][0][0] + 200
point1 = [east, 0]
point2 = [east, left_side.height]
left_side.vert.add_line(point1, point2)
for line in left_side.vert.lines:
    print(line)

([0, 0], [0, 5100])
([493, 0], [466, 5100])
([693, 0], [693, 5100])
([1650, 0], [1650, 5100])


#### Look at the results

In [15]:
fig, ax = plt.subplots(figsize=(2, 4))
ax.imshow(left_side.image, cmap=plt.cm.gray)

for ((x0, y0), (x1, y1)) in left_side.horiz.lines:
    ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)

for ((x0, y0), (x1, y1)) in left_side.vert.lines:
    ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)

plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

## Get grid cells in the left side of the image

We want the interior area of each cell.

In [16]:
left_side.get_cells()
print(len(left_side.cells))
print(len(left_side.cells[0]))

100
3


## Look for writing in the second cell of each row

In [17]:
has_row_label = [row[1].is_row_label() for row in left_side.cells]

[(0, False), (1, False), (2, False), (3, True), (4, True), (5, True), (6, True), (7, True), (8, True), (9, True), (10, True), (11, True), (12, True), (13, True), (14, True), (15, True), (16, True), (17, True), (18, True), (19, True), (20, True), (21, True), (22, True), (23, True), (24, True), (25, True), (26, True), (27, True), (28, True), (29, True), (30, True), (31, True), (32, True), (33, False), (34, False), (35, False), (36, False), (37, False), (38, False), (39, False), (40, False), (41, False), (42, False), (43, False), (44, False), (45, False), (46, False), (47, False), (48, False), (49, False), (50, False), (51, False), (52, False), (53, False), (54, False), (55, True), (56, True), (57, True), (58, True), (59, True), (60, True), (61, True), (62, True), (63, True), (64, True), (65, True), (66, True), (67, True), (68, True), (69, True), (70, True), (71, True), (72, True), (73, True), (74, True), (75, True), (76, True), (77, True), (78, True), (79, True), (80, True), (81, True), 

In [24]:
@interact(row=(0, len(left_side.cells) - 1), col=(0, len(left_side.cells[0]) - 1))
def draw_cell_interior(row, col):
    print('north', left_side.cells[row][col].north)
    print('south', left_side.cells[row][col].south)
    print('east', left_side.cells[row][col].east)
    print('west', left_side.cells[row][col].west)
    print('has label' if has_row_label[row] else 'no label')
    fig, ax = plt.subplots(figsize=(6, 2))
    ax.imshow(left_side.cells[row][col].interior(), cmap=plt.cm.gray)

### Now split the right_side into separate graphs

In [18]:
months = []
for r, row in enumerate(has_row_label[1:], 1):

    if not has_row_label[r - 1] and row:
        north = left_side.cells[r - 1][1].north[1][1]

    if has_row_label[r - 1] and not row:
        south = left_side.cells[r][1].south[1][1]
        month = util.crop(
            right_side.image,
            ((north, right_side.height - south), (0, 0)), copy=True)
        months.append(Grid(image=month))

for month in months:
    print(month.shape)

(1502, 1650)
(1365, 1650)


In [19]:
for month in months:
    month.horiz.threshold = month.width * 0.4
    month.horiz.thetas = np.deg2rad(near_horiz_deg)

    month.horiz.find_grid_lines()
    
    month.vert.threshold = month.height * 0.4
    month.vert.thetas = np.deg2rad(near_vert_deg)

    month.vert.find_grid_lines()


#### Look at what we got

In [20]:
fig, ax = plt.subplots(figsize=(6, 6))

ax.imshow(months[0].image, cmap=plt.cm.gray)
ax.set_title('Top grid')

for ((x0, y0), (x1, y1)) in months[0].horiz.lines:
    ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)

for ((x0, y0), (x1, y1)) in months[0].vert.lines:
    ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)

plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

In [46]:
fig, ax = plt.subplots(figsize=(6, 6))

ax.imshow(months[1].image, cmap=plt.cm.gray)
ax.set_title('Bottom grid')

for ((x0, y0), (x1, y1)) in months[1].horiz.lines:
    ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)

for ((x0, y0), (x1, y1)) in months[1].vert.lines:
    ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)

plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### Find column labels

### Look for tick marks in grid cells

### Stitch image parts back together to report output

# Failed experiments

- Try merging endpoints: Lines are pretty skew across the entire image. I tried to use interior points to make grid lines. This didn't really help things.

- Probabilistic Hough line: This may work for other parts of the image. like tick marks, but it didn't help with either grid lines or row labels. It proved to be much slower and harder to tune for finding grid lines that span the entire image.

- OpenCV: This works, it's just less flexible for searching on a limited set of angles. The ability to pull out the horizontal, vertical, and diagonal lines separately is useful in this application. and difficult to install.