## In order to evaluate the performance of the model, there needs to be a definitive way to measure accuracy.

One possible evaluation metric involves using an "intersection over union" measurement. It takes the overlapping areas of bounding boxes and divides the total area of both bounding boxes. This produces an accuracy score that can be used to measure how close of a match the bounding boxes area. A score of 1.0 reflects a perfect match, where scores closer to 0 are likely incorrect matches.

However, it is also necessary to consider the case that the number of bounding boxes guessed is inaccurate. There are two ways this can happen. In the situation that the number of guessed bounding boxes is lower, the guesses should be matched to the closest real box to determine an error while the missing pairs are automatically considered errors (false negative). Should the number of guesses be higher, each existing box should determine its closest match that is not more proximal to any other box. The left over guesses are considered in the error count as false positives.

This is actually not as complex as it sounds. In order to implement this, the the distance is taken from the start of each bounding box in the true set and the test test. The distances are sorted and boxes that have not yet been paired are matched. If there are unmatched true values, the model missed some things that we know are there. If there are unmatched test values, the model found too many faces.

For the accurate pairings, the intersection over union calculation is performed to determine an accuracy metric. It's possible to tune the model to increase these ratings and the overall false positive and false negative count.



In [1]:
import math
import sqlite3
import pandas as pd

# Consider box1, box2 as [x, y, width, height]
def iou(box1, box2):
    xa = max(box1[0], box2[0])
    ya = max(box1[1], box2[1])
    xb = min(box1[0] + box1[2], box2[0] + box2[2])
    yb = min(box1[1] + box1[3], box2[1] + box2[3])
    
    i_area = (xb - xa) * (yb - ya)
    
    a_area = box1[2] * box1[3]
    b_area = box2[2] * box2[3]
    
    return i_area / float(a_area + b_area - i_area)

def box_distance(box1, box2):
    return math.sqrt( (box1[0] - box2[0]) ** 2 + (box1[1] - box2[1]) ** 2 )


In [26]:
# [ [image_path, image_hash, [bbox1, bbox2, ...]], ... ]
def evaluate_performance(ytrue, yhat):
    
    results = {
        'true_boxes': 0,
        'guessed_boxes': 0,
        'false_positives': 0,
        'false_negatives': 0,
        'bad_guesses': 0,        # < .05
        'unlikely_guesses': 0,   # < .3
        'okay_guesses': 0,       # < .5
        'good_guesses': 0,       # < .75
        'great_guesses': 0,      # else
        'details': []
    }
    
    for (real, guessed) in zip(ytrue, yhat):
        real_boxes = real[2]
        guessed_boxes = guessed[2]
        count_real = len(real_boxes)
        count_guess = len(guessed_boxes)
        distances = []
        i = 0
        results['true_boxes'] += count_real
        results['guessed_boxes'] += count_guess
        for b1 in real_boxes:
            j = 0
            for b2 in guessed_boxes:
                distances.append((i, j, box_distance(b1, b2)))
                j += 1
            i += 1
        distances.sort(key = lambda x: x[2])
        
        assigned_r = [ False for _ in range(count_real)]
        assigned_g = [ False for _ in range(count_guess)]
        assignments = []
        
        for d in distances:
            if assigned_r[d[0]] or assigned_g[d[1]]:
                pass
            else:
                assigned_r[d[0]] = True
                assigned_g[d[1]] = True
                assignments.append(d)
        
        fns = count_real - sum(assigned_r)
        fps = count_guess - sum(assigned_g)
        scores = [ iou(real_boxes[a[0]], guessed_boxes[a[1]]) for a in assignments ]
        results['details'].append((fps, fns, scores))
        results['false_positives'] += fps
        results['false_negatives'] += fns
        for s in scores:
            if s < .05:
                results['bad_guesses'] += 1
            elif s < .3:
                results['unlikely_guesses'] +=1
            elif s < .5:
                results['okay_guesses'] += 1
            elif s < .75:
                results['good_guesses'] += 1
            else:
                results['great_guesses'] += 1
        
    return results

### Here are some dummy data structures for testing and illustration purposes.

In [13]:
fake_ytrue = [
    ['.' , 'a', [[449, 330, 122, 149]]],
    ['.' , 'b', [[361, 98, 263, 339]]],
    ['.' , 'c', [[304, 265, 16, 17],[328, 295, 16, 20]]]
]

fake_yhat = [
    ['.' , 'a', [[448, 330, 122, 149],[449, 330, 122, 149]]],
    ['.' , 'b', [[361, 98, 263, 339]]],
    ['.' , 'c', [[327, 295, 16, 20]]]
]

In [22]:
evaluate_performance(fake_ytrue, fake_yhat)

{'true_boxes': 4,
 'guessed_boxes': 4,
 'false_positives': 1,
 'false_negatives': 1,
 'bad_guesses': 0,
 'unlikely_guesses': 0,
 'okay_guesses': 0,
 'good_guesses': 1,
 'great_guesses': 2,
 'details': [(1, 0, [1.0]), (0, 0, [1.0]), (0, 1, [0.8823529411764706])]}

In [67]:
def transform_results(info, boxes):
    results = []
    for f in info.itertuples():
        results.append([ f.image_path, 'PLACEHOLDER', [ [r.x, r.y, r.width, r.height] for r in 
            boxes[boxes['counts_id'] == f.id].itertuples() ] ])
    return results

def get_trial_notes(trial):
    con = sqlite3.connect('results.db')
    res = pd.read_sql_query('select notes, model_name from trials where id = ?', con, params = (trial,))
    con.close()
    return (res.iloc[0].model_name, res.iloc[0].notes)

def fetch_results(trial):
    con = sqlite3.connect('results.db')
    counts = pd.read_sql_query('select * from trials_counts where trial_id = ?', con, params = (trial,))
    bbox = pd.read_sql_query('select * from trials_bbx where trial_id = ?', con, params = (trial,))
    con.close()
    return transform_results(counts, bbox)
    
def fetch_true(use_val = False):
    con = sqlite3.connect('widerface.db')
    db_str = ['val','train']
    counts = pd.read_sql_query(f'select * from counts_{use_val and db_str[0] or db_str[1]}', con)
    bbox = pd.read_sql_query(f'select * from bbx_{use_val and db_str[0] or db_str[1]}', con)
    con.close()
    return transform_results(counts, bbox)
    


### Get real bounding box data in the correct format. This may take a bit.

In [6]:
ytrue = fetch_true() # Training boxes
# ytrue = fetch_true(True) # Validation boxes

### Get data for a given trial. Again, this may take a bit.

In [7]:
yhat = fetch_results(1)

In [27]:
res = evaluate_performance(ytrue, yhat)

In [18]:
ytrue[0:3]

[['0--Parade/0_Parade_marchingband_1_849.jpg',
  'PLACEHOLDER',
  [[449, 330, 122, 149]]],
 ['0--Parade/0_Parade_Parade_0_904.jpg', 'PLACEHOLDER', [[361, 98, 263, 339]]],
 ['0--Parade/0_Parade_marchingband_1_799.jpg',
  'PLACEHOLDER',
  [[78, 221, 7, 8],
   [78, 238, 14, 17],
   [113, 212, 11, 15],
   [134, 260, 15, 15],
   [163, 250, 14, 17],
   [201, 218, 10, 12],
   [182, 266, 15, 17],
   [245, 279, 18, 15],
   [304, 265, 16, 17],
   [328, 295, 16, 20],
   [389, 281, 17, 19],
   [406, 293, 21, 21],
   [436, 290, 22, 17],
   [522, 328, 21, 18],
   [643, 320, 23, 22],
   [653, 224, 17, 25],
   [793, 337, 23, 30],
   [535, 311, 16, 17],
   [29, 220, 11, 15],
   [3, 232, 11, 15],
   [20, 215, 12, 16]]]]

In [19]:
yhat[0:3]

[['./data/train/images/0--Parade/0_Parade_marchingband_1_849.jpg',
  'PLACEHOLDER',
  [[484, 257, 54, 54], [457, 330, 120, 120], [719, 710, 67, 67]]],
 ['./data/train/images/0--Parade/0_Parade_Parade_0_904.jpg',
  'PLACEHOLDER',
  [[634, 1366, 55, 55],
   [354, 128, 298, 298],
   [486, 1056, 62, 62],
   [539, 1083, 130, 130],
   [40, 1155, 79, 79]]],
 ['./data/train/images/0--Parade/0_Parade_marchingband_1_799.jpg',
  'PLACEHOLDER',
  [[795, 339, 26, 26]]]]

### Note that the difference in guessed boxes and false positives should equal the sum of every guess rating. The sum of this and false negatives (missed boxes) should match the true total.

In [72]:
def basic_results_analysis(res):
    predicted = res['guessed_boxes'] - res['false_positives']
    guess_sum = res['bad_guesses'] + res['unlikely_guesses'] + res['okay_guesses'] + res['good_guesses'] + res['great_guesses']
    good_guesses = res['okay_guesses'] + res['good_guesses'] + res['great_guesses']
    assert(predicted == guess_sum)
    assert(predicted + res['false_negatives'] == res['true_boxes'])
    print(f'This model idenitified {res["guessed_boxes"]} total bounding boxes, though {res["false_positives"]} of them did not correspond to a real bounding box.\n'
    + f'{res["false_negatives"]} known boxes failed to be identified ({round(res["false_negatives"] / res["true_boxes"] * 100, 2)}%).\n'
    + f'Of the guessed boxes, {res["bad_guesses"]} were very unlikely to be a match and {res["unlikely_guesses"]} are probably not accurate.\n'
    + f'{good_guesses} ({round(good_guesses / res["true_boxes"] * 100, 2)}% of total) were identified with reasonably high confidence.')

In [73]:
basic_results_analysis(res)

This model idenitified 51762 total bounding boxes, though 10203 of them did not correspond to a real bounding box.
117865 known boxes failed to be identified (73.93%).
Of the guessed boxes, 9170 were very unlikely to be a match and 1764 are probably not accurate.
30625 (19.21% of total) were identified with reasonably high confidence.


Fortunately there are no errors, so it's likely working as intended. However, the results definitely aren't ideal. Mostly, a lot of faces are missed. 

## Compare to grayscale results

In [69]:
# Grayscale version
yhat2 = fetch_results(2)

In [70]:
res2 = evaluate_performance(ytrue, yhat2)

In [74]:
basic_results_analysis(res2)

This model idenitified 57903 total bounding boxes, though 16220 of them did not correspond to a real bounding box.
117741 known boxes failed to be identified (73.85%).
Of the guessed boxes, 12326 were very unlikely to be a match and 2856 are probably not accurate.
26501 (16.62% of total) were identified with reasonably high confidence.


### The use of grayscale creates a significant accuracy loss at a large performance gain. It could be useful to use grayscale in areas/types of pictures for which there is higher confidence. Of course, gaining this insight requires more detailed result handling.

In [85]:
yhat3 = fetch_results(4)

In [86]:
res3 = evaluate_performance(ytrue, yhat3)

In [87]:
basic_results_analysis(res3)

This model idenitified 33534 total bounding boxes, though 3642 of them did not correspond to a real bounding box.
129532 known boxes failed to be identified (81.25%).
Of the guessed boxes, 4139 were very unlikely to be a match and 1261 are probably not accurate.
24492 (15.36% of total) were identified with reasonably high confidence.
