# Caltech Pedestrian Annotation

The Caltech Pedestrians dataset (http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/) is separated into images and annotations.

Using a third-party converter (https://github.com/mitmul/caltech-pedestrian-dataset-converter), I have the extracted the annotations into a massive JSON file. I will be exploring the annotations in order to get a better feel for the data, and to pick a subset of images I would like to train on.

In particular, the annotations are stored in `.vbb` format, which stands for "video bounding box." Someone on the internet gives the following explanation of the format:

```
A video bounding box (vbb) annotation stores bounding boxes (bbs) for
objects of interest. The primary difference from a static annotation is
that each object can exist for multiple frames, ie, a vbb annotation not
only provides the locations of objects but also tracking information. A
vbb annotation A is simply a Matlab struct. It contains data per object
(such as a string label) and data per object per frame (such as a bb).
Each object is identified with a unique integer id.

Data per object (indexed by integer id) includes the following fields:
 init - 0/1 value indicating whether object w given id exists
 lbl  - a string label describing object type (eg: 'pedestrian')
 str  - the first frame in which object appears (1 indexed)
 end  - the last frame in which object appears (1 indexed)
 hide - 0/1 value indicating object is 'hidden' (used during labeling)

Data per object per frame (indexed by frame and id) includes:
 pos  - [l t w h]: bb indicating predicted object extent
 posv - [l t w h]: bb indicating visible region (may be [0 0 0 0])
 occl - 0/1 value indicating if bb is occluded
 lock - 0/1 value indicating bb is 'locked' (used during labeling)
```

In [1]:
import json

with open('../CaltechPedestrians/data/annotations.json') as f:
    raw = json.load(f)

Let's work our way through the data and count the number of detection labels first.

In [2]:
def generate_detections(max_count=10):
    count = 0
    for s, ss in raw.items():
        for v, vv in ss.items():
            for f, ff in vv['frames'].items():
                for detection in ff:
                    yield detection['lbl']
                    count += 1
                    if count >= max_count:
                        return

generator = generate_detections(max_count=1000000000)

from collections import Counter
c = Counter(generator)

from pprint import pprint
pprint(dict(c))

{'people': 34950, 'person': 153234, 'person-fa': 1153, 'person?': 2848}


There are four classes of labeled objects: people, person, person-fa, and person?. After looking at examples of each, it's probably fine if we use images that just contain people, hopefully, since it's fine to try to detect that.

Now let's make a function to generate the image filename, which incidentally we'll also use to name the xml annotation files in VOC format.

In [68]:
def gen_image_filename(s, vid, frame, extension, misc=''):
    return '{:s}_{:s}_{:s}{:s}.{:s}'.format(s, vid, frame, misc, extension)

Now we create a class for the annotation files, that operates as a context manager so when it's closed, it writes the xml file and we don't have to deal with it directly. The xml file is modeled after https://github.com/experiencor/raccoon_dataset/blob/master/annotations/raccoon-1.xml, which is in the VOC format afaik.

In [82]:
import xml.etree.cElementTree as et
import os

class AnnotationFile:
    
    def __init__(self, path, out_path):
        self.annotation_count = 0
        self.path = path
        self.out_path = out_path
    
    def __enter__(self):
        root = et.Element('annotation', verified='yes')
        self.root = root
        folder = et.SubElement(root, 'folder')
        folder.text = 'images'
        filename = et.SubElement(root, 'filename')
        filename.text = os.path.basename(self.path)
        path = et.SubElement(root, 'path')
        path.text = self.path
        source = et.SubElement(root, 'source')
        database = et.SubElement(source, 'database')
        database.text = 'unknown'
        size = et.SubElement(root, 'size')
        width = et.SubElement(size, 'width')
        width.text = '640'
        height = et.SubElement(size, 'height')
        height.text = '480'
        depth = et.SubElement(size, 'depth')
        depth.text = '3'
        segmented = et.SubElement(root, 'segmented')
        segmented.text = '0'
        return self
    
    def add_annotation(self, label, xmin, xmax, ymin, ymax):
        obj = et.SubElement(self.root, 'object')
        name = et.SubElement(obj, 'name')
        name.text = label
        pose = et.SubElement(obj, 'pose')
        pose.text = 'Unspecified'
        truncated = et.SubElement(obj, 'truncated')
        truncated.text = '0'
        difficult = et.SubElement(obj, 'difficult')
        difficult.text = '0'
        box = et.SubElement(obj, 'bndbox')
        x1 = et.SubElement(box, 'xmin')
        x1.text = str(xmin)
        x2 = et.SubElement(box, 'ymin')
        x2.text = str(ymin)
        x3 = et.SubElement(box, 'xmax')
        x3.text = str(xmax)
        x4 = et.SubElement(box, 'ymax')
        x4.text = str(ymax)
        self.annotation_count += 1

    def __exit__(self, exec_type, exec_val, exec_traceback):
        if self.annotation_count > 0:
            tree = et.ElementTree(self.root)
            tree.write(self.out_path, xml_declaration=False)

Now we loop through all of the images, doing two things to images that meet our criteria:
- annotating the images and saving them elsewhere with bounding boxes in green
- creating the annotation files for each image

In [84]:
import cv2
import matplotlib.pyplot as plt
from random import shuffle
import numpy as np

def annotate_samples():
    count = 0
    max_count = 67083
    skip_size = 1
    num_written = 0
    for s, ss in raw.items():
        for v, vv in ss.items():
            for f, ff in vv['frames'].items():
                if np.mod(count, skip_size) == 0:
                    imname = gen_image_filename(s, v, f, 'png')
                    imxmlname = gen_image_filename(s, v, f, 'xml')
                    impath = '../CaltechPedestrians/data/images/{:s}'.format(imname)
                    xmlpath = '../CaltechPedestrians/data/annotations_xml/{:s}'.format(imxmlname)
                    im = cv2.imread(impath)
                    with AnnotationFile(impath, xmlpath) as af:
                        drew_boxes = False
                        for ii, detection in enumerate(ff):
                            # skip if object is occuluded
                            if detection['occl']:
                                break
                            # get detection bounding box
                            bbl, bbt, bbw, bbh = np.array(detection['pos']).astype(int)
                            # skip if object is too small
                            if bbh < 30 or bbh > 80:
                                break
                            if bbw < 10:
                                break
                            # draw bounding box
                            cv2.rectangle(im, (bbl, bbt), (bbl + bbw, bbt + bbh), \
                                          (0, 255, 0), 2)
                            af.add_annotation(detection['lbl'], bbl, bbl + bbw, bbt, bbt + bbh)
                            drew_boxes = True
                        if drew_boxes:
                            write_name = gen_image_filename(s, v, f, '_annotated')
                            write_path = '../CaltechPedestrians/data/' \
                                         'images_annotated/{:s}'.format(imname)
                            cv2.imwrite(write_path, im)
                            num_written += 1
                            if np.mod(num_written, 80):
                                ending = ''
                            else:
                                ending = ' {:.2f}%\n'.format(100 * count / (max_count/skip_size))
                            print('.', end=ending)
                count += 1
                if count >= max_count:
                    return
    print()
    print('Total frames with labels:', count)
    print('Images written:', num_written)

annotate_samples()

................................................................................ 0.31%
................................................................................ 0.52%
................................................................................ 0.65%
................................................................................ 1.16%
................................................................................ 1.45%
................................................................................ 1.57%
................................................................................ 1.70%
................................................................................ 1.83%
................................................................................ 1.98%
................................................................................ 2.49%
................................................................................ 2.68%
...........................................

................................................................................ 38.98%
................................................................................ 39.27%
................................................................................ 39.67%
................................................................................ 39.83%
................................................................................ 40.00%
................................................................................ 40.14%
................................................................................ 40.29%
................................................................................ 40.69%
................................................................................ 41.21%
................................................................................ 41.32%
................................................................................ 41.44%
................................

................................................................................ 97.27%
................................................................................ 97.76%
................................................................................ 98.86%
................................................................................ 99.04%
................................................................................ 99.16%
................................................................................ 99.49%
.....

Now that the image and annotation data is in the desired format, we can proceed with testing Keras-YOLO2 with different backends (https://github.com/experiencor/keras-yolo2)