# 13. Copy Dashcam Images for Labelling

In the previous notebook 12, we split dashcam footage into images and metadata, ready to be processed by a detection model.  We can jump straight ahead to notebook 15 to apply the previous detection model that was trained on GSV images to the dashcam images.  And we did.

Naively running the old GSV model against the dashcam images, we come up with many "hits".  Most true-positive, but some false positives due to e.g.:

* White markings on the road (give way stripes, traffic islands, turning arrows)
* Reflections off the bonnet of the camera vehicle
* White clouds or objects way off to the side of the road
* etc.

It also didn't perform as well as it could have, in that it only detected the bicycle lane markings when they were very close to the camera, rather than detecting them when they are clear but further into the distance.

So we now want to use "hits" from that initial pass to enhance the training and validation dataset, to include actual dashcam images from the true positives and the false positives.

To discourage the model from some of the false positives, we create new classes such as "GiveWayMarker" or "ArrowMarker" to explicitly label what they ARE.  This will hopefully give the object detection model a way to more confidently label them as something else!

So in this notebook, we do three things:

* Copy the "hits" from the first pass into the "dataset" directory
* Stop and label them with labelImg
* Split them into the "train" and "test" directories to create a larger dataset

## Configuration

Any configuration that is required to run this notebook can be customized in the next cell

In [4]:
# Name of the subdirectory containing dashcam footage for an area, split into frame images in a
# "split" subdirectory, with a detection log in "detections/detection_log.csv"
# This subdirectory is assumed to be in the 'data_sources' directory
import_directory = 'dashcam_tour_mount_eliza'
#import_directory = 'dashcam_tour_frankston'

# Version suffix for the previous dataset.  We will copy existing images from here, and add to them.
input_version = 'V1'

# Version suffix for the new dataset being created, with additional images.
output_version = 'V2'

# Test split percentage
# What percentage is held aside and moved into "test_XXX", while the rest are moved to "train_XXX"
# (where "XXX" is the dataset_version string, above)
# The actual number of images placed in "test_XXX" will be rounded DOWN
test_split_percentage = 20

## Code

In [2]:
# General imports
import os
import sys
import shutil

import math
import random

import pandas as pd

from pathlib import Path

from tqdm.notebook import tqdm, trange


# Make sure local modules can be imported
module_path_root = os.path.abspath(os.pardir)
if module_path_root not in sys.path:
    sys.path.append(module_path_root)
    
# Get root install path, a level above the minor_thesis folder from GitHub
install_path_root = Path(module_path_root).parent.absolute()

In [5]:
# Derived path for input detection log with original image paths
detection_log_path = os.path.join(module_path_root, 'data_sources', import_directory, 'detections', 'detection_log.csv')

# Derived path for main dataset images directory, train images directory, and test images directory
prev_image_train_dir   = os.path.join(install_path_root, 'TensorFlow', 'workspace', 'images', 'train_{0:s}'.format(input_version))
prev_image_test_dir    = os.path.join(install_path_root, 'TensorFlow', 'workspace', 'images', 'test_{0:s}'.format(input_version))
next_image_dataset_dir = os.path.join(install_path_root, 'TensorFlow', 'workspace', 'images', 'dataset_{0:s}'.format(output_dataset_version))
next_image_train_dir   = os.path.join(install_path_root, 'TensorFlow', 'workspace', 'images', 'train_{0:s}'.format(output_version))
next_image_test_dir    = os.path.join(install_path_root, 'TensorFlow', 'workspace', 'images', 'test_{0:s}'.format(output_version))

# Create the output directories, if they do not already exist
Path(next_image_dataset_dir).mkdir(parents=True, exist_ok=True)
Path(next_image_train_dir).mkdir(parents=True, exist_ok=True)
Path(next_image_test_dir).mkdir(parents=True, exist_ok=True)

In [6]:
# Read detection log
df = pd.read_csv(detection_log_path)
df.head(3)

Unnamed: 0,lat,lon,bearing,heading,way_id_start,way_id,node_id,offset_id,score,bbox_0,bbox_1,bbox_2,bbox_3,orig_filename
0,-38.208327,145.109764,346,346,0,0,0,2148.0,0.732124,0.406723,0.374172,0.438563,0.415775,E:\Release\minor_thesis\data_sources\dashcam_t...
1,-38.2047,145.113683,65,65,0,0,0,1332.0,0.6145,0.503657,0.293135,0.541611,0.35472,E:\Release\minor_thesis\data_sources\dashcam_t...
2,-38.204159,145.114524,38,38,0,0,0,1776.0,0.652085,0.464327,0.315719,0.492939,0.359375,E:\Release\minor_thesis\data_sources\dashcam_t...


In [8]:
# Find each image file in the "detection log" CSV and copy them all to the "dataset" folder for the new version
for index in trange(len(df)):
    row = df.iloc[[index]]
    
    orig_filename = row['orig_filename'].item()
        
    output_filename = os.path.basename(orig_filename)
    output_path     = os.path.join(next_image_dataset_dir, output_filename)
    
    shutil.copyfile(orig_filename, output_path)

  0%|          | 0/232 [00:00<?, ?it/s]

## Labelling

Now, all the images for the dataset have been copied to the "dataset_XXX" directory in TensorFlow/workspace/images, where "XXX" is the suffix defined above in the configuration.

The next step is to run "labelImg" from:

https://github.com/tzutalin/labelImg

Following the instructions on that webpage to install and run.

You want to browse to the "dataset_XXX" directory, and label imagesswith the class names:

* BikeLaneMarker
* GiveWayMarker
* IslandMarker
* ArrowMarker
* RoadDefect
* RoadWriting

Once all the labelling is done, you should have an XML file in that directory for every image.  Come back here and run the final phase of this notebook

## Training/Test split

We now want to split the "dataset_XXX" directory into "train_XXX" and "test_XXX" directories according to a percentage split from the config

In [None]:
# Get a list of all label files in the dataset
xml_file_list = [f for f in os.listdir(next_image_dataset_dir) if f.endswith('.xml')]

# Determine how many images to sample for the "test" directory
sample_size = math.floor(len(xml_file_list) * test_split_percentage / 100)

# Randomly select from the list
test_files = random.sample(xml_file_list, sample_size)

# Move the sampled the XML files and their corresponding image file with a different extension
for test_label_file in test_files:
    test_label_base = os.path.splitext(test_label_file)[0]
    
    associated_files = [f for f in os.listdir(next_image_dataset_dir) if f.startswith(test_label_base + '.')]
    for sample_file in associated_files:
        print('Test  file: {0:s}'.format(sample_file))
        input_path  = os.path.join(next_image_dataset_dir, sample_file)
        output_path = os.path.join(next_image_test_dir,    sample_file)
        
        shutil.move(input_path, output_path)

# Move any remaining files to the training directory
remaining_file_list = os.listdir(next_image_dataset_dir)

for training_file in remaining_file_list:
    print('Train file: {0:s}'.format(training_file))
    input_path  = os.path.join(next_image_dataset_dir, training_file)
    output_path = os.path.join(next_image_train_dir,   training_file)
    
    shutil.move(input_path, output_path)
    
# Copy training images and test images from the previous version according to the previous split
prev_train_file_list = os.listdir(prev_image_train_dir)

for training_file in prev_train_file_list:
    print('Prev Train file: {0:s}'.format(training_file))
    
    input_path  = os.path.join(prev_image_train_dir, training_file)
    output_path = os.path.join(next_image_train_dir,   training_file)
    
    shutil.copyfile(input_path, output_path)

prev_test_file_list = os.listdir(prev_image_test_dir)

for test_file in prev_test_file_list:
    print('Prev Test file: {0:s}'.format(test_file))
    
    input_path  = os.path.join(prev_image_test_dir, training_file)
    output_path = os.path.join(next_image_test_dir,   training_file)
    
    shutil.copyfile(input_path, output_path)