# 05. Copy GSV Images for Labelling

In a previous step, Google Street View images were sampled from a list of intersections.  With each sample, a set of images were downloaded and cached, then they were displayed on screen.  The operator used buttons to record which images contained the bicycle lane marker we are looking for.  The "hits" were recorded in a CSV.

In the first half of this Notebook, we take every location listed in the CSV and copy the images to a folder for labelling with labelImg.

Then we pause to run labelImg.

Then in the second half of the Notebook, we randomly allocate labelled images to either the training or testing dataset folders, based on a percentage split.

## Configuration

Any configuration that is required to run this notebook can be customized in the next cell

In [1]:
# Input CSV file of "hits", I.E. images where a clear bicycle lane marker was observed,
# for inclusion in the output dataset via the labelling stagef
# Will be read from the 'data_sources' directory
input_hits_filename = 'hits.csv'

# Sufffix that will be added to the name of the dataset folder and the test/train folders,
# for this "version" of the dataset.  Later, the same suffix will be used for the tfrecord files
# we create from the images, to feed to TensorFlow.
dataset_version = 'V1'

# Test split percentage
# What percentage is held aside and moved into "test_XXX", while the rest are moved to "train_XXX"
# (where "XXX" is the dataset_version string, above)
# The actual number of images placed in "test_XXX" will be rounded DOWN
test_split_percentage = 20

## Code

In [2]:
# General imports
import os
import sys
import shutil

import math
import random

import pandas as pd

from pathlib import Path

from tqdm.notebook import tqdm, trange


# Make sure local modules can be imported
module_path_root = os.path.abspath(os.pardir)
if module_path_root not in sys.path:
    sys.path.append(module_path_root)
    
# Get root install path, a level above the minor_thesis folder from GitHub
install_path_root = Path(module_path_root).parent.absolute()

In [3]:
# Derived path for input "hits" file
input_hits_path = os.path.join(os.path.abspath(os.pardir), 'data_sources', input_hits_filename)

# Derived path for main dataset images directory, train images directory, and test images directory
image_dataset_dir = os.path.join(install_path_root, 'TensorFlow', 'workspace', 'images', 'dataset_{0:s}'.format(dataset_version))
image_train_dir   = os.path.join(install_path_root, 'TensorFlow', 'workspace', 'images', 'train_{0:s}'.format(dataset_version))
image_test_dir    = os.path.join(install_path_root, 'TensorFlow', 'workspace', 'images', 'test_{0:s}'.format(dataset_version))

# Derived GSV download/cache directory
gsv_download_dir = os.path.join(os.path.abspath(os.pardir), 'data_sources', 'gsv')

# Create the output directories, if they do not already exist
Path(image_dataset_dir).mkdir(parents=True, exist_ok=True)
Path(image_train_dir).mkdir(parents=True, exist_ok=True)
Path(image_test_dir).mkdir(parents=True, exist_ok=True)

In [4]:
# Read CSV file of hits
df = pd.read_csv(input_hits_path)
df.head(3)

Unnamed: 0,id,offset,image_num
0,387454,0,1


In [5]:
# Images files in the GSV cache are split into multiple directories.
# Find each image file in the "hit" CSV and copy them all to the "dataset" folder for 
for index in trange(len(df)):
    row = df.iloc[[index]]
    
    id        = row['id'].item()
    offset    = row['offset'].item()
    image_num = row['image_num'].item()
    
    heading = int(image_num) * 90
    
    input_path      = os.path.join(gsv_download_dir, str(id), str(offset), str(heading), 'gsv_0.jpg')
    output_filename = '{0:s}_{1:s}_{2:d}_gsv_0.jpg'.format(str(id), str(offset), heading)
    output_path     = os.path.join(image_dataset_dir, output_filename)
    
    shutil.copyfile(input_path, output_path)

  0%|          | 0/1 [00:00<?, ?it/s]

## Labelling

Now, all the images for the dataset have been copied to the "dataset_XXX" directory in TensorFlow/workspace/images, where "XXX" is the suffix defined above in the configuration.

The next step is to run "labelImg" from:

https://github.com/tzutalin/labelImg

Following the instructions on that webpage to install and run.

You want to browse to the "dataset_XXX" directory, and label bicycle lane markings with the class name:

* BikeLaneMarker

Once all the labelling is done, you should have an XML file in that directory for every image.  Come back here and run the final phase of this notebook

## Training/Test split

We now want to split the "dataset_XXX" directory into "train_XXX" and "test_XXX" directories according to a percentage split from the config

In [6]:
# Get a list of all label files in the dataset
xml_file_list = [f for f in os.listdir(image_dataset_dir) if f.endswith('.xml')]

# Determine how many images to sample for the "test" directory
sample_size = math.floor(len(xml_file_list) * test_split_percentage / 100)

# Randomly select from the list
test_files = random.sample(xml_file_list, sample_size)

# Move the sampled the XML files and their corresponding image file with a different extension
for test_label_file in test_files:
    test_label_base = os.path.splitext(test_label_file)[0]
    
    associated_files = [f for f in os.listdir(image_dataset_dir) if f.startswith(test_label_base + '.')]
    for sample_file in associated_files:
        print('Test  file: {0:s}'.format(sample_file))
        input_path  = os.path.join(image_dataset_dir, sample_file)
        output_path = os.path.join(image_test_dir,    sample_file)
        
        shutil.move(input_path, output_path)

# Move any remaining files to the training directory
remaining_file_list = os.listdir(image_dataset_dir)

for training_file in remaining_file_list:
    print('Train file: {0:s}'.format(training_file))
    input_path  = os.path.join(image_dataset_dir, training_file)
    output_path = os.path.join(image_train_dir,   training_file)
    
    shutil.move(input_path, output_path)

Test  file: 353532_10_0_gsv_0.jpg
Test  file: 353532_10_0_gsv_0.xml
Test  file: 340055_10_0_gsv_0.jpg
Test  file: 340055_10_0_gsv_0.xml
Test  file: 41137_-10_180_gsv_0.jpg
Test  file: 41137_-10_180_gsv_0.xml
Test  file: 254704_-10_90_gsv_0.jpg
Test  file: 254704_-10_90_gsv_0.xml
Test  file: 249290_20_180_gsv_0.jpg
Test  file: 249290_20_180_gsv_0.xml
Test  file: 24417_-20_270_gsv_0.jpg
Test  file: 24417_-20_270_gsv_0.xml
Test  file: 71537_30_0_gsv_0.jpg
Test  file: 71537_30_0_gsv_0.xml
Test  file: 11465_30_270_gsv_0.jpg
Test  file: 11465_30_270_gsv_0.xml
Test  file: 223788_-20_90_gsv_0.jpg
Test  file: 223788_-20_90_gsv_0.xml
Test  file: 58871_0_180_gsv_0.jpg
Test  file: 58871_0_180_gsv_0.xml
Test  file: 138930_20_180_gsv_0.jpg
Test  file: 138930_20_180_gsv_0.xml
Test  file: 362571_-100_0_gsv_0.jpg
Test  file: 362571_-100_0_gsv_0.xml
Test  file: 275050_10_0_gsv_0.jpg
Test  file: 275050_10_0_gsv_0.xml
Test  file: 189252_10_90_gsv_0.jpg
Test  file: 189252_10_90_gsv_0.xml
Test  file: 23093_