# xView Vehicle Object Detection Data Prep

This notebook prepares data for training an object detection model on the xView dataset.


* Download the training images and labels from the xView competition site, unzip them, and put the contents of each zipfile in a local or S3 directory.
* Set `raw_uri` to this directory containing the raw dataset.
* Set `processed_uri` to a local or S3 directory (you can write to), which will store the processed data generated by this notebook.

This is all you will need to do in order to run this notebook.

In [None]:
raw_uri = 's3://raster-vision-xview-example/raw-data'
processed_uri = '/opt/data/examples/xview/processed-data'
# processed_uri = 's3://raster-vision-xview-example/processed-data'

The steps we'll take to prepare the data are as follows:

- Filter out all of the non-vehicle bounding boxes from the labels. Combine all vehicle types into one class. 
- Subset the entire xView dataset to only include the images that are most densely populated with vehicles.
- Split the selected images randomly into 80%/20% training and validation sets
- Split the vehicle labels by image, and save off a label GeoJSON file per image


This process will save the split labels, and `train_scenes.csv` and `val_scenes.csv` files that are used by the experiment at `examples/object_detection/xview.py` to `processed_uri`.

In [None]:
import os
from os.path import join
import json
import random
from collections import defaultdict

from rastervision.pipeline.file_system import (
    download_if_needed, list_paths, file_to_json, json_to_file, 
    get_local_path, make_dir, sync_to_dir, str_to_file)

random.seed(12345)

### Filter out non-vehicle labels

The xView dataset includes labels for a number of different types of objects. We are only interested in building a detector for objects that can be categorized as vehicles (e.g. 'small car', 'passenger vehicle', 'bus'). We have pre-determined the ids that map to vehicle labels and will use them to extract all the vehicles from the whole xView label set. In this section we also assign a class name of 'vehicle' to all of the resulting labels.

In [None]:
label_uri = join(raw_uri, 'xView_train.geojson')
label_js = file_to_json(label_uri)

In [None]:
vehicle_type_ids = [17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 32, 
                    53, 54, 55, 56, 57, 59, 60, 61, 62, 63, 64, 65, 66]

In [None]:
vehicle_features = []
for f in label_js['features']:
    if f['properties']['type_id'] in vehicle_type_ids:
        f['properties']['class_name'] = 'vehicle'
        vehicle_features.append(f)
label_js['features'] = vehicle_features

### Subset images with the most vehicles

In this section we determine which images contain the most vehicles and are therefore the best candidates for this experiment.

In [None]:
image_to_vehicle_counts = defaultdict(int)
for f in label_js['features']:
    image_id = f['properties']['image_id']
    image_to_vehicle_counts[image_id] += 1

In [None]:
# Use top 10% of images by vehicle count.
experiment_image_count = round(len(image_to_vehicle_counts.keys()) * 0.1)
sorted_images_and_counts = sorted(image_to_vehicle_counts.items(), key=lambda x: x[1])
selected_images_and_counts = sorted_images_and_counts[-experiment_image_count:]

### Split into train and validation

Split up training and validation data. Use 80% of images in the training set and 20% in the validation set.

In [None]:
ratio = 0.8
training_sample_size = round(ratio * experiment_image_count)
train_sample = random.sample(range(experiment_image_count), training_sample_size)

train_images = []
val_images = []

In [None]:
for i in range(training_sample_size):
    img = selected_images_and_counts[i][0]
    img_path = join('train_images', img)
    if i in train_sample:
        train_images.append(img_path)
    else:
        val_images.append(img_path)

### Divide labels up by image

Using one vehicle label geojson for all of the training and validation images can become unwieldy. Instead, we will divide the labels up so that each image has a unique geojson associated with it. We will save each of these geojsons to the base directory you provided at the outset.

Then, we will create CSVs that our experiments will use to load the training and validation data.

In [None]:
def subset_labels(images):
    for i in images:
        img_fn = os.path.basename(i)
        img_id = os.path.splitext(img_fn)[0]
        tiff_features = []
        for l in label_js['features']:
            image_id = l['properties']['image_id']
            if image_id == img_fn:
                tiff_features.append(l)

        tiff_geojson = {}
        for key in label_js:
            if not key == 'features':
                tiff_geojson[key] = label_js[key]
        tiff_geojson['features'] = tiff_features
        
        json_to_file(tiff_geojson, join(processed_uri, 'labels', '{}.geojson'.format(img_id)))

In [None]:
subset_labels(train_images)
subset_labels(val_images)

In [None]:
def create_csv(images, path):
    csv_rows = []
    for img in images:
        img_id = os.path.splitext(os.path.basename(img))[0]
        img_path = join('train_images', '{}.tif'.format(img_id))
        labels_path = join('labels','{}.geojson'.format(img_id))
        csv_rows.append('"{}","{}"'.format(img_path, labels_path))
    str_to_file('\n'.join(csv_rows), path)

In [None]:
create_csv(train_images, join(processed_uri, 'train-scenes.csv'))
create_csv(val_images, join(processed_uri, 'val-scenes.csv'))