# Global Wheat Detection Annotations

Creates Pascal VOC (XML) and COCO (JSON) annotations for the [Global Wheat Detection](https://www.kaggle.com/c/global-wheat-detection) challenge.


Required Libraries

In [1]:
import sys, os, re, time, glob, random, json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

!pip install -q pascal-voc-writer
from pascal_voc_writer import Writer

!wget -q https://raw.githubusercontent.com/Tony607/voc2coco/master/voc2coco.py
from voc2coco import *

## Import and Preprocess Data

This notebook can run with or without the `global-wheat-detection.zip` file containing all the competition data. However, the file can be found [here](https://www.kaggle.com/c/global-wheat-detection/data) by selecting Download All. 



In [2]:
#@markdown Select only if `global-wheat-detection.zip` is in the working directory
use_jpg_dataset = False #@param {type: 'boolean'}

In [3]:
if not os.path.isdir('data'):
    !mkdir data
if use_jpg_dataset and os.path.isfile('global-wheat-detection.zip'):
    !unzip -q global-wheat-detection.zip -d data
    !mv data/train data/images
    fnames = glob.glob('data/images/*.jpg')
    ids_from_files = [fn.split('/')[-1].split('.')[0] for fn in fnames] # ids of all images in the orig train set
    df=pd.read_csv('/content/data/train.csv')
else:
    !wget -q https://raw.githubusercontent.com/reyvaz/Global-Wheat-XML-and-COCO-Annotations/master/data/ids_from_files.txt
    !wget -q https://raw.githubusercontent.com/reyvaz/Global-Wheat-XML-and-COCO-Annotations/master/data/train.csv
    ids_from_files = [line.rstrip('\n') for line in open('ids_from_files.txt')]
    df=pd.read_csv('train.csv')

df.head(3)

Unnamed: 0,image_id,width,height,bbox,source
0,b6ab77fd7,1024,1024,"[834.0, 222.0, 56.0, 36.0]",usask_1
1,b6ab77fd7,1024,1024,"[226.0, 548.0, 130.0, 58.0]",usask_1
2,b6ab77fd7,1024,1024,"[377.0, 504.0, 74.0, 160.0]",usask_1


The original data, as it appears in the `bbox` column is in the format
`x_min, y_min, x_dist, y_dist`. Below will convert to `x_min, y_min, x_max, y_max`.


In [4]:
bbox_to_lists = lambda x: x.strip('[]').split(', ')

df['bbox_lists'] = [bbox_to_lists(row) for row in df.bbox]
df['x_min'] = [int(float(row[0])) for row in df.bbox_lists]
df['y_min'] = [int(float(row[1])) for row in df.bbox_lists]
df['x_dist'] = [int(float(row[2])) for row in df.bbox_lists]
df['y_dist'] = [int(float(row[3])) for row in df.bbox_lists]
df['x_max'] = df.x_min + df.x_dist
df['y_max'] = df.y_min + df.y_dist
df = df[df.columns[[0,6,7,10,11]]]

df.head(3)

Unnamed: 0,image_id,x_min,y_min,x_max,y_max
0,b6ab77fd7,834,222,890,258
1,b6ab77fd7,226,548,356,606
2,b6ab77fd7,377,504,451,664


The image dataset (or the `ids_from_files.txt`) contains more files than those listed in the annotations dataframe (`train.csv`). The difference, is that the files include images with no annotations (i.e. they do not contain the object) and are thus not listed in the `train.csv`. 

I will include these images, but they will be processed separately so they can be excluded from training if needed. 

In [5]:
n_images = len(ids_from_files)
unique_ids = np.unique(df.image_id) # ids of all images in the dataframe
images_no_bbox = list(set(ids_from_files) - set(unique_ids)) # difference are images with no bbox (i.e. no object)
len(unique_ids), n_images, len(images_no_bbox)

(3373, 3422, 49)

In [6]:
def list_to_text_file(id_list, dest_path):
    with open(dest_path, 'w') as f: 
        for item in id_list:
            f.write("%s\n" % item) 
    return None

## Creating XML (Pascal VOC) Annotations

In [7]:
LABEL = 'wheat'
width, height = 1024, 1024

All xml files will be in one directory. For cross-validation, only additional text files as the one below with the alternative train/validation splits (i.e. K-folds) would be needed. These can be produced when training.

In [8]:
train_ids, valid_ids = train_test_split(ids_from_files, train_size = 0.8, random_state = 8)
list_to_text_file(train_ids, 'data/train.txt')
list_to_text_file(valid_ids, 'data/val.txt')

In [9]:
jpegs_path = 'data/images'
xml_annotations_path = 'data/xml_annotations'
# create dir to place xml (pascal voc annotations)
!mkdir {xml_annotations_path}

In [10]:
def xml_from_id(id):
    # creates an xml (Pascal Voc format) file for an image file
    # id should be extracted from the original image file name
    im_df=df[df.image_id == id].reset_index()
    path_img = '{}/{}.jpg'.format('data/VOCdevkit/VOC2007/JPEGImages', id)
    path_xml = '{}/{}.xml'.format(xml_annotations_path, id)

    writer = Writer(path_img, width, height)
    if len(im_df) > 0:
        for idx in im_df.index:
            anot = [LABEL] + im_df[['x_min', 
                            'y_min', 'x_max', 'y_max']].iloc[idx].tolist()
            writer.addObject(*anot)
    writer.save(path_xml)
    return None

In [11]:
create_xml_files =  True #@param {type:"boolean"}
if create_xml_files:
    # Create the XML files for training images
    _ = [xml_from_id(id) for id in ids_from_files]

    !ls {xml_annotations_path} -1 | wc -l
    !ls {xml_annotations_path}/*.xml | head -5
    !head -7 {xml_annotations_path}/366187e59.xml

3422
data/xml_annotations/00333207f.xml
data/xml_annotations/005b0d8bb.xml
data/xml_annotations/006a994f7.xml
data/xml_annotations/00764ad5d.xml
data/xml_annotations/00b5c6764.xml
<annotation>
    <folder>JPEGImages</folder>
    <filename>366187e59.jpg</filename>
    <path>/content/data/VOCdevkit/VOC2007/JPEGImages/366187e59.jpg</path>
    <source>
        <database>Unknown</database>
    </source>


## Creating COCO (JSON) Annotations

This section creates the COCO Dataset from the XML files created above
    
The function below has been adapted from the `voc2coco.py` script 
- To include a generated integer ID for each image in the .json files.
    - Useful when using pycocotools.
- To generate a segmentation mask. Segmentation info is inferred from the bboxes.
    - This is required by some detectors (i.e. HTC DetectoRS).
- To eliminate subtracting 1 from the min values of x and y.

In [12]:
images_with_bbox = list(set(ids_from_files) - set(images_no_bbox)) # same as unique_ids array
len(ids_from_files), len(images_with_bbox), len(images_no_bbox)

(3422, 3373, 49)

**COCO data splits**

- will create .80-.20, .90-.10 train/val splits for `images_with_bbox` 
- all `images_no_bbox` will be separated, and included when needed with the train set. 

In [13]:
train_ids_80, valid_ids_20 = train_test_split(images_with_bbox, train_size = 0.8, random_state = 8)
train_ids_10, valid_ids_10 = train_test_split(valid_ids_20, train_size = 0.5, random_state = 8)
train_ids_90 = train_ids_80 + train_ids_10
train_ids_90_plus = train_ids_90 + images_no_bbox
random.shuffle(train_ids_90_plus)

In [14]:
len(train_ids_80), len(valid_ids_20), len(valid_ids_10), len(train_ids_90), len(train_ids_90_plus)

(2698, 675, 338, 3035, 3084)

In [15]:
#@markdown saving id lists to text files
list_to_text_file(train_ids_80, 'data/train_ids_80.txt')
list_to_text_file(valid_ids_20, 'data/valid_ids_20.txt')
list_to_text_file(train_ids_90, 'data/train_ids_90.txt')
list_to_text_file(valid_ids_10, 'data/valid_ids_10.txt')
list_to_text_file(train_ids_90_plus, 'data/train_ids_90_plus.txt')
list_to_text_file(images_no_bbox, 'data/images_no_bbox.txt')

In [16]:
#@markdown Convert XML to COCO Function. Edited from Tony's (Chenwei) Version.
def convert(xml_files, json_file, start_index):
    json_dict = {"images": [], "type": "instances", "annotations": [], "categories": []}
    categories = {"wheat": 1}
    bnd_id = START_BOUNDING_BOX_ID

    for indx, xml_file in enumerate(xml_files):
        image_id = indx + start_index
        tree = ET.parse(xml_file)
        root = tree.getroot()
        path = get(root, "path")
        if len(path) == 1:
            filename = os.path.basename(path[0].text)
        elif len(path) == 0:
            filename = get_and_check(root, "filename", 1).text
        else:
            raise ValueError("%d paths found in %s" % (len(path), xml_file))
        size = get_and_check(root, "size", 1)
        width = int(get_and_check(size, "width", 1).text)
        height = int(get_and_check(size, "height", 1).text)
        image = {
            "file_name": filename,
            "height": height,
            "width": width,
            "id": image_id,
        }
        json_dict["images"].append(image)

        for obj in get(root, "object"):
            category = get_and_check(obj, "name", 1).text
            if category not in categories:
                new_id = len(categories)
                categories[category] = new_id
            category_id = categories[category]
            bndbox = get_and_check(obj, "bndbox", 1)
            xmin = int(get_and_check(bndbox, "xmin", 1).text)
            ymin = int(get_and_check(bndbox, "ymin", 1).text)
            xmax = int(get_and_check(bndbox, "xmax", 1).text)
            ymax = int(get_and_check(bndbox, "ymax", 1).text)
            assert xmax > xmin
            assert ymax > ymin
            o_width = abs(xmax - xmin)
            o_height = abs(ymax - ymin)
            ann = {
                "area": o_width * o_height,
                "iscrowd": 0,
                "image_id": image_id,
                "bbox": [xmin, ymin, o_width, o_height],
                "category_id": category_id,
                "id": bnd_id,
                "ignore": 0,
                "segmentation": [[xmin,ymin, xmin,ymax, xmax,ymax, xmax,ymin]],
            }
            json_dict["annotations"].append(ann)
            bnd_id = bnd_id + 1

    for cate, cid in categories.items():
        cat = {"supercategory": "none", "id": cid, "name": cate}
        json_dict["categories"].append(cat)

    os.makedirs(os.path.dirname(json_file), exist_ok=True)
    json_fp = open(json_file, "w")
    json_str = json.dumps(json_dict)
    json_fp.write(json_str)
    json_fp.close()


# break down of segmentation
# "segmentation" = [x_min, y_min,  # top-left
#                   x_min, y_max,  # bottom-left
#                   x_max, y_max,  # bottom-right
#                   x_max, y_min]. # top-right

In [17]:
xml_files_train_80 = [xml_annotations_path + '/' + id + '.xml' for id in train_ids_80]
xml_files_train_90 = [xml_annotations_path + '/' + id + '.xml' for id in train_ids_90]
xml_files_train_90_plus = [xml_annotations_path + '/' + id + '.xml' for id in train_ids_90_plus]

xml_files_val_20 = [xml_annotations_path +'/' + id + '.xml' for id in valid_ids_20]
xml_files_val_10 = [xml_annotations_path +'/' + id + '.xml' for id in valid_ids_10]
xml_files_no_bb = [xml_annotations_path +'/' + id + '.xml' for id in images_no_bbox]

In [18]:
!mkdir data/coco
json_file_train_80 = './data/coco/train_80.json'
json_file_val_20 = './data/coco/val_20.json'
json_file_train_90 = './data/coco/train_90.json'
json_file_val_10 = './data/coco/val_10.json'
json_file_train_90_plus = './data/coco/train_90_plus.json'
json_file_no_bb = './data/coco/no_bb.json'

In [19]:
_ = convert(xml_files_train_80, json_file_train_80, start_index = 0)
start_index_val = len(xml_files_train_80)
_ = convert(xml_files_val_20, json_file_val_20, start_index = start_index_val)
_ = convert(xml_files_train_90, json_file_train_90, start_index = 0)
_ = convert(xml_files_train_90_plus, json_file_train_90_plus, start_index = 0)

start_index_bb = len(xml_files_train_90)
_ = convert(xml_files_no_bb, json_file_no_bb, start_index = start_index_bb)

start_index_val = len(xml_files_train_90_plus)
_ = convert(xml_files_val_10, json_file_val_10, start_index = start_index_val)

In [20]:
val_json = json.load(open('/content/data/coco/val_20.json'))
#val_json # uncomment to check the annotations file

## Compressing

In [21]:
!mkdir annotations
!mv data/xml_annotations annotations
!mv data/coco annotations
!mv data/*txt annotations
!zip -rq annotations.zip annotations

In [22]:
!ls annotations*

annotations.zip

annotations:
coco		    train_ids_90_plus.txt  valid_ids_10.txt  xml_annotations
images_no_bbox.txt  train_ids_90.txt	   valid_ids_20.txt
train_ids_80.txt    train.txt		   val.txt
