# Pre processing steps of a COCO JSON annotated file 

Given a single COCO annotated JSON file, your goal is to pre-process in order to remove noise and manipulate it into a form which is suitable for training a ML model. This script will also check if the annotated images are broken or missing.

The COCO annotation file includes the following -

1. Name of the images.

2. Dimensions of the images.

3. Classes in the image category.

4. Name of the super categories of the classes.

5. Area acquired by the segmented pixels in an image.

6. Bounding box co-ordinates.

7. Annotated segmentation coordinates.

There is a lot of noise in the real world annotation file. The images name could be wrong. The images mentioned in an annotation file may not be present in the image folder, which will disrupt the model training procedure. The contents within an annotation file may not match with each other. Even the files present in an image folder may be broken or truncated, which will cause errors while reading image files. Our goal is to eradicate all these problems.

Our goal is to make sure that all information in the key values corresponds to each other correctly. This notebook will help you achieve this task.

## Import labels and sample JSON file 
To import total classes for the material, material_form and plastic_type we will import the label files from the waste_identification_ml project from Tensorflow Model Garden.
We will also import a noisy sample JSON file to illustrate an example.

In [None]:
%%bash
curl -O https://raw.githubusercontent.com/tensorflow/models/master/official/projects/waste_identification_ml/pre_processing/config/categories_list_of_dictionaries.py

curl -O https://raw.githubusercontent.com/tensorflow/models/master/official/projects/waste_identification_ml/pre_processing/config/sample_json/dataset.json

mkdir image_folder

curl -o image_folder/image_2.png https://raw.githubusercontent.com/tensorflow/models/master/official/projects/waste_identification_ml/pre_processing/config/sample_images/image_2.png

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3536  100  3536    0     0  29714      0 --:--:-- --:--:-- --:--:-- 29714
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  2427  100  2427    0     0  18248      0 --:--:-- --:--:-- --:--:-- 18248
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 3303k  100 3303k    0     0  14.1M      0 --:--:-- --:--:-- --:--:-- 14.2M


## Import the required libraries

In [None]:
import glob
import tqdm
import json
from PIL import Image
import subprocess
import copy
import os
from google.colab import files
from categories_list_of_dictionaries import *

In [None]:
# reading labels 

images_folder_path = 'image_folder/' #@param {type:"string"}
list_of_material = build_material(MATERIAL_LIST,'material-types')
list_of_material_form = build_material(MATERIAL_FORM_LIST,'material-form-types')
list_of_plastic_type = build_material(PLASTICS_SUBCATEGORY_LIST,'plastic-types')

In [None]:
# common labeling typo errors
_KNOWN_TYPOS = {
  'and': '&',
  'Cassete': 'Cassette',
  'Toy':'Toys',
  'Mug-&-Tub':'tub',
  'Toyss':'toys'
}
_KNOWN_TYPOS

{'and': '&',
 'Cassete': 'Cassette',
 'Toy': 'Toys',
 'Mug-&-Tub': 'tub',
 'Toyss': 'toys'}

## Utility functions

In [None]:
def read_json(file):
  """Read any JSON file.

  Args:
    file: path to the file
  """
  with open(file) as json_file:
    data = json.load(json_file)
  return data


def search_dict_value(dic, id):
  """Returns the key of the dictionary from its value'

  Args:
    dic = Mapping to search by value.
    id = Value to search.
  """ 
  key_list = list(dic.keys())
  val_list = list(dic.values())
  position = val_list.index(id)
  return key_list[position]


def delete_truncated_images(folder_path: str) -> None:
  """Find and delete truncated images.

  Args:
    folder_path: path to the folder where images are saved.
  """
  # path to the images folder to read its content
  files = glob.glob(folder_path + '/*')
  print('Total number of files in the folder:', len(files))

  num = 0

  # read all image files and remove them from the directory in case they are broken
  for file in tqdm.tqdm(files):
    if file.endswith(('.png','.jpg')):
      try:
        img = Image.open(file)
        img.verify()
      except:
        num = num + 1
        subprocess.run(['rm', file])
        print('Broken file name:  ' + file)
  if num == 0:
    print('\nNo broken images found')
  else:
    print('Total number of broken images found:', num)


def spelling_correction(dic):
  """Correcting some common spelling mistakes."""
  for i in dic['categories']:
    for old, new in _KNOWN_TYPOS.items():
      i['name'].replace(old, new)


def labeling_correction(dic, num, labels_dict):
  """Matching annotated labels with the correct labels and correcting the mistakes.

  Mapping the modified labeling ID with the corresponding original ID for alignment
  of categories.

  Args:
    dic: JSON file read as a dictionary
    num: keyword position inside the label
    labels_dict: dictionary showing the labels ID of the original categories 
  """
  incorrect_labels = []
  mapping_list = []
  for i in dic['categories']:
    if i['name'].split('_')[num].lower() in labels_dict.values():
      id = i['id']
      name = i['name'].split('_')[num]
      id_match = search_dict_value(labels_dict, i['name'].split('_')[num].lower())
      mapping_list.append((id, name, id_match))
    else:
      id = i['id']
      incorrect_labels.append(id)
  return mapping_list, incorrect_labels


def images_key(dic):
  """Align the data within the dictionary in the 'images' key.
  
  The 'image_id' parameter in the 'annotation' key is the same as 'id' in the 'images' key of the dictionary. This function 
  will also remove all image data from the 'images' key whose 'id' does not 
  match with 'image_id' in the 'annotation' key in the dictionary.

  Args:
    dic: where the JSON file is read into
  """
  image_ids = set(i['image_id'] for i in dic['annotations'])
  new_images = [i for i in dic['images'] if i['id'] in image_ids]
  return new_images


def annotations_key(dic, incorrect_labels, mapping_dict):
  """Align the data within the dictionary  in the 'annotation' key.
  
  Notice that the 'category_id' in the 'annotation' key is same as 'id' 
  in the 'categories' key of the dictionary.

  Args:
    dic: where the JSON file is read into
  """
  new_annotation = []

  for i in dic['annotations']:
    id = i['category_id']
    if id not in incorrect_labels:
      new_id = [i[2] for i in mapping_dict if i[0] == id][0]
      i['category_id'] = new_id
      new_annotation.append(i)
  return new_annotation


def annotated_images(folder_path, dic):
  """Get images infromation that are mentioned in an annotation file but are not present in an image folder.

  Args:
    folder_path: path of an image folder.
  """
  # read the file names from the directory 
  files = glob.glob(folder_path + '/*')
  files = set(map(os.path.basename, files))

  # list of images in an annotation file
  dic['images'] = [i for i in dic['images'] if i['file_name'] in files]
  return dic


def image_annotation_key(dic):
  """Check if same images are present in both "images" key and "annotations" key. 

  List of the image IDs which are in the "images" key but NOT in "annotation" key.
  Remove information if they are not present in both keys.

  Args:
    dic: annotation file read as a dictionary
  """
  images_id = [i['id'] for i in dic['images']]
  annotation_id = [i['image_id'] for i in dic['annotations']]
  common_list = set(images_id).intersection(annotation_id)
  dic['images'] = [i for i in dic['images'] if i['id'] in common_list]
  dic['annotations'] = [i for i in dic['annotations'] if i['image_id'] in common_list]
  return dic

## Find and delete truncated images from the image folder.

In [None]:
delete_truncated_images(images_folder_path)

Total number of files in the folder: 1


100%|██████████| 1/1 [00:00<00:00, 30.21it/s]


No broken images found





## Perform operations on the file


In [None]:
# read json file and it should contain at least the three keys as shown below
path_to_json = 'dataset.json' #@param {type:"string"}
data = read_json(path_to_json)
print(data.keys())

# create a copy to compare the results in the end
data_preprocessing = copy.deepcopy(data)

dict_keys(['images', 'annotations', 'categories'])


In [None]:
# checking labeling mistakes as all annotated labels should have 6 keywords connected by '_' 
num = 0
for i in tqdm.tqdm(data['categories']):
  if len(i['name'].split('_')) != 6:
    num += 1
print('\nTotal number of wrong annotated labels are', num)

100%|██████████| 6/6 [00:00<00:00, 51463.85it/s]


Total number of wrong annotated labels are 5





In [None]:
# remove category labels which has less than 6 keywords
categories = []
num = 0
for i in data['categories']:
  if len(i['name'].split('_')) >= 6:
    categories.append(i)
  else:
    num += 1
print('\nTotal number of labels which has less than 6 keywords are', num)
data['categories'] = categories

# display categories after removing the labels
data['categories']


Total number of labels which has less than 6 keywords are 0


[{'id': 0,
  'name': 'plastics_HDPE_flexible_color_SAchets-&-pouch_pouch',
  'supercategory': 'plastics_HDPE_flexible_color_SAchets-&-pouch_pouch'},
 {'id': 1,
  'name': 'Plastics_HDPE_Rigid_Blue_Lid_Bottle-Cap_Na_Na',
  'supercategory': 'Plastics_HDPE_Rigid_Blue_Lid_Bottle-Cap_Na_Na'},
 {'id': 2,
  'name': 'Plastics_peTE_Na_Clear_Bottle_Shampoo-Bottle_250Ml_Vlcc',
  'supercategory': 'Plastics_peTE_Na_Clear_Bottle_Shampoo-Bottle_250Ml_Vlcc'},
 {'id': 3,
  'name': 'Plastics_na_Rigid_Blue_Bottle_Hair-Oil-Bottle-500Ml_Parachute',
  'supercategory': 'Plastics_na_Rigid_Blue_Bottle_Hair-Oil-Bottle-500Ml_Parachute'},
 {'id': 4,
  'name': 'Plastics_HDPE_Rigid_Na_Cosmetic_Comb_Na_Na',
  'supercategory': 'Plastics_HDPE_Rigid_Na_Cosmetic_Comb_Na_Na'},
 {'id': 5,
  'name': 'Plastics_PETE_Na_Clear_Bottle_Energy-Drink-Bottle_250Ml_Sting-Energy',
  'supercategory': 'Plastics_PETE_Na_Clear_Bottle_Energy-Drink-Bottle_250Ml_Sting-Energy'}]

In [None]:
# According to the collected data it was found that most issues occurs from the
# 6th keyword which are the sub category of the material form.

for i in tqdm.tqdm(data['categories']):
  l1 = i['name'].split('_')[:5]
  l2 = i['name'].split('_')[5:]
  l1.append('-'.join(l2))
  i['name'] = '_'.join(l1)

# display categories after making corrections
data['categories']

100%|██████████| 6/6 [00:00<00:00, 48026.38it/s]


[{'id': 0,
  'name': 'plastics_HDPE_flexible_color_SAchets-&-pouch_pouch',
  'supercategory': 'plastics_HDPE_flexible_color_SAchets-&-pouch_pouch'},
 {'id': 1,
  'name': 'Plastics_HDPE_Rigid_Blue_Lid_Bottle-Cap-Na-Na',
  'supercategory': 'Plastics_HDPE_Rigid_Blue_Lid_Bottle-Cap_Na_Na'},
 {'id': 2,
  'name': 'Plastics_peTE_Na_Clear_Bottle_Shampoo-Bottle-250Ml-Vlcc',
  'supercategory': 'Plastics_peTE_Na_Clear_Bottle_Shampoo-Bottle_250Ml_Vlcc'},
 {'id': 3,
  'name': 'Plastics_na_Rigid_Blue_Bottle_Hair-Oil-Bottle-500Ml-Parachute',
  'supercategory': 'Plastics_na_Rigid_Blue_Bottle_Hair-Oil-Bottle-500Ml_Parachute'},
 {'id': 4,
  'name': 'Plastics_HDPE_Rigid_Na_Cosmetic_Comb-Na-Na',
  'supercategory': 'Plastics_HDPE_Rigid_Na_Cosmetic_Comb_Na_Na'},
 {'id': 5,
  'name': 'Plastics_PETE_Na_Clear_Bottle_Energy-Drink-Bottle-250Ml-Sting-Energy',
  'supercategory': 'Plastics_PETE_Na_Clear_Bottle_Energy-Drink-Bottle_250Ml_Sting-Energy'}]

In [None]:
print('Dictionary characteristics before processing :')
print('images:',len(data_preprocessing['images']),'categories:', len(data_preprocessing['categories']),'annotations:',len(data_preprocessing['annotations']))

list_of_categories = [(list_of_material,0,'material_type_annotation.json'),\
                      (list_of_material_form,4,'material_form_type_annotation.json'),\
                      (list_of_plastic_type,1,'plastic_type_annotation.json')]

for m in list_of_categories:

  data_processing = copy.deepcopy(data)

  # create a dict showing TDs corresponding to the labels & convert all words
  # to lower case in order to eliminate case sensitive issues
  labels_dict = dict([(i['id'], i['name'].lower()) for i in m[0]])

  # correcting grammatical errors
  spelling_correction(data_processing)

  # create a mapping table to map each label to the right label structure.
  # find the incorrect labels.
  mapping_dict, incorrect_labels = labeling_correction(data_processing, m[1], labels_dict) 

  # change the 'categories' key
  data_processing['categories'] = m[0]

  # change the 'annotation' key
  data_processing['annotations'] = annotations_key(data_processing,  incorrect_labels, mapping_dict)

  # change the 'images' key
  data_processing['images'] = images_key(data_processing)

  # remove data from the 'images' key not present in the image folder
  data_processing = annotated_images(images_folder_path, data_processing)

  # align 'images' and 'annotations' key
  data_processing = image_annotation_key(data_processing)

  # write to a new JSON file
  with open(m[2], 'w') as opened_file:
    opened_file.write(json.dumps(data_processing, indent=4))

  print('\nDictionary characteristics after processing of', m[2].replace('.json','') ,':')
  print('images:',len(data_processing['images']),'categories:', len(data_processing['categories']),'annotations:',len(data_processing['annotations']))  

Dictionary characteristics before processing :
images: 2 categories: 6 annotations: 6

Dictionary characteristics after processing of material_type_annotation :
images: 1 categories: 10 annotations: 5

Dictionary characteristics after processing of material_form_type_annotation :
images: 1 categories: 34 annotations: 5

Dictionary characteristics after processing of plastic_type_annotation :
images: 1 categories: 9 annotations: 4


In [13]:
# View the final JSON file
try:
  files.view(m[2]) # use files.download to download the file
except ImportError:
  pass

<IPython.core.display.Javascript object>