# Extract annotations from COCO Dataset annotation file
This notebook was created to answer a question from stackoverflow: [https://stackoverflow.com/questions/69722538/extract-annotations-from-coco-dataset-annotation-file](https://stackoverflow.com/questions/69722538/extract-annotations-from-coco-dataset-annotation-file)

> I want to train on a subset of COCO dataset. For the images, I have created a folder of first 30k images of train2017 folder. Now I need annotations of those 30k images (extracted from instances_train2017.json) in a separate json file so that I can train it. How can I do it?

The reason for the question is that Coco stores all of the annotations in one long json file, so there is no simple way to extract only the ones that you need. PyLabel can help with this task by importing the dataset, filtering the annotations to the images you care about, and then exporting back to a coco json file. 


In [5]:
import logging
logging.getLogger().setLevel(logging.CRITICAL)
!pip install pylabel > /dev/null

In [6]:
from pylabel import importer

# Download sample dataset 
For this example we can use a sample dataset stored in coco format. The general approach can later be applied to the full coco dataset.

In [7]:
import os 
import zipfile

#Download and import sample coco dataset 
os.makedirs("data", exist_ok=True)
!wget "https://github.com/pylabelalpha/notebook/blob/main/BCCD_coco.zip?raw=true" -O data/BCCD_coco.zip
with zipfile.ZipFile("data/BCCD_coco.zip", 'r') as zip_ref:
   zip_ref.extractall("data")

#Specify path to the coco.json file
path_to_annotations = "data/BCCD_Dataset.json"
#Specify the path to the images (if they are in a different folder than the annotations)
path_to_images = ""

#Import the dataset into the pylable schema 
dataset = importer.ImportCoco(path_to_annotations, path_to_images=path_to_images, name="BCCD_coco")
dataset.df.head(5)

--2021-11-01 07:52:48--  https://github.com/pylabelalpha/notebook/blob/main/BCCD_coco.zip?raw=true
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/pylabelalpha/notebook/raw/main/BCCD_coco.zip [following]
--2021-11-01 07:52:48--  https://github.com/pylabelalpha/notebook/raw/main/BCCD_coco.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/pylabelalpha/notebook/main/BCCD_coco.zip [following]
--2021-11-01 07:52:48--  https://raw.githubusercontent.com/pylabelalpha/notebook/main/BCCD_coco.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting respo

Unnamed: 0_level_0,img_folder,img_filename,img_path,img_id,img_width,img_height,img_depth,ann_segmented,ann_bbox_xmin,ann_bbox_ymin,...,ann_area,ann_segmentation,ann_iscrowd,ann_pose,ann_truncated,ann_difficult,cat_id,cat_name,cat_supercategory,split
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,BloodImage_00315.jpg,,0,640,480,3,0,164.0,261.0,...,13699.0,,,Unspecified,0,0,0,RBC,,
1,,BloodImage_00315.jpg,,0,640,480,3,0,15.0,66.0,...,13699.0,,,Unspecified,0,0,0,RBC,,
2,,BloodImage_00315.jpg,,0,640,480,3,0,13.0,234.0,...,11781.0,,,Unspecified,0,0,0,RBC,,
3,,BloodImage_00315.jpg,,0,640,480,3,0,239.0,3.0,...,11960.0,,,Unspecified,0,0,0,RBC,,
4,,BloodImage_00315.jpg,,0,640,480,3,0,542.0,109.0,...,10290.0,,,Unspecified,1,0,0,RBC,,


PyLabel imports the annotations into a pandas dataframe. Now you can filter this dataframe to the rows related to the images that you care about. There are 364 images in this dataset.

In [8]:
print(f"Number of images: {dataset.analyze.num_images}")
print(f"Class counts:\n{dataset.analyze.class_counts}")

Number of images: 364
Class counts:
RBC          4155
WBC           372
Platelets     361
Name: cat_name, dtype: int64


## Extract images
Lets copy some images to another directory to to represent the images that we care about. 

In [9]:
#Copy 100 images from the BCCD_Dataset/BCCD/JPEGImages/ to BCCD_Dataset/BCCD/100Images/ 
!mkdir data/100Images/ 
!ls data/*.jpg | head -100 | xargs -I{} cp {} data/100Images/ 

mkdir: data/100Images/: File exists


Create a list with all of the files in this directory. 

In [10]:
#Store a list of all of the files in the directory 
files = sorted(os.listdir('data/100Images/'))
print(f"{len(files)} files including {files[0]}")

100 files including BloodImage_00000.jpg


Now filter the dataframe to only images in the list of files.

In [11]:
dataset.df = dataset.df[dataset.df.img_filename.isin(files)].reset_index()
print(f"Number of images {dataset.df.img_filename.nunique()}")


Number of images 100


# Export annotations back as a coso json file

In [12]:
dataset.path_to_annotations = 'data/100Images/'
dataset.name = '100Images_coco'

dataset.export.ExportToCoco()

['data/100Images/100Images_coco.json']