Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Convert coco annotations to json lines
In order to generate models for computer vision, you will need to bring in labeled image data as input for model training in the form of an AzureML Labeled Dataset. You can either use a Labeled Dataset that you have exported from a Data Labeling project, or create a new Labeled Dataset with your labeled training data

In this notebook, we go over how you can convert a coco annotation file to a json line file, which can be used to create a LabeledDataset. 

### Licensing Information - 
This preview software is made available to you on the condition that you agree to
[your agreement][1] governing your use of Azure, and to the Supplemental Terms of Use for Microsoft Azure Previews[2], which supplement your agreement governing your use of Azure.
If you do not have an existing agreement governing your use of Azure, you agree that 
your agreement governing use of Azure is the [Microsoft Online Subscription Agreement][3]
(which incorporates the [Online Services Terms][4]).
By using the software you agree to these terms. This software may collect data
that is transmitted to Microsoft. Please see the [Microsoft Privacy Statement][5]
to learn more about how Microsoft processes personal data.

[1]: https://azure.microsoft.com/en-us/support/legal/
[2]: https://azure.microsoft.com/en-us/support/legal/preview-supplemental-terms/
[3]: https://azure.microsoft.com/en-us/support/legal/subscription-agreement/
[4]: http://www.microsoftvolumelicensing.com/DocumentSearch.aspx?Mode=3&DocumentTypeId=46
[5]: http://go.microsoft.com/fwlink/?LinkId=248681 


## Download traning images and validation images
In this notebook, we use data for [COCO 2017 Object Detection Task](https://cocodataset.org/#detection-2017) as an example. 

**NOTE**: The datasets are not trivial. It takes quite a long time to download and convert. They just serve as examples. If you already have your raw data ready in an Azure storage, you can skip the downloading step. Just refer to the annotation convertion part to convert your dataset into json lines.

- 2017 Train images [**118K/18GB**]

- 2017 Val images [**5K/1GB**]

In [None]:
import os
import urllib
from zipfile import ZipFile

extract_to_dir = 'coco_2017'

def downloadAndExtract(url, extract_to_dir):
    # download data
    data_file = url[url.rfind("/")+1:]
    print(f'Downloading file {data_file}.')
    urllib.request.urlretrieve(url, filename=data_file)
    print(f'Downloaded {data_file}.')

    # extract files
    with ZipFile(data_file, 'r') as zip:
        print(f'Extracting file {data_file}.')
        zip.extractall(path=extract_to_dir)
        print(f'Extracted file {data_file}.')

    # delete zip file
    os.remove(data_file)


In [None]:
# 2017 Train images [118K/18GB]
train_url = "http://images.cocodataset.org/zips/train2017.zip"
# 2017 Val images [5K/1GB] 
val_url = "http://images.cocodataset.org/zips/val2017.zip"
    
# Download training data and validation data to data folder.
downloadAndExtract(train_url, extract_to_dir)
downloadAndExtract(val_url, extract_to_dir)

## Download coco annotations

- 2017 Train/Val annotations [**241MB**]

In [None]:
# 2017 Train/Val annotations [241MB]
annoations_url = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
# Download annotation data to data folder.
downloadAndExtract(annoations_url, extract_to_dir)

## Download coco2jsonl converter

We created the coco2jsonl converter to help you convert coco annotation files to json line files. It is in [the github repo automlForImages](https://github.com/swatig007/automlForImages). If you already cloned it, you can find it in your local clone folder, then you don't need to download it again.

In [None]:
# Download coco2jsonl converter.
coco2jsonl_url = 'https://github.com/swatig007/automlForImages/tree/main/MultiClass/utils/coco2jsonl.py'
urllib.request.urlretrieve(coco2jsonl_url, filename="coco2jsonl.py")

## Convert coco annotation files to json line files.
**NOTE**: The example datasets are not trivial. It takes quite a long time convert. They just serve as examples. Instead of running the example for testing, you may just want to work with your own data by referring to the scripts in this notebook.

In [None]:
# Generate jsonl file for training data.
!python coco2jsonl.py \
--input_coco_file_path "./coco/annotations/instances_train2017.json" \
--output_dir "./coco" --output_file_name "instances_train2017.jsonl" \
--task_type "ObjectDetection" \
--base_url "AmlDatastore://workspaceblobstore/coco_2017/train2017"

# Generate jsonl file for validation data.
!python coco2jsonl.py \
--input_coco_file_path "./coco/annotations/instances_val2017.json" \
--output_dir "./coco" --output_file_name "instances_val2017.jsonl" \
--task_type "ObjectDetection" \
--base_url "AmlDatastore://workspaceblobstore/coco_2017/val2017/"