add jsonl conversion helpers for automl-image notebooks (Azure#2202)

* prototype notebookes for jsonl conversion: multiclass, multilabel, object detection coco, object detection voc, instance segmentation coco, instance segmentation voc * implement jsonl conversion for multiclass and multilabel classification, demonstrate in jsonl-conversion/notebooks and modify automl-image-classification-multiclass-task-fridge-items notebook to use the new implementation * change multilabel notebook to use new jsonl conversion * implement coco jsonl converter for object detection and change notebook to use new implementation, verified that new implementation produces the same jsonl file as the old implementation * implement voc jsonl converter for object detection and demonstrate in object detection notebook, verified that new implementation produces the same train and validation json files as the original * clear outputs in changed notebooks * add instance segmentation to voc jsonl converter, verify that it generates the same train and val annotation files as original, verify that it generates the same annotation file as the coco jsonl converter for object detection * implement coco to jsonl conversion for instance segmentation for iscrowd==0, tested with example notebook from Azure/medical-imaging which ignores crowd annotations, also verified that this does not break the coco to jsonl code for object detection * implement coco to jsonl for instance segmentation for iscrowd==1, generate coco data for instance segmentation notebook, add coco usage to instance segmentation notebook, verified that the jsonl files generated for both voc to jsonl and coco to jsonl (using the newly generated data) are equivalent * refactor mask to polygon using automl.dnn.vision helpers * handling for compressed and uncompressed rle in coco 2 jsonl converter * generate odFridgeObjects data in coco format using rle instead of polygons * demonstrate coco to jsonl for rle data * add docstrings to jsonl conversion code, test with notebooks again, clean up extraneous files * remove extraneous imports * add azureml-automl-dnn-vision pip install for voc to jsonl conversion * reformat with black * add od batch scoring notebook * respond to pr comments: remove unnecessary pip installs, revert notebook metadata, revert modified experiment names, remove az login calls * restore pip install for azureml-automl-dnn-vision, needed to pass gate * revert notebook metadata * copy masktools helpers from azureml-automl-dnn-vision directly into source code * remove unnecessary pip installs, reformat with black, restore metadata * include imports for pycocotools and simplification, necessary for jsonl converison * add skimage pip install * fix skimage -> scikit-image pip install * clarify markdown for pip install prompts Co-authored-by: sharma-riti <52715641+sharma-riti@users.noreply.github.com> --------- Co-authored-by: Rehaan Bhimani <rbhimani@microsoft.com> Co-authored-by: sharma-riti <52715641+sharma-riti@users.noreply.github.com>
vs-li · Apr 27, 2023 · fdadd9d · fdadd9d
1 parent d3bf478
commit fdadd9d
Show file tree

Hide file tree

Showing 16 changed files with 41,639 additions and 672 deletions.
diff --git a/...lticlass-task-fridge-items/automl-image-classification-multiclass-task-fridge-items.ipynb b/...lticlass-task-fridge-items/automl-image-classification-multiclass-task-fridge-items.ipynb
@@ -148,8 +148,15 @@
     "# Extract current dataset name from dataset url\n",
     "dataset_name = os.path.split(download_url)[-1].split(\".\")[0]\n",
     "# Get dataset path for later use\n",
-    "dataset_dir = os.path.join(dataset_parent_dir, dataset_name)\n",
-    "\n",
+    "dataset_dir = os.path.join(dataset_parent_dir, dataset_name)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "# Get the data zip file path\n",
     "data_file = os.path.join(dataset_parent_dir, f\"{dataset_name}.zip\")\n",
     "\n",
@@ -244,6 +251,41 @@
     "In order to use this data to create an AzureML MLTable, we first need to convert it to the required JSONL format. The following script is creating two `.jsonl` files (one for training and one for validation) in the corresponding MLTable folder. The train / validation ratio corresponds to 20% of the data going into the validation file. For further details on jsonl file used for image classification task in automated ml, please refer to the [data schema documentation for multi-class image classification task](https://learn.microsoft.com/en-us/azure/machine-learning/reference-automl-images-schema#image-classification-binarymulti-class)."
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## First generate the jsonl file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, \"../jsonl-conversion/\")\n",
+    "from base_jsonl_converter import write_json_lines\n",
+    "from classification_jsonl_converter import ClassificationJSONLConverter\n",
+    "\n",
+    "converter = ClassificationJSONLConverter(\n",
+    "    uri_folder_data_asset.path, data_dir=dataset_dir\n",
+    ")\n",
+    "jsonl_annotations = os.path.join(dataset_dir, \"annotations.jsonl\")\n",
+    "write_json_lines(converter, jsonl_annotations)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Now split the annotations into train and validation"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -270,35 +312,21 @@
     "    validation_mltable_path, \"validation_annotations.jsonl\"\n",
     ")\n",
     "\n",
-    "# Baseline of json line dictionary\n",
-    "json_line_sample = {\n",
-    "    \"image_url\": uri_folder_data_asset.path,\n",
-    "    \"label\": \"\",\n",
-    "}\n",
+    "\n",
+    "with open(jsonl_annotations, \"r\") as annot_f:\n",
+    "    json_lines = annot_f.readlines()\n",
     "\n",
     "index = 0\n",
-    "# Scan each sub directary and generate a jsonl line per image, distributed on train and valid JSONL files\n",
     "with open(train_annotations_file, \"w\") as train_f:\n",
     "    with open(validation_annotations_file, \"w\") as validation_f:\n",
-    "        for class_name in os.listdir(dataset_dir):\n",
-    "            sub_dir = os.path.join(dataset_dir, class_name)\n",
-    "            if not os.path.isdir(sub_dir):\n",
-    "                continue\n",
-    "\n",
-    "            # Scan each sub directary\n",
-    "            print(f\"Parsing {sub_dir}\")\n",
-    "            for image in os.listdir(sub_dir):\n",
-    "                json_line = dict(json_line_sample)\n",
-    "                json_line[\"image_url\"] += f\"{class_name}/{image}\"\n",
-    "                json_line[\"label\"] = class_name\n",
-    "\n",
-    "                if index % train_validation_ratio == 0:\n",
-    "                    # validation annotation\n",
-    "                    validation_f.write(json.dumps(json_line) + \"\\n\")\n",
-    "                else:\n",
-    "                    # train annotation\n",
-    "                    train_f.write(json.dumps(json_line) + \"\\n\")\n",
-    "                index += 1"
+    "        for json_line in json_lines:\n",
+    "            if index % train_validation_ratio == 0:\n",
+    "                # validation annotation\n",
+    "                validation_f.write(json_line)\n",
+    "            else:\n",
+    "                # train annotation\n",
+    "                train_f.write(json_line)\n",
+    "            index += 1"
    ]
   },
   {
@@ -1399,9 +1427,6 @@
   }
  ],
  "metadata": {
-  "interpreter": {
-   "hash": "da404a94b19d2e6a57be28cbcf6e71fbd41612916c3423bc5e257c11b3d83fa0"
-  },
   "kernel_info": {
    "name": "python3-azureml"
   },

diff --git a/...ltilabel-task-fridge-items/automl-image-classification-multilabel-task-fridge-items.ipynb b/...ltilabel-task-fridge-items/automl-image-classification-multilabel-task-fridge-items.ipynb
@@ -234,6 +234,41 @@
     "In order to use this data to create an AzureML MLTable, we first need to convert it to the required JSONL format. The following script is creating two `.jsonl` files (one for training and one for validation) in the corresponding MLTable folder. The train / validation ratio corresponds to 20% of the data going into the validation file. For further details on jsonl file used for image classification task in automated ml, please refer to the [data schema documentation for multi-label image classification task](https://learn.microsoft.com/en-us/azure/machine-learning/reference-automl-images-schema#image-classification-multi-label)."
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## First generate the jsonl file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, \"../jsonl-conversion/\")\n",
+    "from base_jsonl_converter import write_json_lines\n",
+    "from classification_jsonl_converter import ClassificationJSONLConverter\n",
+    "\n",
+    "converter = ClassificationJSONLConverter(\n",
+    "    uri_folder_data_asset.path, label_file=os.path.join(dataset_dir, \"labels.csv\")\n",
+    ")\n",
+    "jsonl_annotations = os.path.join(dataset_dir, \"annotations.jsonl\")\n",
+    "write_json_lines(converter, jsonl_annotations)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Then split the data into train and validation"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -259,37 +294,20 @@
     "    validation_mltable_path, \"validation_annotations.jsonl\"\n",
     ")\n",
     "\n",
-    "# Baseline of json line dictionary\n",
-    "json_line_sample = {\n",
-    "    \"image_url\": uri_folder_data_asset.path,\n",
-    "    \"label\": [],\n",
-    "}\n",
-    "\n",
-    "# Path to the labels file.\n",
-    "labelFile = os.path.join(dataset_dir, \"labels.csv\")\n",
+    "with open(jsonl_annotations, \"r\") as annot_f:\n",
+    "    json_lines = annot_f.readlines()\n",
     "\n",
-    "# Read each annotation and convert it to jsonl line\n",
+    "index = 0\n",
     "with open(train_annotations_file, \"w\") as train_f:\n",
     "    with open(validation_annotations_file, \"w\") as validation_f:\n",
-    "        with open(labelFile, \"r\") as labels:\n",
-    "            for i, line in enumerate(labels):\n",
-    "                # Skipping the title line and any empty lines.\n",
-    "                if i == 0 or len(line.strip()) == 0:\n",
-    "                    continue\n",
-    "                line_split = line.strip().split(\",\")\n",
-    "                if len(line_split) != 2:\n",
-    "                    print(f\"Skipping the invalid line: {line}\")\n",
-    "                    continue\n",
-    "                json_line = dict(json_line_sample)\n",
-    "                json_line[\"image_url\"] += f\"images/{line_split[0]}\"\n",
-    "                json_line[\"label\"] = line_split[1].strip().split(\" \")\n",
-    "\n",
-    "                if i % train_validation_ratio == 0:\n",
-    "                    # validation annotation\n",
-    "                    validation_f.write(json.dumps(json_line) + \"\\n\")\n",
-    "                else:\n",
-    "                    # train annotation\n",
-    "                    train_f.write(json.dumps(json_line) + \"\\n\")"
+    "        for json_line in json_lines:\n",
+    "            if index % train_validation_ratio == 0:\n",
+    "                # validation annotation\n",
+    "                validation_f.write(json_line)\n",
+    "            else:\n",
+    "                # train annotation\n",
+    "                train_f.write(json_line)\n",
+    "            index += 1"
    ]
   },
   {

diff --git a/...segmentation-task-fridge-items/automl-image-instance-segmentation-task-fridge-items.ipynb b/...segmentation-task-fridge-items/automl-image-instance-segmentation-task-fridge-items.ipynb
@@ -231,15 +231,61 @@
     "In order to use this data to create an AzureML MLTable, we first need to convert it to the required JSONL format. The following script is creating two `.jsonl` files (one for training and one for validation) in the corresponding MLTable folder. The train / validation ratio corresponds to 20% of the data going into the validation file. For further details on jsonl file used for image classification task in automated ml, please refer to the [data schema documentation for image instance segmentation task](https://learn.microsoft.com/en-us/azure/machine-learning/reference-automl-images-schema#instance-segmentation)."
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## First generate JSONL files"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The JSONL Conversion helpers require pycocotools and simplification packages. If you don't have them installed, install them before converting data by runing this cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install pycocotools\n",
+    "!pip install simplification\n",
+    "!pip install scikit-image"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# The jsonl_converter below relies on scikit-image and simplification.\n",
-    "# If you don't have them installed, install them before converting data by runing this cell.\n",
-    "%pip install \"scikit-image\" \"simplification\""
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, \"../jsonl-conversion/\")\n",
+    "from base_jsonl_converter import write_json_lines\n",
+    "from voc_jsonl_converter import VOCJSONLConverter\n",
+    "\n",
+    "base_url = os.path.join(uri_folder_data_asset.path, \"images/\")\n",
+    "converter = VOCJSONLConverter(\n",
+    "    base_url,\n",
+    "    os.path.join(dataset_dir, \"annotations\"),\n",
+    "    mask_dir=os.path.join(dataset_dir, \"segmentation-masks\"),\n",
+    ")\n",
+    "jsonl_annotations = os.path.join(dataset_dir, \"annotations_voc.jsonl\")\n",
+    "write_json_lines(converter, jsonl_annotations)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Then split into train and validation"
    ]
   },
   {
@@ -248,16 +294,102 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from jsonl_converter import convert_mask_in_VOC_to_jsonl\n",
+    "import os\n",
     "\n",
-    "convert_mask_in_VOC_to_jsonl(dataset_dir, uri_folder_data_asset.path)"
+    "# We'll copy each JSONL file within its related MLTable folder\n",
+    "training_mltable_path = os.path.join(dataset_parent_dir, \"training-mltable-folder\")\n",
+    "validation_mltable_path = os.path.join(dataset_parent_dir, \"validation-mltable-folder\")\n",
+    "\n",
+    "# First, let's create the folders if they don't exist\n",
+    "os.makedirs(training_mltable_path, exist_ok=True)\n",
+    "os.makedirs(validation_mltable_path, exist_ok=True)\n",
+    "\n",
+    "train_validation_ratio = 5\n",
+    "\n",
+    "# Path to the training and validation files\n",
+    "train_annotations_file = os.path.join(training_mltable_path, \"train_annotations.jsonl\")\n",
+    "validation_annotations_file = os.path.join(\n",
+    "    validation_mltable_path, \"validation_annotations.jsonl\"\n",
+    ")\n",
+    "\n",
+    "with open(jsonl_annotations, \"r\") as annot_f:\n",
+    "    json_lines = annot_f.readlines()\n",
+    "\n",
+    "index = 0\n",
+    "with open(train_annotations_file, \"w\") as train_f:\n",
+    "    with open(validation_annotations_file, \"w\") as validation_f:\n",
+    "        for json_line in json_lines:\n",
+    "            if index % train_validation_ratio == 0:\n",
+    "                # validation annotation\n",
+    "                validation_f.write(json_line)\n",
+    "            else:\n",
+    "                # train annotation\n",
+    "                train_f.write(json_line)\n",
+    "            index += 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2.4. Convert annotation file from COCO to JSONL\n",
+    "If you want to try with a dataset in COCO format, the scripts below shows how to convert it to `jsonl` format. The file \"odFridgeObjects_coco.json\" consists of annotation information for the `odFridgeObjects` dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, \"../jsonl-conversion/\")\n",
+    "from base_jsonl_converter import write_json_lines\n",
+    "from coco_jsonl_converter import COCOJSONLConverter\n",
+    "\n",
+    "base_url = uri_folder_data_asset.path + \"images/\"\n",
+    "print(base_url)\n",
+    "converter = COCOJSONLConverter(base_url, \"./odFridgeObjectsMask_coco.json\")\n",
+    "jsonl_annotations = os.path.join(dataset_dir, \"annotations_coco.jsonl\")\n",
+    "write_json_lines(converter, jsonl_annotations)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If your COCO segmentation data is encoded in RLE format, it can be converted as follows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, \"../jsonl-conversion/\")\n",
+    "from base_jsonl_converter import write_json_lines\n",
+    "from coco_jsonl_converter import COCOJSONLConverter\n",
+    "\n",
+    "base_url = uri_folder_data_asset.path + \"images/\"\n",
+    "print(base_url)\n",
+    "converter = COCOJSONLConverter(\n",
+    "    base_url, \"./odFridgeObjectsMask_coco_rle.json\", compressed_rle=True\n",
+    ")\n",
+    "jsonl_annotations = os.path.join(dataset_dir, \"annotations_coco.jsonl\")\n",
+    "write_json_lines(converter, jsonl_annotations)"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 2.4. Create MLTable data input\n",
+    "## 2.5. Create MLTable data input\n",
     "\n",
     "Create MLTable data input using the jsonl files created above.\n",
     "\n",