<a href="https://colab.research.google.com/github/yasserius/satellite_image_tinhouse_detector/blob/main/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Welcome to a tutorial on how to train your custom dataset on the TF2 Object Detection API.

If you are looking for a model which draws bounding boxes around objects, then you have come to the right place.

This tutorial is completely contained in a single Colab notebook, so there is no need to run any code on your computer.

The main steps of the tutorial:
1. Install the Object Detection API and other dependencies
2. Load your data from Google Drive
3. Convert the data into TFRecord files.
4. Train the model.

# References

This tutorial has been heavily copied from the following awesome guides:
- [Tensorflow Object Detection API Tutorial](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/)
- [Official Tensorflow Object Detection API Guides](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2.md)
- [TF1 Object Detection Training on GCP](https://colab.research.google.com/github/cloud-annotations/google-colab-training/blob/master/object_detection.ipynb)

# Install the environment

Python 3.6 and Tensorflow 2.3 should work for this to work.

In [None]:
!python --version

Python 3.6.9


In [None]:
import tensorflow as tf
print(tf.__version__)

2.3.0


## Setting directory locations

In [None]:
MODELS_DIR = "/content/models"
OBJ_DET_DIR = "/content/models/research/object_detection"

## Cloning from github

Following the [official guide](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2.md) to install.

In [None]:
!git clone https://github.com/tensorflow/models.git

Cloning into 'models'...
remote: Enumerating objects: 67, done.[K
remote: Counting objects: 100% (67/67), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 46144 (delta 26), reused 43 (delta 2), pack-reused 46077[K
Receiving objects: 100% (46144/46144), 551.17 MiB | 30.04 MiB/s, done.
Resolving deltas: 100% (31629/31629), done.


In [None]:
%%bash
cd models/research
# Compile protos.
protoc object_detection/protos/*.proto --python_out=.
# Install TensorFlow Object Detection API.
cp object_detection/packages/tf2/setup.py .
python -m pip install .

## Adding paths to environment variables

In [None]:
import os

main_dir = "/content"
os.environ['PYTHONPATH'] += f':{main_dir}:{main_dir}/slim'
os.environ['PYTHONPATH'] += f':{main_dir}:{main_dir}/models'
os.environ['PYTHONPATH'] += f':{main_dir}:{main_dir}/models/research'

## Testing the installation

In [None]:
!python "$OBJ_DET_DIR/builders/model_builder_tf2_test.py"

There will be a really long output.

But if the installation succeeded, you should see something like this at the bottom:

```
----------------------------------------------------------------------
Ran 20 tests in 45.495s

OK (skipped=1)
```

# Data

## How to label your data

Your data must be either JPG or PNG format images, and the annotations must be in [PASCAL VOC XML format](https://gist.github.com/Prasad9/30900b0ef1375cc7385f4d85135fdb44).

The best way to annotate for this tutorial is to follow [this tutorial](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#annotate-the-dataset), which uses [LabelImg](https://github.com/tzutalin/labelImg) to annotate the images.


<img src="https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/_images/labelImg.JPG" height=300px>

<small>LabelImg</small>

Once you have the data, upload it to Google Drive and paste the directory path in the `G_DRIVE_PATH` below.

The data folder must contain all the images and XML files in it, like this:
```
1.jpg
1.xml
2.jpg
2.xml
...
```
Make sure the names of the images and XML files are the same. The names don't have to be numbers, they can be anything.

## Copying your data to colab

In [None]:
# Paste your path here.
G_DRIVE_PATH = '/content/drive/My Drive/data/satellite_images_dhaka/'

Mount Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Your images and XML files will be copied to `DATA_DIR`.

In [None]:
DATA_DIR = "/content/data"

!mkdir "$DATA_DIR"

Check to see if Google Drive contains your data:

In [None]:
import glob
from pprint import pprint

images_paths = glob.glob(G_DRIVE_PATH + "*.png")
images_paths = sorted(images_paths)

pprint(images_paths[:10])

['/content/drive/My Drive/data/satellite_images_dhaka/0_1.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_10.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_11.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_12.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_13.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_14.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_15.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_16.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_5.png',
 '/content/drive/My Drive/data/satellite_images_dhaka/0_6.png']


In [None]:
print(len(images_paths))

1260


## Train-Validation Split

The train data receives 90% by default, but you can change `TRAIN_SPLIT` to something like 80 or 95.

In [None]:
from random import shuffle

TRAIN_SPLIT = 90

limit = int(len(images_paths) * TRAIN_SPLIT / 100)

shuffle(images_paths)
train_images = images_paths[:limit]
val_images = images_paths[limit:]

print("Number of train images:", len(train_images))
print("Number of validation images:", len(val_images))

The train images and annotations will be copied to `/content/data/train` and the validation ones to `/content/data/val`.

In [None]:
from IPython.display import display, clear_output
from shutil import copyfile
import os

TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
VAL_DATA_DIR = os.path.join(DATA_DIR, "val")

!mkdir $TRAIN_DATA_DIR
!mkdir $VAL_DATA_DIR

for i, img_path in enumerate(train_images):
  img_name = img_path.split("/")[-1]
  xml_path = img_path.replace(".png", ".xml")
  xml_name = img_name.replace(".png", ".xml")

  if i%100 == 0:
    clear_output(wait=True)
    display("Copying {} out of {} files.".format(i, len(train_images)))

  destination = TRAIN_DATA_DIR

  copyfile(img_path,
           os.path.join(destination, img_name))
  copyfile(xml_path,
           os.path.join(destination, xml_name))
  
print("Successfully copied train data to {}".format(TRAIN_DATA_DIR))

for i, img_path in enumerate(val_images):
  img_name = img_path.split("/")[-1]
  xml_path = img_path.replace(".png", ".xml")
  xml_name = img_name.replace(".png", ".xml")

  if i%100 == 0:
    clear_output(wait=True)
    display("Copying {} out of {} files.".format(i, len(val_images)))

  destination = VAL_DATA_DIR

  copyfile(img_path,
           os.path.join(destination, img_name))
  copyfile(xml_path,
           os.path.join(destination, xml_name))
  
print("Successfully copied validation data to {}".format(VAL_DATA_DIR))

Test to see if the files were copied properly:

In [None]:
print("# images in train dir:", len(glob.glob(TRAIN_DATA_DIR + "/*.png")))
print("# XML in train dir:", len(glob.glob(TRAIN_DATA_DIR + "/*.xml")))
print("# images in validation dir:", len(glob.glob(VAL_DATA_DIR + "/*.png")))
print("# XML in train dir:", len(glob.glob(VAL_DATA_DIR + "/*.xml")))

1134
1134
126
126


# Generate TFRecords

Now that all the data has been copied to colab, it must be converted to TFRecords format, because the Object Detection API uses it. You can read more about the record format online.

I adapted this [official script](https://github.com/tensorflow/models/blob/master/research/object_detection/dataset_tools/create_pascal_tf_record.py) and created a github [gist](https://gist.github.com/yasserius/ef9eb79c3f2f516ed1e4f793150d6f76), which is what is being used in the following steps.

The `train.record` and `val.record` files will be stored in the `/content/data` directory.

In [None]:
TRAIN_TFRECORD_PATH = os.path.join(DATA_DIR, "train.record")
VAL_TFRECORD_PATH = os.path.join(DATA_DIR, "val.record")

Downloading the gist as a file named `tfrecord_generator.py`.

In [None]:
!wget https://gist.githubusercontent.com/yasserius/ef9eb79c3f2f516ed1e4f793150d6f76/raw/20cee1a342d79e24b649a7b3dbc9500be832ce6f/tfrecord_generator.py

--2020-10-22 06:00:52--  https://gist.githubusercontent.com/yasserius/ef9eb79c3f2f516ed1e4f793150d6f76/raw/20cee1a342d79e24b649a7b3dbc9500be832ce6f/tfrecord_generator.py
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6460 (6.3K) [text/plain]
Saving to: ‘tfrecord_generator.py’


2020-10-22 06:00:52 (72.2 MB/s) - ‘tfrecord_generator.py’ saved [6460/6460]



The script does two things:
1. It generates a `label_map.pbtxt` file, which contains all the class names and their indices.
2. It generates `train.record` and `val.record` using the image and XML files.

## Generating label map file

In [None]:
from tfrecord_generator import create_tfrecords, generate_label_map

LABEL_MAP_PATH = os.path.join(DATA_DIR, "label_map.pbtxt")

label_map_dict = generate_label_map(TRAIN_DATA_DIR, output_path=LABEL_MAP_PATH)

print(label_map_dict)

  if not xml:


Successfully created /content/data/label_map.pbtxt
{'house': 1}


## Generating TFRecord files

In [None]:
create_tfrecords(TRAIN_DATA_DIR,
                 output_path=TRAIN_TFRECORD_PATH,
                 label_map_dict=label_map_dict)

create_tfrecords(VAL_DATA_DIR,
                 output_path=VAL_TFRECORD_PATH,
                 label_map_dict=label_map_dict)

  if not xml:


Check the size of the record files to check if the files are proper. The sizes are shown in bytes, and should be close to your raw data size.

In [None]:
!stat -c%s "$TRAIN_TFRECORD_PATH"

338842333


In [None]:
!stat -c%s "$VAL_TFRECORD_PATH"

38516541


## (Optional) copy record files to google drive

Since colab will lose the record files with the session, it is best that copy them back into drive.

In [None]:
!cp "$TRAIN_TFRECORD_PATH" "$G_DRIVE_PATH"
!cp "$VAL_TFRECORD_PATH" "$G_DRIVE_PATH"

# Download pretrained model

## Paste the link to pretrained mdoel

You must choose a pretrained model from [TF2 Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md).

Then paste the link to the tar file below:

In [None]:
MODEL_URL = 'http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz'

## Downloading to checkpoint directory

The downlaoded model will be in `/content/pretrained`.

In [None]:
CHECKPOINT_DIR = "/content/pretrained"

filename = MODEL_URL.split('/')[-1]

!mkdir "$CHECKPOINT_DIR"
!wget $MODEL_URL
!tar -xzvf "$filename" -C "$CHECKPOINT_DIR"

--2020-10-22 13:44:40--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 74.125.26.128, 2607:f8b0:400c:c04::80
Connecting to download.tensorflow.org (download.tensorflow.org)|74.125.26.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 244817203 (233M) [application/x-tar]
Saving to: ‘ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz’


2020-10-22 13:44:43 (74.6 MB/s) - ‘ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz’ saved [244817203/244817203]

ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0.index
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config
ssd_resnet50_v1_fpn_640x640_coco17_

The following files are needed for our training:
1. `ckpt-0`
2. `pipeline.config`

This program gets them and stores them in `PIPELINE_TEMPLATE_PATH` and `CKPT_PATH`.

In [None]:
import os

PIPELINE_TEMPLATE_PATH = None
CKPT_PATH = None

for dir, subdirs, files in os.walk(CHECKPOINT_DIR):
  if "pipeline.config" in files:
    PIPELINE_TEMPLATE_PATH = os.path.join(dir, "pipeline.config")
    print(PIPELINE_TEMPLATE_PATH)
  elif "ckpt-0.data-00000-of-00001" in files:
    CKPT_PATH = os.path.join(dir, "ckpt-0.data-00000-of-00001")
    print(CKPT_PATH)

/content/pretrained/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config
/content/pretrained/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001


# Editing configuration file

The `pipeline.config` file is very important, since it contains all the trianing parameters and paths to all the record files.

The following program edits that.

You can increase or decrease the batch size. If you have small images like MNIST, use 32. But if each image is large like 1000x1000, then use 2 or 4.

In [None]:
BATCH_SIZE = 8

In [None]:
import re

from google.protobuf import text_format

from object_detection.utils import config_util
from object_detection.utils import label_map_util

configs = config_util.get_configs_from_pipeline_file(PIPELINE_TEMPLATE_PATH)

label_map = label_map_util.get_label_map_dict(LABEL_MAP_PATH)
num_classes = len(label_map.keys())
meta_arch = configs["model"].WhichOneof("model")

override_dict = {
  'model.{}.num_classes'.format(meta_arch): num_classes,
  'train_config.batch_size': BATCH_SIZE,
  'train_input_path': TRAIN_TFRECORD_PATH,
  'eval_input_path': VAL_TFRECORD_PATH,
  'train_config.fine_tune_checkpoint': CKPT_PATH,
  'train_config.fine_tune_checkpoint_type': "detection",
  'label_map_path': LABEL_MAP_PATH
}

configs = config_util.merge_external_params_with_configs(configs, kwargs_dict=override_dict)
pipeline_config = config_util.create_pipeline_proto_from_configs(configs)
config_util.save_pipeline_config(pipeline_config, DATA_DIR)

print("Successfully created configuration file.")

INFO:tensorflow:Maybe overwriting model.ssd.num_classes: 1
INFO:tensorflow:Maybe overwriting train_config.batch_size: 8
INFO:tensorflow:Maybe overwriting train_input_path: /content/data/train.record
INFO:tensorflow:Maybe overwriting eval_input_path: /content/data/val.record
INFO:tensorflow:Maybe overwriting train_config.fine_tune_checkpoint: /content/pretrained/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0
INFO:tensorflow:Maybe overwriting train_config.fine_tune_checkpoint_type: detection
INFO:tensorflow:Maybe overwriting label_map_path: /content/data/label_map.pbtxt
INFO:tensorflow:Writing pipeline config file to /content/data/pipeline.config
Successfully created configuration file.


You can view the final file from the left panel at `/content/data/pipeline.config`. 

If you are more interested, you can read these links too: [1](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/configuring_jobs.md), [2](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#configure-the-training-pipeline)


# Training

Finally, your model is ready to train.

The checkpoints (`ckpt` weight files) at different stages of the training will be stored in `OUTPUT_PATH`.

Since colab is not super reliable and often disconnects, it might be the case that you lose your trained checkpoints. So, it is best that you store the checkpoints in drive.

In [None]:
OUTPUT_PATH = os.path.join(G_DRIVE_PATH, "training_oct_10")

# uncomment this is you want to store the checkpoints in colab
# OUTPUT_PATH = "/content/training"

If you wish to monitor the training via colab, run the following cell.

In [None]:
%load_ext tensorboard
%tensorboard --logdir=$OUTPUT_PATH

Now begins the long process of training.

* You can tweak the `--checkpoint_every_n=100 ` to some other value. Use 200 or 300 if you want to store less checkpoints, and 10 or 50 if you want checkpoints more often.

* Keep in mind, colab will allow you a maximum of 12 hours of training.

* Also, paste this [javascript code](https://www.rockyourcode.com/script-to-stop-google-colab-from-disconnecting/) into the browser console to prevent colab from disconnecting.

* Finally, after you run the script, **it takes time to show the training process. BE PATIENT!**

* There will be a lot of warnings, no worries. And every 100 time steps later (which might be 1 or 2 hours), it will print the loss like this:
```
I1021 14:56:54.731037 140444534626176 model_lib_v2.py:652] Step 100 per-step time 64.361s loss=6.536
```

* To see detailed output of what is expected, see [this](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#training-the-model).

Run the script!

In [None]:
PIPELINE_CONFIG_PATH = "/content/data/pipeline.config"
TRAIN_FILE = "/content/models/research/object_detection/model_main_tf2.py"

!mkdir "$OUTPUT_PATH"

!python  $TRAIN_FILE \
    --alsologtostderr
    --pipeline_config_path="$PIPELINE_CONFIG_PATH" \
    --model_dir="$OUTPUT_PATH" \
    --checkpoint_every_n=100 \

# Export model

Once the loss decreases to about 0.1, you can stop training.

In [None]:
EXPORT_DIR = os.path.join(G_DRIVE_PATH, "export")

!mkdir "$EXPORT_DIR"

!python "$OBJ_DET_DIR/exporter_main_v2.py" \
  --input_type='image_tensor' \
  --pipeline_config_path="$PIPELINE_CONFIG_PATH"  \
  --trained_checkpoint_dir="$OUTPUT_PATH" \
  --output_directory="$EXPORT_DIR"