Skip to content

Latest commit

 

History

History
677 lines (478 loc) · 33.4 KB

File metadata and controls

677 lines (478 loc) · 33.4 KB

Training Custom Object Detector

So, up to now you should have done the following:

  • Installed TensorFlow (See tf_install)
  • Installed TensorFlow Object Detection API (See tf_models_install)

Now that we have done all the above, we can start doing some cool stuff. Here we will see how you can train your own object detector, and since it is not as simple as it sounds, we will have a look at:

  1. How to organise your workspace/training files
  2. How to prepare/annotate image datasets
  3. How to generate tf records from such datasets
  4. How to configure a simple training pipeline
  5. How to train a model and monitor it's progress
  6. How to export the resulting model and use it to detect objects.

Preparing the Workspace

  1. If you have followed the tutorial, you should by now have a folder Tensorflow, placed under <PATH_TO_TF> (e.g. C:/Users/sglvladi/Documents), with the following directory tree:

  2. Now create a new folder under TensorFlow and call it workspace. It is within the workspace that we will store all our training set-ups. Now let's go under workspace and create another folder named training_demo. Now our directory structure should be as so:

  3. The training_demo folder shall be our training folder, which will contain all files related to our model training. It is advisable to create a separate training folder each time we wish to train on a different dataset. The typical structure for training folders is shown below.

Here's an explanation for each of the folders/filer shown in the above tree:

  • annotations: This folder will be used to store all *.csv files and the respective TensorFlow *.record files, which contain the list of annotations for our dataset images.
  • exported-models: This folder will be used to store exported versions of our trained model(s).
  • images: This folder contains a copy of all the images in our dataset, as well as the respective *.xml files produced for each one, once labelImg is used to annotate objects.

    • images/train: This folder contains a copy of all images, and the respective *.xml files, which will be used to train our model.
    • images/test: This folder contains a copy of all images, and the respective *.xml files, which will be used to test our model.
  • models: This folder will contain a sub-folder for each of training job. Each subfolder will contain the training pipeline configuration file *.config, as well as all files generated during the training and evaluation of our model.
  • pre-trained-models: This folder will contain the downloaded pre-trained models, which shall be used as a starting checkpoint for our training jobs.
  • README.md: This is an optional file which provides some general information regarding the training conditions of our model. It is not used by TensorFlow in any way, but it generally helps when you have a few training folders and/or you are revisiting a trained model after some time.

If you do not understand most of the things mentioned above, no need to worry, as we'll see how all the files are generated further down.

Preparing the Dataset

Annotate the Dataset

Install LabelImg

There exist several ways to install labelImg. Below are 3 of the most common.

Using PIP (Recommended)
  1. Open a new Terminal window and activate the tensorflow_gpu environment (if you have not done so already)
  2. Run the following command to install labelImg:
  1. labelImg can then be run as follows:
Use precompiled binaries (Easy)

Precompiled binaries for both Windows and Linux can be found here .

Installation is the done in three simple steps:

  1. Inside you TensorFlow folder, create a new directory, name it addons and then cd into it.
  2. Download the latest binary for your OS from here. and extract its contents under Tensorflow/addons/labelImg.
  3. You should now have a single folder named addons/labelImg under your TensorFlow folder, which contains another 4 folders as such:
  1. labelImg can then be run as follows:
Build from source (Hard)

The steps for installing from source follow below.

1. Download labelImg

  • Inside you TensorFlow folder, create a new directory, name it addons and then cd into it.
  • To download the package you can either use Git to clone the labelImg repo inside the TensorFlow\addons folder, or you can simply download it as a ZIP and extract it's contents inside the TensorFlow\addons folder. To keep things consistent, in the latter case you will have to rename the extracted folder labelImg-master to labelImg.1
  • You should now have a single folder named addons\labelImg under your TensorFlow folder, which contains another 4 folders as such:

2. Install dependencies and compiling package

3. Test your installation

Annotate Images

Once open, you should see a window similar to the one below:

alternate text

I won't be covering a tutorial on how to use labelImg, but you can have a look at labelImg's repo for more details. A nice Youtube video demonstrating how to use labelImg is also available here. What is important is that once you annotate all your images, a set of new *.xml files, one for each image, should be generated inside your training_demo/images folder.

Partition the Dataset

Once you have finished annotating your image dataset, it is a general convention to use only part of it for training, and the rest is used for evaluation purposes (e.g. as discussed in evaluation_sec).

Typically, the ratio is 9:1, i.e. 90% of the images are used for training and the rest 10% is maintained for testing, but you can chose whatever ratio suits your needs.

Once you have decided how you will be splitting your dataset, copy all training images, together with their corresponding *.xml files, and place them inside the training_demo/images/train folder. Similarly, copy all testing images, with their *.xml files, and paste them inside training_demo/images/test.

For lazy people like myself, who cannot be bothered to do the above, I have put together a simple script that automates the above process:

scripts/partition_dataset.py

Once the script has finished, two new folders should have been created under training_demo/images, namely training_demo/images/train and training_demo/images/test, containing 90% and 10% of the images (and *.xml files), respectively. To avoid loss of any files, the script will not delete the images under training_demo/images. Once you have checked that your images have been safely copied over, you can delete the images under training_demo/images manually.

Create Label Map

TensorFlow requires a label map, which namely maps each of the used labels to an integer values. This label map is used both by the training and detection processes.

Below we show an example label map (e.g label_map.pbtxt), assuming that our dataset contains 2 labels, dogs and cats:

Label map files have the extention .pbtxt and should be placed inside the training_demo/annotations folder.

Create TensorFlow Records

Now that we have generated our annotations and split our dataset into the desired training and testing subsets, it is time to convert our annotations into the so called TFRecord format.

Convert *.xml to *.record

To do this we can write a simple script that iterates through all *.xml files in the training_demo/images/train and training_demo/images/test folders, and generates a *.record file for each of the two. Here is an example script that allows us to do just that:

scripts/generate_tfrecord.py

Once the above is done, there should be 2 new files under the training_demo/annotations folder, named test.record and train.record, respectively.

Configuring a Training Job

For the purposes of this tutorial we will not be creating a training job from scratch, but rather we will reuse one of the pre-trained models provided by TensorFlow. If you would like to train an entirely new model, you can have a look at TensorFlow's tutorial.

The model we shall be using in our examples is the SSD ResNet50 V1 FPN 640x640 model, since it provides a relatively good trade-off between performance and speed. However, there exist a number of other models you can use, all of which are listed in TensorFlow 2 Detection Model Zoo.

Download Pre-Trained Model

To begin with, we need to download the latest pre-trained network for the model we wish to use. This can be done by simply clicking on the name of the desired model in the table found in TensorFlow 2 Detection Model Zoo. Clicking on the name of your model should initiate a download for a *.tar.gz file.

Once the *.tar.gz file has been downloaded, open it using a decompression program of your choice (e.g. 7zip, WinZIP, etc.). Next, open the *.tar folder that you see when the compressed folder is opened, and extract its contents inside the folder training_demo/pre-trained-models. Since we downloaded the SSD ResNet50 V1 FPN 640x640 model, our training_demo directory should now look as follows:

Note that the above process can be repeated for all other pre-trained models you wish to experiment with. For example, if you wanted to also configure a training job for the EfficientDet D1 640x640 model, you can download the model and after extracting its context the demo directory will be:

Configure the Training Pipeline

Now that we have downloaded and extracted our pre-trained model, let's create a directory for our training job. Under the training_demo/models create a new directory named my_ssd_resnet50_v1_fpn and copy the training_demo/pre-trained-models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config file inside the newly created directory. Our training_demo/models directory should now look like this:

Now, let's have a look at the changes that we shall need to apply to the pipeline.config file (highlighted in yellow):

scripts/pipeline.config

It is worth noting here that the changes to lines 178 to 179 above are optional. These should only be used if you installed the COCO evaluation tools, as outlined in the tf_models_install_coco section, and you intend to run evaluation (see evaluation_sec).

Once the above changes have been applied to our config file, go ahead and save it.

Training the Model

Before we begin training our model, let's go and copy the TensorFlow/models/research/object_detection/model_main_tf2.py script and paste it straight into our training_demo folder. We will need this script in order to train our model.

Now, to initiate a new training job, open a new Terminal, cd inside the training_demo folder and run the following command:

Once the training process has been initiated, you should see a series of print outs similar to the one below (plus/minus some warnings):

...
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
W0716 05:24:19.105542  1364 util.py:143] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
W0716 05:24:19.106541  1364 util.py:143] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
W0716 05:24:19.107540  1364 util.py:143] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
W0716 05:24:19.108539  1364 util.py:143] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0716 05:24:19.108539  1364 util.py:151] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
INFO:tensorflow:Step 100 per-step time 1.153s loss=0.761
I0716 05:26:55.879558  1364 model_lib_v2.py:632] Step 100 per-step time 1.153s loss=0.761
...

Important

The output will normally look like it has "frozen", but DO NOT rush to cancel the process. The training outputs logs only every 100 steps by default, therefore if you wait for a while, you should see a log for the loss at step 100.

The time you should wait can vary greatly, depending on whether you are using a GPU and the chosen value for batch_size in the config file, so be patient.

If you ARE observing a similar output to the above, then CONGRATULATIONS, you have successfully started your first training job. Now you may very well treat yourself to a cold beer, as waiting on the training to finish is likely to take a while. Following what people have said online, it seems that it is advisable to allow you model to reach a TotalLoss of at least 2 (ideally 1 and lower) if you want to achieve "fair" detection results. Obviously, lower TotalLoss is better, however very low TotalLoss should be avoided, as the model may end up overfitting the dataset, meaning that it will perform poorly when applied to images outside the dataset. To monitor TotalLoss, as well as a number of other metrics, while your model is training, have a look at tensorboard_sec.

If you ARE NOT seeing a print-out similar to that shown above, and/or the training job crashes after a few seconds, then have a look at the issues and proposed solutions, under the issues section, to see if you can find a solution. Alternatively, you can try the issues section of the official Tensorflow Models repo.

Note

Training times can be affected by a number of factors such as:

  • The computational power of you hardware (either CPU or GPU): Obviously, the more powerful your PC is, the faster the training process.
  • Whether you are using the TensorFlow CPU or GPU variant: In general, even when compared to the best CPUs, almost any GPU graphics card will yield much faster training and detection speeds. As a matter of fact, when I first started I was running TensorFlow on my Intel i7-5930k (6/12 cores @ 4GHz, 32GB RAM) and was getting step times of around 12 sec/step, after which I installed TensorFlow GPU and training the very same model -using the same dataset and config files- on a EVGA GTX-770 (1536 CUDA-cores @ 1GHz, 2GB VRAM) I was down to 0.9 sec/step!!! A 12-fold increase in speed, using a "low/mid-end" graphics card, when compared to a "mid/high-end" CPU.
  • The complexity of the objects you are trying to detect: Obviously, if your objective is to track a black ball over a white background, the model will converge to satisfactory levels of detection pretty quickly. If on the other hand, for example, you wish to detect ships in ports, using Pan-Tilt-Zoom cameras, then training will be a much more challenging and time-consuming process, due to the high variability of the shape and size of ships, combined with a highly dynamic background.
  • And many, many, many, more....

Evaluating the Model (Optional)

By default, the training process logs some basic measures of training performance. These seem to change depending on the installed version of Tensorflow.

As you will have seen in various parts of this tutorial, we have mentioned a few times the optional utilisation of the COCO evaluation metrics. Also, under section image_partitioning_sec we partitioned our dataset in two parts, where one was to be used for training and the other for evaluation. In this section we will look at how we can use these metrics, along with the test images, to get a sense of the performance achieved by our model as it is being trained.

Firstly, let's start with a brief explanation of what the evaluation process does. While the training process runs, it will occasionally create checkpoint files inside the training_demo/training folder, which correspond to snapshots of the model at given steps. When a set of such new checkpoint files is generated, the evaluation process uses these files and evaluates how well the model performs in detecting objects in the test dataset. The results of this evaluation are summarised in the form of some metrics, which can be examined over time.

The steps to run the evaluation are outlined below:

  1. Firstly we need to download and install the metrics we want to use.
    • For a description of the supported object detection evaluation metrics, see here.
    • The process of installing the COCO evaluation metrics is described in tf_models_install_coco.
  2. Secondly, we must modify the configuration pipeline (*.config script).
    • See lines 178-179 of the script in training_pipeline_conf.
  3. The third step is to actually run the evaluation. To do so, open a new Terminal, cd inside the training_demo folder and run the following command:

    Once the above is run, you should see a checkpoint similar to the one below (plus/minus some warnings):

    ...
    WARNING:tensorflow:From C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\inputs.py:79: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
    W0716 05:44:10.059399 17144 deprecation.py:317] From C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\inputs.py:79: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
    WARNING:tensorflow:From C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\inputs.py:259: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use `tf.cast` instead.
    W0716 05:44:12.383937 17144 deprecation.py:317] From C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\inputs.py:259: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use `tf.cast` instead.
    INFO:tensorflow:Waiting for new checkpoint at models/my_ssd_resnet50_v1_fpn
    I0716 05:44:22.779590 17144 checkpoint_utils.py:125] Waiting for new checkpoint at models/my_ssd_resnet50_v1_fpn
    INFO:tensorflow:Found new checkpoint at models/my_ssd_resnet50_v1_fpn\ckpt-2
    I0716 05:44:22.882485 17144 checkpoint_utils.py:134] Found new checkpoint at models/my_ssd_resnet50_v1_fpn\ckpt-2

While the evaluation process is running, it will periodically check (every 300 sec by default) and use the latest models/my_ssd_resnet50_v1_fpn/ckpt-* checkpoint files to evaluate the performance of the model. The results are stored in the form of tf event files (events.out.tfevents.*) inside models/my_ssd_resnet50_v1_fpn/eval_0. These files can then be used to monitor the computed metrics, using the process described by the next section.

Monitor Training Job Progress using TensorBoard

A very nice feature of TensorFlow, is that it allows you to coninuously monitor and visualise a number of different training/evaluation metrics, while your model is being trained. The specific tool that allows us to do all that is Tensorboard.

To start a new TensorBoard server, we follow the following steps:

The above command will start a new TensorBoard server, which (by default) listens to port 6006 of your machine. Assuming that everything went well, you should see a print-out similar to the one below (plus/minus some warnings):

Once this is done, go to your browser and type http://localhost:6006/ in your address bar, following which you should be presented with a dashboard similar to the one shown below (maybe less populated if your model has just started training):

alternate text

Exporting a Trained Model

Once your training job is complete, you need to extract the newly trained inference graph, which will be later used to perform the object detection. This can be done as follows:

  • Copy the TensorFlow/models/research/object_detection/exporter_main_v2.py script and paste it straight into your training_demo folder.
  • Now, open a Terminal, cd inside your training_demo folder, and run the following command:

After the above process has completed, you should find a new folder my_model under the training_demo/exported-models, that has the following structure:

This model can then be used to perform inference.

Note

You may get the following error when trying to export your model:

If this happens, have a look at the export_error issue section for a potential solution.


  1. The latest repo commit when writing this tutorial is 8d1bd68.