# Train Model
Folder structure was inspired by this [Blog Post](https://neptune.ai/blog/how-to-train-your-own-object-detector-using-tensorflow-object-detection-api)

### Load Packages
When running the script, sometimes the following error comes up: \
`undefined symbol: _ZN10tensorflow8OpKernel11TraceStringEPNS_15OpKernelContextEb`

To remedy this issue, we have to reinstall both tensorflow and keras

In [5]:
!pip uninstall -y tensorflow keras

[0mFound existing installation: tensorflow 2.6.0
Uninstalling tensorflow-2.6.0:
  Successfully uninstalled tensorflow-2.6.0
[0mFound existing installation: keras 2.6.0
Uninstalling keras-2.6.0:
  Successfully uninstalled keras-2.6.0


In [6]:
!pip install tensorflow==2.6.0 keras==2.6.0

[0mCollecting tensorflow==2.6.0
  Using cached tensorflow-2.6.0-cp39-cp39-manylinux2010_x86_64.whl (458.4 MB)
Collecting keras==2.6.0
  Using cached keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
[0mInstalling collected packages: keras, tensorflow
Successfully installed keras-2.6.0 tensorflow-2.6.0
[0m

In [7]:
import os
import time

### Setup
- Go over to [Tensorflow Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md) and select the model we want to fine-tune
- Select which model to train and copy the path of the url e.g. *efficientdet_d1_coco17_tpu-32.tar.gz*

In [96]:
SETUP_TENSORFLOW = True

WORKSPACE_PATH = "workspace"
DATA_PATH = os.path.join(WORKSPACE_PATH, "data")
PRE_TRAINED_MODELS_PATH = os.path.join(WORKSPACE_PATH, "pre_trained_models")

MODELS_PATH = os.path.join(WORKSPACE_PATH, "models")
BASE_MODEL_URL = "http://download.tensorflow.org/models/object_detection/tf2/20200711"
MODEL_NAME = "efficientdet_d1_coco17_tpu-32"

PIPELINE_CONFIG_PATH = os.path.join(PRE_TRAINED_MODELS_PATH, MODEL_NAME, "pipeline.config")

NUM_CLASSES = 10

EPOCHS = 50
BATCH_SIZE = 4

OPTIMIZER = "adam" # momentum or adam
LEARNING_RATE = 1e-2 # 1e-2 for momentum, or 1e-3 for adam

if (OPTIMIZER == "adam"):
    LEARNING_RATE = 1e-3
    
USE_AUGMENTATION = "None" # One of "None", "default", or "aligned"
    
    
COMBINATION_NAME = f"epochs_{EPOCHS}-batch_size_{BATCH_SIZE}-optimizer_{OPTIMIZER}-learning_rate_{LEARNING_RATE}-aug_{USE_AUGMENTATION}"

In [97]:
if not os.path.exists(WORKSPACE_PATH):
    os.mkdir(WORKSPACE_PATH)

if not os.path.exists(MODELS_PATH):
    os.mkdir(MODELS_PATH)
    
if not os.path.exists(DATA_PATH):
    os.mkdir(DATA_PATH)
    
if not os.path.exists(PRE_TRAINED_MODELS_PATH):
    os.mkdir(PRE_TRAINED_MODELS_PATH)
    
if not os.path.exists(os.path.join(MODELS_PATH, MODEL_NAME)):
    os.mkdir(os.path.join(MODELS_PATH, MODEL_NAME))
    
if not os.path.isdir(os.path.join(MODELS_PATH, MODEL_NAME, COMBINATION_NAME)):
    os.mkdir(os.path.join(MODELS_PATH, MODEL_NAME, COMBINATION_NAME))

### Download Tensorflow Object Detection API (only needed the first time)
1. Download all files required for the Tensorflow Object Detection API
2. Download and move *builder.py* that is required for the API to work (fix for Schlaubox)
3. Compile Protos
4. Test whether TFOD API was successfully installed

In [10]:
if SETUP_TENSORFLOW:
    !rm -rf models
    !git clone https://github.com/markusbink/models

Cloning into 'models'...
remote: Enumerating objects: 74520, done.[K
remote: Total 74520 (delta 0), reused 0 (delta 0), pack-reused 74520[K
Receiving objects: 100% (74520/74520), 595.17 MiB | 10.33 MiB/s, done.
Resolving deltas: 100% (53034/53034), done.
Updating files: 100% (3352/3352), done.


In [22]:
%cd models/research/
!pip install -q --user .
%cd ../..

/home/jovyan/top-down-object-detection/models/research
[0m/home/jovyan/top-down-object-detection


In [20]:
if SETUP_TENSORFLOW:
    !wget https://raw.githubusercontent.com/protocolbuffers/protobuf/main/python/google/protobuf/internal/builder.py
    !mv builder.py /home/jovyan/.local/lib/python3.9/site-packages/google/protobuf/internal/

--2023-02-25 09:55:35--  https://raw.githubusercontent.com/protocolbuffers/protobuf/main/python/google/protobuf/internal/builder.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5188 (5.1K) [text/plain]
Saving to: ‘builder.py’


2023-02-25 09:55:35 (71.1 MB/s) - ‘builder.py’ saved [5188/5188]



In [23]:
# Compile protos.
%cd models/research/
!protoc object_detection/protos/*.proto --python_out=.
%cd ../..

# Remove this file since it prevents the API to work in the Schlaubox
#!rm models/research/opt/conda/lib/python3.9/site-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so

if SETUP_TENSORFLOW:
    # Test if the Object Dectection API is working correctly
    !python3 models/research/object_detection/builders/model_builder_tf2_test.py

/home/jovyan/top-down-object-detection/models/research
/home/jovyan/top-down-object-detection
caused by: ['/home/jovyan/.local/lib/python3.9/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl5mutexC1Ev']
caused by: ['/home/jovyan/.local/lib/python3.9/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZNK10tensorflow4data11DatasetBase3GetEPNS_15OpKernelContextElPSt6vectorINS_6TensorESaIS5_EE']
Running tests under Python 3.9.13: /opt/conda/bin/python3
[ RUN      ] ModelBuilderTF2Test.test_create_center_net_deepmac
2023-02-25 09:56:02.291611: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-25 09:56:02.764537: W tensorflow/core/common_runtime/gpu/gpu_

### Create TFRecords
1. Upload zip of folders with train, validation and test files (images and annotations)
2. Extract zip
3. Create TFRecords using ***generate_tf_record.py***
----
*Note: Have to create TFRecords here since creating these locally and afterwards uploading them seems to corrupt the files*

In [14]:
#!rm -rf {DATA_PATH}/train {DATA_PATH}/test {DATA_PATH}/val

In [15]:
%%capture
#!unzip {DATA_PATH}/splits.zip -d {DATA_PATH}

In [16]:
#!python {WORKSPACE_PATH}'/scripts/generate_tfrecord.py' -x {DATA_PATH}'/train' -l {DATA_PATH}'/label_map.pbtxt' -o {DATA_PATH}'/train.tfrecord'
#!python {WORKSPACE_PATH}'/scripts/generate_tfrecord.py' -x {DATA_PATH}'/val' -l {DATA_PATH}'/label_map.pbtxt' -o {DATA_PATH}'/val.tfrecord'
#!python {WORKSPACE_PATH}'/scripts/generate_tfrecord.py' -x {DATA_PATH}'/test' -l {DATA_PATH}'/label_map.pbtxt' -o {DATA_PATH}'/test.tfrecord'

In [17]:
!ls {DATA_PATH} -l --block-size=M

total 2781M
-rw-r--r-- 1 jovyan users    1M Feb 12 23:01 label_map.pbtxt
-rw-r--r-- 1 jovyan users 1233M Feb 15 20:39 splits.zip
drwxr-xr-x 2 jovyan users    1M Feb 15 18:10 test
-rw-r--r-- 1 jovyan users    1M Feb 16 19:23 test.csv
-rw-r--r-- 1 jovyan users  161M Feb 16 19:23 test.tfrecord
drwxr-xr-x 2 jovyan users    1M Feb 15 18:10 train
-rw-r--r-- 1 jovyan users    1M Feb 16 19:22 train.csv
-rw-r--r-- 1 jovyan users 1081M Feb 16 19:22 train.tfrecord
drwxr-xr-x 2 jovyan users    1M Feb 15 18:10 val
-rw-r--r-- 1 jovyan users    1M Feb 16 19:23 val.csv
-rw-r--r-- 1 jovyan users  309M Feb 16 19:23 val.tfrecord


### Download pre-trained model
1. Download model specified in **MODEL_NAME**
2. Extract the zipped model
3. Remove the zip-file we no longer need
4. Copy pipeline.config to our models/MODEL_NAME folder to configure it to our own liking

In [18]:
#!wget {BASE_MODEL_URL}/{MODEL_NAME}.tar.gz -P {PRE_TRAINED_MODELS_PATH} 
#!tar -xvzf {PRE_TRAINED_MODELS_PATH}/{MODEL_NAME}.tar.gz -C {PRE_TRAINED_MODELS_PATH}
#!rm -rf {PRE_TRAINED_MODELS_PATH}/{MODEL_NAME}.tar.gz

### Update parameters in the pipeline.config
- **Batch Size:** 4
- **Learning Rate:** 1e-3, 1e-2
- **Optimizer:** Momentum, Adam
- **Epochs:** 50

-------

*Note: The pipeline.config does not have a epochs parameter. Instead we have to calculate the **num_steps** based on the epochs we want to use:*

- ***num_steps = epochs * (num_samples / batch_size)***
- ***epochs = num_steps / (num_samples / batch_size)***


In [90]:
# Manipulate pipeline.config
from object_detection.utils import config_util
from object_detection.protos import preprocessor_pb2

In [91]:
NUM_SAMPLES = len([x for x in os.listdir(os.path.join(DATA_PATH, 'train')) if x.endswith('.xml')])
NUM_STEPS = int(EPOCHS * (NUM_SAMPLES / BATCH_SIZE))

In [98]:
# Read config
!cp {PRE_TRAINED_MODELS_PATH}/{MODEL_NAME}/pipeline.config {MODELS_PATH}/{MODEL_NAME}/{COMBINATION_NAME}
config = config_util.get_configs_from_pipeline_file(os.path.join(MODELS_PATH, MODEL_NAME, COMBINATION_NAME, "pipeline.config"))

In [99]:
# Manipulate config
config['model'].ssd.num_classes = NUM_CLASSES
config['train_config'].fine_tune_checkpoint_type = "detection"
config['train_config'].batch_size = BATCH_SIZE
config['train_config'].num_steps = NUM_STEPS
config['train_config'].fine_tune_checkpoint = os.path.join(PRE_TRAINED_MODELS_PATH, MODEL_NAME, "checkpoint", "ckpt-0")

if OPTIMIZER == "momentum":
    config['train_config'].optimizer.ClearField('adam_optimizer')
    config['train_config'].optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.learning_rate_base = LEARNING_RATE
    config['train_config'].optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.warmup_learning_rate = LEARNING_RATE / 5
    config['train_config'].optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.warmup_steps = int(NUM_STEPS * 0.01)
    config['train_config'].optimizer.momentum_optimizer.learning_rate.cosine_decay_learning_rate.total_steps = NUM_STEPS
    
if OPTIMIZER == "adam":
    config['train_config'].optimizer.ClearField('momentum_optimizer')
    config['train_config'].optimizer.adam_optimizer.learning_rate.cosine_decay_learning_rate.learning_rate_base = LEARNING_RATE
    config['train_config'].optimizer.adam_optimizer.learning_rate.cosine_decay_learning_rate.warmup_learning_rate = LEARNING_RATE / 5
    config['train_config'].optimizer.adam_optimizer.learning_rate.cosine_decay_learning_rate.warmup_steps = int(NUM_STEPS * 0.01)
    config['train_config'].optimizer.adam_optimizer.learning_rate.cosine_decay_learning_rate.total_steps = NUM_STEPS
    
    
config['train_input_config'].label_map_path = os.path.join(DATA_PATH, 'label_map.pbtxt')
config['train_input_config'].tf_record_input_reader.input_path[0] = os.path.join(DATA_PATH, 'train.tfrecord')
config['eval_config'].batch_size = 1
config['eval_config'].metrics_set[0] = 'coco_detection_metrics' # default value
#config['eval_config'].metrics_set[0] = 'pascal_voc_detection_metrics'
#config['eval_config'].include_metrics_per_category = True # include per category metrics
config['eval_config'].all_metrics_per_category = True # include detailed per category metrics

config['eval_input_config'].label_map_path = os.path.join(DATA_PATH, 'label_map.pbtxt')
config['eval_input_config'].tf_record_input_reader.input_path[0] = os.path.join(DATA_PATH, 'val.tfrecord')

# Update Augmentation settings
if USE_AUGMENTATION == "None":
    config['train_config'].ClearField('data_augmentation_options')
    
if USE_AUGMENTATION == "aligned":
    config['train_config'].ClearField('data_augmentation_options')

    # Create new PreprocessingStep messages for each new option with their default settings
    hue_option = preprocessor_pb2.RandomAdjustHue()
    saturation_option = preprocessor_pb2.RandomAdjustSaturation()
    brightness_option = preprocessor_pb2.RandomAdjustBrightness()
    image_scale_option = preprocessor_pb2.RandomImageScale()
    horizontal_flip_option = preprocessor_pb2.RandomHorizontalFlip()

    # Create a new PreprocessingStep instance for each data augmentation option
    step = preprocessor_pb2.PreprocessingStep()

    step.random_adjust_hue.CopyFrom(hue_option)
    config['train_config'].data_augmentation_options.extend([step])

    step.random_adjust_saturation.CopyFrom(saturation_option)
    config['train_config'].data_augmentation_options.extend([step])

    step.random_adjust_brightness.CopyFrom(brightness_option)
    config['train_config'].data_augmentation_options.extend([step])

    step.random_image_scale.CopyFrom(image_scale_option)
    config['train_config'].data_augmentation_options.extend([step])

    step.random_horizontal_flip.CopyFrom(horizontal_flip_option)
    config['train_config'].data_augmentation_options.extend([step])

In [100]:
# Save updated config
config = config_util.create_pipeline_proto_from_configs(config)
config_util.save_pipeline_config(config, os.path.join(MODELS_PATH, MODEL_NAME, COMBINATION_NAME))

INFO:tensorflow:Writing pipeline config file to workspace/models/efficientdet_d1_coco17_tpu-32/epochs_50-batch_size_4-optimizer_adam-learning_rate_0.001-aug_None/pipeline.config


### Start training
1. Move the model_main_tf2.py training script to our workspace folder
2. Start the training process
3. Directly after starting, run the **eval.ipynb** for evaluation

In [None]:
!cp models/research/object_detection/model_main_tf2.py {WORKSPACE_PATH}

In [101]:
start_time = time.time()

!python3 workspace/model_main_tf2.py \
    --pipeline_config_path=workspace/models/{MODEL_NAME}/{COMBINATION_NAME}/pipeline.config \
    --model_dir=workspace/models/{MODEL_NAME}/{COMBINATION_NAME} \
    --checkpoint_every_n=100 \
    --alsologtostderr

end_time = time.time()
elapsed_time = (end_time - start_time) / 60

print(f"Elapsed time: {round(elapsed_time, 2)} minutes")

caused by: ['/home/jovyan/.local/lib/python3.9/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl5mutexC1Ev']
caused by: ['/home/jovyan/.local/lib/python3.9/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZNK10tensorflow4data11DatasetBase3GetEPNS_15OpKernelContextElPSt6vectorINS_6TensorESaIS5_EE']
 The versions of TensorFlow you are currently using is 2.6.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
2023-02-25 16:46:53.014819: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following 