Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object Detection] unable use custom dataset #5940

Closed
Aspirinkb opened this issue Dec 20, 2018 · 7 comments
Closed

[Object Detection] unable use custom dataset #5940

Aspirinkb opened this issue Dec 20, 2018 · 7 comments

Comments

@Aspirinkb
Copy link

System information

  • What is the top-level directory of the model you are using: models/research/object_detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes, I have written codes to convert bdd100k image data set to TFRecord format files (a training set and a valuation set.)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.11.0
  • Bazel version (if compiling from source): -
  • CUDA/cuDNN version: 10.0
  • GPU model and memory: GeForce 1080, 8GB
  • Exact command to reproduce:
python3 ./object_detection/model_main.py \
--pipeline_config_path=/home/yann/tensorflow/OD/models/ssd_inception_v2_bdd100k/ssd_inception_v2_coco.config \
--model_dir=/home/yann/tensorflow/OD/models/ssd_inception_v2_bdd100k/training/ \
--num_train_steps=30 \
--sample_1_of_n_eval_examples=1 \
--alsologtostderr

Describe the problem

I want to use the object detection api to train my own data set (bdd100k image data set), so I reference the create_kitti_tf_record.py and create_pascal_tf_record.py to write a script to convert the bdd100k to TFRecord files, one is bdd100k_train.record and the other is bdd100k_val.record, but when I start to train the model by the command above, some errors come out. The error tips just confused as following:

...
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
2018-12-20 10:00:25.883849: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-20 10:00:26.237011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8095
pciBusID: 0000:02:00.0
totalMemory: 7.93GiB freeMemory: 7.40GiB
2018-12-20 10:00:26.237142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-12-20 10:00:30.303272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-20 10:00:30.303432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2018-12-20 10:00:30.303475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2018-12-20 10:00:30.304127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7133 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1292, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0] = 0 is not in [0, 0)
         [[{{node GatherV2_2}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT64, Tparams=DT_INT64, _device="/device:CPU:0"](cond_1/Merge, Reshape_8, GatherV2_1/axis)]]
         [[{{node IteratorGetNext}} = IteratorGetNext[output_shapes=[[8], [8,300,300,3], [8,2], [8,3], [8,100], [8,100,4], [8,100,7], [8,100,7], [8,100], [8,100], [8,100], [8]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./object_detection/model_main.py", line 109, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "./object_detection/model_main.py", line 105, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1296, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 887, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1110, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0] = 0 is not in [0, 0)
         [[{{node GatherV2_2}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT64, Tparams=DT_INT64, _device="/device:CPU:0"](cond_1/Merge, Reshape_8, GatherV2_1/axis)]]
         [[{{node IteratorGetNext}} = IteratorGetNext[output_shapes=[[8], [8,300,300,3], [8,2], [8,3], [8,100], [8,100,4], [8,100,7], [8,100,7], [8,100], [8,100], [8,100], [8]], output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]

It works well when I use the pascal data set step by step, following the Generating the PASCAL VOC TFRecord files., so I think there is something wrong with my convert codes but I don't know where is wrong:

    writer = tf.python_io.TFRecordWriter(output_path)
    img_w, img_h = 1280, 720
    with open(json_path, 'rb') as json_f:
        items = ijson.items(json_f, 'item')
        img_counter = 0
        skiped_img_counter = 0
        for item in items: # item is a dict, which contains a jpg image and its labels etc.
            img_counter += 1
            img_name = item['name']
            xmins = []
            ymins = []
            xmaxs = []
            ymaxs = []
            classes = []
            labels = []
            occluded = []
            truncated = []

            labels_ = item['labels']
            for label in labels_:
                category = label['category']
                if category in categories:
                    nums[category] += 1
                    labels.append(label_id[category])
                    classes.append(category.encode('utf8'))
                    att_ = label['attributes']
                    occluded.append(int(att_['occluded'] == 'true'))
                    truncated.append(int(att_['truncated'] == 'true'))
                    box2d = label['box2d']
                    xmins.append(float(box2d['x1'])/img_w) 
                    ymins.append(float(box2d['y1'])/img_h) 
                    xmaxs.append(float(box2d['x2'])/img_w)
                    ymaxs.append(float(box2d['y2'])/img_h)
            difficult_obj = [0] * len(xmins)
            if 0 == len(xmins):
                skiped_img_counter += 1
                print("{0} has no object, skip it and continue.".format(img_name))
                continue
            assert len(xmins) == len(labels) == len(classes) == len(difficult_obj) == len(occluded) == len(truncated), 'not same list length'
            img_path = os.path.join(img_folder, img_name)
            with tf.gfile.GFile(img_path, 'rb') as fid:
                encoded_jpg = fid.read()
            key = hashlib.sha256(encoded_jpg).hexdigest()
            # att = item['attributes']
            # weather, scene, timeofday = att['weather'], att['scene'], att['timeofday']
            tf_example = tf.train.Example(features=tf.train.Features(feature={
                'image/height': int64_feature(img_h),
                'image/width': int64_feature(img_w),
                'image/filename': bytes_feature(img_name.encode('utf8')),
                'image/source_id': bytes_feature(img_name.encode('utf8')),
                'image/key/sha256': bytes_feature(key.encode('utf8')),
                'image/encoded': bytes_feature(encoded_jpg),
                'image/format': bytes_feature('jpg'.encode('utf8')),
                'image/object/bbox/xmin': float_list_feature(xmins),
                'image/object/bbox/xmax': float_list_feature(xmaxs),
                'image/object/bbox/ymin': float_list_feature(ymins),
                'image/object/bbox/ymax': float_list_feature(ymaxs),
                'image/object/bbox/text': bytes_list_feature(classes),
                'image/object/bbox/label': int64_list_feature(labels),
                'image/object/bbox/difficult': int64_list_feature(difficult_obj),
                'image/object/bbox/occluded': int64_list_feature(occluded),
                'image/object/bbox/truncated': int64_list_feature(truncated),
            }))
            print(img_name, 'precessed.')
            writer.write(tf_example.SerializeToString())
        print('{0} images were processed and {1} were skipped.'.format(img_counter, skiped_img_counter))
        print(nums)
        writer.close()
@Aspirinkb
Copy link
Author

The feature in tf.train.Example

'image/object/bbox/text': bytes_list_feature(classes),
'image/object/bbox/label': int64_list_feature(labels),
'image/object/bbox/difficult': int64_list_feature(difficult_obj),
'image/object/bbox/occluded': int64_list_feature(occluded),
'image/object/bbox/truncated': int64_list_feature(truncated),

should be:

'image/object/class/text': bytes_list_feature(classes),
'image/object/class/label': int64_list_feature(labels),
'image/object/difficult': int64_list_feature(difficult_obj),
'image/object/occluded': int64_list_feature(occluded),
'image/object/truncated': int64_list_feature(truncated),   

This is a error caused by my carelessness, all the keys in the example feature should in core.standard_field.TfExampleFields

@CoderGenJ
Copy link

Hi,
I recently used BDD100K to train the SSD-MobileNet model.But based on the official config which is named SSD_MobileNet_V2.config,I found that the loss remained at about 7.0.Could you tell me whether you converge the model?If yes,how to change config?

I have tried to freeze the mobilenet layes weights in order to use the good feature trained from coco dataset.Because the amount unbalance between class,I tried to cut the BDD100K dataset class only four classes ,car,truck,person and traffic light. It does reduce the loss to 3.But mAP is only 0.2.

@CoderGenJ
Copy link

Could you please tell me some advices or your tring history?

@CoderGenJ
Copy link

Could you please tell me loss value and mAP in your best test?

@Aspirinkb
Copy link
Author

@SLAMgreen Hi, I have got a more lower mAP(0.1) since the feature extractor of the ssd model is a simple small CNN model and I trained a quantized SSD model from scratch. The total loss is 3.4.
The SSD is not good at detecting small objects in the image, and I think the traffic light is too small to detect. What would you say?
Maybe you can improve the size of the input image, use some image object augmentation methods like mixup and a FPN model.
I don't know, you can analysis the mAP, recall of small, medium and large objects.

@CoderGenJ
Copy link

Thank you for your advice and i will try in the further.
From the detection result ,I can see that the performance of SSD-MobileNet detecting traffic light is poor which only can detect partly traffic light and localization has a big bias from labels.

@sainisanjay
Copy link

Hi,
I recently used BDD100K to train the SSD-MobileNet model.But based on the official config which is named SSD_MobileNet_V2.config,I found that the loss remained at about 7.0.Could you tell me whether you converge the model?If yes,how to change config?

I have tried to freeze the mobilenet layes weights in order to use the good feature trained from coco dataset.Because the amount unbalance between class,I tried to cut the BDD100K dataset class only four classes ,car,truck,person and traffic light. It does reduce the loss to 3.But mAP is only 0.2.

Hi @SLAMgreen did you able to get good mAP with SSD-MobileNet? because same issue i am facing. Can check over here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants