Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSD Object Detection Mappings? #107

Closed
madhavajay opened this issue Jan 15, 2018 · 42 comments
Closed

SSD Object Detection Mappings? #107

madhavajay opened this issue Jan 15, 2018 · 42 comments

Comments

@madhavajay
Copy link

Can anyone explain what the mappings are of the two tensors which are produced from the output of the ssd_mobilenet_v1_android_export model which the example converts to a .mlmodel file?

When I look at this example:
https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb

Which uses:
ssd_mobilenet_v1_coco_2017_11_17

You can see the code does this:

(boxes, scores, classes, num) = sess.run(
    [detection_boxes, detection_scores, detection_classes, num_detections],
    feed_dict={image_tensor: image_np_expanded}
)

print(boxes.shape)
print(boxes)

(1, 100, 4)
[  3.90840471e-02   1.92150325e-02   8.72103453e-01   3.15773487e-01]
[  1.09515011e-01   4.02835608e-01   9.24646080e-01   9.73047853e-01]
[  5.07123828e-01   3.85651529e-01   8.76479626e-01   7.03940928e-01]

Looks good!

So I assume what we have here is the first 100 boxes with 4 dimensions each.

I traced the values and code and did this:

box_coords = ymin, xmin, ymax, xmax
(left, right, top, bottom) = (xmin, xmax, ymin, ymax)
e.g.
(0.10951501131057739, 0.4028356075286865, 0.9246460795402527, 0.9730478525161743)
left: 0.4028356075286865
right: 0.9730478525161743
top: 0.10951501131057739
bottom: 0.9246460795402527

These are % of the entire image so multiplied by the input size of 300 and you should get the original pixel locations.

This all makes sense.

However, I need the model in CoreML so I followed this guide:
https://github.com/tf-coreml/tf-coreml/blob/master/examples/ssd_example.ipynb

Which uses: ssd_mobilenet_v1_android_export
I assume from the README that its the same model:
https://github.com/tensorflow/models/tree/master/research/object_detection

August 11, 2017
We have released an update to the Android Detect demo which will now run models trained using the Tensorflow Object Detection API on an Android device. By default, it currently runs a frozen SSD w/Mobilenet detector trained on COCO, but we encourage you to try out other detection models!

Obviously its slightly different but to what degree I don't know. Can someone clarify?

Now, after the export process I load it into Xcode and when I run the model, I get this kind of output from the tensor.

boxes in concat__0 e.g. concat:0

concat:0 are the bounding-box encodings of the 1917 anchor boxes

Why are there negative values in the bounding box coords?
[ 0.35306236, -0.48976013, -2.5883727 , -4.0799093 ]
[ 0.8760979 , 1.1190459 , -2.6803727 , -1.5514386 ]
[ 1.3935553 , 0.85614955, -0.92042184, -2.7950268 ]

Also can anyone explain what the 1917 are? There are 91 categories in COCO but why 1917 anchor boxes?

I even looked at the android example and its not much easier to understand:

Even better would be an explanation of the CoreML files output tensors?
It would be great to have the end of the file map the output in python to show what they are.

Perhaps draw the box and show the category label just like the object_detection_tutorial.ipynb that would be great! :)

I am completely lost with this so any help would be greatly appreciated!

@vonholst
Copy link

The 1917 are the predictions from box priors from the different layers in the model. The coreml-example strips the original model from preprocess, box-generation and postprocessing (like non-max-supression).

However, the preprocessing can be achieved by creating the model with:
"
tfcoreml.convert(
...
image_scale=2./255.,
red_bias=-1.0,
green_bias=-1.0,
blue_bias=-1.0)
"
But, the box-generation and post-processing is up to you to implement in xcode.

What I need though is the specification of the anchor-boxes layout used in creating the tensor flow model, to be able to correctly map coordinates in the box-generation. Is there any information on that?

@madhavajay
Copy link
Author

madhavajay commented Jan 18, 2018

Thanks for the tip. I tried adding the image_scale and biases to the convert method but I get the same output still.

I am using this example image from the SSD conversion notebook:
https://upload.wikimedia.org/wikipedia/commons/9/93/Golden_Retriever_Carlos_%2810581910556%29.jpg

coreml_box_encodings, coreml_scores = coreml_outputs

# Box Locations
coreml_box_encodings.squeeze()
# these are the same as above, not transposed???
array([[ 0.35299924,  0.87612647,  1.39346564, ...,  0.09383671,
        -0.09675099,  0.04786451],
       [-0.48974958,  1.11910188,  0.85613155, ...,  1.02312708,
         0.14625645,  0.95588237],
       [-2.58839083, -2.68030357, -0.92033517, ..., -2.25849342,
         0.7405231 , -3.04520822],
       [-4.07984161, -1.5513767 , -2.79496622, ...,  0.21464264,
        -2.82254958,  1.20153475]])

Also why does the Object Detection API example output numbers that are 0 -> 1 as a % of the input locations? That is much more sane, so how are they doing that? Its the same model no?
https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb

      (boxes, scores, classes, num) = sess.run(
          [detection_boxes, detection_scores, detection_classes, num_detections],
          feed_dict={image_tensor: image_np_expanded})

      print('boxes')
      print(boxes.shape)
      print(boxes)

The same happens in this iOS example which uses TensorFlow mobile:
https://github.com/JieHe96/iOS_Tensorflow_ObjectDetection_Example

To me it looks like the boxRect vector contains non negative sane locations which can be used to simply drawRect

        float l = boxRect.at(i*4+0);
        float t = boxRect.at(i*4+1);
        float r = boxRect.at(i*4+2);
        float b = boxRect.at(i*4+3);
        
        CGRect textRect = CGRectMake(l, t-20>0 ? t-20 : 10, 100, 30);
        CGRect drawRect = CGRectMake(l, t, r-l, b-t);

Maybe this repo will shed some light on how the boxes are created:
https://github.com/balancap/SDC-Vehicle-Detection

But why is the conversion destroying the useful output from the TensorFlow ProtoBuf version?

@vonholst are you able to make sense out of the following np arrays:

coreml_box_encodings, coreml_scores = coreml_outputs

With respect to the above DOG image which is used in the Object Detection API example?
Or another reference image you wish to share.

Scoring

coreml_scores.squeeze()
array([[  2.60945582,   3.01424789,   2.65478683, ...,  -2.66811037,
          5.15048122,   3.56572962],
       [ -3.20076275,  -3.7116468 ,  -3.25320625, ...,  -7.5839529 ,
         -8.10477352,  -8.72828484],
       [ -7.36386967,  -7.54023457,  -8.0100174 , ...,  -8.06105518,
         -7.97770119,  -7.68251896],
       ...,
       [ -6.8966341 ,  -6.29983568,  -6.44188881, ...,  -7.39128351,
         -8.11299706,  -8.18768692],
       [ -9.71340847,  -9.01159286,  -9.71074677, ..., -11.07734108,
         -8.64872265,  -9.34194088],
       [ -8.51055622,  -8.49759293,  -8.95537186, ..., -11.96751595,
         -8.73058605,  -9.57052708]])

Same goes for the scores, shouldnt they be non negative?

When I look at the Object Detection API example, I see this:

      (boxes, scores, classes, num) = sess.run(
          [detection_boxes, detection_scores, detection_classes, num_detections],
          feed_dict={image_tensor: image_np_expanded})

      print(scores.shape)
      print(scores)
      print(classes.shape)
      print(classes)

(1, 100)
[[0.94069076 0.93450266 0.23088202 0.22518906 0.1724992  0.13961881
  0.13212039 0.09975734 0.08992747 0.08888832 0.08091756 0.07895215
  0.07818329 0.07565769 0.07331282 0.07074907 0.06597457 0.06404536
  0.06387985 0.06381162 0.06380178 0.06050147 0.05898505 0.05898324
  0.05761419 0.05641391 0.05638798 0.05500007 0.05495741 0.05426413
  0.05425422 0.05409579 0.05335903 0.05279237 0.05228516 0.05219484
  0.05164067 0.05066017 0.05030207 0.04837736 0.04822728 0.04720661
  0.04673669 0.04671789 0.0464424  0.04614932 0.04420191 0.04286543
  0.04251252 0.04174349 0.04166571 0.04137345 0.04096129 0.04053527
  0.03990961 0.03988715 0.03984378 0.03930985 0.03923889 0.03877977
  0.03877515 0.03853    0.03816225 0.03815909 0.03742459 0.03689881
  0.03677597 0.03671168 0.03654755 0.0363533  0.0361237  0.03610827
  0.03601931 0.03586451 0.03582645 0.03572749 0.03565424 0.03555561
  0.03504343 0.03495524 0.03489256 0.03459927 0.03383927 0.03381943
  0.03363901 0.03349813 0.03300397 0.03294906 0.03281055 0.03275418
  0.0325323  0.03200774 0.03199514 0.0317288  0.03149139 0.03133158
  0.03106265 0.0305737  0.03044932 0.03041713]]

(1, 100)
[[18. 18. 18. 18. 18. 33.  1. 63. 18. 21.  3. 18.  1. 18. 21. 18. 62. 18.
  18. 62.  1. 62. 18.  3.  1. 21.  1.  1.  1.  1.  1. 18.  1.  1.  1.  1.
  62.  1.  1.  1.  1.  1. 62. 18.  1. 63.  1.  1. 84.  1.  1. 18.  1.  1.
  62. 21. 57. 18. 18. 62. 18.  1.  1.  1.  1.  1.  1.  1. 18. 62.  1. 88.
  47.  1. 47. 62. 84.  1. 18.  1.  1.  1.  1. 84. 18. 62. 18. 21.  1. 62.
  62.  1. 18. 21.  1. 18. 47.  1.  1. 47.]]

This makes sense, there are 100 of the top probabilities and their classes. The classes are rounded floats, theres no way to mistake them. I don't see any rounded floats in the output of the CoreML model, nor can I see any "probability" looking values in the scores.

This looks like garbage:

coreml_scores.squeeze()

array([[  2.60945582,   3.01424789,   2.65478683, ...,  -2.66811037,
          5.15048122,   3.56572962],
       [ -3.20076275,  -3.7116468 ,  -3.25320625, ...,  -7.5839529 ,
         -8.10477352,  -8.72828484],
       [ -7.36386967,  -7.54023457,  -8.0100174 , ...,  -8.06105518,
         -7.97770119,  -7.68251896],
       ...,
       [ -6.8966341 ,  -6.29983568,  -6.44188881, ...,  -7.39128351,
         -8.11299706,  -8.18768692],
       [ -9.71340847,  -9.01159286,  -9.71074677, ..., -11.07734108,
         -8.64872265,  -9.34194088],
       [ -8.51055622,  -8.49759293,  -8.95537186, ..., -11.96751595,
         -8.73058605,  -9.57052708]])

@vonholst Do you understand how to interpret that numpy array?
Where are the classes and the scores, theres only 2 output nodes in the CoreML graph but all the other implementations have 4 output nodes:

detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
num_detections = detection_graph.get_tensor_by_name('num_detections:0')

Any help would be greatly appreciated.

@vonholst
Copy link

@madhavajay I believe this "garbage" is what you get before applying some sort of activation function. The example has extracted only the feature-generator. The complete tensor flow graph has the post-processing required to make sense of these numbers.

coreml_box_encodings:
These are a little tricky when you don't know the layout of anchor-boxes specifications/layout. I looked att the graph in the .pb-file and found the following. The box-encodings ar concatenated from 6 tensors. These are my guess on grids per tensor:
"0":
1083 boxes
3 priors
19x19 grid

"1":
600 boxes
6 priors
10x10 grid

"2":
150 boxes
6 priors
5x5 grid

"3":
54 boxes
6 priors
3x3 grid

"4":
24 boxes
6 priors
2x2 grid

"5":
6 boxes
6 priors
1x1 grid

I don't know if this is correct, but the numbers add up. I don't know the shapes and sizes of these though. And also I don't know the activation function to make sense of the numbers.

As for the coreml_scores, I would assume maybe a softmax to make sense of them.

@madhavajay
Copy link
Author

Ah yes that is making a lot more sense now. Might help to BOLD / Emphasis the important line in the README.

The full MobileNet-SSD TF model contains 4 subgraphs: Preprocessor, FeatureExtractor, MultipleGridAnchorGenerator, and Postprocessor. Here we will extract the FeatureExtractor from the model and strip off the other subgraphs, as these subgraphs contain structures not currently supported in CoreML. The tasks in Preprocessor, MultipleGridAnchorGenerator and Postprocessor subgraphs can be achieved by other means, although they are non-trivial.

@vonholst If this is something you are also trying to achieve in Swift perhaps we can work together to figure it out?

I guess the first thing would be to create a working interpretation in python, which can then be ported to any language.

So we need to create a MultipleGridAnchorGenerator and a Postprocessor?
I will read the SSD paper to get some insight but perhaps it might help to run the working example and debug in Tensorboard.

Do you have any idea how we will be able to get the Scores AND the Class numbers from the score output?

My thoughts were to try and compare the same input image to both the original ssd_mobilenet_v1_android_export ProtoBuf model and the feature extracted CoreML model so that as we apply mappings and transformations we can compare it to something meaningful.

Also if anyone in the Google team would be able to shed any light that would also be appreciated! 👍

@vonholst
Copy link

vonholst commented Jan 18, 2018

@madhavajay Yes, I am trying to get this working in coreml as well. I have run a similar yolo version on mobilenet in coreml, and i would like to see how this compares to that implementation. I think much of the post processing is equivalent, but I need to decode the boxes.

As for the classes, I expect that they follow the order of the txt-file. And I think you can get the scores by applying a simple soft max.

I think we could make a guess at the default boxes generated by the ssd method in the following:
https://github.com/tensorflow/models/blob/master/research/object_detection/anchor_generators/multiple_grid_anchor_generator.py

@vincentchu
Copy link

I did a lot of work on this over the weekend and I have a working understanding of the outputs produced by the converted CoreML model.

I'll try and walk through what I found— please let me know if anything is unclear. Anyway, the CoreML model outputs two MLMultiArrays:

  1. Scores for each class (concat_1__0, a 1 x 1 x 91 x 1 x 1917 MLMultiArray)
  2. Anchor-encoded Boxes (concat__0, a 1 x 1 x 4 x 1 x 1917 MLMultiArray)

Here, 91 refers to the index of the class labels (0 = background, 18 = dog). There are a total of 1917 anchor boxes as well.

Postprocessing

The postprocessing goes like this:

  1. Prune out all boxes that are below a threshold of 0.01. For our golden retriever sample, the only indices that work are in scores[18][...] with [...] = [1756, 1857, 1858, 1860, 1911, 1912, 1914]
  2. Take this set of indices and compute the corresponding bounding boxes for each prediction
  3. Apply non-maximum suppression to this set of scores / boxes

Now, the non-maximum suppression part is pretty easy to understand. You can read about it here but the basic gist is that you sort each box by its score in descending order. You weed out any box that overlaps >50% with any other box that is scored more highly.

The trickiest part here is computing the bounding boxes. To do that, you need to take the output of the CoreML model and adjust a base set of anchor boxes. This set of 1917 anchor boxes tiles the 300x300 input image.

The output of the k-th CoreML box is:

ty, tx, th, tw = boxes[0, 0, :, 0, k]

You take these and combine them with the anchor boxes using the same routine as this python code. Note: You'll need to use the scale_factors of 10.0 and 5.0 here.

Now, the anchor boxes themselves are generated using this logic. I followed the logic to the bitter end, but instead of trying to reimplement the logic in swift, I just exported them out of the Tensorflow Graph from the import/MultipleGridAnchorGenerator/Identity tensor. You can see those anchors here

The logic for combining the box prediction and the anchor boxes is written up here.

Hope this helps! Again, this was a lot of blood, sweat, and tears and reading a ton of Tensorflow code and going through all of the logic. Honestly thought I would stab my eyeballs out. 😭 At the end, I was able to reproduce the bounding box for the golden retriever:

screen shot 2018-01-22 at 8 46 36 pm

@vonholst
Copy link

@vincentchu: Big thanks. Had the same strategy minus the time, so I am glad you shared your solution. Saved me a weekend. If I can find the time I will implement the generator in swift, so it will be more convenient to implement other configurations. Thanks again.

@vonholst
Copy link

@vincentchu Out of curiosity, how did you find the scale_factors = [10.0, 10.0, 5.0, 5.0] needed by KeypointBoxCoder?

@madhavajay
Copy link
Author

@vincentchu you are a legend!!! 👍 I read the SSD paper and my next task was to compare the outputs side by side and see if i could find the implementation of the boxes but exporting the values from the Graph makes a lot more sense, are you able to explain how you did that? I tried loading the protobuf up in TensorBoard but couldn't see how to actually view the post processing step that modifies the values. I think the process of reverse engineering the input and output of exported graphs is a really important one particularly when converting models between formats. Thanks so much for your hard work, I am going to try your Swift code and see what I can get working. 🎉

@vincentchu
Copy link

@vonholst I was spelunking around in the code and saw this line in the box coder. I couldn't get any of my encodings to work, but then made a guess that maybe the scale_factor was an issue.

I then went back to the original config and found the appropriate scale factors were set.

@vincentchu
Copy link

@madhavajay @vonholst I spent most of my time inside of an IPython notebook just dumping contents of the tensor and tensorboard. I cleaned it up and added some comments--- you might be interested in checking out this notebook. In it, I:

  • Load ssd_mobilenet_v1_android_export.pb
  • Fetch our favorite golden retriever JPG
  • List all named tensors in the graph
  • Write a copy of the graph into a format that tensorboard can read
  • Demonstrate a python version of box decoding that gets the appropriate answer

In terms of exporting values out of the graph, it's not super sophisticated. I just evaluate the tensor I want to dump, then I literally code-genned the swift code and print'ed to STDOUT. Very hacky, but my script is here.

@madhavajay
Copy link
Author

Thats awesome, this is what I was going to do but you have done the work so now I can play with the notebook and see how its done! 👍 If you have a blog this would make a really really super awesome blog post. Thanks so much for sharing your hard work!

@madhavajay
Copy link
Author

@vincentchu Just tried the code, first issue was a possible range issue and it crashed:

    // I assume we skip 0 because its ??? but I entered 91 for classes so this crashed
    for klass in 1...numClasses {
      for box in 0...(numAnchors-1) {

Maybe best to use the ..< range operator

for klass in 1..<numClasses {
      for box in 0..<numAnchors {

Or alternatively provide it as a default inside the class to 90?
I think because its 91 in the array and the paper might be best to stick to 91 as number of classes.

Thoughts?

@madhavajay
Copy link
Author

@vincentchu I am seeing some weird scoring, do we need to apply a softmax like @vonholst suggested?

For example score: 1.3203125?

Also it seems to be having trouble recognizing stuff, I am seeing stop signs okay but many other objects failing to recognize.

Prediction(klass: 65, index: 1911, score: 0.5908203125, anchor: OpenCV91.BoundingBox(yMin: 0.025000009685754776, xMin: 0.025000009685754776, yMax: 0.97500002384185791, xMax: 0.97500002384185791, imgHeight: 300.0, imgWidth: 300.0), anchorEncoding: OpenCV91.AnchorEncoding(ty: 0.07094370573759079, tx: -0.027406346052885056, th: 0.076941326260566711, tw: 0.13402754068374634))

Prediction(klass: 65, index: 1914, score: 1.3203125, anchor: OpenCV91.BoundingBox(yMin: -0.17175143957138062, xMin: 0.16412428021430969, yMax: 1.1717514991760254, xMax: 0.8358757495880127, imgHeight: 300.0, imgWidth: 300.0), anchorEncoding: OpenCV91.AnchorEncoding(ty: 0.019769947975873947, tx: 0.005826219916343689, th: -1.5519936084747314, tw: 2.0325460433959961))

Also I couldnt see any way you were returning the object class information, so I added the klass property to the Prediction struct and passed it through in the initializer like so:

let prediction = Prediction(klass: klass, index: box, score: score, anchor: anchor, anchorEncoding: anchorEncoding)

@vincentchu
Copy link

@madhavajay It should be n=90--- the 0-th class is the background, so there should really only be 90 detected classes. That's why the klass ranges from 1...90 (inclusive). I don't want to detect anything from the background case.

The array however has 91 slots because I want the indexing to match the labels instead of remembering to add everything.

@madhavajay
Copy link
Author

@vincentchu okay well 1...90 and 1..<91 are the same thing so either way. 👍

@vincentchu
Copy link

@madhavajay In terms of scoring, I don't think you necessarily have to transform the scores. You can just treat them as "higher = better". So you start with anything >0.01 as the threshold, then you could apply non-maximum suppression to that pruned set (on a per-class basis).

Then I think you could just take the top 20 detections say of those candidates by score...

@madhavajay
Copy link
Author

Okay but what if I want to convert them to the scale of 0 -> 100%?

@vincentchu
Copy link

@madhavajay yes, you will have to process them further. I didn't dig too much into it, but I think the loss function is here.

My impression is that the logits are per-anchor box, so you'd compute the logit across classes with fixed box index ?

@vonholst
Copy link

@madhavajay Then you might as well apply a sigmoid. In the config there are some parameters related to score-thresholds. They mention sigmoid as score converter. I'm not sure exactly how this config is used in model-gerenation though.

@madhavajay
Copy link
Author

@vincentchu @vonholst great I will look into this soon and if I can figure it out ill post here! :)

I am still seeing some strange output with the bounding boxes though:

Prediction(klass: 13, index: 1914, score: 1.28515625, anchor: OpenCV91.BoundingBox(yMin: -0.17175143957138062, xMin: 0.16412428021430969, yMax: 1.1717514991760254, xMax: 0.8358757495880127, imgHeight: 300.0, imgWidth: 300.0), anchorEncoding: OpenCV91.AnchorEncoding(ty: -0.01125398650765419, tx: 0.0046060904860496521, th: -1.5216562747955322, tw: 2.0175662040710449))

ACCEPTED klass: 13 label: stop sign

The CGRect I get is:
X:49.2372840642929,Y:-51.5254318714142,W:201.525440812111,H:403.050881624222,P:1.28515625,L:13,LS:stop sign

Negative values again? Also if the input image is 300x300 how am I getting a height over 300?

@madhavajay
Copy link
Author

@vincentchu also any chance you could share your example Xcode project so I can debug mine against your working version? 😊

@vonholst
Copy link

vonholst commented Jan 24, 2018

@madhavajay @vincentchu I used the code provided by vincentchu. I applied a sigmoid to the score to bind them in (0,1). I think there are some issues with the non-max-suppression algorithm (not being implemented is the main issue I guess 😊). I used hollance implementation .

This is a screenshot btw:
img_4930

If we should discuss this further I think we might continue the discussion in another forum, since these should mainly focus on functional issues on tf-coreml.

Another tip: hollance has some other useful code for handling MLMultiArrays and object-detection that might speed up your code as well. I used those for other implementations.

@vincentchu
Copy link

@vonholst Amazing! The NMS should be pretty easy, we just have to write an implementation to compute the IOU (Jaccard index) of two boxes, then weed any boxes out that have >50% overlap with another, higher-scoring box.

Could I ask a favor? This is actually my first learning iOS project so I think I would benefit quite a bit from seeing your code. I was trying to get this to work in real time, vs. just on the golden retriever picture and just having some struggles. Would help me a ton if I could look at how you did the real-time inference. Maybe you could just create a new repo with the code, and we could submit PRs to improve/implement the NMS and do our discussion there?

@madhavajay
Copy link
Author

@vonholst @vincentchu Yes I am also trying to implement this as a working Camera Demo and having some similar issues, would be great if we perhaps worked on a shared git repo to provide a working implementation for iOS and CoreML. @vonholst do you mind committing your code to a repo and putting a cut down demo app in there we can all iterate on?

@vonholst
Copy link

Absolutely. I will need to clear some stuff from the code due to some parts are developed for a client. I will try to make a clean project that I safely can share.

You have most of the code though, I just filled out the blanks in vincentchu’s code. Hollance also has some complete Xcode projects with nms implemented if you are in a hurry. Otherwise I will fix this as soon as I can.

@vonholst
Copy link

@vincentchu @madhavajay I created a clean (still kinda messy) project that I can share. We can continue the discussion on that forum if there is any trouble. SSDMobileNetCoreML

@madhavajay
Copy link
Author

Awesome! 🎉

@shishaozheng
Copy link

@vonholst @madhavajay @vincentchu Thank you so much! This is the only detailed solution that I found on internet to use tensorflow ssd model with coreml . And it indeed give me a great help!

@madhavajay
Copy link
Author

@vonholst and @vincentchu did all the work, I just started the issue. 🤓

Also I would say that for anyone else if you watch Andrew Ng's DeepLearning CNN series it covers the YOLO algorithm and sliding windows / non max suppression really well:
https://www.youtube.com/watch?v=ArPaAX_PhIs&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF

Anyone attempting to understand the output of the SSD should watch those videos to understand what the output vectors really contain.

@madhavajay
Copy link
Author

@vonholst @vincentchu Hey guys, theres no issues on the example repo:
https://github.com/vonholst/SSDMobileNet_CoreML

I was tracking the TF Lite progress here:
tensorflow/tensorflow#15633

From the developer at google andrewharp:

I have a commit porting the Android TF demo to tflite currently under review, should show up on github this week hopefully.

It's Android only, but you should be able to adapt it for iOS. The only thing is that some of the pre-processing (image resizing/normalization) and post-processing (non-max suppression and adjustment by box priors) is done in Java as tflite doesn't fully support all the operators used by MobileNet SSD.

Seems like there might be a chance to re-use the post processing from SSDMobileNet_CoreML demo and add some pre-processing to get the TF Lite iOS demo working.

@vonholst
Copy link

vonholst commented Apr 5, 2018

@madhavajay Interesting! I don’t have a device at hand to try the android implementation. But it seems much of the iOS code should be similar. Swamped right now but will try as soon as I get a chance.

@madhavajay
Copy link
Author

madhavajay commented Apr 5, 2018

@vonholst Yeah no worries, I have an Android device so I could provide some debugging. Whats the main thing you want to know? The inputs / outputs of the Android TF Lite model in Android Studio?

Also I might try and get the model loaded up in TF and poke around to understand it a bit better. Also super busy but seems like this could be a good contribution of the code to the TensorFlow repo. :)

@skercher
Copy link

@vonholst SSDMobileNet_CoreML is awesome! Thanks for sharing.

I am trying to replicate your conversion of the pb file to mlmodel using:
https://github.com/tf-coreml/tf-coreml/blob/master/examples/ssd_example.ipynb

I updated the conversion:

coreml_model = tfcoreml.convert(
      tf_model_path=frozen_model_file,
      mlmodel_path=coreml_model_file,
      input_name_shape_dict=input_tensor_shapes,
      image_input_names="Preprocessor/sub:0",
      output_feature_names=output_tensor_names,
      image_scale=2./255.,
      red_bias=-1.0,
      green_bias=-1.0,
      blue_bias=-1.0
)

But still seeing issues when running with the new mlmodel.
On line 172 SSDPostprocessor.swift
let score = classPredictions[offset(klass, box)].doubleValue . **"EXC_BAD_ACCESS"**
It also appears the output is different from the conversion to what is checked into the project.
Checked in MLmodel output: concat_1__0 concat__0
My genereated MLmodel ouput: concat__0 concat_1__0

Do you have any idea if I am missing anything? Thanks in advance!

@tanndx17
Copy link

@vonholst Hi, I see the image_scale=2./255 part in the model inference. I dont't see any image scaling preprocess step in tensorflow object detection api demo code. https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb
I that prequisite or optional,..or it's already included somewhere?

@vonholst
Copy link

It’s required preprocessing if you use a pretrained mobilenet. If it’s not included in the model then you need to do that manually at inference.

@mrgloom
Copy link

mrgloom commented Mar 11, 2019

Here is some description on topic: https://github.com/tf-coreml/tf-coreml/blob/master/examples/ssd_example.ipynb

By inspecting TensorFlow GraphDef, it can be found that: (1) the input tensor of MobileNet-SSD Feature Extractor is Preprocessor/sub:0 of shape (1,300,300,3), which contains the preprocessed image. (2) The output tensors are: concat:0 of shape (1,1917,4), the box coordinate encoding for each of the 1917 anchor boxes; and concat_1:0 of shape (1,1917,91), the confidence scores (logits) for each of the 91 object classes (including 1 class for background), for each of the 1917 anchor boxes.

I wonder if there any example of postproccessing and preprocessing in python/C++? i.e. for parts that are removed from graph?

@vonholst
Copy link

@mrgloom , From your link:

Preprocess the image - normalize to [-1,1]

img = img.resize([300,300], PIL.Image.ANTIALIAS)
img_array = np.array(img).astype(np.float32) * 2.0 / 255 - 1

@kiad4631
Copy link

kiad4631 commented Aug 3, 2019

hi.
i have a tflite model of ssdlite mobilenetv2 on coco dataset 2017.and i want to test accuracy of that on val coco 2017.
i know should use interpreter and output of that is the prediction of input image.but output shape of my model is [1 , 1917,1 ,4].
i don't know how i can map between output of interpreter and annotation of val2017.
can anyone help me?

@madhavajay
Copy link
Author

@Davari393 Check out this source code which does the mapping from 1917 to anchor boxes and scores: https://github.com/vonholst/SSDMobileNet_CoreML

@kiad4631
Copy link

I did a lot of work on this over the weekend and I have a working understanding of the outputs produced by the converted CoreML model.

I'll try and walk through what I found— please let me know if anything is unclear. Anyway, the CoreML model outputs two MLMultiArrays:

  1. Scores for each class (concat_1__0, a 1 x 1 x 91 x 1 x 1917 MLMultiArray)
  2. Anchor-encoded Boxes (concat__0, a 1 x 1 x 4 x 1 x 1917 MLMultiArray)

Here, 91 refers to the index of the class labels (0 = background, 18 = dog). There are a total of 1917 anchor boxes as well.

Postprocessing

The postprocessing goes like this:

  1. Prune out all boxes that are below a threshold of 0.01. For our golden retriever sample, the only indices that work are in scores[18][...] with [...] = [1756, 1857, 1858, 1860, 1911, 1912, 1914]
  2. Take this set of indices and compute the corresponding bounding boxes for each prediction
  3. Apply non-maximum suppression to this set of scores / boxes

Now, the non-maximum suppression part is pretty easy to understand. You can read about it here but the basic gist is that you sort each box by its score in descending order. You weed out any box that overlaps >50% with any other box that is scored more highly.

The trickiest part here is computing the bounding boxes. To do that, you need to take the output of the CoreML model and adjust a base set of anchor boxes. This set of 1917 anchor boxes tiles the 300x300 input image.

The output of the k-th CoreML box is:

ty, tx, th, tw = boxes[0, 0, :, 0, k]

You take these and combine them with the anchor boxes using the same routine as this python code. Note: You'll need to use the scale_factors of 10.0 and 5.0 here.

Now, the anchor boxes themselves are generated using this logic. I followed the logic to the bitter end, but instead of trying to reimplement the logic in swift, I just exported them out of the Tensorflow Graph from the import/MultipleGridAnchorGenerator/Identity tensor. You can see those anchors here

The logic for combining the box prediction and the anchor boxes is written up here.

Hope this helps! Again, this was a lot of blood, sweat, and tears and reading a ton of Tensorflow code and going through all of the logic. Honestly thought I would stab my eyeballs out. At the end, I was able to reproduce the bounding box for the golden retriever:

screen shot 2018-01-22 at 8 46 36 pm

Hi pal.
i am using tflite model of ssdlite mobilenet v2.and i want evaluate that on my system.but i dont know how should i fill this variable in my code?

  let yACtr = (anchor.yMin + anchor.yMax) / 2.0
  let xACtr = (anchor.xMin + anchor.xMax) / 2.0
  let ha = (anchor.yMax - anchor.yMin)
  let wa = (anchor.xMax - anchor.xMin)
  
  let ty = anchorEncoding.ty / 10.0
  let tx = anchorEncoding.tx / 10.0
  let th = anchorEncoding.th / 5.0
  let tw = anchorEncoding.tw / 5.0
  
  let w = exp(tw) * wa
  let h = exp(th) * ha
  
  let yCtr = ty * ha + yACtr
  let xCtr = tx * wa + xACtr
  
  let yMin = yCtr - h / 2.0
  let xMin = xCtr - w / 2.0
  let yMax = yCtr + h / 2.0
  let xMax = xCtr + w / 2.0

Are the ty,yx,th,tw equal prediction[-1][:4] of model Respectively??
please help me

@kiad4631
Copy link

I did a lot of work on this over the weekend and I have a working understanding of the outputs produced by the converted CoreML model.

I'll try and walk through what I found— please let me know if anything is unclear. Anyway, the CoreML model outputs two MLMultiArrays:

  1. Scores for each class (concat_1__0, a 1 x 1 x 91 x 1 x 1917 MLMultiArray)
  2. Anchor-encoded Boxes (concat__0, a 1 x 1 x 4 x 1 x 1917 MLMultiArray)

Here, 91 refers to the index of the class labels (0 = background, 18 = dog). There are a total of 1917 anchor boxes as well.

Postprocessing

The postprocessing goes like this:

  1. Prune out all boxes that are below a threshold of 0.01. For our golden retriever sample, the only indices that work are in scores[18][...] with [...] = [1756, 1857, 1858, 1860, 1911, 1912, 1914]
  2. Take this set of indices and compute the corresponding bounding boxes for each prediction
  3. Apply non-maximum suppression to this set of scores / boxes

Now, the non-maximum suppression part is pretty easy to understand. You can read about it here but the basic gist is that you sort each box by its score in descending order. You weed out any box that overlaps >50% with any other box that is scored more highly.

The trickiest part here is computing the bounding boxes. To do that, you need to take the output of the CoreML model and adjust a base set of anchor boxes. This set of 1917 anchor boxes tiles the 300x300 input image.

The output of the k-th CoreML box is:

ty, tx, th, tw = boxes[0, 0, :, 0, k]

You take these and combine them with the anchor boxes using the same routine as this python code. Note: You'll need to use the scale_factors of 10.0 and 5.0 here.

Now, the anchor boxes themselves are generated using this logic. I followed the logic to the bitter end, but instead of trying to reimplement the logic in swift, I just exported them out of the Tensorflow Graph from the import/MultipleGridAnchorGenerator/Identity tensor. You can see those anchors here

The logic for combining the box prediction and the anchor boxes is written up here.

Hope this helps! Again, this was a lot of blood, sweat, and tears and reading a ton of Tensorflow code and going through all of the logic. Honestly thought I would stab my eyeballs out. At the end, I was able to reproduce the bounding box for the golden retriever:

screen shot 2018-01-22 at 8 46 36 pm

Hi.
If I want to compute IOU is it correct to do like this?
boxes = real bbox coordinate COCO annotation
p = prediction of tflite ssdlite mobilenetv2 model

yACtr = (boxes[i][0] + boxes[i][2]) / 2.0
xACtr = (boxes[i][1] + boxes[i][3]) / 2.0
ha = (boxes[i][2] - boxes[i][0])
wa = (boxes[i][3] - boxes[i][1])
ty , tx , th , tw = p[0:4]
w = np.exp(tw / 5.0) * wa
h = np.exp(th / 5.0) * ha
xCtr = (tx / 10.0) * wa + xACtr
yCtr = (ty / 10.0) * ha + yACtr
ymin = yCtr - h / 2.0
xmin = xCtr - w / 2.0
ymax = yCtr + h / 2.0
xmax = xCtr + w / 2.0
pred = [ymin , xmin , ymax ,xmax]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants