# [question] What is the purpose of the variances? #53

Closed
opened this issue Mar 10, 2017 · 11 comments

Contributor

### oarriaga commented Mar 10, 2017 • edited

 Hello @rykov8 sorry for bothering you again, Looking at your ssd_training.py file it seems like that variances do not take part of the training loss or even in any part of the training pipeline. However, in your ssd_utils.py, the method detection_out does change the prediction by multiplying them by their respective variance.  decode_bbox_center_x = mbox_loc[:, 0] * prior_width * variances[:, 0] decode_bbox_center_x += prior_center_x decode_bbox_center_y = mbox_loc[:, 1] * prior_width * variances[:, 1] decode_bbox_center_y += prior_center_y decode_bbox_width = np.exp(mbox_loc[:, 2] * variances[:, 2]) decode_bbox_width *= prior_width decode_bbox_height = np.exp(mbox_loc[:, 3] * variances[:, 3])  I do understand that one has to decode the boxes since they were encoded using the transformation described in equation 2 from SSD (and faster R-CNN) My main concern is that the variances are changing explicitly the values already outputted by the CNN without considering them directly in the training procedure; furthermore, I do not seem to find any reference, neither in the SSD or in Faster R-CNN papers, that make a reference to these variances. Maybe I am missing something in the papers or in the implementation, in that case I would be very grateful if you could tell me if I making a mistake or maybe if you could elaborate on the use of these variances. Thank you very much.

### tachim commented Mar 10, 2017 • edited

 They are taken into account in encode_box which is called by assign_boxes which is used to create the labeled data for training. Edit: pretty sure they're used for numerical conditioning of the problem, otherwise the detection offsets would be on a different scale than the classifications and slow down the optimization.
Contributor Author

### oarriaga commented Mar 11, 2017 • edited

 @tachim thank you very much! I did not see they were being used in the encode_box method. Then everything makes sense, and as you said, they are probably used for numerical conditioning.

### oarriaga closed this Mar 11, 2017

Contributor Author

### oarriaga commented May 23, 2017 • edited

 Hello @villanuevab As far as I am concerned they scale the transformed box coordinates to make the classification task easier. They are not mentioned on the paper.

### villanuevab commented May 23, 2017

 Please see the inline comments: # the following 2 lines => g_hat for cx, cy encoded_box[:, :2][assign_mask] = box_center - assigned_priors_center encoded_box[:, :2][assign_mask] /= assigned_priors_wh # here we divide by the "variance" of cx, cy of the prior boxes encoded_box[:, :2][assign_mask] /= assigned_priors[:, -4:-2] # the following line => g_hat for wh encoded_box[:, 2:4][assign_mask] = np.log(box_wh / assigned_priors_wh) # here we divide by the "variance" of wh of the prior boxes encoded_box[:, 2:4][assign_mask] /= assigned_priors[:, -2:]  May I ask: What are these "variances" on a conceptual level and why are they used? Why not just use the \hat{g} definitions? Thanks to this thread, I understand that they are incorporated into the training pipeline, but on a conceptual level I cannot match these variances with: any concept outlined in the paper, any mathematical definition of variance that I know of. How are these variances initialized? I.e., why are they set to [0.1] as a default in PriorBox? I'd really appreciate some clarification here! Thank you.

### villanuevab commented May 23, 2017 • edited

 @oarriaga I edited my question to be more precise; I think you addressed my confusion on why they are included in the code, even if they are not in the paper. Thank you! Can you elaborate on why they are called "variances" and how they are initialized?
Contributor Author

### oarriaga commented May 23, 2017

 I believe the term variances is misleading and they should be called something like "box_scale_factors" or at least thats how I call it on my SSD implementation
Owner

### rykov8 commented May 23, 2017 • edited

 @villanuevab here the author of the original paper comments about the variances. Probably, the naming comes from the idea, that the ground truth bounding boxes are not always precise, in other words, they vary from image to image probably for the same object in the same position just because human labellers cannot ideally repeat themselves. Thus, the encoded values are some random values, and we want them to have unit variance that is why we divide by some value. Why they are initialized to the values used in the code - I've no idea, probably some empirical estimation by the authors.

### MicBA commented Jun 8, 2017

 Hi @rykov8 when i evaluate the model and look into the numbers the network return i can see that the variance is the same always.. is that ok? if it's suppose to be same.. way not save as default parameters and not as network output?
referenced this issue Aug 30, 2017

### stavBodik commented Nov 29, 2017 • edited

 When I am testing the assign boxes function I get negative coordinates for the assigned bounding boxes which is caused due to the encoding after the IOU step inside encode_box function. variances are coefficients for encoding/decoding the locations of bounding boxes. The first value is used to encode/decode coordinates of the centers. The second value is used to encode/decode the sizes of bounding boxes. Why the encoding/decoding is needed ? is it for using less parameters while training (2 instead of 4 )? for faster optimization ? less calculations ?

### leimao commented Apr 9, 2019

 In my opinion, most of the implementations on bounding box encoding/decoding with variance were conceptually incorrect. Please check my post at https://leimao.github.io/blog/Bounding-Box-Encoding-Decoding/ and provide some feedback on how you think. Thank you.
referenced this issue May 23, 2019

### mattroos commented May 23, 2019

 Thank you, @leimao. Very helpful blog post!