Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update and fix bugs #51

Merged
merged 1 commit into from
May 15, 2018
Merged

update and fix bugs #51

merged 1 commit into from
May 15, 2018

Conversation

zhanghang1989
Copy link
Owner

This PR should have addressed most of the issues:
fixes #48
fixes #47
fixes #46
fixes #45
fixes #35

@d-li14
Copy link

d-li14 commented May 15, 2018

Sorry, in my case, the loss exposion issue during evaluation in #35 and #45 still exists. I use the pytorch 0.4.0 compatible code from encoding.nn import BatchNorm2d. Is there anything wrong? @zhanghang1989

@zhanghang1989
Copy link
Owner Author

Hi @d-li14, The code is working okay with the pytorch master branch.
Can you provide a minimum error reproduction code and i will take a look into it?

@d-li14
Copy link

d-li14 commented May 15, 2018

Thanks for your timely reply. I have tried that module in some semantic segmentation code, and taking the drn training on cityscapes dataset as a naive example, with just replacing this line of original syncbn with encoding.nn.BatchNorm2d, the losses in evaluation will be weird like this

[2018-05-16 03:24:16,377 segment.py:232 validate] Test: [0/31]  Time 16.048 (16.048)    Loss 26639412.0000 (26639412.0000)    Score 0.593 (0.593)
[2018-05-16 03:24:20,453 segment.py:232 validate] Test: [10/31] Time 0.405 (1.829)      Loss 33462836.0000 (29009771.8182)    Score 0.880 (0.561)
[2018-05-16 03:24:28,128 segment.py:232 validate] Test: [20/31] Time 0.412 (1.324)      Loss 20237872.0000 (27516075.8095)    Score 2.562 (0.626)
[2018-05-16 03:24:32,216 segment.py:232 validate] Test: [30/31] Time 0.423 (1.029)      Loss 15786181.0000 (22639639.2903)    Score 2.251 (0.671)

@zhanghang1989
Copy link
Owner Author

Typically, it is not necessary to calculate loss in evaluation mode. You can try to add the following code https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/nn/syncbn.py#L41

def forward(self, input):
        if not self.training:
            return batch_norm(
                input, self.running_mean, self.running_var, self.weight, self.bias,
                self.training, self.momentum, self.eps)

@d-li14
Copy link

d-li14 commented May 15, 2018

Well, there is no need to calculate the loss in evaluation mode exactly, but as we can see in the log above, the score is also abnormally low in contrast to that in the training mode, which seems to mean that the model makes totally unreasonable predictions during evaluation.

@penguinshin
Copy link

I am not a user of this repo, but I am experiencing a similar issue with my own code for eval/train. Basically, if I train a 1 layer neural net using train mode on a single batch, and then switch the eval mode, and evaluate on the same batch that I trained on, the loss is much worse. I'm not sure if there's something stupid that I'm doing, or if theres a bug in pyTorch.

@zhanghang1989
Copy link
Owner Author

That is because different behavior of BatchNorm layer in training/eval mode

@penguinshin
Copy link

Yes, but this problem is still happening for a single batch only. Surely after training 100 steps on a single batch during training mode the parameters for the running mean/std wouldn't change right?

@zhanghang1989
Copy link
Owner Author

train mode: using mean(x) and var(x)
eval mode: using accumulated_mean and accumulated_var, which are averaged over the dataset

@penguinshin
Copy link

penguinshin commented Oct 17, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants