Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss & accuracy #21

Open
szerintedmi opened this issue May 17, 2020 · 8 comments
Open

Loss & accuracy #21

szerintedmi opened this issue May 17, 2020 · 8 comments

Comments

@szerintedmi
Copy link
Contributor

szerintedmi commented May 17, 2020

I'm trying to retrain your model for our specific use case. I'm training with images augmented from a 30k set. I also added accuracy calculations and validations.

The loss and accuracy seems to stall no matter how I change the learning rate.
What would you recommend? Should I just train longer ? Shall I try to "freeze" or lower the LR on part of the layers (which layers? all encoders?). Or is that how far it can get?
Have you experimented with different LR algos (Cyclic etc.)?

I ran these with 120k training images (50 epochs , 200 Iterations each, batch size 12 ). Validation size: 600 images after each epoch

Training from scratch

image

Training on your pre-trained model (173.6 MB) LR=0.001 (as yours)

image

LR reduced to 0.0001 (on pre-trained model)

image

@xuebinqin
Copy link
Owner

xuebinqin commented May 18, 2020 via email

@szerintedmi
Copy link
Contributor Author

@Nathanua , thank you for taking the time to answer. Very useful tips, much appreciated.

You can think about increasing the filter numbers of the network.

Do you mean increasing "M" ( the channels in the internal layers of RSUs)? Or increase the number of layers ("L")?

In our current version model, we use 6 side outputs for reducing the overfitting. You can also try to disable some or all of the side outputs supervision to increase the model capacity.

Just to double check my understanding: you suggest to use only the BCE of the last side output as a loss function for the training? (in your train code it's called loss2 or tar).
Are you solely using this fused BCE loss instead of last output loss to avoid overfitting ? In that case it totally make sense to me to use only d0 loss for training as long we don't overfit.
It might be trivial but I can't get my head around why it would increase model capacity. Btw what you mean by "model capacity" ? :-)

BTW, I am not sure which exact accuracy measure are you using. But I suggest to use IoU or F-measure to evaluate the segmentation performance.

Used MAE but good call, we will try IoU and F-measure

(I suggest to train your model from scratch not start from our pre-trained weights).

So far our results are getting better much quicker when we start from your pretrained weights (as you can see from the graphs above). Do you think this trend would change if we train from scratch for longer?

In addition, your validation set is too small compared with your training set. I think that's why your validation losses are even smaller than training losses.

Indeed, already increased to 2000 validation samples. Will play around if we need more

RES: We didn't test that much on LR. We use the adam optimizer with default settings.

I just tried to lower the LR because it seemed the loss is oscillating. But the training might not been long enough to see the trend. Happy to share our findings if you are interested

@szerintedmi
Copy link
Contributor Author

btw, here are the results from 100 iterations (~240k samples) and 2000 validation samples per iteration.
image

@xuebinqin
Copy link
Owner

xuebinqin commented May 21, 2020 via email

@szerintedmi
Copy link
Contributor Author

szerintedmi commented May 23, 2020

thanks @Nathanua !

We run some trainings optimizing to d0 loss ( fusion loss without the side output losses). I didn't bring any significant improvement. Qualitatively it seems the same as with your original loss.
UPDATE: just run a small test on 300 samples to see loss vs. loss2 and the correlation seems pretty strong, so no surprise changing the loss fx didn't make a noticable difference
image

You mentioned you trained 120 hours with 10k images augmented to 20k. We trained with 30k images augmented to 240k. It finishes in 9-10hrs on a single GPU Colab instance. Why is so much faster for us? Have you fed the same images multiple times during the training? If so why?

@ohheysherry66
Copy link

I'm trying to retrain your model for our specific use case. I'm training with images augmented from a 30k set. I also added accuracy calculations and validations.

The loss and accuracy seems to stall no matter how I change the learning rate.
What would you recommend? Should I just train longer ? Shall I try to "freeze" or lower the LR on part of the layers (which layers? all encoders?). Or is that how far it can get?
Have you experimented with different LR algos (Cyclic etc.)?

I ran these with 120k training images (50 epochs , 200 Iterations each, batch size 12 ). Validation size: 600 images after each epoch

Training from scratch

image

Training on your pre-trained model (173.6 MB) LR=0.001 (as yours)

image

LR reduced to 0.0001 (on pre-trained model)

image

Hi, could you please tell me where to add the validate process,it seems change a lot ,thankyou.

@xuebinqin
Copy link
Owner

xuebinqin commented Jul 17, 2020 via email

@ohheysherry66
Copy link

you can first define a testing dataloader just after the training dataloader and then feed the testing data with in the if ite_num % save_frq == 0: by a for loop just like for i, data in enumerate(salobj_dataloader):. before the for loop of testing you need to change the net mode to evaluation by net.eval() and change that back to training mode by net.train() after the validation.

On Fri, Jul 17, 2020 at 1:10 AM ohheysherry66 @.***> wrote: I'm trying to retrain your model for our specific use case. I'm training with images augmented from a 30k set. I also added accuracy calculations and validations. The loss and accuracy seems to stall no matter how I change the learning rate. What would you recommend? Should I just train longer ? Shall I try to "freeze" or lower the LR on part of the layers (which layers? all encoders?). Or is that how far it can get? Have you experimented with different LR algos (Cyclic etc.)? I ran these with 120k training images (50 epochs , 200 Iterations each, batch size 12 ). Validation size: 600 images after each epoch Training from scratch [image: image] https://user-images.githubusercontent.com/7456451/82139227-4f97a300-981e-11ea-93fd-64109911391b.png Training on your pre-trained model (173.6 MB) LR=0.001 (as yours) [image: image] https://user-images.githubusercontent.com/7456451/82139394-5541b880-981f-11ea-9f09-f795bf7e9bcc.png LR reduced to 0.0001 (on pre-trained model) [image: image] https://user-images.githubusercontent.com/7456451/82139404-730f1d80-981f-11ea-9662-1f5a6324545c.png Hi, could you please tell me where to add the validate process,it seems change a lot ,thankyou. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGOROZ22JSNX3OMUR5HBDR372P5ANCNFSM4NDI5UFQ .
-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/

Thankyou,big help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants