Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deeplab V3+ and Xception #78

Closed
lromor opened this issue Jul 26, 2020 · 11 comments
Closed

Deeplab V3+ and Xception #78

lromor opened this issue Jul 26, 2020 · 11 comments

Comments

@lromor
Copy link

lromor commented Jul 26, 2020

Hi!
Great repo.
Could you recommend a configuration file for running an experiment using Deeplab V3+ and Xception
to achieve at some level similar to the results presented in the paper https://arxiv.org/pdf/1802.02611.pdf?

I'm constantly getting very low mIOUs with the following:

{
    "name": "DeepLab",
    "n_gpu": 1,
    "use_synch_bn": true,

    "arch": {
        "type": "DeepLab",
        "args": {
            "backbone": "xception",
            "freeze_bn": false,
            "freeze_backbone": false
        }
    },

    "train_loader": {
        "type": "VOC",
        "args":{
            "data_dir": "data/VOCtrainval_11-May-2012",
            "batch_size": 8,
            "crop_size": 513,
            "augment": true,
            "shuffle": true,
            "scale": true,
            "flip": true,
            "rotate": false,
            "blur": false,
            "split": "train_aug",
            "num_workers": 4
        }
    },

    "val_loader": {
        "type": "VOC",
        "args":{
            "data_dir": "data/VOCtrainval_11-May-2012",
            "batch_size": 8,
            "crop_size": 513,
            "val": true,
            "split": "val",
            "num_workers": 4
        }
    },

    "optimizer": {
        "type": "SGD",
        "differential_lr": true,
        "args":{
            "lr": 0.01,
            "weight_decay": 1e-4,
            "momentum": 0.9
        }
    },

    "loss": "CrossEntropyLoss2d",
    "ignore_index": 255,
    "lr_scheduler": {
        "type": "Poly",
        "args": {}
    },

    "trainer": {
        "epochs": 80,
        "save_dir": "saved/",
        "save_period": 10,
  
        "monitor": "max Mean_IoU",
        "early_stop": 10,
        
        "tensorboard": true,
        "log_dir": "saved/runs",
        "log_per_iter": 20,

        "val": true,
        "val_per_epochs": 5
    }
}

PSPNet has an initial mIOU which quickly scales up. In my case, I observe a very low increase (after an epoch is around 0.06).
Any ideas/suggestions? It seems to be a problem of the xception model. With resnet I don't see the issue.

@yassouali
Copy link
Owner

yassouali commented Jul 28, 2020

Hi @lromor,

For deeplab v3+ with xception backbone, the backbone used is not really the same, if you go through the code, you'll see that the checkpoint model we're using from pretrained-models.pytorch is a smaller version than the one deeplab v3+ uses, and the layers not in the checkpoint are initialized using the last layer in the checkpoint. I think this is why you might have some problems when training.

I suggest to not use differentiable learning rate (where the backbone is trained with 0.1lr), and use the same learning rate across the whole model to train the backbone too.

If you like, I am sure you can find some ported deeplab xception checkpoints, and in this case, you can load the correct weights and train normally (I might do this if I have sometime).

@lromor
Copy link
Author

lromor commented Jul 28, 2020

I see, thank you for you answer. I'm also noticing another strange behavior with resnet, the validation loss tends to go up after 600 iterations but mIOU keeps rising (still in the validation set). Have you also observed this weird behavior?

@yassouali
Copy link
Owner

Not really. The validation loss might to go up from time to time, but as a general trend, I'd expect it to go down. Maybe try a lower LR.

@lromor
Copy link
Author

lromor commented Aug 3, 2020

I'm having the very same diverging validation loss of #29. I don't know what's the problem. Did you achieve the same result also with a single gpu? I'm starting to think that this parallel batch norm might be faulty...

@lromor
Copy link
Author

lromor commented Aug 4, 2020

@yassouali it seems to me that by default, the validation set performs scaling augmentation.
So the configuration file you posted might


    "val_loader": {
        "type": "VOC",
        "args":{
            "data_dir": "data/VOCtrainval_11-May-2012",
            "batch_size": 8,
            "crop_size": 513,
            "val": true,
            "split": "val",
            "num_workers": 4
        }
    },

this might actually enable scale=true.

@yassouali
Copy link
Owner

@lromor the current config applies a batch validation, and for validation, in base_dataset, the augmentations used in validation are different. We only apply center crops, where the smaller side is automatically rescaled to the crop size. If you want, to get more precise results, you can remove crop_size and use a batch of 1. This way you'll be sure to use the original images, but it'll take a bit longer.

@lromor
Copy link
Author

lromor commented Aug 4, 2020

You are right. I just noticed that the validation does skip the augmentations.

I'm searching for possible reasons. I did update the resnet101 backbone and achieved better results, but just1,2% reaching ~75% on the validation dataset.
One thing I want to ask you, it seems that no matter what, after 5 epochs, the validation loss starts (slowly) increasing.
Do you think this is expected? The mIOU and other metrics do improve over time in the validation set, but
it's somewhat counterintuitive for the validation to go (slightly but steady) upwards.

It's similar to what's happening in the plot you can find in the bottom of this page:
https://www.tensorflow.org/tutorials/images/segmentation

One last thing. The original paper uses multi-grid. I'm not sure what does that means exactly, but maybe this implementation lacks of this method leading to lower accuracies.
What do you think? Otherwise, I can't think of other issues.
Maybe the fact that I'm fine-tuning also all the backbone parameters (and not just the batchnorm)?

@yassouali
Copy link
Owner

I think the multi-grid refers to using different assp_branch on top of the backbone with different dilation rates. As for the mIoU difference (I see that the expected result with a resnet101 is 78%), as you said, I certainly think the batch norm plays some role in this, fine-tuning batch norm is always tricky, maybe try different lr rates for the batch norm, currently, the backbone is fine tunned with LR/10, maybe you can try the same LR for only batch norm layers, or as you said, only fine-tuning the batch norm layers and freezing the backbone, I think torchvision have a reference implementation of deeplab v3+ if you want to take a look.

And ofcourse, please make a pull request with your findings and correction to the code if you find any improvements.

@lromor
Copy link
Author

lromor commented Aug 4, 2020

@yassouali , this is a run using the torchvision backbone.
I still used your ASPP implementation and decoder.
This is the problem I was talking about (ignore all the colored dots except the orange):
image

The training goes down, the val loss, goes up.
Notice that even if the trend of the val loss is to go up from the start, all the initial part (first 5 epochs) is missing, and obviously when the model is just initialized the validation loss is much much higher.

The upward trend it's still relatively small, but the interesting thing is that the mIOU keeps improving, as well as the class accuracy.

I wonder how's that possible.

@yassouali
Copy link
Owner

@lromor

Sorry for the delay,

Yeah, the behavior is certainly interesting, but maybe since the backbone is pretrained, we are already close to a local minimum, and the model only needs to search within the vicinity for this minimum for the optimal performances. Which might explain why the loss remains relatively small, affirming that no overfitting is taking place.

@lromor
Copy link
Author

lromor commented Aug 18, 2020

I see. Thanks for your help. In my fork I have an updated version with newer backbones. But, as you can see, they are not really changing the results too much.

Regarding the increasing loss:

That makes sense. I thought strange for the validation mIoU and loss to have diverging trends.

@yassouali yassouali mentioned this issue May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants