Deeplab V3+ and Xception #78

lromor · 2020-07-26T09:30:11Z

Hi!
Great repo.
Could you recommend a configuration file for running an experiment using Deeplab V3+ and Xception
to achieve at some level similar to the results presented in the paper https://arxiv.org/pdf/1802.02611.pdf?

I'm constantly getting very low mIOUs with the following:

{
    "name": "DeepLab",
    "n_gpu": 1,
    "use_synch_bn": true,

    "arch": {
        "type": "DeepLab",
        "args": {
            "backbone": "xception",
            "freeze_bn": false,
            "freeze_backbone": false
        }
    },

    "train_loader": {
        "type": "VOC",
        "args":{
            "data_dir": "data/VOCtrainval_11-May-2012",
            "batch_size": 8,
            "crop_size": 513,
            "augment": true,
            "shuffle": true,
            "scale": true,
            "flip": true,
            "rotate": false,
            "blur": false,
            "split": "train_aug",
            "num_workers": 4
        }
    },

    "val_loader": {
        "type": "VOC",
        "args":{
            "data_dir": "data/VOCtrainval_11-May-2012",
            "batch_size": 8,
            "crop_size": 513,
            "val": true,
            "split": "val",
            "num_workers": 4
        }
    },

    "optimizer": {
        "type": "SGD",
        "differential_lr": true,
        "args":{
            "lr": 0.01,
            "weight_decay": 1e-4,
            "momentum": 0.9
        }
    },

    "loss": "CrossEntropyLoss2d",
    "ignore_index": 255,
    "lr_scheduler": {
        "type": "Poly",
        "args": {}
    },

    "trainer": {
        "epochs": 80,
        "save_dir": "saved/",
        "save_period": 10,
  
        "monitor": "max Mean_IoU",
        "early_stop": 10,
        
        "tensorboard": true,
        "log_dir": "saved/runs",
        "log_per_iter": 20,

        "val": true,
        "val_per_epochs": 5
    }
}

PSPNet has an initial mIOU which quickly scales up. In my case, I observe a very low increase (after an epoch is around 0.06).
Any ideas/suggestions? It seems to be a problem of the xception model. With resnet I don't see the issue.

The text was updated successfully, but these errors were encountered:

yassouali · 2020-07-28T15:13:22Z

Hi @lromor,

For deeplab v3+ with xception backbone, the backbone used is not really the same, if you go through the code, you'll see that the checkpoint model we're using from pretrained-models.pytorch is a smaller version than the one deeplab v3+ uses, and the layers not in the checkpoint are initialized using the last layer in the checkpoint. I think this is why you might have some problems when training.

I suggest to not use differentiable learning rate (where the backbone is trained with 0.1lr), and use the same learning rate across the whole model to train the backbone too.

If you like, I am sure you can find some ported deeplab xception checkpoints, and in this case, you can load the correct weights and train normally (I might do this if I have sometime).

lromor · 2020-07-28T16:36:07Z

I see, thank you for you answer. I'm also noticing another strange behavior with resnet, the validation loss tends to go up after 600 iterations but mIOU keeps rising (still in the validation set). Have you also observed this weird behavior?

yassouali · 2020-08-01T18:20:18Z

Not really. The validation loss might to go up from time to time, but as a general trend, I'd expect it to go down. Maybe try a lower LR.

lromor · 2020-08-03T11:53:11Z

I'm having the very same diverging validation loss of #29. I don't know what's the problem. Did you achieve the same result also with a single gpu? I'm starting to think that this parallel batch norm might be faulty...

lromor · 2020-08-04T09:13:12Z

@yassouali it seems to me that by default, the validation set performs scaling augmentation.
So the configuration file you posted might


    "val_loader": {
        "type": "VOC",
        "args":{
            "data_dir": "data/VOCtrainval_11-May-2012",
            "batch_size": 8,
            "crop_size": 513,
            "val": true,
            "split": "val",
            "num_workers": 4
        }
    },

this might actually enable scale=true.

yassouali · 2020-08-04T12:12:01Z

@lromor the current config applies a batch validation, and for validation, in base_dataset, the augmentations used in validation are different. We only apply center crops, where the smaller side is automatically rescaled to the crop size. If you want, to get more precise results, you can remove crop_size and use a batch of 1. This way you'll be sure to use the original images, but it'll take a bit longer.

lromor · 2020-08-04T12:37:27Z

You are right. I just noticed that the validation does skip the augmentations.

I'm searching for possible reasons. I did update the resnet101 backbone and achieved better results, but just1,2% reaching ~75% on the validation dataset.
One thing I want to ask you, it seems that no matter what, after 5 epochs, the validation loss starts (slowly) increasing.
Do you think this is expected? The mIOU and other metrics do improve over time in the validation set, but
it's somewhat counterintuitive for the validation to go (slightly but steady) upwards.

It's similar to what's happening in the plot you can find in the bottom of this page:
https://www.tensorflow.org/tutorials/images/segmentation

One last thing. The original paper uses multi-grid. I'm not sure what does that means exactly, but maybe this implementation lacks of this method leading to lower accuracies.
What do you think? Otherwise, I can't think of other issues.
Maybe the fact that I'm fine-tuning also all the backbone parameters (and not just the batchnorm)?

yassouali · 2020-08-04T13:18:43Z

I think the multi-grid refers to using different assp_branch on top of the backbone with different dilation rates. As for the mIoU difference (I see that the expected result with a resnet101 is 78%), as you said, I certainly think the batch norm plays some role in this, fine-tuning batch norm is always tricky, maybe try different lr rates for the batch norm, currently, the backbone is fine tunned with LR/10, maybe you can try the same LR for only batch norm layers, or as you said, only fine-tuning the batch norm layers and freezing the backbone, I think torchvision have a reference implementation of deeplab v3+ if you want to take a look.

And ofcourse, please make a pull request with your findings and correction to the code if you find any improvements.

lromor · 2020-08-04T23:47:32Z

@yassouali , this is a run using the torchvision backbone.
I still used your ASPP implementation and decoder.
This is the problem I was talking about (ignore all the colored dots except the orange):

The training goes down, the val loss, goes up.
Notice that even if the trend of the val loss is to go up from the start, all the initial part (first 5 epochs) is missing, and obviously when the model is just initialized the validation loss is much much higher.

The upward trend it's still relatively small, but the interesting thing is that the mIOU keeps improving, as well as the class accuracy.

I wonder how's that possible.

yassouali · 2020-08-13T12:46:08Z

@lromor

Sorry for the delay,

Yeah, the behavior is certainly interesting, but maybe since the backbone is pretrained, we are already close to a local minimum, and the model only needs to search within the vicinity for this minimum for the optimal performances. Which might explain why the loss remains relatively small, affirming that no overfitting is taking place.

lromor · 2020-08-18T07:21:34Z

I see. Thanks for your help. In my fork I have an updated version with newer backbones. But, as you can see, they are not really changing the results too much.

Regarding the increasing loss:

That makes sense. I thought strange for the validation mIoU and loss to have diverging trends.

yassouali mentioned this issue Aug 4, 2020

Configurations for DeepLab on Cityscapes #54

Closed

yassouali closed this as completed Aug 21, 2020

yassouali mentioned this issue May 20, 2021

Deeplabv3 #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deeplab V3+ and Xception #78

Deeplab V3+ and Xception #78

lromor commented Jul 26, 2020 •

edited

yassouali commented Jul 28, 2020 •

edited

lromor commented Jul 28, 2020 •

edited

yassouali commented Aug 1, 2020

lromor commented Aug 3, 2020 •

edited

lromor commented Aug 4, 2020

yassouali commented Aug 4, 2020

lromor commented Aug 4, 2020 •

edited

yassouali commented Aug 4, 2020

lromor commented Aug 4, 2020 •

edited

yassouali commented Aug 13, 2020

lromor commented Aug 18, 2020

Deeplab V3+ and Xception #78

Deeplab V3+ and Xception #78

Comments

lromor commented Jul 26, 2020 • edited

yassouali commented Jul 28, 2020 • edited

lromor commented Jul 28, 2020 • edited

yassouali commented Aug 1, 2020

lromor commented Aug 3, 2020 • edited

lromor commented Aug 4, 2020

yassouali commented Aug 4, 2020

lromor commented Aug 4, 2020 • edited

yassouali commented Aug 4, 2020

lromor commented Aug 4, 2020 • edited

yassouali commented Aug 13, 2020

lromor commented Aug 18, 2020

lromor commented Jul 26, 2020 •

edited

yassouali commented Jul 28, 2020 •

edited

lromor commented Jul 28, 2020 •

edited

lromor commented Aug 3, 2020 •

edited

lromor commented Aug 4, 2020 •

edited

lromor commented Aug 4, 2020 •

edited