Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECA-Net Efficient Channel Attention #82

Merged
merged 11 commits into from Feb 9, 2020

Conversation

chris-ha458
Copy link
Contributor

This is my initial implementation of ECA-Net's ECA module.
The ECA module is a highly efficient(in terms of FLOPs,throughput,parameter size) way to implement channel attention.
The original paper https://arxiv.org/abs/1910.03151 shows it to be competitive with, or out perform other well known methods such as Squeeze and excite.

This is my initial pull request that first implements the eca_layer and applies it to the base resnet that pytorch-image-model uses.
I also added a test network "ecaresnext26tn_32x4d" which differs from "seresnext26tn_32x4d" only in that it replaces each se layer with an eca_layer.(it also does not have pretrained weights)

I tested this code comparing it with "seresnext26tn_32x4d" so far on MNIST and CIFAR and it looks okay.

This pull also includes my initial implementation of circular ECA
whereas the original ECA for some reason implemented channelwise attention with zero(default) padding. However, since channels do not have inherent ordering, there is no reason the channels on the 'edges' be left out with out neighbors. To allow for each channel to have equal number of neighbors to be convolved with, I manually implemented circular padding.
Although my own testing showed it also to be competitive or better than original ECA, I wanted further review before further integrating it into the codebase.

@rwightman what do you think?
Please feel free to point out errors, style adjustments necessary etc that needs to be dealt with before this code could be pulled.
If further testing is required, please let me know the scope and extent of the testing you'd like to see.
Unfortunately I can manage any ImageNet training/testing only through a google collab account with its 30 hours free GPU time so please keep that in mind.

Thanks.

implement ECA module by
1. adopting original eca_module.py into models folder
2. adding use_eca layer besides every instance of SE layer
@rwightman
Copy link
Collaborator

Thanks for the PR, I actually have an initial implementation locally that I'm running an experiment with, but the circular padding is a nice idea so might pull this one in the end. Have there been any comparisons result wise with the circular padding?

Also, for mine, I find it much easier to quickly make sense of the shape intent using view vs combining transpose and squeeze/unsqueeze like the original authors did. So my forward is like:

    def forward(self, x):
        y = self.avg_pool(x)
        y = self.conv(y.view(x.shape[0], 1, -1))
        y = self.sigmoid(y.view(x.shape[0], -1, 1, 1))
        return x * y.expand_as(x)

The other design decision I was debating was whether to push a separate use_eca through the args and add the .eca to the module vs make .se changeable with a se_layer arg.. in the same way as I've gone to allow the activations and normalization to be easily switched. Hmm

@chris-ha458
Copy link
Contributor Author

1.the circular padding 'seems' to have competitive or better results compared the original code.
I am not sure how memory efficient my code is so I thought I would wait for further review.
The original issue raised on the pytorch github suggests other padding methods using f.pad which I feel is superior.
If you want to see more tangible benchmarks let me know the metrics and I'll try to muster what I can.
2. Your code definitely looks better, and I feel the performance difference would be negligible.
I +1 your version of the code.

I have not seen code utilizing multiple attention modules (aside of sk+se obviously but SK isn't an attention module in the traditional sense)
Maybe we could separate the se layer and make it interchangeable.

Sample toy code
From attentionmodules import *

Se_layer=cbam, se, eca etc...

I think it could be a good opportunity to add other easy to implement attention modules too(such as cbam)

@chris-ha458
Copy link
Contributor Author

  1. In short I support the 'same way as I've gone to allow the activations and normalization to be easily switched'
    That's one of the reason I really like your code base and decided to contribute in the first place

functionally similar
adjusted rwightman's version of reshaping and viewing.
Use F.pad for circular eca version for cleaner code
@chris-ha458
Copy link
Contributor Author

I rebased my code on the most recent master, cleaned up code with shape intent using view, and in a similar vein, changed my hacky padding into the F.pad implementation. the 'circular'mode of F.pad is not properly documented, but at least it isn't bugged.

@rwightman
Copy link
Collaborator

rwightman commented Feb 6, 2020

@vrandme great, thanks

I have another branch, select_kernel with the SelectiveKernelNet impl on it. I'm debating whether I want to get that merged in before this, or other way around, and whether to do some other refactor first or after. So, will likely be next week before I've made final decision and am ready to merge

A few minor things to cleanup in the meantime

  • I like to keep class names CamelCase, and was thinking EcaModule vs EcaLayer (mosty since that was the suffix used for SE)
  • A few commas that should have spaces after them

The one other thing regarding the decision to have both use_ema and use_eca, or an attn_layer=EcaModule/CecaModule/SEModule ... I realize they could be used together as currently implemented, but not sure that would actually be worthwhile? Have you tried the two together and compared the result vs each individually? I'm currently leaning towards self.se = attn_layer(outplanes)... I was thinking, if at some point the desire to combine multiple attn modules at that location exists, could just pass a list and do roughly,

if isinstance(attn_layer, (list, tuple)):
  self.se = nn.Sequential(*[a(outplanes) for a in attn_layer])`
else:
  self.se = attn_layer(outplanes)

CamelCased EcaModule.
Renamed all instances of ecalayer to EcaModule.
eca_module.py->EcaModule.py
Make pylint happy
(commas, unused imports, missed imports)
@chris-ha458
Copy link
Contributor Author

I am very interested in your select_kernel branch and have been following it closely.
To be frank, I hoped to have implemented res2net under selective kernels
(instead of using 5x5 or 3x3 strided kernels as second branch for SK, just split the higher resolution paths of the res2net and utilize that for attention etc)
This turned out to be much more invasive a change then I had anticipated, and I didn't think I could accomplish it in a way that would be acceptable for your codebase so I let it go for the time being.

After your branch is merged into master, and if there is further interest(either by me or you etc) I might look into it again.

My work depends on you and your work and not the other way around so take your time.
In the meantime tell me more aspects of the code you want to be dealt with.

As for my code I made changed all the names but i'm not sure if I changed it the way you'd like.
Personally between EcaModule vs EcaLayer, I guess internal consistency within your code is the most important aspect here especially since AFAIK there aren't widely accepted norms for attention modules/layers. I've named it EcaModule/CecaModule after SEModules

As for the space after commas, I just followed pylint's instructions which also pointed out some import issues(I left out importing functional as F in my pushed code)
If it still isn't according to your style let me know

I realize they could be used together as currently implemented, but not sure that would actually be worthwhile? Have you tried the two together and compared the result vs each individually?

I have not done testing myself, but I think it would only make sense to use such modules concurrently and not sequentially.
Sequential (channel) attention does not make sense to me since they would either continuously focus on the same channels, which negates using several of them, or their competing suppression and focus would cancel each other out.
Also a quick literature search did not show any meaningful examples implementing such several attentions

As for the concurrent/parallel use, there would need to be a way to combine their results which is another design decision (addition, softmax etc) that I am not equipped to handle atm.

Also I think the combinatoric explosion(order, selection, parallel vs sequential etc) makes it difficult to reason about how to design/test this multiple attention idea.

In conclusion
I am in favor of the " self.se = attn_layer(outplanes)" idea.

When channel size is given,
calculate adaptive kernel size according to original paper.
Otherwise use the given kernel size(k_size), which defaults to 3
@chris-ha458
Copy link
Contributor Author

I think I am almost done with my pull request.
I forgot to implement adaptive kernel sizes(although it could be manually set through the k_size argument).
Without this, it would not correctly operate under the "self.se = attn_layer(outplanes)" idea.
It would still work, but would just silently discard the outplanes argument and use a kernel size of 3 as default. This isn't a significant issue since even with kernel size 3, it seems to outperform all other attention modules including SE and CBAM, but considering the minimal computation and parameter cost of implementing the full adaptive kernel size, I saw no reason not too.

The original github's ResNet seems to implement it the way I did "originally" but the paper is kinda inconsistent when presenting their findings under kernel size 3 vs adaptive.
I documented the behavior of my code and it can be used either way.

To recap,

  1. when given a channel size as first argument as is common for attention modules, it calculates an adaptive channel size.
  2. When channel size is not given AND k_size is given, use that as channel size for attention
  3. When neither is given, default to k_size = 3
  4. gamma and beta is also exposed as an argument, but I don't see anyone wanting to change it especially considering the original paper doesn't explore deep into any values differing from default. But I left it the way the original paper was written.

This was done for both ECA and CECA

@chris-ha458
Copy link
Contributor Author

Have you tried the two together and compared the result vs each individually?
I think I have found a case of this CBAM (https://github.com/Jongchan/attention-module) which is mentioned in the assembled net paper and ECA paper, is such an example.

image

It is an evolution of BAM which incorporated spatial attention. CBAM adds channel attention on top in a sequential manner.
It shows better results compared to SE modules but inferior compared to ECA
(according to the ECA paper)
So my initial intuition is wrong at least for the case where sufficiently different attention mechanisms can be applied(in this case, channel and spatial attention)

The spatial attention implemented in CBAM is quite simple and actually quite similar to the original ECA, differing mainly in the dimension (channel vs spatial)

Maybe it could be combined with ECA sometime?
The improvement, if present, would be marginal considering how efficient and effective ECA and other sota attention modules are, but might be worth looking into.
I am currently looking for infrastructure to implement sufficient testing(I'm thinking of getting a subscription in gradient or the new collab subscription service) and if I get it, I think I'd be able to explore this idea further.

@rwightman rwightman changed the base branch from master to attention February 9, 2020 20:30
@rwightman rwightman merged commit 46471df into huggingface:attention Feb 9, 2020
@rwightman
Copy link
Collaborator

@vrandme I've merged this onto a new branch called attention, I'm currently in the progress of merging attention with select_kernel and doing some fairly extensive refactoring that should hopefully leave things making sense at the end :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants