-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VideoModelZoo models #1130
Add VideoModelZoo models #1130
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good, thanks a lot for the PR Bruno!
I have done a first pass and made a few comments.
Also, lint is failing, can you have a look?
if stride != 1 or self.inplanes != planes * block.expansion: | ||
ds_stride = stride | ||
# 2D convolutions should not be downsampled along temporal axis | ||
if conv_builder.__name__ == "video_2d_conv": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a pattern that I generally see in the PyTorch codebase.
What about the following: instead of having conv_builder
be a function that returns a nn.Module
instance, what about making it a class that inherits from nn.Module
.
So you could have something like
class Conv3dNoTemporal(nn.Conv3d):
...
class Conv3dTheOtherOne(nn.Conv3d):
...
and then, in this check, you do
if isinstance(conv_builder, Conv3dNoTemporal):
ds_stride = (1, stride, stride)
?
|
||
class VideoTrunkBuilder(nn.Module): | ||
|
||
def __init__(self, block, conv_makers, model_depth, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to expose model_depth
or just layers
, as in the resnet equivalent? If we expose layers
, then the BLOCK_CONFIG
can probably go away, and we hard-code the layer configuration inside the function getters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that would give us quite a 8-10 constructors for each architecture ... isn't this to an extent cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it depends. When creating a simple training script, we will customize the training script to have a depth
argument to be passed to the model, but not all models support that, so this will make experimenting with potentially other backbones harder.
But don't worry about this, I can make a pass afterwards and modify a few things.
@fmassa I've taken a look and modified the diff to reflect the comments. There are two thinks that I'm not a 100% sure about - please check the unresolved comments for potential issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few more comments. Also, tests seems to be failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Can you also add some quick docstrings to the models, with paper references etc?
Thanks!
@fmassa added the docstrings - double check and merge if everything is ok :) Thanks a lot for the help! |
Codecov Report
@@ Coverage Diff @@
## master #1130 +/- ##
==========================================
+ Coverage 64.89% 65.58% +0.68%
==========================================
Files 68 79 +11
Lines 5413 5805 +392
Branches 835 888 +53
==========================================
+ Hits 3513 3807 +294
- Misses 1643 1731 +88
- Partials 257 267 +10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
Following up on #1077 in adding video capabilities, here is a port of the VMZ models, described in Du Tran's A Closer Look at Spatiotemporal Convolutions for Action Recognition. All of the models have been tested on Kinetics and they get within 1% of the reported results or better on Kinetics.
This is a very early commit, mostly looking for feedback from maintainers.
cc @fmassa @dutran