Skip to content

The method of DenseAVInteractions is inconsistent with that of the paper #7

@Steven-navi

Description

@Steven-navi

Hello, authors!
I found a small problem with the implementation of the paper content.
The fusion method for the two Updated Agg Tokens for video and audio mentioned in the paper is linear aggregation. And the paper does not explain how the weight matrix of the two features is calculated.
However, in the code, the two features are only joined in the channel dimension after dimension expansion. The code is as follows.

xva = torch.cat((
xv.unsqueeze(2).repeat(1, 1, na, 1),
xa.unsqueeze(1).repeat(1, nv, 1, 1),
), dim=3).flatten(1, 2)

So how should I understand this part of the content, looking forward to your answer!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions