Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the design of Match-Net and the features fed in. #31

Open
xwjabc opened this issue Oct 9, 2019 · 6 comments
Open

Question about the design of Match-Net and the features fed in. #31

xwjabc opened this issue Oct 9, 2019 · 6 comments

Comments

@xwjabc
Copy link

xwjabc commented Oct 9, 2019

  1. According to the paper, the feature extractor of match-net has 4-conv layers, one pooling layer and one fc layer. Are these layers:
    -- Conv1: 3x3 conv - 256 channels -> ReLU
    -- Conv2: 3x3 conv - 256 channels -> ReLU
    -- Conv3: 3x3 conv - 1024 channels -> ReLU
    -- Conv4: 3x3 conv - 1024 channels -> ReLU
    -- Pooling: GlobalAvgPool
    -- FC: 1024 to 256 channels (No ReLU)
    Besides, the similarity learning net have:
    -- Substraction (output 256 channels)
    -- Element-wise square (output 256 channels)
    -- FC: 256 to 1 channels (No ReLU)
    -- Sigmoid function.
    Am I correct?

  2. In the mask head, it has the procedure:
    backbone -> RoI Pooling -> 4x conv (feature extractor) -> 1x deconv + 1 conv (predictor)
    So in the paper, for the experiments using mask features, the RoI features fed into the match net should be the features after RoI Pooling. Am I correct? Do we have individual RoI Pooling for match net or just re-use the RoI Pooled features from mask head?

@geyuying
Copy link
Collaborator

  1. You are correct. Just re-use the RoI Pooled features from mask head because after the second stage, features from RoI Align already contain mask information. We tried using features from other layers, but got worse performance.

@geyuying
Copy link
Collaborator

-- Conv1: 3x3 conv - 256 channels -> ReLU
-- Conv2: 3x3 conv - 256 channels -> ReLU
-- Conv3: 3x3 conv - 256 channels -> ReLU
-- Conv4: 3x3 conv - 1024 channels -> ReLU
-- Pooling: GlobalAvgPool
--ReLU
-- FC: 1024 to 256 channels (No ReLU) +BN
Besides, the similarity learning net have:
-- Substraction (output 256 channels)
-- Element-wise square (output 256 channels)
-- FC: 256 to 2 channels (No ReLU)(The first channel means similarity, the second channel means difference. Positive pair label (1,0) ,negative pair label(0,1)
-- Softmax function.

@xwjabc
Copy link
Author

xwjabc commented Oct 11, 2019

Thank you for your great help! Besides, I have two more questions:

  1. In the first version of the answer of the match network, I noticed that there are several tile operations:
INFO net.py: 263: self1 : (64, 256) => self_user : (8, 8, 256) ------- (op: Reshape)
INFO net.py: 263: self_user : (8, 8, 256) => self_user_ : (8, 8, 256) ------- (op: Transpose)
INFO net.py: 263: self_user_ : (8, 8, 256) => self_user_after : (64, 256) ------- (op: Reshape)
INFO net.py: 263: self_user_after : (64, 256) => self_user_after_ : (512, 256) ------- (op: Tile)
INFO net.py: 263: self2 : (64, 256) => self_shop_before : (64, 2048) ------- (op: Tile)
INFO net.py: 263: self_shop_before : (64, 2048) => self_shop : (512, 256) ------- (op: Reshape)

Could you explain a bit of the use of tile function?
Besides, I see the final output has shape (512, 2). However, according to the discussion, we should have 4096 pairs (512 positive pairs and 3584 negative pairs), which will lead to a shape of (4096, 2). I wonder the reason of such gap.

  1. In the evaluation of the retrieval, does Match R-CNN compare the user instance with all shop instances, or only compare the user instance with the shop instances which has the same predicted class as the user instance?

@geyuying
Copy link
Collaborator

  1. 4019 is proper. In our experiment, in oder to reduce the number of pairs, we do not use all pairs.
  2. compare the user instance with all shop instances

@xwjabc
Copy link
Author

xwjabc commented Oct 14, 2019

Thank you for your great help! In my current implementation, I use the mask features after RoIAlign in the mask branch. However, the number of instances in the mask features is limited (1~2 instances per gt garment (unique pair_id + style) in total at the beginning of the training). Thus, I wonder how you can generate 8 instances per image for the retrieval task? Thx!

@joppichristian
Copy link

joppichristian commented Jan 14, 2020

  1. 4019 is proper. In our experiment, in oder to reduce the number of pairs, we do not use all pairs.
  2. compare the user instance with all shop instances

How did you compare all the user instance with all shop instances? It means an enormous number of comparisons. I have 4x Titan RTX and tqdm estimates 6000 hours to complete the evaluation. Have I missed something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants