Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some code missing? #3

Closed
cswhjiang opened this issue May 12, 2017 · 37 comments
Closed

some code missing? #3

cswhjiang opened this issue May 12, 2017 · 37 comments

Comments

@cswhjiang
Copy link

cswhjiang commented May 12, 2017

python scripts/prepro_labels.py --input_json .../dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk failed. Here are the errors:

Traceback (most recent call last):
  File "scripts/prepro_labels.py", line 192, in <module>
    main(params)
  File "scripts/prepro_labels.py", line 138, in main
    imgs = imgs['images']
TypeError: list indices must be integers, not str

It seems that some code is missing.

@ruotianluo
Copy link
Owner

Did you change --input_json .../dataset_coco.json to your own path?

@cswhjiang
Copy link
Author

Thanks. I found the reason just now.

@brisker
Copy link

brisker commented Sep 4, 2017

@ruotianluo
for the "show_attend_tell" model, where is the code that visualize the visual attention on the input image?

@ruotianluo
Copy link
Owner

Sadly no, because I feel like it will be a little bit messy to add it.
But in principle, you can always save the alphas as a member variable, and visualize it using the code in arctic-caption.

@brisker
Copy link

brisker commented Sep 5, 2017

@ruotianluo
I do not quite understand what you mean...
what is "alphas"?
what is ‘arctic-caption“?
Besides, could you please provide any demos for the attention visualization for show_attend_tell model?
(if convenient:) ) thx a lot

@ruotianluo
Copy link
Owner

Sorry, what I means alpha is the attention map which is named as weight in my code.
alphas: https://github.com/ruotianluo/neuraltalk2.pytorch/blob/master/models/CaptionModel.py#L269
arctic-caption: https://github.com/kelvinxu/arctic-captions/blob/master/alpha_visualization.ipynb

You can save the weights at each timestep, and visualize it using the last block of the alpha_visualization.ipynb

@brisker
Copy link

brisker commented Sep 6, 2017

@ruotianluo
why using

self.rnn = getattr(nn, self.rnn_type.upper())(self.input_encoding_size + self.att_feat_size, 
                self.rnn_size, self.num_layers, bias=False, dropout=self.drop_prob_lm)

here?
why not directly using nn.LSTM or nn.GRU?

@ruotianluo
Copy link
Owner

@brisker In principle it allows you to use GRU instead of LSTM, however I forgot if I tested or not.

@brisker
Copy link

brisker commented Sep 6, 2017

@ruotianluo
??? GRU or LSTM, it is decided by opt.rnn_type, right?
why bother to write something like

self.rnn = getattr(nn, self.rnn_type.upper())(self.input_encoding_size + self.att_feat_size, 
                self.rnn_size, self.num_layers, bias=False, dropout=self.drop_prob_lm)

@ruotianluo
Copy link
Owner

@brisker Why not?

@brisker
Copy link

brisker commented Sep 6, 2017

@ruotianluo
new to image caption..
two questions

  1. what does this variable ss_prob mean?
  2. what does this variable masks mean?

@ruotianluo
Copy link
Owner

schedule sampling
masks indicate how long each caption is.

@brisker
Copy link

brisker commented Sep 6, 2017

@ruotianluo
the attention map which is named as weight in your code:
alphas: https://github.com/ruotianluo/neuraltalk2.pytorch/blob/master/models/CaptionModel.py#L269
did you mean that by just resizing the "weight" variable to the size of the input image, and then we can get the attention that is ready to be added to the input image for visualization?

@ruotianluo
Copy link
Owner

@brisker yes. Note that weight is flattened, you should first resize to 7x7. (I forgot to mention, this show_attend_tell is not exactly the same as described in the paper, it's simplified a little bit.)

@brisker
Copy link

brisker commented Sep 6, 2017

@ruotianluo
I am new to image caption and do not quite understand how the model works...
could you please specify a little bit about where is the simplification in your code for the show_attend_tell model , comparing to the original paper?
here ?

att_feats = cnn_model(images).permute(0, 2, 3, 1)
fc_feats = att_feats.mean(2).mean(1)

you seems to just perform average pooling on the conv features as the fc_feats

@ruotianluo
Copy link
Owner

@brisker This is because I'm using resnet.
The network details are different, but the main difference is I didn't add the doubly stochastic attention in the paper.

@brisker
Copy link

brisker commented Sep 7, 2017

@ruotianluo
do you mean that by setting schedule sampling probability to a value larger than 0.0, then the model is a Stochastic “Hard” Attention model depicted in the show_attend_tell paper?

@ruotianluo
Copy link
Owner

No, schedule sampling is another thing which is not mentioned in the show attend tell paper; you can google the schedule sampling paper;
I forgot to mention, hard attention is also one thing I didn't implement here.

@brisker
Copy link

brisker commented Sep 7, 2017

@ruotianluo

  1. did you mean that the schedule sampling paper is only corresponding to
    these codes?
            if self.training and i >= 1 and self.ss_prob > 0.0: # otherwiste no need to sample
                sample_prob = fc_feats.data.new(batch_size).uniform_(0, 1)

                #print("sample_prob:")
                #print(sample_prob.size())
                sample_mask = sample_prob < self.ss_prob
                if sample_mask.sum() == 0:
                    it = seq[:, i].clone()
                else:
                    sample_ind = sample_mask.nonzero().view(-1)
                    it = seq[:, i].data.clone()
                    #prob_prev = torch.exp(outputs[-1].data.index_select(0, sample_ind)) # fetch prev distribution: shape Nx(M+1)
                    #it.index_copy_(0, sample_ind, torch.multinomial(prob_prev, 1).view(-1))
                    prob_prev = torch.exp(outputs[-1].data) # fetch prev distribution: shape Nx(M+1)
                    it.index_copy_(0, sample_ind, torch.multinomial(prob_prev, 1).view(-1).index_select(0, sample_ind))
                    it = Variable(it, requires_grad=False)
  1. what is the benefits of schedule sampling?

@ruotianluo
Copy link
Owner

ruotianluo commented Sep 7, 2017

  1. Yes, it's replacing the input of network with sampled output by chance.

It's designed to solve the problem of test training discrepancy.
In practice, its effect depends on different model, like FCModel doesn't need scheduled sampling but ShowTell will perform better with scheduled sampling.

@brisker
Copy link

brisker commented Sep 28, 2017

@ruotianluo
so weird and embarrassed to ask, but when I am doing inference using the same ShowTellModel model and the same image, why the inference results are different from each other?( I modified the ShowTellModel a little bit:
at time step 0 ,I feed the LSTM with the fc_feats, with an image embedding layer, and time step 1 feed the start token)
Any idea on why this happens?

@ruotianluo
Copy link
Owner

Did you set model to evaluate?

@brisker
Copy link

brisker commented Sep 28, 2017

@ruotianluo
yes

@ruotianluo
Copy link
Owner

How different are the results

@brisker
Copy link

brisker commented Sep 29, 2017

@ruotianluo
technically, in an attention model, it is not a must to concat the
the "att_feats" (output of cnn) and the LSTM hidden states, to feed as the input of LSTM , right?

@ruotianluo
Copy link
Owner

Yes, as long as its mathematically equivalent.

@brisker
Copy link

brisker commented Sep 29, 2017

@ruotianluo
so what is the key idea of attention model? dynamically using the fc_feats and the LSTM hidden states to compute a weight tensor, and visualize this weight tensor on the input image?

@ruotianluo
Copy link
Owner

Idea is you can look at different part of the image at each time step.

@brisker
Copy link

brisker commented Sep 29, 2017

@ruotianluo
so you mean that every time step, we use the fc_feats and the LSTM hidden states to compute the attention "weight" tensor, and every time step, the fc_feats here is different? But actually it is the same, right?(because the cnn model forward have already been completed)------- I am a little puzzled here.. which variable's change leads to "look at different part of the image at each time step"

@ruotianluo
Copy link
Owner

We don't use fc_feats to compute the weight, we use att_feats. hidden state is changing.

@brisker
Copy link

brisker commented Sep 29, 2017

thanks for all your replies :)
@ruotianluo
so it's the hidden state changes lead to "look at different part of the image at each time step" ? But actually att_feats does not change during the unrolling process of LSTM, right? If we do not concat the att_feats and the hidden states as the input of LSTM, it seems that the attention is only related to the LSTM....is it common , to not concat, in exsiting attention models? pros and cons?

@ruotianluo
Copy link
Owner

Att_feats are changing over locations, hidden states are changing over time. And the output of the attention module is a weighted summation of att_feats

@ruotianluo
Copy link
Owner

Technically I don't concat but mathematical it's equivalent. I wrote in this way to avoid duplicate computation.

@brisker
Copy link

brisker commented Sep 29, 2017

@ruotianluo
here :-- https://github.com/ruotianluo/neuraltalk2.pytorch/blob/master/models/OldModel.py#L224

 att_feats_ = att_feats.view(-1, att_size, self.att_feat_size) # batch * att_size * att_feat_size
att_res = torch.bmm(weight.unsqueeze(1), att_feats_).squeeze(1) # batch * att_feat_size 
output, state = self.rnn(torch.cat([xt, att_res], 1).unsqueeze(0), state)

xt is output of each time step, and att_res is a weighted sum of att_feats, right?
But if I do not concat xt and att_res (only xt), there is obviously no weighted sum of att_feats, right? If do not concat, only hidden states changing, no weighted sum of att_feats, is this reasonable?

@ruotianluo
Copy link
Owner

Ok, it seems that I misunderstood your question. Yes, if you don't concat att_res here, it's not an attention model, and there's no visualization either, because there's no training signal to the attention module.

@brisker
Copy link

brisker commented Sep 29, 2017

thanks for all you replies :)
@ruotianluo
so it is a must to combine both the output of LSTM xt and att_feats, in attention model, right?
Besides, is there any other operations to combine this two variables, except concat?

@ruotianluo
Copy link
Owner

There are a lot of different fusion types proposed in VQA literature. You can check it out. One easiest alternative is elementwise product.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants