Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling longer videos while preparing data files #4

Closed
sxs4337 opened this issue Aug 14, 2016 · 10 comments
Closed

Handling longer videos while preparing data files #4

sxs4337 opened this issue Aug 14, 2016 · 10 comments

Comments

@sxs4337
Copy link

sxs4337 commented Aug 14, 2016

Thanks for preparing a wonderful code!

While preparing the data h5 files, as mentioned- "batch['data'] stores the visual features. shape (n_step_lstm, batch_size, hidden_dim)"
How to deal with videos that are longer than "n_step_lstm" length?
If the video is broken into parts and stored as separate input samples, would the model figure out and learn from parts of same video using the batch['label'] parameter.

Any help on preparing the data h5 files would be appreciated.
Thanks.

@tsenghungchen
Copy link
Owner

tsenghungchen commented Aug 14, 2016

Hi,

For videos that are longer than n_step_lstm, I simply discard the rest and take the clips with the maximum length as input.
For the second question, I haven't tried splitting videos into segments. You might need to modify the code since the labels are starting from the very first index for all of my data. (Only different at where it ends)
Since I'm a bit busy recently, I might put up the H5 data generation code later this week.

@tsenghungchen
Copy link
Owner

Sorry for accidentally sending the message.
If you are in a hurry, you can see the HDF5 documentation first for some details. If you encounter any problems, feel free to ask me, thanks.

@sxs4337
Copy link
Author

sxs4337 commented Aug 15, 2016

Thanks for the quick response and the clarifications.
I agree the labels always start from the first index but may end earlier than the sequence length. For videos longer than n_step_lstm, would it not be better to sample videos to get n_step_lstm no. of frames covering the entire video. Discarding longer videos would decrease the dataset size(If I interpret your reply correctly).

I read through the Att,py script and writing a script to generate the h5 files for MSVD dataset.
Appreciate the help.

@tsenghungchen
Copy link
Owner

Maybe I misled you. For longer videos, I trim the videos to fit the length "n_step_lstm". Another approach is like what you just said, sample videos to get n_step_lstm no. of frames covering the entire video. Therefore, there is no change in the total number of videos.

@sxs4337
Copy link
Author

sxs4337 commented Aug 15, 2016

Thanks for clearing that.
So while sampling the video to n_step_lstm no. of frames, the 'label' parameter would always be the same vector. If for example n_step_lstm is 5, then the vector would always be [-1, -1, -1, -1, 0]. Right?
Thanks.

@tsenghungchen
Copy link
Owner

Yes, you're right.

@sxs4337
Copy link
Author

sxs4337 commented Aug 16, 2016

Thanks.
While uniformly sampling each MSVD video to 50 time steps, I was able to train a model with the following parameters-
#data downloaded from https://github.com/yaoli/arctic-capgen-vid/blob/master/README.md
dim_image = 1024
dim_hidden= 256
n_frame_step = 50
n_caption_step = 35
n_epochs = 200
batch_size = 100
learning_rate = 0.0001

I get the following results at testing on 200 epoch trained model-

init COCO-EVAL scorer
tokenization...
PTBTokenizer tokenized 110046 tokens at 461760.74 tokens per second.
PTBTokenizer tokenized 2133 tokens at 37307.49 tokens per second.
setting up scorers...
computing Bleu score...
{'reflen': 1566, 'guess': [1564, 1279, 994, 709], 'testlen': 1564, 'correct': [1164, 514, 242, 74]}
ratio: 0.998722860791
Bleu_1: 0.743
Bleu_2: 0.546
Bleu_3: 0.417
Bleu_4: 0.295
computing METEOR score...
METEOR: 0.231
computing Rouge score...
ROUGE_L: 0.591
computing CIDEr score...
CIDEr: 0.274

The Meteor score seem too low. During training, the val scores were METEOR: 0.256 and the predicted sentences i.e. "PD:" were observed to be incomplete (mostly last word of the sentence missing). The word vocab size from function preProBuildWordVocab is 1500. I am trying to replicate results (as closely as possible) of https://github.com/yaoli/arctic-capgen-vid

Any thoughts on what I may be missing?
Thanks a lot!

@tsenghungchen
Copy link
Owner

I think that the incomplete sentences cause the METEOR score to be this low. Are all of the sentences in the val set incomplete?
I didn't encountered such problem. I have no idea. Maybe you can check the training captions or testing captions using pdb and see if the captions are complete or not.
Sorry for not being able to help.

@sxs4337
Copy link
Author

sxs4337 commented Aug 17, 2016

Thanks.
The test Meteor score with hidden dimension as 512 is 0.243, so a little improvement. I agree that the missing last word may be causing low scores. I looked into- jazzsaxmafia/video_to_sequence#5
I see that your script already has the required correction but I still miss the last word in predicted sentences.

Sample of some GT and predicted sentences during validation-
GT: all the boys are playing the different games
GT: Children playing at a park in the city.
GT: Children play on a playground, and then start running.
PD: children play on a .

May be there is some alignment issue with the data files I prepared. I look forward to running the model with the data you generated after you post the h5 generation files.
Appreciate all the replies and help. Thanks!

@tsenghungchen
Copy link
Owner

Hi,
I have uploaded the h5 generation codes and you can see the usage notes in README.
Since they are quite complicated, questions are welcome. I'm glad to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants