Handling longer videos while preparing data files #4

sxs4337 · 2016-08-14T19:44:13Z

Thanks for preparing a wonderful code!

While preparing the data h5 files, as mentioned- "batch['data'] stores the visual features. shape (n_step_lstm, batch_size, hidden_dim)"
How to deal with videos that are longer than "n_step_lstm" length?
If the video is broken into parts and stored as separate input samples, would the model figure out and learn from parts of same video using the batch['label'] parameter.

Any help on preparing the data h5 files would be appreciated.
Thanks.

tsenghungchen · 2016-08-14T22:29:09Z

Hi,

For videos that are longer than n_step_lstm, I simply discard the rest and take the clips with the maximum length as input.
For the second question, I haven't tried splitting videos into segments. You might need to modify the code since the labels are starting from the very first index for all of my data. (Only different at where it ends)
Since I'm a bit busy recently, I might put up the H5 data generation code later this week.

tsenghungchen · 2016-08-14T22:33:40Z

Sorry for accidentally sending the message.
If you are in a hurry, you can see the HDF5 documentation first for some details. If you encounter any problems, feel free to ask me, thanks.

sxs4337 · 2016-08-15T01:44:12Z

Thanks for the quick response and the clarifications.
I agree the labels always start from the first index but may end earlier than the sequence length. For videos longer than n_step_lstm, would it not be better to sample videos to get n_step_lstm no. of frames covering the entire video. Discarding longer videos would decrease the dataset size(If I interpret your reply correctly).

I read through the Att,py script and writing a script to generate the h5 files for MSVD dataset.
Appreciate the help.

tsenghungchen · 2016-08-15T17:02:55Z

Maybe I misled you. For longer videos, I trim the videos to fit the length "n_step_lstm". Another approach is like what you just said, sample videos to get n_step_lstm no. of frames covering the entire video. Therefore, there is no change in the total number of videos.

sxs4337 · 2016-08-15T20:24:35Z

Thanks for clearing that.
So while sampling the video to n_step_lstm no. of frames, the 'label' parameter would always be the same vector. If for example n_step_lstm is 5, then the vector would always be [-1, -1, -1, -1, 0]. Right?
Thanks.

tsenghungchen · 2016-08-15T23:48:08Z

Yes, you're right.

sxs4337 · 2016-08-16T12:51:14Z

Thanks.
While uniformly sampling each MSVD video to 50 time steps, I was able to train a model with the following parameters-
#data downloaded from https://github.com/yaoli/arctic-capgen-vid/blob/master/README.md
dim_image = 1024
dim_hidden= 256
n_frame_step = 50
n_caption_step = 35
n_epochs = 200
batch_size = 100
learning_rate = 0.0001

I get the following results at testing on 200 epoch trained model-

init COCO-EVAL scorer
tokenization...
PTBTokenizer tokenized 110046 tokens at 461760.74 tokens per second.
PTBTokenizer tokenized 2133 tokens at 37307.49 tokens per second.
setting up scorers...
computing Bleu score...
{'reflen': 1566, 'guess': [1564, 1279, 994, 709], 'testlen': 1564, 'correct': [1164, 514, 242, 74]}
ratio: 0.998722860791
Bleu_1: 0.743
Bleu_2: 0.546
Bleu_3: 0.417
Bleu_4: 0.295
computing METEOR score...
METEOR: 0.231
computing Rouge score...
ROUGE_L: 0.591
computing CIDEr score...
CIDEr: 0.274

The Meteor score seem too low. During training, the val scores were METEOR: 0.256 and the predicted sentences i.e. "PD:" were observed to be incomplete (mostly last word of the sentence missing). The word vocab size from function preProBuildWordVocab is 1500. I am trying to replicate results (as closely as possible) of https://github.com/yaoli/arctic-capgen-vid

Any thoughts on what I may be missing?
Thanks a lot!

tsenghungchen · 2016-08-17T14:10:16Z

I think that the incomplete sentences cause the METEOR score to be this low. Are all of the sentences in the val set incomplete?
I didn't encountered such problem. I have no idea. Maybe you can check the training captions or testing captions using pdb and see if the captions are complete or not.
Sorry for not being able to help.

sxs4337 · 2016-08-17T14:22:25Z

Thanks.
The test Meteor score with hidden dimension as 512 is 0.243, so a little improvement. I agree that the missing last word may be causing low scores. I looked into- jazzsaxmafia/video_to_sequence#5
I see that your script already has the required correction but I still miss the last word in predicted sentences.

Sample of some GT and predicted sentences during validation-
GT: all the boys are playing the different games
GT: Children playing at a park in the city.
GT: Children play on a playground, and then start running.
PD: children play on a .

May be there is some alignment issue with the data files I prepared. I look forward to running the model with the data you generated after you post the h5 generation files.
Appreciate all the replies and help. Thanks!

tsenghungchen · 2016-08-17T18:28:16Z

Hi,
I have uploaded the h5 generation codes and you can see the usage notes in README.
Since they are quite complicated, questions are welcome. I'm glad to help!

tsenghungchen closed this as completed Aug 14, 2016

tsenghungchen reopened this Aug 14, 2016

tsenghungchen added the enhancement label Aug 17, 2016

tsenghungchen closed this as completed Aug 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling longer videos while preparing data files #4

Handling longer videos while preparing data files #4

sxs4337 commented Aug 14, 2016

tsenghungchen commented Aug 14, 2016 •

edited

tsenghungchen commented Aug 14, 2016

sxs4337 commented Aug 15, 2016

tsenghungchen commented Aug 15, 2016

sxs4337 commented Aug 15, 2016

tsenghungchen commented Aug 15, 2016

sxs4337 commented Aug 16, 2016

tsenghungchen commented Aug 17, 2016

sxs4337 commented Aug 17, 2016

tsenghungchen commented Aug 17, 2016

Handling longer videos while preparing data files #4

Handling longer videos while preparing data files #4

Comments

sxs4337 commented Aug 14, 2016

tsenghungchen commented Aug 14, 2016 • edited

tsenghungchen commented Aug 14, 2016

sxs4337 commented Aug 15, 2016

tsenghungchen commented Aug 15, 2016

sxs4337 commented Aug 15, 2016

tsenghungchen commented Aug 15, 2016

sxs4337 commented Aug 16, 2016

tsenghungchen commented Aug 17, 2016

sxs4337 commented Aug 17, 2016

tsenghungchen commented Aug 17, 2016

tsenghungchen commented Aug 14, 2016 •

edited