To train the S2VT model you will need to compile from my recurrent branch of caffe:
git clone https://github.com/vsubhashini/caffe.git
git checkout recurrent
To compile Caffe, please refer to the Installation page.
Get preprocessed model and sample data
./get_s2vt.sh
Run the captioner
python s2vt_captioner.py -m s2vt_vgg_rgb
- Pre-process videos to get frame features. The code provided here does not process videos directly. VGG features need to be extracted for each frame. Pre-processed data (VGG features) for the MSVD corpus can be found here. Use this script to download training, validation, and test data. (~1.2GB)
./download_data.sh
Extracting features for your own videos/images
If you wish to process your own videos, then you need to first extract VGG
features. This can be done using the code
here
extract_vgg_features.py
.
- Convert features to hdf5. If your features are in text format use
framefc7_stream_text_to_hdf5_data.py
to convert to hdf5 data. If they are in a mat file you might want to useframefc7_stream_mat_text_to_hdf5_data.py
.
python framefc7_stream_text_to_hdf5_data.py
- Point to the hdf5 training data. Modify
s2vt.prototxt
to point to the hdf5 training and validation data. - Train the model. Use
s2vt_solver.prototxt
to train your model.
Code to evaluate the predicted sentences (with example) can be found at https://github.com/vsubhashini/caption-eval.
If you find this code helpful, please consider citing:
Sequence to Sequence - Video to Text
Sequence to Sequence - Video to Text
S. Venugopalan, M. Rohrbach, J. Donahue, T. Darrell, R. Mooney, K. Saenko
The IEEE International Conference on Computer Vision (ICCV) 2015