Visual Word2Vec (vis-w2v)
Learning visually grounded word embeddings from abstract image
Satwik Kottur*, Ramakrishna Vedantam*, José Moura, Devi Parikh
Visual Word2Vec (vis-w2v): Learning Visually grounded embeddings from abstract images
[ArXiv] [Project Page]
* = equal contribution
The code is organized as follows:
visword2vec.c: Main code for training vis-w2v
refineFunctions.c: Contains functions related to refining embeddings
helperFunctions.c: Mostly contain assisting functions like io, tokenization, etc.
visualFunction.c: Contains code to refine based on tuples and also perform common sense (cs) task on test and validation sets
vpFunctions.c: Contains code to refine based on sentences and also perform visual paraphrasing (vp) task
structs.h: Contains structures defined for the code
filepaths.h: Contains the paths to the files needed for the above two tasks-cs,vp
Makefile: Helps to setup, compile and run programs
Other files can be ignored for now.
Program accepts following as inline arguments:
embed-path: Initialization for the embeddings (pre-trained using word2vec usually)
Format: Header should have
Each following row should first have the word, and embeddings delimited by space
output: Path to where to store the output embeddings
size: Size of the hidden layer (should match with the pre-loaded embeddings)
threads: Number of threads to use for refining, loading and other operations
Currently only saving one embedding is supported. For multi embeddings simply turn
trainMulti flag (top of
visword2vec.c file) to 1. However, saving can be done by
uncommenting code in
Steps for usage
- Liblinear-2.1 must be compiled and the path must be correctly set
- yael must be setup (for k means) and corresponding paths setup Link here: yael
- cs and vp options should have correctly
- Make sure all the paths are accessible and correctly set
- Any change to this file, should be followed by re-compiling the code
- Additionally, you also need to link the correct path to liblibear
- NTLK is used for tokenization and lemmatization (VP task)
To run either cs or vp, comment or uncomment corresponding wrapper calls in
trainModel() function of visword2vec. And then
make cs or
make vp for the
two tasks correspondingly to compile and run.
make simply compiles while
make clean cleans up all the binaries.
NOTE: All the binaries are stored in
bin/ folder (might have to create one if
doesnt exist beforehand).
In this paper, we deal with three tasks: Common Sense Assertion Classification, Visual Paraphrasing and Text-based Image Retrieval.
A. Common Sense Assertion Classification (Project page)
Download the dataset from their project page here. Code to process this dataset further is given in
The following are the pre-processing steps:
- Extract the training (P, R, S) tuples
- Extract the visual features for clustering
- Extract the test and val (P, R, S) tuples
- Extract the
word2vecembeddings to initialize from (you can alternatively use any other embeddings to begin with, we recommend you use pre-trained embeddings to reproduce results from the paper).
All the above four steps can be done by simply running:
cd utils/cs/ python extractData.py <path to downloaded data> <path to save the data>(optional)
By default it created a folder
data/cs and saves the files in this folder. This will produce files
visual_train.txt at destination folder corresponding to above files. Once these files are produced, open
filepath.h and make sure the macros point to right file paths.
# define ROOT_CS "data/cs/" # define CS_VISUAL_FEATURE_FILE ROOT_CS "visual_train.txt" # define CS_PRS_TRAIN_FILE ROOT_CS "PRS_train.txt" # define CS_PRS_TEST_FILE ROOT_CS "PRS_test.txt" # define CS_PRS_VAL_FILE ROOT_CS "PRS_val.txt"
Now, to run, simply:
make ./visword2vec -cs 1 -embed-path data/cs/word2vec_cs.bin -output cs_refined.bin -size 200 -clusters 25
You can also give in other parameters to suit your needs.
B. Visual Paraphrasing (Project page)
Download the VP dataset from their project page here. Also download the clipart scenes and descriptions (ASD) used to train
vis-w2v from the clipart project page here.
All the scripts needed for pre-processing are available in
Follow the steps below:
Step 1: Run the
fetchVPTrainData.m function to extract relevant data for training
cd utils/vp >> fetchVPTrainData(<path to ASD dataset>, <path to VP dataset>, <path to save the data>); For example: >> fetchVPTrainData('data/vp/AbstractScenes_v1.1', 'data/vp/imagine_v1/', 'data/vp/');
It does the following (not important. If you just want desired data, run the above command):
- Extracting visual features
abstract_features.txtfrom Abstract Scene Dataset (ASD) using MATLAB script.
>> cd utils/vp >> extractAbstractFeatures(<path to ASD dataset>, <path to save the data>) For example: >> extractAbstractFeatures('data/vp/AbstractScenes_v1.1', 'data/vp/')
- The alignment between ASD and VP datasets is given in two files
utils/vp/. We will use them along with train/test split of VP and select features from training sentences only, again using MATLAB.
cd utils/vp >> alignAbstractFeatures(<path to VP dataset>, <path to abstract_features.txt>, <path to save the data>) For example: >> alignAbstractFeatures('data/vp/imagine_v1/', 'data/vp/', 'data/vp/')
- Get the training sentences from VP dataset (for learning
vis-w2v). This would produce
cd utils/vp >> saveVPTrainSentences(<path to VP dataset>, <path to save sentences>) For example: >> saveVPTrainSentences('data/vp/imagine_v1/', 'data/vp')
Step 2: One should use our new embeddings
vis-w2v in place of
word2vec in visual paraphrasing task (
imagine_v1/code/feature/compute_features_vp.m at line 32). Alternatively, we can save their other text features (co-occurance and total frequency) and use it in our code for speed and smoother interface between learning embeddings and performing the task.
This can be achieved by adding the following lines to
imagine_v1/code/feature/compute_features_vp.m before line 30, and running
save('vp_txt_features.mat', 'feat_vp_text_tf_1', 'feat_vp_text_tf_2', 'feat_vp_text_coc_1', 'feat_vp_text_coc_2'); % Escape from running remainder code error('Saved features, getting out!')
Step 3: Next, we obtain all the relevant information to perform the visual paraphrasing task (using MATLAB)
- Sentence pairs:
- Other textual features:
- Ground truth:
- Train / test split:
- Train / val split:
cd /utils/vp >> fetchVPTaskData(<path to VP dataset>, <path to vp_txt_features.mat>, <path to save the files>) For example: >> fetchVPTaskData('data/vp/imagine_v1/', 'data/vp/imagine_v1/code/', 'data/vp')
Step 4: Finally, we tokenize and lemmatize all the sentences, i.e,
cd /utils/vp python lemmatizeVPTrain.py <path to data> <path to save sentences> For example: python lemmatizeVPTrain.py data/vp/ data/vp/
Phew! That's a lot of pre-processing. Now we are all set to learn vis-w2v embeddings while performing the visual paraphrasing task.
Like before, check all the filepaths in
filepaths.h before proceeding.
Now, to run, simply:
make ./visword2vec -vp 1 -embed-path data/vp/word2vec_vp.bin -output vp_refined.bin -size 200 -clusters 100
You can also give in other parameters to suit your needs.
For VP, these are
mode that indicates the training context and
window-size that indicates the size in
DESCRIPTIONS: Use all the three sentences for training
SENTENCES: Use sentences one after the other
WINDOW: Use a context window of size
WORDS: Use each word separately
The program prints 100 runs with both validation and test performance. We choose the run with best validation performance and report the corresponding test result.
C. Text-based Image Retrieval
This task involves retrieving the appropriate image based on the associated tuple. We collect the data and make it available at
utils/text-ret/text_ret_tuples.pickle along with ground truths for each image as
utils/text-ret/text_ret_gt.txt. The goal is to retrieve correct image from the list of ground truth tuples using each of collected queries in pickle file as query.
The code for this task is provided in Python in
utils/text_ret/ along with the data. To run, we need to point it to the data directory along with embedding paths. There are two modes for this task: (A) SINGLE - this uses single embedding for P, R, S. (B) MULTI - this uses three different embeddings for each of P, R, S. The inputs are given accordingly.
cd utils/text-ret/ python performRetrieval.py <path to data> <path to embedding> (or) python performRetrieval.py <path to data> <path to P embedding> <path to R embedding> <path to S embedding> For example: python performRetrieval.py ./ cs_refined.bin