Download the trial, train and test sets from the task page.
Place the data in the same directory, in subdirectories named trial_v1
, train_v1
, test_v1
, respectively.
In each subdirectory, create a folder with images for the specific subset named trial_images_v1
, train_images_v1
, or test_images_v1
.
Wikipedia retrieval is handled by wiki-index
application available in a separate repository: https://github.com/sdadas/wiki-index.
Clone the repository, execute mvn package
to build the app, and java -jar target/wiki-index.jar
to run it.
In order to build a new index, you need to download the appropriate Wikipedia dump in pages-articles
format from the Wikimedia Downloads respository.
Then, you can execute the vwsd/wikipedia.py
script.
Instead of building an index from scratch, you can also download our pre-built indexes for English, Italian and Persian. Unzip the archive to the directory from which you run the Java program.
Download WIT dataset from the official repository.
You should download all *.tsv.gz
files from the training, test and validation parts of the dataset, then unpack them to the directory of your choice.
To generate predictions for the test dataset, execute the following command:
python run_model.py --wit_dir [path_to_wit_directory] --data_dir [path_to_vwsd_task_data] --data_split test --lang en
For languages other than English, the code uses additional models which need to be downloaded.
First, download the fine-tuned CLIP text encoders for Italian and Persian, and then extract them in the project directory.
Next, create a new directory named embeddings
in the project root directory.
Download the FastText models for Italian and Persian, and place them in the newly created directory.
To run the code for a language other than English, execute the same command as in step 4, changing the lang
parameter.
For example:
python run_model.py --wit_dir [path_to_wit_directory] --data_dir [path_to_vwsd_task_data] --data_split test --lang it