Skip to content
/ vwsd Public

Code for SemEval 2023 Task 1: Visual Word Sense Disambiguation

Notifications You must be signed in to change notification settings

sdadas/vwsd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting started

1. Download the VWSD task data

Download the trial, train and test sets from the task page. Place the data in the same directory, in subdirectories named trial_v1, train_v1, test_v1, respectively. In each subdirectory, create a folder with images for the specific subset named trial_images_v1, train_images_v1, or test_images_v1.

2. Prepare Wikipedia index

Wikipedia retrieval is handled by wiki-index application available in a separate repository: https://github.com/sdadas/wiki-index. Clone the repository, execute mvn package to build the app, and java -jar target/wiki-index.jar to run it. In order to build a new index, you need to download the appropriate Wikipedia dump in pages-articles format from the Wikimedia Downloads respository. Then, you can execute the vwsd/wikipedia.py script.

Instead of building an index from scratch, you can also download our pre-built indexes for English, Italian and Persian. Unzip the archive to the directory from which you run the Java program.

3. Download WIT dataset

Download WIT dataset from the official repository. You should download all *.tsv.gz files from the training, test and validation parts of the dataset, then unpack them to the directory of your choice.

4. Run the code for English

To generate predictions for the test dataset, execute the following command:

python run_model.py --wit_dir [path_to_wit_directory] --data_dir [path_to_vwsd_task_data] --data_split test --lang en

5. Download additional models for Italian and Persian

For languages other than English, the code uses additional models which need to be downloaded. First, download the fine-tuned CLIP text encoders for Italian and Persian, and then extract them in the project directory. Next, create a new directory named embeddings in the project root directory. Download the FastText models for Italian and Persian, and place them in the newly created directory.

6. Run the code for Italian or Persian

To run the code for a language other than English, execute the same command as in step 4, changing the lang parameter. For example:

python run_model.py --wit_dir [path_to_wit_directory] --data_dir [path_to_vwsd_task_data] --data_split test --lang it

About

Code for SemEval 2023 Task 1: Visual Word Sense Disambiguation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages