Visual word sense disambiguation - SemEval 2023

About

V-WSD is a SemEval 2023 Task.
Task description : Given a word and some limited textual context, the task is to select among a set of candidate images the one which corresponds to the intended meaning of the target word.
Dataset : The dataset for the task was provided by the organizers of the task. All the target words and their contexts are in the English language for the training data. Each entry consists of a target word, its context and references to 10 images, all separated by an escape character (tabs). Only one of the corresponding image is the gold standard.

The training set is ~18GB, with 12869 samples. (12999 images are present)
e.g.

target_word <tab> full_phrase <tab> image_1 <tab> image_2 <tab> ... <tab> image_10

"target_word": the potentially ambiguous target word.

"full_phrase": the textual context containing the target_word.

Infrastructure

We used GCP's compute engine for our project. We used a single V100 GPU (16GB VRAM) for our experiments. We cached the dataset on GCS to avoid repeated downloads.

Running the code

The dependencies for running the code can be installed using the env.yaml file.

conda env create -f env.yaml
conda activate cs521

For Training the model, replace the angular braces accordingly for the following command

python main.py --base_path "<DATASET_DIR>" --model_save_path "<SAVE_DIR>" --model_log_path "<LOG_DIR>"

For Evaluating the model, replace the angular braces accordingly for the following command

python main.py --execute 1 --base_path "<DATASET_DIR>" --model_save_path "<SAVE_DIR>" --model_log_path "<LOG_DIR>"

Results

Training Results

Loss
Mean Reciprocal Rank
Hit rate @ 1

Future work

Due to the limited access to better hardware we were limited to one experiment, in the future we can

Perform hyperparameter search.
Investigate data augmentation. Further negative examples could be added to each sample based on the same target word.
Apply data augmentation on images such as color jitter, random crop etc.
Use more powerful vision encoders. We made use of the tiny variant of ConvNextV2.

References

LiT: Zero-Shot Transfer with Locked-image text Tuning, Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
models		models
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
env.yaml		env.yaml
get_dataset.sh		get_dataset.sh
loss.py		loss.py
main.py		main.py
metrics.py		metrics.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual word sense disambiguation - SemEval 2023

About

Infrastructure

Running the code

Results

Future work

References

About

Releases

Packages

Contributors 2

Languages

License

vaibhavBh-0/VisualWSD

Folders and files

Latest commit

History

Repository files navigation

Visual word sense disambiguation - SemEval 2023

About

Infrastructure

Running the code

Results

Future work

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages