CLIP Visual Spatial Reasoning

Benchmark CLIP models using Visual Spatial Reasoning.

Original Visual Spatial Reasoning repo

Note: Currently this is true zero shot (so no fine tuning). I benchmark the following CLIP models:

OpenClip laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
OpenClip laion/CLIP-ViT-H-14-laion2B-s32B-b79K
OpenAI Clip openai/clip-vit-large-patch14-336

Findings:

Using the (True) / (False) modifiers proposed in the paper results gives no better than random results.
After experimenting with many stratagies for modifying the prompts I was able to get results at 55% (so slightly better than average)

Open questions:

Will fine tuning the modle show same/better results as the model types in the VSR paper
How do the different relationship score (does CLIP nativly understand any relationships resonable well)

- fine tuning results

python src\train.py --base_model ViT-L/14@336px --mini_batch_size 20 --batch_size 500 --learning_rate 2e-5

test_accuracy: 65.07% trained model: model_run-113-65-07.pt

v-002 results

uses the modified prompts ie:

The horse is left of
The horse is left of the person.

python src\eval002.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 55.23%

python src\eval002.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 55.44%

python src\eval002.py --model_url openai/clip-vit-large-patch14-336

Score: 54.39%

v-001 results

python src\eval001.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 55.23%

python src\eval001.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 53.83%

python src\eval001.py --model_url openai/clip-vit-large-patch14-336

Score: 53.86%

v-000 results

uses the prompts from the VSR paper (but without retraining); ie:

The horse is left of the person. (False)
The horse is left of the person. (True)

python src\eval000.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 49.24%

python src\eval000.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 49.51%

python src\eval000.py --model_url openai/clip-vit-large-patch14-336

Score: 48.85%

install

conda env create
conda activate clip-vsr

run

python src\eval.py

Download images

See data/ folder's readme. Images should be saved under data/images/.

Citation

If you use the VSR dataset please site the orginal authors:

@article{Liu2022VisualSR,
  title={Visual Spatial Reasoning},
  author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.00363}
}

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
amd_requirements.txt		amd_requirements.txt
environment.yml		environment.yml
legacy_requirements.txt		legacy_requirements.txt
test_open_clip.py		test_open_clip.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

amd_requirements.txt

amd_requirements.txt

environment.yml

environment.yml

legacy_requirements.txt

legacy_requirements.txt

test_open_clip.py

test_open_clip.py

Repository files navigation

CLIP Visual Spatial Reasoning

Benchmark CLIP models using Visual Spatial Reasoning.

- fine tuning results

v-002 results

v-001 results

v-000 results

install

run

Download images

Citation

License

About

Releases

Packages

Languages

License

Sohojoe/CLIP_visual-spatial-reasoning

Folders and files

Latest commit

History

Repository files navigation

CLIP Visual Spatial Reasoning

Benchmark CLIP models using Visual Spatial Reasoning.

- fine tuning results

v-002 results

v-001 results

v-000 results

install

run

Download images

Citation

License

About

Resources

License

Stars

Watchers

Forks

Languages