Skip to content

Sohojoe/CLIP_visual-spatial-reasoning

Repository files navigation


CLIP Visual Spatial Reasoning

Benchmark CLIP models using Visual Spatial Reasoning.

Original Visual Spatial Reasoning repo

Note: Currently this is true zero shot (so no fine tuning). I benchmark the following CLIP models:

  • OpenClip laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
  • OpenClip laion/CLIP-ViT-H-14-laion2B-s32B-b79K
  • OpenAI Clip openai/clip-vit-large-patch14-336

Findings:

  • Using the (True) / (False) modifiers proposed in the paper results gives no better than random results.
  • After experimenting with many stratagies for modifying the prompts I was able to get results at 55% (so slightly better than average)

Open questions:

  • Will fine tuning the modle show same/better results as the model types in the VSR paper
  • How do the different relationship score (does CLIP nativly understand any relationships resonable well)

- fine tuning results

python src\train.py --base_model ViT-L/14@336px --mini_batch_size 20 --batch_size 500 --learning_rate 2e-5

test_accuracy: 65.07% trained model: model_run-113-65-07.pt

v-002 results

uses the modified prompts ie:

  • The horse is left of
  • The horse is left of the person.
python src\eval002.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 55.23%

python src\eval002.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 55.44%

python src\eval002.py --model_url openai/clip-vit-large-patch14-336

Score: 54.39%

v-001 results

python src\eval001.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 55.23%

python src\eval001.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 53.83%

python src\eval001.py --model_url openai/clip-vit-large-patch14-336

Score: 53.86%

v-000 results

uses the prompts from the VSR paper (but without retraining); ie:

  • The horse is left of the person. (False)
  • The horse is left of the person. (True)
python src\eval000.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 49.24%

python src\eval000.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 49.51%

python src\eval000.py --model_url openai/clip-vit-large-patch14-336

Score: 48.85%

install

conda env create
conda activate clip-vsr

run

python src\eval.py

Download images

See data/ folder's readme. Images should be saved under data/images/.

Citation

If you use the VSR dataset please site the orginal authors:

@article{Liu2022VisualSR,
  title={Visual Spatial Reasoning},
  author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.00363}
}

License

This project is licensed under the Apache-2.0 License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages