Skip to content

Visual and Vision-Language Representation Pre-Training with Contrastive Learning

License

Notifications You must be signed in to change notification settings

saadkh1/clip_dual_encoder

Repository files navigation

Visual and Vision-Language Representation Pre-Training with Contrastive Learning

In this repository, I present my approach as a family of vision-language foundation systems. These systems, which are considered among the most advanced in the field of artificial intelligence, are used to solve a variety of important tasks, such as generation, retrieval, and classification tasks. The basis of the vision-language systems is a combination of pre-trained encoder models of computer vision and natural language processing.

Setup

I used Python 3.7 and PyTorch 1.7.1 to train and test my models. It also requires the installation of timm and transformers packages with the following versions:

$ pip install timm==0.6.7
$ pip install transformers

Training the model:

python main.py

During this research, I trained our models on several image encoder models, such as deit3, efficientnet_b8, convnext, swinv2, and fbnetv3, as well as text encoder models, such as roberta, xlm-roberta, xlnet, albert, electra, and bert, on a small-scale dataset. I concluded that changing image and text encoder models would necessarily change the efficiency of the model. I was able to show that encoder models like ConvNeXt and RoBERTa produce better results than more popular encoder models like BERT and ViT. I also found that training our model on a small-scale data set produced accurate results.

Below in this table are the best results I got after testing more than 80 pairs of image and text encoders in 20 epochs.

Image and Text Encoders Top-1 accuracy Top-5 accuracy
Swin + BERT 13.99 31.65
Swin + ALBERT 13.28 29.04
Swin + RoBERTa 14.83 32.64
ConvNeXt + BERT 16.63 35.31
ConvNeXt + ALBERT 13.46 29.46
ConvNeXt + RoBERTa 17.31 35.31

The model composed by the ConvNeXt and RoBERTa encoders has achieved satisfactory results on image-text retrieval tasks on some datasets, such as the ImageNetV2 dataset, the Unsplash dataset, and more other datasets, using different queries (textual queries, visual queries, and visual + textual queries) as shown in these notebooks Search_In_Unsplash.ipynb and Image_To_Text_Search.ipynb.It also achieved these results on the video-text retrieval task for videos from YouTube Test_video.ipynb.This model has outperformed many other models in image classification tasks on some datasets, such as Food-101, CIFAR-10, CIFAR-100, Describable Textures (DTD), Oxford-IIIT Pets (Pets), MNIST, STL-10, the German Traffic Sign Recognition Benchmark (GTSRB), and Rendered SST2 (SST) CR_classification.ipynb.Also, I tested this model on zero-shot image classification tasks as it showed its efficiency in some datasets, such as the (CIFAR-100 classes with some random images), the (puppy and bagel), and the (chihuahua and muffin) datasets Zero-Shot Image Classification.ipynb.

I did this by applying the contrastive learning method to the image-text pair dataset, which was used to train my models. Then I used the Zero-Shot method to retrieve images or texts. My model achieved all of these results, considering the size of the Flickr30k dataset and the number of epochs used in training.

About

Visual and Vision-Language Representation Pre-Training with Contrastive Learning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published