<a href="https://colab.research.google.com/github/smbanasik/CLIP-Training-Research/blob/colab/CSCE_636_CLIP_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contrastive Language-Image Pretraining with SogCLR

### Benchmarks

The following results are recall at 1 results on the provided MSCOCO and ImageNet datasets. The first row of results are from the model trained using the CLIP loss, and the second row of results are from the model trained using the SogCLR loss. All results are based on a batch size of 128 for 30-epoch pretraining. IR@1 denotes the recall at 1 of image retrieval on MSCOCO, TR@1 denotes the recall at 1 of text retrieval on MSCOCO, and ACC@1 denotes the top 1 accuracy on ImageNet. Average denotes the average of the three metrics.

| Method | MSCOCO TR@1 | MSCOCO IR@1 | ImageNet ACC@1 | Average |
|:----------:|:--------:|:--------:|:--------:|:--------:|
| CLIP | 12.0 | 9.32 | 21.35 | 14.22 |
| SogCLR |  14.38  |  10.73  | 24.54 | 16.55 |

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [33]:
!git checkout colab

Branch 'colab' set up to track remote branch 'colab' from 'origin'.
Switched to a new branch 'colab'


In [35]:
import numpy as np
import random
import time
import datetime
import os
import torch
from transformers import DistilBertTokenizer

from our_model import create_optimizer

import pipeline as pipe
import our_model as our
from main import HyperParamsAndArgs

torch.manual_seed(params.seed)
np.random.seed(params.seed)
random.seed(params.seed)
params = HyperParamsAndArgs()

# train_loader, coco_loader, imagenet_loader = generate_loaders(params)

print("Creating model")
model = our.CLIP(image_encoder=params.image_encoder, text_encoder=params.text_encoder, embed_dim=params.embed_dim, init_model=True, bsz=params.batch_size,
              world_size=1, ita_type=params.loss_type, sogclr_gamma=params.sogclr_gamma, rho_I=params.rho_I, rho_T=params.rho_T, tau_init=params.tau_init,
              eta_init=params.eta_init, beta_u=params.beta_u, temp=params.temp, learnable_temp=params.learnable_temp,
              vicreg_sim_coeff=params.vicreg_sim_coeff, vicreg_std_coeff=params.vicreg_std_coeff, personalized_tau=params.personalized_tau,
              use_temp_net=params.isogclr_temp_net, alpha=params.alpha, distributed=False)
optimizer = create_optimizer(params, model)
tokenizer = DistilBertTokenizer.from_pretrained(params.text_encoder)
network = our.CLIP_Network(model, optimizer, tokenizer, params)

Creating model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/102M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

In [None]:
print("--Begin training--")
start_time = time.time()
# Train CLIP_Network model
for epoch in range(params.epochs):
    epoch_loss = pipe.train(network, train_loader, params, epoch)
    if epoch % params.save_interval == 0:
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': network.model.state_dict(),
            'optimizer_state_dict': network.optimizer.state_dict(),
            'scheduler_state_dict': network.scheduler.state_dict(),
            'loss': epoch_loss,
        }
        torch.save(checkpoint, os.path.join(params.output_dir, 'checkpoint_'+str(epoch+1)+'.pth'))
    print(f'Epoch {epoch+1}/{params.epochs}- Loss: {epoch_loss:.4f}')

total_time = time.time() - start_time
total_time_str = str(datetime.timedelta(seconds=int(total_time)))
print('Training time {}'.format(total_time_str))

In [None]:
# Start validation
print('--Begin validation--')
start_time = time.time()

img_to_text_r1, text_to_img_r1 = pipe.evaluate_image_and_text(model, coco_loader, params.device)
zero_shot_acc = pipe.evaluate_top1_classification(model, imagenet_loader, params.device)
final_metric = (img_to_text_r1 + text_to_img_r1 + zero_shot_acc) / 3
print(f'Results: Img-Text R1: {img_to_text_r1:.4f}, Text-Img R1: {text_to_img_r1:.4f}, Zero-Shot:{zero_shot_acc:.4f}')
print(f'Final Metric: {final_metric:.4f}')
total_time = time.time() - start_time
total_time_str = str(datetime.timedelta(seconds=int(total_time)))
print('Validation time {}'.format(total_time_str))