# CLIP

CLIP (Contrastive Language–Image Pretraining) is a model developed by OpenAI that connects vision and language. It is trained on a large amount of publicly available internet text paired with images. The model learns to understand and generate meaningful representations from both images and text, making it capable of zero-shot transfer across a range of tasks.

### Environment: imports and device

In [1]:
# Imports
import torch
import clip
from PIL import Image
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.utils.data import DataLoader
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from CLIP import classify

# device
device = "cpu"                                  # "cuda" if torch.cuda.is_available() else "cpu"

  from .autonotebook import tqdm as notebook_tqdm


### Model load
In this notebook, we are using the `ViT-B/32` version of the CLIP model. ViT-B/32 stands for Vision Transformer Base with a patch size of 32. The Vision Transformer is a model architecture that treats an image as a sequence of patches and applies transformer layers to understand the image. The ‘Base’ version of the model has a balance between size and performance, making it a good choice for many applications. Other versions of the model include `ViT-L/14` which is larger and potentially more accurate but also more computationally intensive.

In [2]:
model, preprocess = clip.load("ViT-L/14", device=device)

## Caltech-101


In [3]:
dataset_name = "caltech101"
dataset = datasets.ImageFolder(root=f'../../data/{dataset_name}', transform=preprocess)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

classes = dataset.classes
labels = [f'this image contains a {l}.' for l in classes]
text = clip.tokenize(labels).to(device)


In [None]:
# Create a new dataset without transformations
dataset_no_transform = datasets.ImageFolder(
    root=f'../../data/caltech101',
)


# Iterate over the first 10 images and display them
for i in range(10):
    image, label = dataset_no_transform[i]
    plt.imshow(image)
    plt.title(dataset_no_transform.classes[label])
    plt.show()


## Classification

In [8]:
## Load if already classified
true_labels = np.load(f'results/true_labels-{dataset_name}.npy')
predicted_labels = np.load(f'results/pred_labels-{dataset_name}.npy')

In [5]:
## Labels Calculation
true_labels, predicted_labels = classify(model, text, dataloader, dataset_name)

np.save('true_labels.npy', true_labels)
np.save('predicted_labels.npy', predicted_labels)

True: 4 | Predicted: 4
True: 75 | Predicted: 75
True: 2 | Predicted: 2
True: 4 | Predicted: 4
True: 0 | Predicted: 16
True: 38 | Predicted: 38
True: 89 | Predicted: 89
True: 14 | Predicted: 14
True: 62 | Predicted: 62
True: 1 | Predicted: 1
True: 36 | Predicted: 36
True: 60 | Predicted: 60
True: 21 | Predicted: 97
True: 75 | Predicted: 75
True: 0 | Predicted: 23
True: 2 | Predicted: 2
True: 0 | Predicted: 16
True: 85 | Predicted: 85
True: 36 | Predicted: 36
True: 2 | Predicted: 2
True: 2 | Predicted: 2
True: 91 | Predicted: 91
True: 66 | Predicted: 66
True: 6 | Predicted: 6
True: 41 | Predicted: 41
True: 0 | Predicted: 16
True: 93 | Predicted: 93
True: 8 | Predicted: 8
True: 93 | Predicted: 93
True: 50 | Predicted: 50
True: 49 | Predicted: 49
True: 46 | Predicted: 14
True: 53 | Predicted: 53
True: 4 | Predicted: 4
True: 32 | Predicted: 32
True: 0 | Predicted: 16
True: 44 | Predicted: 44
True: 77 | Predicted: 77
True: 0 | Predicted: 16
True: 49 | Predicted: 49
True: 78 | Predicted: 78
T

Traceback (most recent call last):
  File "/home/andrea/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_478466/1012369574.py", line 2, in <module>
    true_labels, predicted_labels = classify(model, text, dataloader, dataset_name)
  File "/home/andrea/Desktop/MM_LLMs-vs-CV/paper_code/models_evaluation/CLIP/CLIP.py", line 17, in classify
  File "/home/andrea/.local/lib/python3.10/site-packages/clip/model.py", line 348, in encode_text
    x = self.transformer(x)
  File "/home/andrea/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1529, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/andrea/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/andrea/.local/lib/python3.10/site-packages/clip/model.py", line 203, in forward
   

In [6]:
confusion_mat = confusion_matrix(true_labels, predicted_labels)

plt.figure(figsize=(30, 30))
sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

NameError: name 'true_labels' is not defined

In [12]:
print(classification_report(true_labels, predicted_labels, target_names=classes))

                   precision    recall  f1-score   support

BACKGROUND_Google       0.16      0.25      0.20       347
            Faces       0.00      0.00      0.00       321
         Leopards       0.99      0.44      0.61       150
       Motorbikes       1.00      0.00      0.01       579
        accordion       1.00      0.54      0.70        41
        airplanes       0.99      0.43      0.60       598
           anchor       0.60      0.29      0.39        31
              ant       0.59      0.77      0.67        35
           barrel       0.97      0.87      0.92        38
             bass       0.73      0.91      0.81        35
           beaver       1.00      0.58      0.73        31
        binocular       0.00      0.00      0.00        29
           bonsai       1.00      0.79      0.88        92
            brain       0.20      1.00      0.34        75
     brontosaurus       1.00      0.77      0.87        30
           buddha       1.00      1.00      1.00       

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
