# Linear probe using CLIP features

In our previous tutorial, `Interacting_with_CLIP.ipynb`, we evaluated CLIP in zero-shot setting in which we use the cosine similarity between image features and label features as model prediction.

In this tutorial, we will cover another approach for using pretrained models for classification tasks, namely, linear probe.
Unlike zero-shot classification, linear probe involves training using the training dataset.
However, to keep the training cost low, we only train a linear classifier on top of the frozen pretrained model.

Side Note: Linear probe is not something new. Indeed, we did similar thing in CNN transfer learning tutorial, when we froze the main CNN and only trained linear classifier. The name 'linear probe' is often used in self-supervised learning literature to highlight that only the linear classifier is trained while the main network is being frozen. The name 'linear probe' is used because it evaluates the 'linear separability' of features learned during


In [1]:
# ! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to c:\users\hp\appdata\local\temp\pip-req-build-vgj2sb9e
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git 'C:\Users\HP\AppData\Local\Temp\pip-req-build-vgj2sb9e'


In [2]:

import numpy as np
import torch
from pkg_resources import packaging

print("Torch version:", torch.__version__)


Torch version: 2.4.1+cu121


## Load model

In [3]:
import clip

clip.available_models()

['RN50',
 'RN101',
 'RN50x4',
 'RN50x16',
 'RN50x64',
 'ViT-B/32',
 'ViT-B/16',
 'ViT-L/14',
 'ViT-L/14@336px']

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model, preprocess = clip.load("ViT-B/16")
model = model.to(device).eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

Model parameters: 149,620,737
Input resolution: 224
Context length: 77
Vocab size: 49408


## Setting up train and test dataset

In [6]:
# We will evaluate CLIP on conventional image classification dataset (CIFAR10)

from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
from tqdm import tqdm

cifar10_train = CIFAR10('D:/data', transform=preprocess, download=True, train=True)
cifar10_test = CIFAR10('D:/data', transform=preprocess, download=True, train=False)

train_loader = DataLoader(cifar10_train, batch_size=100, shuffle=True, num_workers=2)
test_loader = DataLoader(cifar10_test, batch_size=100, shuffle=False, num_workers=2)


Files already downloaded and verified
Files already downloaded and verified


## Linear probe option 1: using torch

In the [CLIP paper](https://arxiv.org/pdf/2103.00020), the authors use image feature before projecting it to shared projection space for linear probe.

To do so, we need to remove the projection layer (weight, to be specific) from the model

In [7]:
# See: https://github.com/openai/CLIP/blob/main/clip/model.py

sample_image = cifar10_test[0][0].unsqueeze(0).to(device)  # (1, 3, 224, 224)
print(sample_image.shape)

# before removing projection weight
with torch.no_grad():
    out_before = model.encode_image(sample_image).float()
print(out_before.shape)

# after removing projection weight
visual_proj = model.visual.proj
model.visual.proj = None

with torch.no_grad():
    out_after = model.encode_image(sample_image).float()
print(out_after.shape)

torch.Size([1, 3, 224, 224])
torch.Size([1, 512])
torch.Size([1, 768])


  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)


In [8]:
import torch.nn as nn
import torch.optim as optim

linear_classifier = nn.Linear(768, 10).to(device)
optimizer = optim.Adam(linear_classifier.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(3):
    for x, y in tqdm(train_loader):
        x, y = x.to(device), y.to(device)
        with torch.no_grad():
            image_feature = model.encode_image(x).float()
        logits = linear_classifier(image_feature)
        loss = criterion(logits, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    test_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for x, y in tqdm(test_loader):
            x, y = x.to(device), y.to(device)
            image_feature = model.encode_image(x).float()
            logits = linear_classifier(image_feature)
            loss = criterion(logits, y)
            test_loss += loss.item() * len(y)
            correct += (logits.argmax(dim=1) == y).sum().item()
            total += len(y)
    test_loss /= len(cifar10_test)
    test_acc = correct / total

    print(f"[Epoch {epoch+1}] test_loss: {test_loss:.4f}, test_acc: {test_acc * 100:.2f}%")


100%|██████████| 500/500 [01:15<00:00,  6.66it/s]
100%|██████████| 100/100 [00:19<00:00,  5.13it/s]


[Epoch 1] test_loss: 0.1487, test_acc: 94.99%


100%|██████████| 500/500 [01:14<00:00,  6.67it/s]
100%|██████████| 100/100 [00:19<00:00,  5.10it/s]


[Epoch 2] test_loss: 0.1418, test_acc: 95.41%


100%|██████████| 500/500 [01:14<00:00,  6.70it/s]
100%|██████████| 100/100 [00:19<00:00,  5.12it/s]

[Epoch 3] test_loss: 0.1346, test_acc: 95.59%





## Linear probe option 2: using external library

Another possible way to train a linear classifier on top of the learned feature is to first extract image features for all images and then use external library (e.g., scikit-learn) to train a linear classifier.

This allows us to easily use more complicated optimization algorithms implemented in scikit-learn, such as [L-BFGS](https://ko.wikipedia.org/wiki/L-BFGS) which is a [Quasi-Newton Method](https://en.wikipedia.org/wiki/Quasi-Newton_method).

In fact, the [CLIP paper](https://arxiv.org/pdf/2103.00020) uses this approach for linear probe evaluation (see Appendix A.3)

"We train a logistic regression classifier using scikit-learn’s L-BFGS implementation, with maximum 1,000 iteration"

scikit-learn LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

However, if you need data augmentation, the first approach is preferable.

In [9]:
import numpy as np

# TODO:
# extract train image features, convert to numpy
# store both image feature and label (y) as numpy arrays, each with name `train_features` and `train_labels`
train_features = []
train_labels = []
for x, y in tqdm(train_loader):
    x = x.to(device)
    with torch.no_grad():
        image_feature = model.encode_image(x).float()
    train_features.append(image_feature.cpu().numpy())
    train_labels.append(y.numpy())

train_features = np.concatenate(train_features, axis=0)
train_labels = np.concatenate(train_labels, axis=0)

print()
print(train_features.shape)
print(train_labels.shape)

# TODO:
# extract test image features, convert to numpy
# store both image feature and label (y) as numpy arrays, each with name `test_features` and `test_labels`

test_features = []
test_labels = []
for x, y in tqdm(test_loader):
    x = x.to(device)
    with torch.no_grad():
        image_feature = model.encode_image(x).float()
    test_features.append(image_feature.cpu().numpy())
    test_labels.append(y.numpy())

test_features = np.concatenate(test_features, axis=0)
test_labels = np.concatenate(test_labels, axis=0)

print()
print(test_features.shape)
print(test_labels.shape)

100%|██████████| 500/500 [01:20<00:00,  6.25it/s]



(50000, 768)
(50000,)


100%|██████████| 100/100 [00:19<00:00,  5.01it/s]


(10000, 768)
(10000,)





In [10]:
from sklearn.linear_model import LogisticRegression

C = 0.1
logistic_regression = LogisticRegression(solver="lbfgs", max_iter=1000, C=C)
logistic_regression.fit(train_features, train_labels)

test_pred = logistic_regression.predict(test_features)
test_acc = (test_pred == test_labels).sum() / len(test_labels)
print(f"Test acc: {test_acc * 100:.2f}%")

Test acc: 96.08%


## Exercise: Linear probe vs zero-shot classification on CIFAR100

1. Compute zero-shot classification accuracy of CLIP on CIFAR100 as in tutorial `7_1_Interacting_with_CLIP.ipynb`.

2. Implement linear probe evaluation on CIFAR100 (option 2 using scikit-learn).

3. Compare results.

In [11]:
from torchvision.datasets import CIFAR100

cifar100_train = CIFAR100('D:/data', transform=preprocess, download=True, train=True)
cifar100_test = CIFAR100('D:/data', transform=preprocess, download=True, train=False)

train_loader = DataLoader(cifar100_train, batch_size=100, shuffle=True, num_workers=2)
test_loader = DataLoader(cifar100_test, batch_size=100, shuffle=False, num_workers=2)

Files already downloaded and verified
Files already downloaded and verified


In [14]:
# TODO 1: zero-shot classification

# make sure to re-set model.visual.proj with visual_proj for zero-shot classification
model.visual.proj = visual_proj

templates = [
        "a photo of a {c}.",
        "a blurry photo of a {c}.",
        "a black and white photo of a {c}.",
        "a low contrast photo of a {c}.",
        "a high contrast photo of a {c}.",
        "a bad photo of a {c}.",
        "a good photo of a {c}.",
        "a photo of a small {c}.",
        "a photo of a big {c}.",
        "a photo of the {c}.",
        "a blurry photo of the {c}.",
        "a black and white photo of the {c}.",
        "a low contrast photo of the {c}.",
        "a high contrast photo of the {c}.",
        "a bad photo of the {c}.",
        "a good photo of the {c}.",
        "a photo of the small {c}.",
        "a photo of the big {c}."
    ]

text_features = []
for classname in cifar100_test.classes:
    text_descriptions = [template.format(c=classname) for template in templates]
    text_tokens = clip.tokenize(text_descriptions).to(device)
    with torch.no_grad():
        class_text_features = model.encode_text(text_tokens).float()
        class_text_features /= class_text_features.norm(dim=-1, keepdim=True)

        class_text_feature = class_text_features.mean(dim=0)
        class_text_feature /= class_text_feature.norm(dim=-1, keepdim=True)
    text_features.append(class_text_feature)

text_features = torch.stack(text_features, dim=0)


correct = 0
total = 0
for images, labels in tqdm(test_loader):
    images, labels = images.to(device), labels.to(device)

    # TODO: extract image features, compute prediction, and compute accuracy
    with torch.no_grad():
        image_features = model.encode_image(images).float()
        image_features = image_features / image_features.norm(dim=1, keepdim=True)

    prediction = (image_features @ text_features.T).argmax(dim=1)
    correct += (prediction == labels).sum().item()
    total += len(labels)

accuracy = correct / total
print(f"Accuracy: {accuracy * 100:.2f}%")


100%|██████████| 100/100 [00:19<00:00,  5.23it/s]

Accuracy: 66.94%





In [15]:
# TODO 2: linear probe evaluation option 2

# make sure to remove model.visual.proj for linear probe
model.visual.proj = None

# TODO:
# extract train image features, convert to numpy
# store both image feature and label (y) as numpy arrays, each with name `train_features` and `train_labels`
train_features = []
train_labels = []
for x, y in tqdm(train_loader):
    x = x.to(device)
    with torch.no_grad():
        image_feature = model.encode_image(x).float()
    train_features.append(image_feature.cpu().numpy())
    train_labels.append(y.numpy())

train_features = np.concatenate(train_features, axis=0)
train_labels = np.concatenate(train_labels, axis=0)

print()
print(train_features.shape)
print(train_labels.shape)

# TODO:
# extract test image features, convert to numpy
# store both image feature and label (y) as numpy arrays, each with name `test_features` and `test_labels`

test_features = []
test_labels = []
for x, y in tqdm(test_loader):
    x = x.to(device)
    with torch.no_grad():
        image_feature = model.encode_image(x).float()
    test_features.append(image_feature.cpu().numpy())
    test_labels.append(y.numpy())

test_features = np.concatenate(test_features, axis=0)
test_labels = np.concatenate(test_labels, axis=0)

print()
print(test_features.shape)
print(test_labels.shape)

C = 0.1
logistic_regression = LogisticRegression(solver="lbfgs", max_iter=1000, C=C)
logistic_regression.fit(train_features, train_labels)

test_pred = logistic_regression.predict(test_features)
test_acc = (test_pred == test_labels).sum() / len(test_labels)
print(f"Test acc: {test_acc * 100:.2f}%")



100%|██████████| 500/500 [01:20<00:00,  6.22it/s]



(50000, 768)
(50000,)


100%|██████████| 100/100 [00:19<00:00,  5.11it/s]



(10000, 768)
(10000,)
Test acc: 82.64%
