In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Intorduction

This code is based on the OpenAI clip paper. 
We have tried to use different implementation, the [Huggingface](https://https://huggingface.co/openai/clip-vit-base-patch32) on and the [CLIP](https://github.com/openai/CLIP) implementation for few shot and zero shot learning. 





## CLIP

[BLOG](https://openai.com/blog/clip/)
[Paper](https://https://arxiv.org/pdf/2103.00020.pdf)


Idea: CLIP (Contrastive Language–Image Pre-training) can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.


Problems tackled by this paper:

typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts; standard vision models are good at one task and one task only, and require significant effort to adapt to a new task; and models that perform well on benchmarks have disappointingly poor performance on stress tests, casting doubt on the entire deep learning approach to computer vision.





## Approach 


Scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets. Our method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.

Intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it.






## Limitations


While CLIP usually performs well on recognizing common objects, it struggles on more abstract or systematic tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo. On these two datasets, zero-shot CLIP is only slightly better than random guessing. Zero-shot CLIP also struggles compared to task specific models on very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species.

CLIP also still has poor generalization to images not covered in its pre-training dataset. For instance, although CLIP learns a capable OCR system, when evaluated on handwritten digits from the MNIST dataset, zero-shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset. Finally, we’ve observed that CLIP’s zero-shot classifiers can be sensitive to wording or phrasing and sometimes require trial and error “prompt engineering” to perform well.

# **Zero shot Huggingface**

In [None]:
from PIL import Image
import requests

In [None]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 46.0 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 41.1 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 5.5 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.6 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installati

In [None]:
from transformers import CLIPProcessor, CLIPModel

In [None]:
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

Downloading:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/605M [00:00<?, ?B/s]

In [None]:
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Downloading:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/862k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/389 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/568 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.49M [00:00<?, ?B/s]

In [None]:
%cd  gdrive/MyDrive/coco_crops_few_shot/coco_crops_few_shot/val/airplane/

/content/gdrive/MyDrive/coco_crops_few_shot/coco_crops_few_shot/val/airplane


In [None]:
image = Image.open(r"/content/gdrive/MyDrive/coco_crops_few_shot/coco_crops_few_shot/val/airplane/000000001232_161324.jpg")  

In [None]:
labels = ["airplane", "bicycle", "boat", "bus", "car","motocycle","train","truck"]

In [None]:
inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)

In [None]:
outputs = model(**inputs)

In [None]:
logits_per_image = outputs.logits_per_image

In [None]:
logits_per_image

tensor([[27.4976, 20.8163, 21.5692, 19.8618, 21.9612, 24.1665, 17.7010, 19.3076]],
       grad_fn=<PermuteBackward>)

In [None]:
probs = logits_per_image.softmax(dim=1)

In [None]:
for l, p in zip(labels, probs[0]):
    print(f"{l:<16} {p:.4f}")

airplane         0.9575
bicycle          0.0012
boat             0.0025
bus              0.0005
car              0.0038
motocycle        0.0342
train            0.0001
truck            0.0003


# **Testing on the zero shot dataset**

In [None]:
image2 = Image.open(r"/content/gdrive/MyDrive/coco_crops_zero_shot/test/bench/000000013876_577922.jpg")  

In [None]:
labels = ["airplane", "bicycle", "boat", "bus", "car","motocycle","train","truck", "bench"]

In [None]:
inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image2, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
logits_per_image

tensor([[18.3568, 24.1950, 21.2019, 22.6827, 21.6038, 18.1754, 23.0455, 20.6152,
         30.0898]], grad_fn=<PermuteBackward>)

In [None]:
probs = logits_per_image.softmax(dim=1)

In [None]:
for l, p in zip(labels, probs[0]):
    print(f"{l:<16} {p:.4f}")

airplane         0.0000
bicycle          0.0027
boat             0.0001
bus              0.0006
car              0.0002
motocycle        0.0000
train            0.0009
truck            0.0001
bench            0.9954


# Few SHOT LEARNING CLIP repo 

In [None]:
! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-8wkvg7lo
  Running command git clone -q https://github.com/openai/CLIP.git /tmp/pip-req-build-8wkvg7lo


In [None]:
import numpy as np
import torch

print("Torch version:", torch.__version__)

assert torch.__version__.split(".") >= ["1", "7", "1"], "PyTorch 1.7.1 or later is required"

Torch version: 1.9.0+cu102


In [None]:
import clip

clip.available_models()

['RN50', 'RN101', 'RN50x4', 'RN50x16', 'ViT-B/32', 'ViT-B/16']

In [None]:
model, preprocess = clip.load("ViT-B/32")
model.cuda().eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

Model parameters: 151,277,313
Input resolution: 224
Context length: 77
Vocab size: 49408


In [None]:
import os
import skimage
import IPython.display
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

from collections import OrderedDict
import torch

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# images in skimage to use and their textual descriptions
descriptions = {
    "airplane": "image of airplane",
    "bicycle": "image of bicycle",
    "boat": "image of boat",
    "bus": "image of bus",
    "car": "image of car",
    "motorcycle": "image of motorcycle",
    "train": "image of train", 
    "truck": "image of truck"
}

In [None]:
preprocess

Compose(
    Resize(size=224, interpolation=bicubic, max_size=None, antialias=None)
    CenterCrop(size=(224, 224))
    <function _transform.<locals>.<lambda> at 0x7fccc73f85f0>
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)

In [None]:
data_dir = "/content/gdrive/MyDrive/coco_crops_few_shot/coco_crops_few_shot/train/"

In [None]:
original_images = []
images = []
texts = []
labels= ['motorcycle', 'bicycle', 'airplane', 'bus', 'car', 'boat', 'truck', 'train']
i = -1

In [None]:
for root, subdirectories, files in os.walk(data_dir):
    for file in files:
        image = Image.open(os.path.join(root,file))
        original_images.append(image)
        images.append(preprocess(image))
        texts.append(descriptions[labels[i]])
    i+=1


In [None]:
image_input = torch.tensor(np.stack(images)).cuda()
text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda()

In [None]:
with torch.no_grad():
    image_features = model.encode_image(image_input).float()
    text_features = model.encode_text(text_tokens).float()

In [None]:
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T

In [None]:
print(similarity)

[[0.28839195 0.30192918 0.29780307 ... 0.1728441  0.19305587 0.20042631]
 [0.28839195 0.30192918 0.29780307 ... 0.1728441  0.19305587 0.20042631]
 [0.28839195 0.30192918 0.29780307 ... 0.1728441  0.19305587 0.20042631]
 ...
 [0.24105565 0.22355422 0.22971815 ... 0.2592097  0.26274705 0.28666192]
 [0.24105565 0.22355422 0.22971815 ... 0.2592097  0.26274705 0.28666192]
 [0.24105565 0.22355422 0.22971815 ... 0.2592097  0.26274705 0.28666192]]


#**FEW Shot with clip**




In [None]:
! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git

In [None]:
import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from tqdm import tqdm

In [None]:
# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

In [None]:
data_dir = "/content/gdrive/MyDrive/coco_crops_few_shot/coco_crops_few_shot/"

In [None]:
data_transforms = {
    'train': transforms.Compose([
        transforms.Resize(224),
        transforms.RandomCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(224),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

In [None]:
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}

In [None]:
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=16,
                                              shuffle=True, num_workers=8)
              for x in ['train', 'val']}

  cpuset_checked))


In [None]:
def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

In [None]:
# Calculate the image features
train_features, train_labels = get_features(image_datasets['train'])
test_features, test_labels = get_features(image_datasets['val'])

100%|██████████| 3/3 [01:40<00:00, 33.46s/it]
100%|██████████| 6/6 [03:30<00:00, 35.16s/it]


In [None]:
# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(np.float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Accuracy = 88.469


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.6s finished


There are different ideas which we can test later on: 




1.   Trying a different classifier for the Few shot example.
2.   Trying other models aport from the CLIP for few shot/zero shot example for image classification and novel object detection. 
3. Chcking the other models ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'ViT-B/32', 'ViT-B/16'] and compare the difference. 



