<a href="https://colab.research.google.com/github/yingjun-mou/CLIP/blob/master/Reproduce_CLIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notework aims to reproduce CLIP model proposed in *Learning Transferable Visual Models From Natural Language Supervision*.

# Part 1. Use exisiting open_clip model

Task: Use open_clip to perform image classifications

##Step 1. Colab preparation

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

In [None]:
! pip install open_clip_torch matplotlib

In [None]:
import numpy as np
import torch

## Step 2. Loading the model

In [None]:
import open_clip

# List the names of all available CLIP models.
open_clip.list_pretrained()

In [None]:
# Load one of the models.
model, _, preprocess = open_clip.create_model_and_transforms('convnext_base_w', pretrained='laion2b_s13b_b82k_augreg')

In [None]:
model.eval()
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Context length:", context_length)
print("Vocab size:", vocab_size)


## Step 3. Image Preprocessing

* normalize the pixel intensity using the dataset mean and standard deviation
* resize the input images
* center-crop them to conform with the image resolution that model expects

In [None]:
preprocess

##Step 4. Text Preprocessing

* Use a case-insensitive tokenizer `tokenizer.tokenize()`. It will pad the outputs to become 77 tokens long, which is the CLIP model expects.

In [None]:
from open_clip import tokenizer

In [None]:
tokenizer.tokenize("Hello World!")

##Step 5. Set up input images and texts

We are going to feed 8 example images and their textual descriptions to the model, and compare the similarity between the corresponding features.

The tokenizer is case-insensitive, and we can freely give any suitable textual descriptions.

In [None]:
import os
import skimage
import IPython.display
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

from collections import OrderedDict
import torch

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# images in skimage to use and their textual descriptions
descriptions = {
    "page": "a page of text about segmentation",
    "chelsea": "a facial photo of a tabby cat",
    "astronaut": "a portrait of an astronaut with the American flag",
    "rocket": "a rocket standing on a launchpad",
    "motorcycle_right": "a red motorcycle standing in a garage",
    "camera": "a person looking at a camera on a tripod",
    "horse": "a black-and-white silhouette of a horse",
    "coffee": "a cup of coffee on a saucer"
}


In [None]:
original_images = []
images = []
texts = []
plt.figure(figsize=(16,5))

for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]:
  name = os.path.splitext(filename)[0]
  if name not in descriptions:
    continue

  image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB")

  plt.subplot(2, 4, len(images) +1)
  plt.imshow(image)
  plt.title(f"{filename}\n{descriptions[name]}")
  plt.xticks([])
  plt.yticks([])

  original_images.append(image)
  images.append(preprocess(image))
  texts.append(descriptions(name))

plt.tight_layout()
