# CLIP (Contrastive Language-Image Pre-Training)

In this notebook, we will be exploring OpenAI's [CLIP](https://openai.com/research/clip) model. CLIP is a deep learning model that learns to *associate images and text*. It is trained on a variety of image-text pairs, and learns to predict which image goes with which text. This allows it to perform a variety of tasks, such as zero-shot image classification, image generation, and text-to-image generation.

In [None]:
import torch
import tqdm.auto as tqdm

# this automatically reloads the libraries so you can update them dynamically
%load_ext autoreload
%autoreload 2

## CLIP in Action

Collect the required packages and the CLIP model.
*For people runing this notebook locally*: You can use a fresh conda environment

In [2]:
# uncomment the line below if you are not in colab
#%conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
%pip install ftfy regex tqdm
%pip install git+https://github.com/openai/CLIP.git

Collecting ftfy
  Obtaining dependency information for ftfy from https://files.pythonhosted.org/packages/f4/f0/21efef51304172736b823689aaf82f33dbc64f54e9b046b75f5212d5cee7/ftfy-6.2.0-py3-none-any.whl.metadata
  Downloading ftfy-6.2.0-py3-none-any.whl.metadata (7.3 kB)
Collecting regex
  Obtaining dependency information for regex from https://files.pythonhosted.org/packages/a8/01/18232f93672c1d530834e2e0568a80eaab1df12d67ae499b1762ab462b5c/regex-2023.12.25-cp311-cp311-win_amd64.whl.metadata
  Downloading regex-2023.12.25-cp311-cp311-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/42.0 kB ? eta -:--:--
     ---------------------------------------- 0.0/42.0 kB ? eta -:--:--
     --------- ------------------------------ 10.2/42.0 kB ? eta -:--:--
     ------------------ ------------------- 20.5/42.0 kB 222.6 kB/s eta 0:00:01
     --------------------------- ---------- 30.7/42.0 kB 262.6 kB/s eta 0:00:01
     -------------------------------------- 42.0/42.0 

  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git 'C:\Users\johnj\AppData\Local\Temp\pip-req-build-qc5zgx7d'


We will be working with the following diagram of the CLIP architecture:
<img src='https://github.com/openai/CLIP/blob/main/CLIP.png?raw=true'>

In [1]:
# download the clip diagram
#!wget https://github.com/openai/CLIP/blob/main/CLIP.png?raw=true -O CLIP.png

'wget' is not recognized as an internal or external command,
operable program or batch file.


Now, let's load a pretrained CLIP model with a base (B) size Vision Transformer (ViT) that uses 32 patches.

In [3]:
import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

100%|███████████████████████████████████████| 338M/338M [05:43<00:00, 1.03MiB/s]


In [6]:
# TODO: try your own images and text labels
im = "CLIP.png" 
image = preprocess(Image.open(im)).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device) 

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

  attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)


Label probs: [[0.9927   0.004185 0.002968]]
