<a href="https://colab.research.google.com/github/tsakailab/alpp/blob/main/colab/demo_CLIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CLIP (Contrastive Language-Image Pre-training)
<img src="https://images.openai.com/blob/fbc4f633-9ad4-4dc2-bd94-0b6f1feee22f/overview-a.svg?width=10&height=10&quality=50" width=480 align="top" /> &emsp;&emsp;
<img src="https://images.openai.com/blob/d9d46e4b-6d6a-4f9e-9345-5c6538b1b8c3/overview-b.svg?width=10&height=10&quality=50" width=480 align="top" style="float:left"/>

## CLIPを試食しましょう．

----
### CLIPをインストールして，学習済みのモデルを入手します．
ダウンロードに1分程度時間がかかります．

In [None]:
!pip install -q open_clip_torch

In [None]:
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
model.eval();

### Web上の画像を取得して表示します．

In [None]:
uri = "http://static.independent.co.uk/s3fs-public/styles/article_large/public/thumbnails/image/2016/02/25/13/cat-getty_0.jpg"

import imageio.v3 as imageio
cimg = imageio.imread(uri)

import matplotlib.pyplot as plt
plt.imshow(cimg); plt.axis('off');

### 画像を英語で記述します．
- 文の数や長さは自由です（下の例では6つの文を用意しました）

In [None]:
texts = [
    "A photo of a cute cat.",
    "This small orange kitten melts hearts with its adorable expression as it gazes into the camera's lens.",
    "This small orange tiger melts hearts with its adorable expression as it gazes into the camera's lens.",
    "A tiny orange kitten has perfected the art of capturing attention with its captivating stare. ",
    "Twin kittens",
    "Captivating cityscape, where modern skyscrapers and bustling streets blend in perfect harmony.",
    ]

### 画像と各文の類似度を評価し，下記の値を表示します．
- コサイン類似度
- softmax関数で換算した確率

In [None]:
from PIL import Image
image = preprocess(Image.fromarray(cimg)).unsqueeze(0)
image_features = model.encode_image(image)

text = tokenizer(texts)
text_features = model.encode_text(text)

unit_image_features = image_features / image_features.norm(dim=-1, keepdim=True)
unit_text_features = text_features / text_features.norm(dim=-1, keepdim=True)
similarities = unit_image_features.matmul(unit_text_features.T)
text_probs = (100.0 * similarities).softmax(dim=-1)

print(*["{:>+.3f} ({:6.1%} )".format(cossim, prob.item()) + ": \""
        + (txt[:80]+"...\"" if len(txt)>80 else txt+"\""+" "*(83-len(txt))) + " ({:d} tokens)".format(ntok)
        for cossim, prob, txt, ntok in zip(*similarities, *text_probs.data, texts, text.count_nonzero(dim=1))], sep='\n')

## さあ，あなたの番です．
### 画像を取得して表示します．

In [None]:
uri = "https://cdn.cms-twdigitalassets.com/content/dam/help-twitter/ja/using-twitter/accessibility/write-image-desc/1_Succinct_clear_detailed_example.jpeg.twimg.1920.jpeg"
#uri = "https://github.com/pytorch/hub/raw/master/images/dog.jpg"
#uri = "https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F166345%2F7358a513-a377-c29f-2a3d-4e2058990576.jpeg?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&w=1400&fit=max&s=6946b0bc6140a739bc60ddaa3a0aab8c"
#uri = "http://images.cocodataset.org/test-stuff2017/000000006149.jpg"
#uri = "http://images.cocodataset.org/test-stuff2017/000000024309.jpg"
#uri = "http://images.cocodataset.org/test-stuff2017/000000004954.jpg"
#uri = "https://otamatone.jp/cms/wp-content/uploads/2019/09/190421_otamatone71638-300x300.jpg"
#uri = "https://eeo.today/media/wp-content/uploads/2024/01/17171601/15.png"

cimg = imageio.imread(uri, pilmode='RGBA')[:,:,:3]
plt.imshow(cimg); plt.axis('off');

### 画像を英語で記述してください．
- 文の数や長さは自由です．
- 翻訳で作文してもよいです　→　[DeepL](https://www.deepl.com/ja/translator)

In [None]:
texts = [
    "A photo of X",
    "Verbalize the picture yourself.",
    "",
    ]

In [None]:
image = preprocess(Image.fromarray(cimg)).unsqueeze(0)
image_features = model.encode_image(image)

text = tokenizer(texts)
text_features = model.encode_text(text)

unit_image_features = image_features / image_features.norm(dim=-1, keepdim=True)
unit_text_features = text_features / text_features.norm(dim=-1, keepdim=True)
similarities = unit_image_features.matmul(unit_text_features.T)
text_probs = (100.0 * similarities).softmax(dim=-1)

print(*["{:>+.3f} ({:6.1%} )".format(cossim, prob.item()) + ": \""
        + (txt[:80]+"...\"" if len(txt)>80 else txt+"\""+" "*(83-len(txt))) + " ({:2d} tokens)".format(ntok)
        for cossim, prob, txt, ntok in zip(*similarities, *text_probs.data, texts, text.count_nonzero(dim=1))], sep='\n')

# 参考：画像とテキストとの間の変換技術
画像や図の題を自動的に生成する技術は[image captioning](https://paperswithcode.com/task/image-captioning)と呼ばれています．
- [huggingface.co](https://huggingface.co/docs/transformers/main/en/tasks/image_captioning) [paperwithcode.com](https://paperswithcode.com/task/image-captioning)
- [試食する1](https://huggingface.co/spaces/SRDdev/Image-Caption), [試食する2](ttps://imagecaptiongenerator.com/)

逆に，テキストから画像を生成する技術が[text-to-image generation](https://en.wikipedia.org/wiki/Text-to-image_model)であり，[stable diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion)等が有名です．

<br>

# 参考：あなた自身の言語化能力を伸ばすための教材
視覚情報を言語化する能力の育成は，情報リテラシー教育の盲点です．
- [良い画像の説明を作成する方法（How to write great image descriptions）](https://help.twitter.com/ja/using-x/write-image-descriptions)
- [Image Description Guidelines](http://diagramcenter.org/table-of-contents-2.html)
- [Effective Practices for Description of Science Content within Digital Talking Books](https://www.wgbh.org/foundation/services/ncam/effective-practices-for-description-of-science-content-within-digital-talking-books)
- [Describing Figures](https://www.sigaccess.org/welcome-to-sigaccess/resources/describing-figures/)

----
# 以下，おまけ（[参考](https://github.com/mlfoundations/open_clip/blob/main/docs/Interacting_with_open_clip.ipynb)）

In [None]:
context_length = model.context_length
vocab_size = model.vocab_size

import numpy as np
print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Context length:", context_length)
print("Vocab size:", vocab_size)

In [None]:
vocabs = tokenizer.encoder
print(len(vocabs))
import random
print(*random.sample(list(vocabs.items()), 5), sep='  ')

In [None]:
print(text.count_nonzero(dim=1))
print(*texts, sep='\n')
print(text)

In [None]:
import open_clip
open_clip.list_pretrained()

In [None]:
preprocess