# いらすとや画像をCLIPで検索する

* CLIPとは: https://qiita.com/sonoisa/items/00e8e2861147842f0237
* 動作デモ: https://huggingface.co/spaces/sonoisa/Irasuto_search_CLIP_zero-shot

※以下で例に用いている「いらすとや」データの再配布はできないため、以下のコードはそのままでは動きません。ご自身で何らかテキストと画像のペアからなるデータを用意して、コードを改修してお試しください。

In [None]:
!nvidia-smi

Tue Apr 12 11:33:23 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# データの準備

### いらすとやメタデータ

次の属性を持ったオブジェクトのリストを格納したJSONファイル（ファイル名: irasuto_items_20210224.json）を用意する。  
残念ながらデータの配布は許可が得られていないため配布はできません。

* "page": そのイラストのページのURL
* "title": そのイラストのタイトル
* "desc": そのイラストの説明文
* "imgs": そのイラストの画像のURLのリスト（複数件の場合があるため）


例:

    [{
        "page": "https://www.irasutoya.com/2014/09/blog-post_804.html", 
        "title": "お辞儀をしている男性会社員のイラスト", 
        "desc": "お礼や謝罪のお辞儀をしている、男性会社員（サラリーマン）のイラストです。", 
        "imgs": ["https://3.bp.blogspot.com/-zaYysdZCU4k/U-8HB8LJ2LI/AAAAAAAAk-A/RwTwUJI01hs/s450/business_ojigi_man.png"]
    }, 
    {
        "page": "https://www.irasutoya.com/2017/03/blog-post_806.html", 
        "title": "笑い袋のイラスト", 
        "desc": "押すと楽しい笑い声が聞こえる玩具、笑い袋のイラストです。", 
        "imgs": ["https://4.bp.blogspot.com/-voAAYQAh5tk/WKbK-SWJTTI/AAAAAAABB4g/5cB7IA-BisQPzpeqMsPAGjjKE3rNgdoRgCLcB/s400/toy_waraibukuro.png"]
    }, ...

In [None]:
# !cp /content/drive/MyDrive/irasuto_items_20210224.json /content/

### いらすとや画像データ

いらすとやメタデータに対応する画像ダウンロードし次のディレクトリ構造で用意する。

    /content/images/{メタデータの項目のインデックス}/{画像ファイル名}

上記のメタデータの場合、次のように画像ファイルが格納されることを想定する。

    /content/images/0/business_ojigi_man.png
    /content/images/1/toy_waraibukuro.png
    ...


In [None]:
# %%capture
# !cp -r /content/drive/MyDrive/images.zip /content/
# !unzip images.zip

# 依存ライブラリの読み込み

In [None]:
%%capture
!pip install transformers==4.14.0 pytorch_lightning==1.5.7 fugashi ipadic ja-sentence-segmenter pyarrow

In [None]:
!git clone https://github.com/sonoisa/clip-japanese

Cloning into 'clip-japanese'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 36 (delta 7), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (36/36), done.


# いらすとや画像を前処理する

メタデータを読み込む

In [None]:
import json

with open("/content/irasuto_items_20210224.json", "r", encoding="utf-8") as f_in:
    items = json.load(f_in)

## 画像のサイズを224 x 224にする。

画像の幅と高さを最低224にし、224 x 224になるよう中央部分を切り出す。

In [None]:
import os
import urllib
from transformers import CLIPFeatureExtractor
from PIL import Image
from tqdm import tqdm


# 前処理後の画像ファイルを格納するディレクトリを作成
for item_index in range(len(items)):
    os.makedirs(f"/content/images_224x224/{item_index}", exist_ok=True)


# 前処理関数：画像のリサイズと中央部分の切り出しを行うことでサイズを224x244にする。
preprocessor = CLIPFeatureExtractor(
    do_resize=True, size=224,
    do_center_crop=True, crop_size=224,
    do_normalize=False)

for item_index, item in tqdm(enumerate(items), total=len(items)):
    for image_url in item["imgs"]:
        try:
            image_filename = urllib.parse.urlparse(image_url).path.rsplit("/", 1)[-1]
            image_filepath = f"/content/images/{item_index}/{image_filename}"
            image = Image.open(image_filepath)
            resized_and_cropped_image = preprocessor([image]).pixel_values[0]
            resized_and_cropped_image.save(f"/content/images_224x224/{item_index}/{image_filename}")
        except Exception as e:
            print(e, item_index, image_url)
            pass

  7%|▋         | 1656/25007 [00:52<11:30, 33.83it/s]

unknown file extension:  1651 https://lh4.googleusercontent.com/proxy/qg-k0HMNY6DaNBaSEIJdokyIloovn4dbdjDJtk8XzEQrX-IspOw0lzKfuRVY_bacN4abLXtp4XoPYJt89XOnh-K_uTIkdxLitIiwXOye=s0-d


 23%|██▎       | 5739/25007 [02:50<09:43, 33.00it/s]

unknown file extension:  5733 https://2.bp.blogspot.com/-7YovtPx3SoY/WHG2MK42yFI/AAAAAAABA_s/O9io3JgZJF87EEOT6udwMABKyO_9zpLQACLcB/s400/paralympic_#archery.png


 29%|██▉       | 7306/25007 [03:33<08:03, 36.59it/s]

unknown file extension:  7300 https://lh4.googleusercontent.com/proxy/qg-k0HMNY6DaNBaSEIJdokyIloovn4dbdjDJtk8XzEQrX-IspOw0lzKfuRVY_bacN4abLXtp4XoPYJt89XOnh-K_uTIkdxLitIiwXOye=s0-d


 51%|█████     | 12630/25007 [06:08<06:21, 32.48it/s]

[Errno 2] No such file or directory: '/content/images/12625/tamanegi_onion+(1).png' 12625 https://3.bp.blogspot.com/-SDkR2b5YQec/UkJNENH-daI/AAAAAAAAYYE/fZCzG5KG9I4/s400/tamanegi_onion+(1).png


 62%|██████▏   | 15397/25007 [07:34<04:34, 34.96it/s]

unknown file extension:  15390 https://lh4.googleusercontent.com/proxy/qg-k0HMNY6DaNBaSEIJdokyIloovn4dbdjDJtk8XzEQrX-IspOw0lzKfuRVY_bacN4abLXtp4XoPYJt89XOnh-K_uTIkdxLitIiwXOye=s0-d


 69%|██████▊   | 17142/25007 [08:33<04:17, 30.50it/s]

[Errno 2] No such file or directory: '/content/images/17135/kaden_stand+(1).png' 17135 https://3.bp.blogspot.com/-UV6E0Ecju68/UnXnT0rAYAI/AAAAAAAAaNE/83TmKz0yMus/s400/kaden_stand+(1).png


 86%|████████▌ | 21469/25007 [10:47<01:46, 33.18it/s]

[Errno 2] No such file or directory: '/content/images/21470/Cocos-(Keeling)-Islands-The.png' 21470 https://2.bp.blogspot.com/-h_yq0pXhHKg/U2tHdKJj3fI/AAAAAAAAgM8/zHI3vvYVM0U/s120/Cocos-(Keeling)-Islands-The.png


 95%|█████████▍| 23685/25007 [11:52<00:30, 42.97it/s]

unknown file extension:  23675 https://goo.gl/kte14


100%|██████████| 25007/25007 [12:30<00:00, 33.32it/s]


# CLIP用クラスを定義し、モデルを読み込み

In [None]:
import os
import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download


class ClipTextModel(nn.Module):
    def __init__(self, model_name_or_path, device=None):
        super(ClipTextModel, self).__init__()

        if os.path.exists(model_name_or_path):
            # load from file system
            output_linear_state_dict = torch.load(os.path.join(model_name_or_path, "output_linear.bin"))
        else:
            # download from the Hugging Face model hub
            filename = hf_hub_download(repo_id=model_name_or_path, filename="output_linear.bin")
            output_linear_state_dict = torch.load(filename)

        self.model = AutoModel.from_pretrained(model_name_or_path)
        config = self.model.config

        self.max_cls_depth = 6

        sentence_vector_size = output_linear_state_dict["bias"].shape[0]
        self.sentence_vector_size = sentence_vector_size
        self.output_linear = nn.Linear(self.max_cls_depth * config.hidden_size, sentence_vector_size)
        self.output_linear.load_state_dict(output_linear_state_dict)

        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, 
                                                       is_fast=True, do_lower_case=True)

        self.eval()

        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.to(self.device)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
    ):
        output_states = self.model(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            output_attentions=None,
            output_hidden_states=True,
            return_dict=True,
        )
        token_embeddings = output_states[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        hidden_states = output_states["hidden_states"]

        output_vectors = []

        for i in range(1, self.max_cls_depth + 1):
            cls_token = hidden_states[-1 * i][:, 0]
            output_vectors.append(cls_token)

        output_vector = torch.cat(output_vectors, dim=1)
        logits = self.output_linear(output_vector)

        output = (logits,) + output_states[2:]
        return output

    @torch.no_grad()
    def encode_text(self, texts, batch_size=8, max_length=64):
        self.eval()
        all_embeddings = []
        iterator = range(0, len(texts), batch_size)
        for batch_idx in iterator:
            batch = texts[batch_idx:batch_idx + batch_size]

            encoded_input = self.tokenizer.batch_encode_plus(
                batch, max_length=max_length, padding="longest", 
                truncation=True, return_tensors="pt").to(self.device)
            model_output = self(**encoded_input)
            text_embeddings = model_output[0].cpu()

            all_embeddings.extend(text_embeddings)

        # return torch.stack(all_embeddings).numpy()
        return torch.stack(all_embeddings)        

    def save(self, output_dir):
        self.model.save_pretrained(output_dir)
        self.tokenizer.save_pretrained(output_dir)
        torch.save(self.output_linear.state_dict(), os.path.join(output_dir, "output_linear.bin"))


In [None]:
import os
import torch
from torch import nn
import transformers
from huggingface_hub import hf_hub_download


class ClipVisionModel(nn.Module):
    def __init__(self, model_name_or_path, device=None):
        super(ClipVisionModel, self).__init__()

        if os.path.exists(model_name_or_path):
            # load from file system
            visual_projection_state_dict = torch.load(os.path.join(model_name_or_path, "visual_projection.bin"))
        else:
            # download from the Hugging Face model hub
            filename = hf_hub_download(repo_id=model_name_or_path, filename="visual_projection.bin")
            visual_projection_state_dict = torch.load(filename)

        self.model = transformers.CLIPVisionModel.from_pretrained(model_name_or_path)
        config = self.model.config

        self.feature_extractor = transformers.CLIPFeatureExtractor.from_pretrained(model_name_or_path)

        vision_embed_dim = config.hidden_size
        projection_dim = 512

        self.visual_projection = nn.Linear(vision_embed_dim, projection_dim, bias=False)
        self.visual_projection.load_state_dict(visual_projection_state_dict)

        self.eval()

        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.to(self.device)

    def forward(
        self,
        pixel_values=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        output_states = self.model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        image_embeds = self.visual_projection(output_states[1])

        return image_embeds

    @torch.no_grad()
    def encode_image(self, images, batch_size=8):
        self.eval()
        all_embeddings = []
        iterator = range(0, len(images), batch_size)
        for batch_idx in iterator:
            batch = images[batch_idx:batch_idx + batch_size]

            encoded_input = self.feature_extractor(batch, return_tensors="pt").to(self.device)
            model_output = self(**encoded_input)
            image_embeddings = model_output.cpu()

            all_embeddings.extend(image_embeddings)

        # return torch.stack(all_embeddings).numpy()
        return torch.stack(all_embeddings)        

    @staticmethod
    def remove_alpha_channel(image):
        image.convert("RGBA")
        alpha = image.convert('RGBA').split()[-1]
        background = Image.new("RGBA", image.size, (255, 255, 255))
        background.paste(image, mask=alpha)
        image = background.convert("RGB")
        return image

    def save(self, output_dir):
        self.model.save_pretrained(output_dir)
        self.feature_extractor.save_pretrained(output_dir)
        torch.save(self.visual_projection.state_dict(), os.path.join(output_dir, "visual_projection.bin"))


In [None]:
import os
import torch
from torch import nn
from huggingface_hub import snapshot_download


class ClipModel(nn.Module):
    def __init__(self, model_name_or_path, device=None):
        super(ClipModel, self).__init__()

        if os.path.exists(model_name_or_path):
            # load from file system
            repo_dir = model_name_or_path
        else:
            # download from the Hugging Face model hub
            repo_dir = snapshot_download(model_name_or_path)

        self.text_model = ClipTextModel(repo_dir, device=device)
        self.vision_model = ClipVisionModel(os.path.join(repo_dir, "vision_model"), device=device)

        with torch.no_grad():
            logit_scale = nn.Parameter(torch.ones([]) * 2.6592)
            logit_scale.set_(torch.load(os.path.join(repo_dir, "logit_scale.bin")).clone().cpu())
            self.logit_scale = logit_scale

        self.eval()

        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.to(self.device)

    def forward(self, pixel_values, input_ids, attention_mask, token_type_ids):
        image_features = self.vision_model(pixel_values=pixel_values)
        text_features = self.text_model(input_ids=input_ids, 
                                        attention_mask=attention_mask, 
                                        token_type_ids=token_type_ids)[0]

        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        logit_scale = self.logit_scale.exp()
        logits_per_image = logit_scale * image_features @ text_features.t()
        logits_per_text = logits_per_image.t()

        return logits_per_image, logits_per_text

    def save(self, output_dir):
        torch.save(self.logit_scale, os.path.join(output_dir, "logit_scale.bin"))
        self.text_model.save(output_dir)
        self.vision_model.save(os.path.join(output_dir, "vision_model"))

# 日本語CLIPモデルを読み込む

In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = ClipModel("sonoisa/clip-vit-b-32-japanese-v1", device=device)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.44M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/443M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/500 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/751 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/713 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.46k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.57M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/258k [00:00<?, ?B/s]

# ゼロショット用のデータセットを作る

文の正規化処理の定義

In [None]:
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja から引用・一部改変
from __future__ import unicode_literals
import re
import unicodedata

def unicode_normalize(cls, s):
    pt = re.compile('([{}]+)'.format(cls))

    def norm(c):
        return unicodedata.normalize('NFKC', c) if pt.match(c) else c

    s = ''.join(norm(x) for x in re.split(pt, s))
    s = re.sub('－', '-', s)
    return s

def remove_extra_spaces(s):
    s = re.sub('[ 　]+', ' ', s)
    blocks = ''.join(('\u4E00-\u9FFF',  # CJK UNIFIED IDEOGRAPHS
                      '\u3040-\u309F',  # HIRAGANA
                      '\u30A0-\u30FF',  # KATAKANA
                      '\u3000-\u303F',  # CJK SYMBOLS AND PUNCTUATION
                      '\uFF00-\uFFEF'   # HALFWIDTH AND FULLWIDTH FORMS
                      ))
    basic_latin = '\u0000-\u007F'

    def remove_space_between(cls1, cls2, s):
        p = re.compile('([{}]) ([{}])'.format(cls1, cls2))
        while p.search(s):
            s = p.sub(r'\1\2', s)
        return s

    s = remove_space_between(blocks, blocks, s)
    s = remove_space_between(blocks, basic_latin, s)
    s = remove_space_between(basic_latin, blocks, s)
    return s

def normalize_neologd(s):
    s = s.strip()
    s = unicode_normalize('０-９Ａ-Ｚａ-ｚ｡-ﾟ', s)

    def maketrans(f, t):
        return {ord(x): ord(y) for x, y in zip(f, t)}

    s = re.sub('[˗֊‐‑‒–⁃⁻₋−]+', '-', s)  # normalize hyphens
    s = re.sub('[﹣－ｰ—―─━ー]+', 'ー', s)  # normalize choonpus
    s = re.sub('[~∼∾〜〰～]+', '〜', s)  # normalize tildes (modified by Isao Sonobe)
    s = s.translate(
        maketrans('!"#$%&\'()*+,-./:;<=>?@[¥]^_`{|}~｡､･｢｣',
              '！”＃＄％＆’（）＊＋，－．／：；＜＝＞？＠［￥］＾＿｀｛｜｝〜。、・「」'))

    s = remove_extra_spaces(s)
    s = unicode_normalize('！”＃＄％＆’（）＊＋，－．／：；＜＞？＠［￥］＾＿｀｛｜｝〜', s)  # keep ＝,・,「,」
    s = re.sub('[’]', '\'', s)
    s = re.sub('[”]', '"', s)
    s = s.lower()
    return s

def normalize_text(text):
    return normalize_neologd(text)

文分割処理の定義

In [None]:
import functools
from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation

split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_tail_no = functools.partial(concatenate_matching, former_matching_rule=r"^(?P<result>.+)(の)$", 
                                   remove_former_matched=False)
segmenter = make_pipeline(split_newline, concat_tail_no, split_punc2)

画像のURLをローカルに存在する画像ファイルへのパスに読み替える処理の定義

In [None]:
import os
import urllib

def get_image_files(item_index, image_urls):
    image_files = []
    new_image_urls = []
    for image_url in image_urls:
        image_filename = urllib.parse.urlparse(image_url).path.rsplit("/", 1)[-1]
        image_filepath = f"/content/images_224x224/{item_index}/{image_filename}"
        if os.path.isfile(image_filepath):
            image_files.append(image_filepath)
            new_image_urls.append(image_url)
    return image_files, new_image_urls

いらすとやデータの読み込み

In [None]:
import json

with open("/content/irasuto_items_20210224.json", "r", encoding="utf-8") as f_in:
    items = json.load(f_in)

タイトルと説明文を正規化する

In [None]:
for item in items:
    if "LINEスタンプ" in item["title"] or "LINEのスタンプ" in item["title"]:
        continue

    try:
        title = item["title"]
        normalized_title = normalize_text(title)
        item["normalized_title"] = normalized_title

        desc = item["desc"]
        if desc.strip() == "":
            # 説明文がない場合は、タイトルを説明文にする
            item["normalized_desc"] = normalized_title
            item["desc"] = title
        else:
            normalized_desc = normalize_text(desc)
            normalized_desc = list(segmenter(normalized_desc))[0]  # 最初の一文を説明文として採用する。
            item["normalized_desc"] = normalized_desc
    except Exception as e:
        print(e, item)
        continue

いらすとやの全データについて、説明文と画像の埋め込みを計算する

In [None]:
from tqdm import tqdm

import pandas as pd
from PIL import Image

data = []
for item_index, item in tqdm(enumerate(items), total=len(items)):
    if "normalized_title" not in item or "normalized_desc" not in item:
        continue
    
    page = item["page"]
    title = item["title"]
    desc = item["desc"]
    normalized_title = item["normalized_title"]
    normalized_desc = item["normalized_desc"]
    image_urls = item["imgs"]
    
    image_files, image_urls = get_image_files(item_index, image_urls)
    if len(image_files) == 0:
        continue

    try:
        sentence_vector = model.text_model.encode_text([normalized_desc])[0].numpy()
    except Exception as e:
        print(e, item_index)
        continue

    try:
        # アルファチャネルがある画像は入力できないため、アルファチャネルを削除する（白背景にする）。
        images = [ClipVisionModel.remove_alpha_channel(Image.open(image_file)) for image_file in image_files]
        image_vectors = model.vision_model.encode_image(images)

        for image_vector, image_url, in zip(image_vectors, image_urls):
            image_vector = image_vector.numpy()
            data.append([item_index, page, title, normalized_title, desc, normalized_desc, 
                        image_urls, image_url, sentence_vector, image_vector])
    except Exception as e:
        print(e, item_index)
        raise e

columns = ["item_index", "page", "title", "normalized_title", "description", "normalized_description", 
           "images", "image_url", "sentence_vector", "image_vector"]
df = pd.DataFrame(data, columns=columns)

100%|██████████| 25007/25007 [12:26<00:00, 33.52it/s]


# 検索実験

In [None]:
import numpy as np

# 検索対象となる全イラストの文の埋め込みベクトル
target_sentence_vectors = np.stack(df["sentence_vector"])

# 検索対象となる全イラストの画像の埋め込みベクトル
target_image_vectors = np.stack(df["image_vector"])

In [None]:
from scipy.spatial.distance import cdist
from html import escape
from PIL import Image
from IPython.display import display, HTML, clear_output

# テキストの埋め込み
def encode_text(text, model):
    text = normalize_text(text)
    text_embedding = model.text_model.encode_text([text]).numpy()
    return text_embedding

# 画像の埋め込み
def encode_image(image_filename, model):
    image = Image.open(image_filename)
    image_embedding = model.vision_model.encode_image([image]).numpy()
    return image_embedding

# 検索の実行
def search_irasuto(query_embedding, target_vectors, closest_n=10):
    distances = cdist(
        query_embedding, target_vectors, metric="cosine"
    )[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    for i, (idx, distance) in enumerate(results[0:closest_n]):
        page_url = df.iloc[idx]["page"]
        desc = df.iloc[idx]["description"]
        img_url = df.iloc[idx]["image_url"]
        display(HTML(f"<div><a href='{page_url}' target='_blank' rel='noopener noreferrer'>" + \
                     "<img src='{img_url}' width='100'>{distance / 2:.4f}: {desc}</a><div>"))
    

In [None]:
query_input = "暴走したAIのイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "暴走した人工知能のイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
query_input = "暴走した人工知能のイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "想像以上に大きかったイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
query_input = "想像以上に大きかったという気持ちのイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "なんて素敵なんだと感動しているイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "これが欲しかったというイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "的確な言葉を選ぶのが大変だというイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
query_input = "的確な言葉を選ぶのが大変だというイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "拍手をしている人のイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
query_input = "拍手をしている人のイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "絵を描いている人工知能のイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
query_input = "絵を描いている人工知能のイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "コロナウイルスと戦う細胞のイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
query_input = "コロナウイルスと戦う細胞のイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
query_input = "感染検査をしているイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
query_input = "感染検査をしているイラスト"

query_embedding = encode_text(query_input, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/1.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/1.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/3.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/3.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/4.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/4.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/10.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/10.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/14.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_sentence_vectors, closest_n=5)

In [None]:
image_file = "/content/clip-japanese/sample_images/14.jpeg"

query_embedding = encode_image(image_file, model)
search_irasuto(query_embedding, target_image_vectors, closest_n=5)