# Deep Dive into Text-Image Search Engine with Towhee

In the [previous tutorial](./1_build_text_image_search_engine.ipynb), we built and prototyped a proof-of-concept image search engine. Now, let's feed it with large-scale image datasets, and deploy it with accleration service.

## Preparation

### Install Dependencies

First we need to install dependencies such as pymilvus, towhee and opencv-python.

In [1]:
! python -m pip -q install pymilvus towhee opencv-python

### Prepare the data

For text-image search, we use CIFAR-10 dataset as an example to show how to finetune CLIP model for users' customized dataset. CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. It is widely used as an image recognition benchmark for various computer vision models. In this example, we manually create the caption by creating the sentence with its corresponding label.


In [28]:
import torchvision
import os
import json


root_dir = '/tmp/'
train_dataset = torchvision.datasets.CIFAR10(root=root_dir, train=True, download=True)
eval_dataset = torchvision.datasets.CIFAR10(root=root_dir, train=False, download=True)


idx = 0
def build_image_text_dataset(root, folder, dataset):
    results = []
    global idx
    labels = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
    if not os.path.exists(os.path.join(root,folder)):
        os.mkdir(os.path.join(root,folder))
    for img, label_idx in dataset:
        item  = {}
        imgname = "IMG{:06d}.png".format(idx)
        filename = os.path.join(root, folder, imgname)
        idx = idx + 1
        caption = 'this is a picture of {}.'.format(labels[label_idx])
        img.save(filename)
        item['caption_id'] = idx
        item['image_id'] = idx
        item['caption'] = caption
        item['image_path'] = filename
        results.append(item)
    return results

def gen_caption_meta(root, name, meta):
    save_path = os.path.join(root, name+'.json')
    with open(save_path, 'w') as fw:
        fw.write(json.dumps(meta, indent=4))

train_results = build_image_text_dataset(root_dir, 'train', train_dataset)
gen_caption_meta(root_dir, 'train', train_results)

eval_results = build_image_text_dataset(root_dir, 'eval', eval_dataset)
gen_caption_meta(root_dir, 'eval', eval_results)


Files already downloaded and verified
Files already downloaded and verified


Now we have an image-text annotation of CIFAR-10

|caption ID|image ID | caption   |  image  | image path|
|:--------|:-------- |:----------|:--------|:----------|
| 0 | 0 | this is a picture of frog.|  <img src="train/IMG000000.png" max-width="50" width="50" height="50">| /train/IMG000000.png |
| 1 | 1 | this is a picture of truck. |  <img src="train/IMG000001.png" max-width="50" width="50" height="50">| train/IMG000001.png |
| 2 | 2 | this is a picture of truck. |  <img src="train/IMG000002.png" max-width="50" width="50" height="50">| train/IMG000002.png  |
| 3 | 3 | this is a picture of deer.|  <img src="train/IMG000003.png" max-width="50" width="50" height="50">| train/IMG000003.png  |
| 4 | 4 | this is a picture of automobile.|  <img src="train/IMG000004.png" max-width="50" width="50" height="50">| train/IMG000004.png  |

### Create a Milvus Collection

Before getting started, please make sure you have built `text_image_search` collection that uses the [L2 distance metric](https://milvus.io/docs/v2.0.x/metric.md#Euclidean-distance-L2) and an [IVF_FLAT index](https://milvus.io/docs/v2.0.x/index.md#IVF_FLAT) as the [previous tutorial](./1_build_text_image_search_engine.ipynb).

In [None]:
import towhee
from towhee.dc2 import ops
#step1
#get the operator, modality has no effect to the training model, it is only for the inference branch selection.
clip_op = ops.image_text_embedding.clip(model_name='clip_vit_base_patch16', modality='image').get_op()


#step2
#trainer configuration, theses parameters are huggingface-style standard training configuration.
data_args = {
    'dataset_name': None,
    'dataset_config_name': None,
    'train_file': '/tmp/train.json',
    'validation_file': '/tmp/eval.json',
    'cache_dir': './cache',
    'max_seq_length': 77,
    'data_dir': 'path_to_your_data',
    'image_mean': [0.48145466, 0.4578275, 0.40821073],
    "image_std": [0.26862954, 0.26130258, 0.27577711]
}

training_args = {
    'num_train_epochs': 32, # you can add epoch number to get a better metric.
    'per_device_train_batch_size': 64,
    'per_device_eval_batch_size': 64,
    'do_train': True,
    'do_eval': True,
    'eval_steps':1,
    'remove_unused_columns': False,
    'dataloader_drop_last': True,
    'output_dir': './output/train_clip_exp',
    'overwrite_output_dir': True,
}

#step3
#train your model
clip_op.train(data_args=data_args, training_args=training_args)


CLIP operator uses standard Hugging Face training procedure to finetune the model. The detail of training configuration can be found at [transformers doc](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments).
When training procedure is finished, we can load the trained weights in the operator.

In [None]:
ops.image_text_embedding.clip(model_name='clip_vit_base_patch16', modality='image', checkpoint_path='path_your_weights/pytorch.bin')

## Making Our Text-Image Search Pipeline Production Ready

The text-image pipeline now can be finetuned on customized dataset to get the gain from specific dataset. To put the text-image search engine into production, we also need to execute the whole pipeline in a highly-efficient way instead  of original PyTorch execution.

Towhee supports NVIDIA Triton Inference Server to improve performance for inferencing data for production-ready services. The supported model can be transfered to a Triton service just in a few lines.

Operators can be packed into a Triton service for better inferencing performance. Some specific models of operator can be exported to ONNX models and achieve better acceleration (default is TorchScript).

In [3]:
from towhee.dc2 import ops
import numpy as np

op = ops.image_text_embedding.clip(model_name='clip_vit_base_patch16', modality='image').get_op()
full_list = op.supported_model_names()
onnx_list = op.supported_model_names(format='onnx')

print('full model list:', full_list)
print('onnx model list:', onnx_list)

full model list: ['clip_vit_base_patch16', 'clip_vit_base_patch32', 'clip_vit_large_patch14', 'clip_vit_large_patch14_336']
onnx model list: ['clip_vit_base_patch16', 'clip_vit_base_patch32', 'clip_vit_large_patch14', 'clip_vit_large_patch14_336']


All candidate models of CLIP can be transfered to ONNX model for the Triton pipeline acceleration.

In [None]:
op = ops.image_text_embedding.clip(model_name='clip_vit_base_patch16', modality='text').get_op()

#your host machine IP address, e.g. 192.158.1.38
ip_addr = '192.158.1.38'

#make sure you have built Milvus collection successfully.
p_search = (
    pipe.input('text')
        .map('text', 'vec', ops.image_text_embedding.clip(model_name='clip_vit_base_patch16', modality='text'), config={'device': 0})
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x))
        .map('vec', ('search_res'), ops.ann_search.milvus_client(host=ip_addr, port='19530', limit=5, collection_name="text_image_search", output_fields=['url']))
        .output('text','search_res')
)

towhee.build_docker_image_v2(
    dc_pipeline=p_search,
    image_name='text_image_search:v1',
    cuda_version='11.7', # '117dev' for developer
    format_priority=['onnx'],
    inference_server='triton'
)


After the docker image is built, the inferencing service and its associated model is resident in it. Start the service by running a docker container.

```console
docker run -td --gpus=all --shm-size=1g \
    -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    text_image_search:v1 \
    tritonserver --model-repository=/workspace/models
```

Now we can use a client to visit the accelerated service.

In [26]:
from towhee import triton_client

client = triton_client.Client(url='localhost:8000')

data = "a black dog."
res = client(data)

for idx, dis_score, path in res[0][1]:
    print('idx: {}, distance_score:{:.2f} , path: {}'.format(idx, dis_score, path))
client.close()

idx: 96, distance_score:1.35 , path: ./train/Bouvier_des_Flandres/n02106382_8906.JPEG
idx: 506, distance_score:1.38 , path: ./train/Doberman/n02107142_4753.JPEG
idx: 835, distance_score:1.38 , path: ./train/Afghan_hound/n02088094_3882.JPEG
idx: 507, distance_score:1.39 , path: ./train/Doberman/n02107142_32921.JPEG
idx: 832, distance_score:1.39 , path: ./train/Afghan_hound/n02088094_6565.JPEG
