# Deep Dive into Text-Image Search Engine with Towhee

In the [previous tutorial](./1_build_text_image_search_engine.ipynb), we built and prototyped a proof-of-concept image search engine. Now, let's feed it with large-scale image datasets, and deploy it as a micro-service with Towhee.

## Preparation

### Install Dependencies

First we need to install dependencies such as pymilvus, towhee, fastapi and opencv-python.

In [1]:
! python -m pip -q install pymilvus towhee fastapi opencv-python

### Prepare the data

There is a subset of the ImageNet dataset (100 classes, 10 images for each class) is used in this demo, and the dataset is available via [Github](https://github.com/towhee-io/examples/releases/download/data/reverse_image_search.zip).  The dataset is same as our previous tutorial: "[Build a Milvus powered Text-Image Search Engine in Minutes](./1_build_text_image_search_engine.ipynb)", and to make things easy, we'll repeat the important code blocks below; if you have already downloaded data, please move on to next section.

The dataset is organized as follows:
- **train**: directory of candidate images;
- **test**: directory of test images;
- **reverse_image_search.csv**: a csv file containing an ***id***, ***path***, and ***label*** for each image;

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/reverse_image_search.zip -O
! unzip -q -o reverse_image_search.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
100  119M  100  119M    0     0  9808k      0  0:00:12  0:00:12 --:--:-- 15.4M


To use the dataset for image search, let's first define a helper function:

- **read_images(results)**: read images by image IDs;

In [11]:
import cv2
import pandas as pd
import towhee
from towhee._types.image import Image

df = pd.read_csv('reverse_image_search.csv')
df.head()

id_img = df.set_index('id')['path'].to_dict()
def read_images(results):
    imgs = []
    for re in results:
        path = id_img[re.id]
        imgs.append(Image(cv2.imread(path), 'BGR'))
    return imgs

### Create a Milvus Collection

Before getting started, please make sure you have [installed milvus](https://milvus.io/docs/v2.0.x/install_standalone-docker.md). Let's first create a `text_image_search` collection that uses the [L2 distance metric](https://milvus.io/docs/v2.0.x/metric.md#Euclidean-distance-L2) and an [IVF_FLAT index](https://milvus.io/docs/v2.0.x/index.md#IVF_FLAT).

In [5]:
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

def create_milvus_collection(collection_name, dim):
    connections.connect(host='127.0.0.1', port='19530')
    
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
    FieldSchema(name='id', dtype=DataType.INT64, descrition='ids', is_primary=True, auto_id=False),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='text image search')
    collection = Collection(name=collection_name, schema=schema)

    index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":512}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

## Making Our Text-Image Search Engine Production Ready

To put the text-image search engine into production, we need to feed it with a large-scale dataset and deploy a microservice to accept incoming queries.

### Optimize for large-scale dataset

When the dataset becomes very large, as huge as tens of millions of images, it faces two significant problems:

1. embedding feature extractor and Milvus data loading needs to be fast so that we can finish the search index in time;
2. There are corrupted images or images with wrong formats in the dataset. It is impossible to clean up all such bad cases when the dataset is huge. So the data pipeline needs to be very robust to such exceptions.

Towhee supports parallel execution to improve performance for large-scale datasets, and also has `exception_safe` execution mode to ensure system stability.

### Improve Performance with Parallel Execution

We are able to enable parallel execution by simply calling `set_parallel` within the pipeline. It tells towhee to process the data in parallel. Here is an example that enables parallel execution on a pipeline using CLIP model. It can be seen that the execution speed below is nearly four times faster than before. And note that please clean up the GPU cache before runing with parallel.

We'll use a helper class to compute runtime:

In [9]:
import time

class Timer:
    def __init__(self, name):
        self._name = name

    def __enter__(self):
        self._start = time.time()
        return self

    def __exit__(self, *args):
        self._interval = time.time() - self._start
        print('%s: %.2fs'%(self._name, self._interval))
        
with Timer('test timer'): # a small test case for the timer
    time.sleep(2.4)

test timer: 2.40s


In [13]:
collection = create_milvus_collection('test_clip_vit_b32', 512)
with Timer('clip_vit_b32 load'):
    ( 
        towhee.read_csv('reverse_image_search.csv')
            .runas_op['id', 'id'](func=lambda x: int(x))
            .image_decode['path', 'img']()
            .image_text_embedding.clip['img', 'vec'](model_name='clip_vit_b32', modality='image')
            .tensor_normalize['vec', 'vec']()
            .to_milvus['id', 'vec'](collection=collection, batch=100)
    )
    
collection_parallel = create_milvus_collection('test_clip_vit_b32_parallel', 512)
with Timer('clip_vit_b32+parallel load'):
    ( 
        towhee.read_csv('reverse_image_search.csv')
            .runas_op['id', 'id'](func=lambda x: int(x))
            .set_parallel(4)
            .image_decode['path', 'img']()
            .image_text_embedding.clip['img', 'vec'](model_name='clip_vit_b32', modality='image')
            .tensor_normalize['vec', 'vec']()
            .to_milvus['id', 'vec'](collection=collection_parallel, batch=100)
    )

clip_vit_b32 load: 295.80s
clip_vit_b32+parallel load: 81.60s


## Deploy as a Microservice

The data pipeline used in our experiments can be converted to a function with `towhee.api` and `as_function()`, as it is presented in the [previous tutorial](./1_build_text_image_search_engine.ipynb). We can also convert the data pipeline into a RESTful API with `serve()`, it generates FastAPI services from towhee pipelines.

### Insert Image Data

In [14]:
import time
import towhee
from fastapi import FastAPI
from pymilvus import connections, Collection

app = FastAPI()
connections.connect(host='127.0.0.1', port='19530')
milvus_collection = Collection('test_clip_vit_b32')

@towhee.register(name='get_path_id')
def get_path_id(path):
    timestamp = int(time.time()*10000)
    id_img[timestamp] = path
    return timestamp

@towhee.register(name='milvus_insert')
class MilvusInsert:
    def __init__(self, collection):
        self.collection = collection

    def __call__(self, *args, **kwargs):
        data = []
        for iterable in args:
            data.append([iterable])
        mr = self.collection.insert(data)
        self.collection.load()
        return str(mr)

with towhee.api['file']() as api:
    app_insert = (
        api.image_load['file', 'img']()
        .save_image['img', 'path'](dir='tmp/images')
        .get_path_id['path', 'id']()
        .image_text_embedding.clip['img', 'vec'](model_name='clip_vit_b32',modality='image')
        .tensor_normalize['vec', 'vec']()
        .milvus_insert[('id', 'vec'), 'res'](collection=milvus_collection)
        .select['id', 'path']()
        .serve('/insert', app)
    )

### Search Matched Image

In [15]:
with towhee.api['text']() as api:
    app_search = (
        api.image_text_embedding.clip['text', 'vec'](model_name='clip_vit_b32',modality='text')
        .tensor_normalize['vec','vec']()
        .milvus_search['vec', 'result'](collection=milvus_collection, limit=5)
        .runas_op['result', 'res_file'](func=lambda res: str([id_img[x.id] for x in res]))
        .select['res_file']()
        .serve('/search', app)
    )

### Count Numbers

In [16]:
with towhee.api() as api:
    app_count = (
        api.map(lambda _: milvus_collection.num_entities)
        .serve('/count', app)
        )

### Start Server

Finally to start FastAPI, there are three services `/insert`, `/search` and `/count`, you can run the following commands to test:

```bash
# upload text and search
$ curl -X POST "http://0.0.0.0:8000/search"  --data "a white dog"
# upload an image and insert
$ curl -X POST "http://0.0.0.0:8000/insert"  --data-binary @test/banana/n07753592_323.JPEG -H 'Content-Type: image/jpeg'
# count the collection
$ curl -X POST "http://0.0.0.0:8000/count"
```

In [17]:
import uvicorn
import nest_asyncio

nest_asyncio.apply()
uvicorn.run(app=app, host='0.0.0.0', port=8000)

INFO:     Started server process [21465]
2022-06-01 14:52:17,273 - 8605226496 - server.py-server:75 - INFO: Started server process [21465]
INFO:     Waiting for application startup.
2022-06-01 14:52:17,274 - 8605226496 - on.py-on:45 - INFO: Waiting for application startup.
INFO:     Application startup complete.
2022-06-01 14:52:17,275 - 8605226496 - on.py-on:59 - INFO: Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2022-06-01 14:52:17,276 - 8605226496 - server.py-server:206 - INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO:     127.0.0.1:64892 - "POST /search HTTP/1.1" 200 OK
INFO:     127.0.0.1:64905 - "POST /insert HTTP/1.1" 200 OK
INFO:     127.0.0.1:64907 - "POST /insert HTTP/1.1" 200 OK
INFO:     127.0.0.1:64909 - "POST /count HTTP/1.1" 200 OK


INFO:     Shutting down
2022-06-01 14:57:04,178 - 8605226496 - server.py-server:252 - INFO: Shutting down
INFO:     Waiting for application shutdown.
2022-06-01 14:57:04,283 - 8605226496 - on.py-on:64 - INFO: Waiting for application shutdown.
INFO:     Application shutdown complete.
2022-06-01 14:57:04,286 - 8605226496 - on.py-on:75 - INFO: Application shutdown complete.
INFO:     Finished server process [21465]
2022-06-01 14:57:04,287 - 8605226496 - server.py-server:85 - INFO: Finished server process [21465]
