# __Search image by image__

## Preface

- Tutorial Difficulty : ★☆☆☆☆
- 7 min read
- Languages : [SQL](https://en.wikipedia.org/wiki/SQL) (100%)
- File location : tutorial_en/thanosql_search/search_image_by_image.ipynb
- References : [MNIST DataSet](http://yann.lecun.com/exdb/mnist/), [A Simple Framework for Contrastive Learning of Visual Representations](https://arxiv.org/abs/2002.05709)

## Introduction to the tutorial

<div class="admonition note">
    <h4 class="admonition-title">Understanding image digitization techniques</h4>
    <p>Images are high-dimensional data (height x width x channel [RGB] x color intensity), which is meaningless if the information for each pixel is randomly generated. In other words, an image can only be recognized as an image if each pixel has a specific pattern associated with the surrounding pixels. With this information, it can be inferred that an image can be represented on a low-dimensional feature vector than what the image is comprised of. Recently, studies using artificial intelligence to numericalize and express each image in a low-dimensional space based on the similarity of each image have been conducted in the form of image digitization, vectorization, and embedding.</p>
</div>

There are several ways to define the similarity of an image. It could be that the colors are similar, the objects in the image are similar, or the context of the image may be similar, such as a value of a handwritten number. Although it is difficult to give an exact definition of a similar image, artificial intelligence learns and quantifies these general characteristics.

ThanoSQL uses the [Self-Supervised Learning Model](https://en.wikipedia.org/wiki/Self-supervised_learning) to input images into the database (DB) and retrieve similar images from it. When you upload your images to ThanoSQL's DB, similar images are placed closer while non-similar images are placed further away by an artificial intelligence algorithm. You can derive a general definition of an image from a dataset with no label, fine-tune it to an image with a small amount of target values, and use it for classification or regression tasks.

In addition, ThanoSQL uses artificial intelligence algorithms to quantify datasets. The vectorized data is stored as a DB column in the image table and is used to calculate the similarity(distance) to search for similar images.

__The following are use case examples of ThanoSQL's similar image search algorithm.__

- Inputting your favorite image and having similar artworks within the DB searched and recommended back to you.
- Finding similar images within an album containing thousands of photos.
- Creating your own search engine or artificial intelligence model by storing the numerical value of the image in ThanoSQL's DB, and using the ThanoSQL Auto-ML regression/classification prediction model.
 
<div class="admonition note">
    <h4 class="admonition-title">In this tutorial</h4>
    <p>👉 This tutorial will use the <mark style="background-color:#FFD79C">MNIST handwriting dataset</mark>. Each image is a fixed size(28x28 = 784 pixels) with a decimal value between 0 to 1, consisting of a number from 0 to 9 written by different people, and is correctly labeled. The MNIST handwriting dataset consists of 1,000 train images and 200 test images.</p>
</div>
    
Try creating a model that uses ThanoSQL to input handwriting data and retrieves similar images in the DB.

[![IMAGE](https://docs.thanosql.ai/img/thanosql_search/search_image_by_image/simclr_img7.png "MNIST data") ](https://docs.thanosql.ai/img/thanosql_search/search_image_by_image/simclr_img7.png)

## __0. Prepare Dataset__

To use the query syntax of ThanoSQL, you must create an API token and run the query below, as mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/en/getting_started/how_to_use_ThanoSQL/#5-thanosql-workspace).

In [None]:
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>

### __Prepare Dataset__

In [None]:
%%thanosql
GET THANOSQL DATASET mnist_data
OPTIONS (overwrite=True)

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>GET THANOSQL DATASET</strong>" Use this query to save the desired dataset to your workspace environment. </li>
        <li>"<strong>OPTIONS</strong>" Use this statement to specify the options to use for the <strong>GET THANOSQL DATASET</strong> query.
        <ul>
            <li>"overwrite" : Overwrite if a dataset with the same name exists. If set as True, the existing dataset is replaced with the new dataset (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

In [None]:
%%thanosql
COPY mnist_train 
OPTIONS (overwrite=True)
FROM "thanosql-dataset/mnist_data/mnist_train.csv"

In [None]:
%%thanosql
COPY mnist_test 
OPTIONS (overwrite=True)
FROM "thanosql-dataset/mnist_data/mnist_test.csv"

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>Use the "<strong>COPY</strong>" query statement to specify the dataset to be copied into the DB. </li>
        <li>"<strong>OPTIONS</strong>" specifies the options to use for the <strong>COPY</strong> query statement.
        <ul>
            <li>"overwrite" : Overwrite if a dataset with the same name exists in the DB. If True, the existing dataset is overwritten with the new dataset (True|False, DEFAULT: False) </li>
        </ul>
        </li>
    </ul>
</div>

## __1. Checking the Dataset__

To create a handwriting classification model, use the <mark style="background-color:#FFEC92">mnist_train</mark> table stored in the ThanoSQL [DB](https://en.wikipedia.org/wiki/Database). The <mark style="background-color:#FFEC92">mnist_train</mark> table contains the file name, label information, and path that contains the <mark style="background-color:#FFD79C">MNIST</mark> image files. Run the query below and check the contents of the table.

In [None]:
%%thanosql
SELECT * 
FROM mnist_train 
LIMIT 5

<div class="admonition note">
    <h4 class="admonition-title">Understanding the Data Table</h4>
    <p>The <mark style="background-color:#FFEC92">mnist_train</mark> table contains the following information: The "6782.jpg" image file is a handwritten image with the number 5.</p>
    <ul>
        <li><mark style="background-color:#D7D0FF">image_path</mark>: image path</li>
        <li><mark style="background-color:#D7D0FF">filename</mark>: file name</li>
        <li><mark style="background-color:#D7D0FF">label</mark>: image label</li>
    </ul>
</div>

## __2. Creating an Image Numerical Model__

Create an image quantification model using the <mark style="background-color:#FFEC92">mnist_train</mark> table referenced in the previous step. Execute the query below to create a model named <mark style="background-color:#E9D7FD">my_image_search_model</mark>.  
(Estimated time required for query execution: 1 min)

In [None]:
%%thanosql
BUILD MODEL my_image_search_model
USING SimCLR
OPTIONS (
    image_col="image_path",
    max_epochs=1,
    overwrite=True
    )
AS 
SELECT * 
FROM mnist_train

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>Create and train a model called <mark style="background-color:#E9D7FD">mnist_model</mark> using the "<strong>BUILD MODEL</strong>" query statement.</li>
        <li>The "<strong>USING</strong>" query statement specifies that the <mark style="background-color:#E9D7FD">SimCLR</mark> model should be used as the base model.</li>
        <li>"<strong>OPTIONS</strong>" specifies the options for the query used to create a model.
        <ul>
            <li>"image_col" : Column containing the path of the image in the data table (Default: "<mark style="background-color:#D7D0FF">image_path</mark>")</li>
            <li>"max_epochs" : Number of dataset training to be done to generate image quantization models</li>
            <li>"overwrite" : Overwrite if a model with the same name exists. If true, the existing model is replaced with the new model (True|False, DEFAULT: False) </li>
        </ul>
        </li>
    </ul>
</div>

Run the following "__CONVERT USING__" query statement to digitize the `mnist_test` images. The quantized results are stored in a column named <mark style="background-color:#D7D0FF">my_image_search_model_simclr</mark> in the table, `mnist_test`.

In [None]:
%%thanosql
CONVERT USING my_image_search_model
OPTIONS (
    table_name= "mnist_test",
    image_col="image_path"
    )
AS 
SELECT * 
FROM mnist_test

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>The "<strong>CONVERT USING</strong>" query statement uses <code>my_image_search_model</code> as an algorithm for image quantification.   </li>
        <li>The "<strong>OPTIONS</strong>" query statement defines the options required for image quantification.
        <ul>
            <li>"table_name" : Defines the table name to be stored in the ThanoSQL DB. </li>
            <li>"image_col" : Defines the column containing the path of the image in the data table (default: "image_path")</li>
        </ul>
        </li>
    </ul>
</div>

## __3. Search for Similar Images Using Image Quantization Models__

This step uses the <mark style="background-color:#E9D7FD">my_image_search_model</mark> image quantization model and the test table to search for images similar to the "923.jpg" image file (handwritten 8).

<a href="https://docs.thanosql.ai/img/thanosql_search/search_image_by_image/simclr_img8.png">
    <img alt="IMAGE" src="https://docs.thanosql.ai/img/thanosql_search/search_image_by_image/simclr_img8.png" style="width:100px">
</a>

<p style="text-align:center">923.jpg Image File </p>

In [None]:
%%thanosql
SEARCH IMAGE images='thanosql-dataset/mnist_data/test/923.jpg' 
USING my_image_search_model 
AS
SELECT * 
FROM mnist_test

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>SEARCH IMAGE [images|audio|videos]</strong>" query statement defines the image|audio|video file you want to use for your search.  <br></li>
        <li>"<strong>USING</strong>" defines the model used for image quantification.<br></li>
        <li>The "<strong>AS</strong>" query statement defines the embedding table to use for searches. The <code>mnist_embds</code> table is used. </li>
    </ul>
</div>

Run the following query to output the "__SEARCH__" result using the "__PRINT__" query statement in ThanoSQL to output the top four most similar images. We've only done a minimal amount of learning, but you can see that it's outputting images similar to 8.

In [None]:
%%thanosql
PRINT IMAGE 
AS (
    SELECT image_path, my_image_search_model_simclr_similarity1 
    FROM (
        SEARCH IMAGE images='thanosql-dataset/mnist_data/test/923.jpg' 
        USING my_image_search_model 
        AS 
        SELECT * 
        FROM mnist_test
        )
    ORDER BY my_image_search_model_simclr_similarity1 DESC 
    LIMIT 4
    )

<div class="admonition danger">
    <h4 class="admonition-title">Note</h4>
    <p>The basic learning options of the image similarity search algorithm are learned to recognize the image as the same regardless of the image's left-right inversion, color differences, and etc. This is because a dog's picture should be recognized as a dog even if it is flipped or changed in color. If color changes are important, such as clothing images, or if vertical and horizontal twists are important, such as numbers, the options should be changed when learning.</p>
</div>

## __4. In Conclusion__

In this tutorial, we used the `MNIST` handwriting dataset to perform image quantification and similar image search based on quantification results. We aimed to explain the operation rather than focusing on the accuracy of image similarity. The image quantification model's accuracy can be improved by adding precise tuning and small amounts of labels to each image dataset during learning. You can create your own image quantification model to add search capabilities to various types of unstructured datasets and deploy your own model using Auto-ML techniques.
<br>
The next step is to explore the various "__OPTIONS__" and learning methods of image quantification models. If you want to learn more on how to build your own accurate image conversion model, go ahead with the following tutorials.

- [How to Upload to ThanoSQL DB](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_connecting/data_upload/)
- [Creating an Intermediate Similar Image Search Model]

<div class="admonition tip">
    <h4 class="admonition-title">Inquiries about deploying a model for your own service</h4>
    <p>If you have any difficulties creating your own model using ThanoSQL or applying it to your services, please feel free to contact us below😊</p>
    <p>For inquiries about building a image similarity search model: contact@smartmind.team</p>
</div>