# **GIF search with MiniLM-L6 and CLIP embeddings**

<div class="align-center">
  <a href="https://getindexify.ai/"><img src="https://getindexify.ai/Indexify_Logo_Wordmark.svg" width="145"></a>
  <a href="https://discord.com/invite/kF8UZACA7r"><img src="https://raw.githubusercontent.com/rishiraj/random/main/Discord%20button.png" width="145"></a><br>
  Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/tensorlakeai/indexify">Github</a></i> ⭐
</div>

In this notebook, we'll create a semantic GIF search functionality with Indexify and Tumblr GIF dataset https://github.com/raingo/TGIF-Release. We'll use Indexify CLIP and MiniLM-L6 extractors to create embeddings for the GIFs and the search queries. We'll then use the embeddings to find the most similar GIFs to the search query.

## **Setup**

In [None]:
%pip install requests indexify indexify-extractor-sdk 

# Download Indexify Server
!curl https://getindexify.ai | sh

# Download Extractors
!indexify-extractor download hub://embedding/clip_embedding
!indexify-extractor download hub://embedding/minilm-l6

After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

Open 2 terminals and run the following commands:

```bash
# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server
```

##  **Create Extraction Graph**

In [None]:
from indexify import IndexifyClient, Document
client = IndexifyClient()
client.extractors()

In [None]:
extraction_graph_spec = """
name: "gif-search"
extraction_policies:
  - extractor: "tensorlake/clip-extractor"
    name: "clip"
    labels_eq: "content:image"

  - extractor: "tensorlake/minilm-l6"
    name: "minilm"
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)                                            

In [None]:
client.indexes()

## **Ingesting Data**

In [None]:
import requests
res = requests.get("https://raw.githubusercontent.com/raingo/TGIF-Release/master/data/tgif-v1.0.tsv")
items = res.text.split("\n")

In [None]:
for item in items[0:1000]:
    url, text = item.split("\t")

    # validate image
    r = requests.get(url)
    if r.headers.get("Content-Type") != "image/gif":
        print("image removed", url)
        continue

    client.ingest_remote_file("gif-search", url, "image/gif", { "url":url, "content":"image" })
    content_id = client.add_documents("gif-search", Document(text=text, labels={ "url": url }))
    client.wait_for_extraction(content_id)

## **Search Data**

In [None]:
query = "cats being curious"
max_results = 10

In [None]:
minilm_results = client.search_index("minilm-description.embedding", "person dancing on camera", max_results)
clip_results = client.search_index("clip-gif.embedding", "person dancing on camera", max_results)

### **Merge results**

In [None]:
results = set()
for i in range(max_results):
    minilm_url = minilm_results[i].get("labels",{}).get("url")
    if minilm_url and minilm_url not in results:
        results.add(minilm_url)

    clip_url = clip_results[i].get("labels",{}).get("url")
    if clip_url and clip_url not in results:
        results.add(clip_url)

In [None]:
list(results)