<a href="https://colab.research.google.com/github/runjhunratawal/Meme-Search-Engine-using-Jina/blob/main/Copy_of_Meme_search_with_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Jina Meme Search Workshop

![](http://examples.jina.ai/images/meme_search.gif)

In this workshop we're going to build a meme search engine using [Jina](https://github.com/jina-ai/jina/). It will search a dataset of memes and return URLs to the images as well as metadata.

For this workshop we'll just focus on searching **text**. For a more complete repo (including text/image search and frontend) you can check [this link](https://github.com/alexcg1/jina-meme-search).

You can play with a live example [here](https://examples.jina.ai).

# Prerequisites

- Check out [Jina's repo](https://github.com/jina-ai/jina/) to understand what Jina does
- Watch [Neural search using cute fuzzy animals](https://www.youtube.com/watch?v=3FyddFAFNPQ) to understand how neural search works
- Watch [Jina basics in under two minutes](https://www.youtube.com/watch?v=mnnC37ewQI8) to learn the fundamental components of Jina

# Terminology

Let's go through the words we'll be using in this workshop.

In our example, we'll search through all the <code>text</code> (which comes from a JSON file) and then display the image <code>uri</code> of each match.

<table>
    <tr>
        <td>
            <img src="https://raw.githubusercontent.com/jina-ai/workshops/main/memes/koala.png" width=300 align="left">
        </td>
        <td>
            <table>
                <tr>
                    <td>Template</td>
                    <td>Surprised Koala</td>
                </tr>
                <tr>
                    <td>Caption</td>
                    <td>This is poisonous. What</td>
                </tr>
                <tr>
                    <td><code>uri</code></td>
                    <td>https//i.imgflip.com/foo_bar.jpg</td>
                </tr>
                <tr>
                    <td><code>text</code></td>
                    <td>Surprised Koala | This is poisonous. What</td>
                </tr>
            </table>
        </td>
    </tr>
</table>

# Set up basics

## Housekeeping

### Clean up from last time

In [1]:
!cd /
!rm -rf workspace images sample_data

### Set basic options

In [2]:
# Enable Jupyter widgets so we can see images
!jupyter nbextension enable --py widgetsnbextension

# Disable warnings
import warnings
warnings.filterwarnings('ignore')

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## Set maximum images to index

Since this is just a workshop and not a real-world application we'll just search through 50 memes. This will save us time in processing.

In [3]:
max_docs = 50

## Install Prerequisites

In [4]:
!pip install jina==2.6.4
!pip install ipywidgets==7.6.5 # Get nice widgets in the notebook



# Process data

We use an open-source dataset from imgflip, originally downloaded from [Kaggle](https://www.kaggle.com/abhishtagatya/imgflipscraped-memes-caption-dataset).

## Why this dataset?

We chose this dataset because

- It has rich metadata (caption, template name)
- It has recognizable memes (many datasets were just random pics with overlaid Impact font)
- It doesn't *seem* to have too many racist/sexist/\*phobic memes 🤞

## Why does this dataset kinda suck?

It only contains so many memes, and new memes come out all the time. So it won't have the latest, dankest stuff.

## Download data

In [5]:
!mkdir -p data
!wget -O data/memes.json -nc https://jina-examples-datasets.s3.amazonaws.com/memes/memes.json -q

## Load data

I've written a function to help load the data from the JSON file we downloaded earlier.

In this function we:

- Create a `DocumentArray` to hold `Documents` (using [docarray package](https://docarray.jina.ai))
- Optionally shuffle the memes
- Create a `Document` for each meme
- Set `Document.text` to the template name (e.g. `"Surprised Koala"`) + meme caption (e.g. `"This is poisonous. What"`)
- Populate some `tags` for the `Document` (e.g. absolute URL for image)

In [6]:
from docarray import Document, DocumentArray

In [7]:
import json
def prep_docs(input_file, num_docs=None, shuffle=True):
    docs = DocumentArray()
    memes = []
    print(f"Processing {input_file}")
    with open(input_file, "r") as file:
        raw_json = json.loads(file.read())

    for template in raw_json:
        for meme in template["generated_memes"]:
            meme["template"] = template["name"]
        memes.extend(template["generated_memes"])

    if shuffle:
        import random
        random_seed = 1337

        random.seed(random_seed)
        random.shuffle(memes)

    for meme in memes[:num_docs]:
        doctext = f"{meme['template']} - {meme['caption_text']}"
        doc = Document(text=doctext)
        doc.tags = meme
        doc.tags["uri_absolute"] = "http:" + doc.tags["image_url"]
        docs.extend([doc])

    return docs

In [8]:
docs = prep_docs(
    input_file="data/memes.json", 
    num_docs=max_docs, 
    shuffle=False
)

Processing data/memes.json


In [9]:
!wget -nc https://raw.githubusercontent.com/jina-ai/workshops/main/memes/helper.py
from helper import show_images

File ‘helper.py’ already there; not retrieving.



In [10]:
show_images(docs)

HBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00\…

### What does a Document look like?

Let's take a look at a Document to get an idea of what's going on

In [11]:
print(docs[0])

<docarray.document.Document ('id', 'mime_type', 'tags', 'text') at 140045744303696>


We can see several fields. Let's look at each of them:

In [12]:
print(docs[0].id)                  # Unique ID for index
print(docs[0].text)                # Content we're searching through
print(docs[0].mime_type)           # MIME type of that content

for tag in docs[0].tags.items():   # Metadata
    print(tag)

64b6fc2a81d011ec88430242ac1c0002
Drake Hotline Bling -  my mom when i play with my phone for 30 minutes my mom when she looks at instagram phone for two hours 
text/plain
('image_url', '//i.imgflip.com/3xd5o0.jpg')
('id', 1.0)
('hash_id', '73c18611930ad390a4a6a7d1cd9214daba85b9f3408997e59c1e0ab63362e7db')
('alt_text', 'Drake Hotline Bling | my mom when i play with my phone for 30 minutes my mom when she looks at instagram phone for two hours | image tagged in memes,drake hotline bling | made w/ Imgflip meme maker')
('template', 'Drake Hotline Bling')
('uri_absolute', 'http://i.imgflip.com/3xd5o0.jpg')
('caption_text', ' my mom when i play with my phone for 30 minutes my mom when she looks at instagram phone for two hours ')


### Why shuffle?

The memes are arranged alphabetically by template. So if we don't shuffle we're going to get very similar memes. This makes it more difficult to search for something interesting.

If we shuffle we get a more interesting mix:

In [28]:
docs = prep_docs(
    input_file="data/memes.json", 
    num_docs=max_docs, 
    shuffle=True
)

Processing data/memes.json


In [14]:
show_images(docs)

HBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00\…

# Index and search data with Flow

## Set up Flow

Before we index or search, we need to create our Flow. Only then can we open it as a context manager and do stuff with it.

In [15]:
from jina import Flow

In [16]:
flow = (
    Flow()
    .add(
        name="meme_text_encoder",
        uses="jinahub://SpacyTextEncoder/v0.3",                 # Using Executors from Jina Hub means we don't need to write our own!
        uses_with={"model_name": "en_core_web_md"},
        install_requirements=True
    )
    .add(
        name="meme_text_indexer",
        uses="jinahub://SimpleIndexer/v0.11",
        install_requirements=True
    )
)

## Index data with Flow

We use our Flow to build an index of where all the meme text embeddings lie on an n-dimensional graph

**Note:** If the notebook crashes here, restart (don't factory-reset) it and run all cells again. This seems to be a bug in Google Colab.

In [17]:
with flow:
  indexed_docs = flow.index(
      inputs=docs,
      request_size=64,
      show_progress=True,
      return_results=True
  )
print("DONE!")

[32m⠏[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...



[32m⠙[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...



           Flow@345[I]:[32m🎉 Flow is ready to use![0m
	🔗 Protocol: 		[1mGRPC[0m
	🏠 Local access:	[4m[36m0.0.0.0:54407[0m
	🔒 Private network:	[4m[36m172.28.0.2:54407[0m
	🌐 Public address:	[4m[36m34.90.178.248:54407[0m[0m
[32m⠋[0m Working... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[32m╸[0m[2m[32m[0m [36m0:00:00[0m estimating... 

            The syntax of traversal_path is changed to comma-separated string, '
            that means your need to change ('r',) into `r`. '
            The old list of string syntax will be deprecated soon
            [0m [1;30m(raised from /usr/local/lib/python3.7/dist-packages/docarray/array/mixins/traverse.py:28)[0m


                                                                                                    [32m⠙[0m Working... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[32m╸[0m[2m[32m[0m [36m0:00:00[0m estimating... 



                                                                                                    [32m⠹[0m Working... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[32m╸[0m[2m[32m[0m [36m0:00:00[0m estimating... 



                                                                                                    [32m⠸[0m Working... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[32m╸[0m[2m[32m[0m [36m0:00:00[0m estimating... 



[32m⠦[0m       DONE [33m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[33m╸[0m[2m[33m[0m [36m0:00:00[0m 100% ETA: 0 seconds [K80 steps done in 0 seconds
DONE!


## What did we just do?

Indexing our Documents:

- Generated *vector embeddings* for each Document, generated by a neural network. These provide a representation of the meaning of each piece of text, instead of it just being a string of characters
- Stored the embeddings and metadata of each Document in an on-disk index (`./workspace`) for fast lookup later

Let's look a Document after it's been through the indexing Flow. We can see its vector embedding stored as a Numpy ndarray in its `.embedding` attribute:

In [23]:
print(indexed_docs[0].data.docs[0].embedding) # We dig deep because there's a lot of additional metadata we don't care about right now

dense {
  buffer: "5\301!=Q0\347=U\337\275\2750H\363\274\306\366\033>\014\353\301<\233\276\316\274p+\275=&\221\265<\375\376\t@\250rV\276\0069\013\275\225\003\207=\234l\013\276\216Xk\274\355\314\354\274\323\201\335\275[\320t?\212B1\275\321\265\355\275P\263\331\273\2069)\275\350\350?;\035\"\224\275\216=\265=\216\325\346\274\205\t7=(\201\352:Z\373\017=\005\253C=*\270\252\275\301\247\004>`\007\334\275\332_z=vU4>\017\356_\275&\'\222=\325\325(=\272\'r\275m\315\002=\232\320\214\274\220\366\341=9\266\241\275\330\251b\275\273*\200=\313\200\253=f\021I\275\310\377H\274Eb\222\274F\2377\275\223W\211\275\200\177\032=\007A\313\2752\277\260<\305\220t=\307\320\r\274\371(\350<4\t\036\276\031\300\327=\035\251R\275/,L=\251n2\275\335m\364\275\273R_>\331\343:=\264\002\347\275\013\226\270\2752Dn<\026h\227\274\351\005\000>\370h\361<\315\247\204\274\014\030+>#\341\201<\303\344~=:c\306=\315\211\201<\241O\250\275\350k\316\275:\306A>\320=\263={\341<>\2548\024\276l\025m\274j\205\000\275N)8\276\374\262\232\276OL\23

## Search data with Flow

### Create query Document

A Document is the fundamental thing that Jina works with. So anything we pass or out needs to be a Document

In [24]:
search_term = "child"
query_doc = Document(text=search_term)

### Send query Document to Flow

This Flow will:

- Generate a vector embedding for our query Document based on its text
- Compare that embedding with the 50 embeddings (one for each meme) we stored in our index earlier
- Return the closest matches based on Document (i.e. text) similarity

In [25]:
with flow:
  response = flow.search(inputs=query_doc, return_results=True)
print("DONE!")

[32m⠴[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...



           Flow@345[I]:[32m🎉 Flow is ready to use![0m
	🔗 Protocol: 		[1mGRPC[0m
	🏠 Local access:	[4m[36m0.0.0.0:50147[0m
	🔒 Private network:	[4m[36m172.28.0.2:50147[0m
	🌐 Public address:	[4m[36m34.90.178.248:50147[0m[0m


            The syntax of traversal_path is changed to comma-separated string, '
            that means your need to change ('r',) into `r`. '
            The old list of string syntax will be deprecated soon
            [0m [1;30m(raised from /usr/local/lib/python3.7/dist-packages/docarray/array/mixins/traverse.py:28)[0m


DONE!


### Extract matches

A Jina response contains a lot of extra data. We just want the DocumentArray with matching Documents

In [26]:
matches = response[0].docs[0].matches

In [27]:
show_images(matches)

HBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00\…

## Using this IRL

How can we take this out of a notebook and build a real-world application with it?

### Not use a notebook

Notebooks are great for demo's or scientific research. Not so great for real-world applications.

### Use a better indexer

Jina's **[PQLiteIndexer](https://hub.jina.ai/executor/pn1qofsj)** offers powerful features like pre-filtering based on metadata. In our notebook we just use [SimpleIndexer](https://hub.jina.ai/executor/zb38xlt4) which is nice for demonstrations but lacks PQLite's power.

### Use a RESTful API

#### In notebook: gRPC

```python
with flow:
    flow.search(Document(text="foo"))
```

#### In Python: RESTful or gRPC

```python
with flow:
    flow.protocol = "http"
    flow.port_expose = 12345
    flow.block()
```

### Use Dockerized Executors...

This means having to install fewer requirements locally.


#### In notebook: `jinahub://foo`

```python
flow = (
    Flow()
    .add(...)
    .add(
        uses="jinahub://SimpleIndexer",
    )
)
```

#### In Python: `jinahub+docker://foo`

```python
flow = (
    Flow()
    .add(...)
    .add(
        uses="jinahub+docker://SimpleIndexer",
    )
)
```

### ...or wrap everything in Docker

See an example [docker-compose.yml](https://github.com/alexcg1/jina-meme-search/blob/main/docker-compose.yml)