# Finding Similar Images using Imagededup!

I found this nice framework called [`imagededup`](https://github.com/idealo/imagededup) which is a python package that simplifies the task of finding exact and near duplicates in an image collection which is more or less what we need to do in this competition so I decided to give it a try!

* If you want, you can modify this a little and use it for submission.

**You can always comment any suggestions you have down below and *upvote* the notebook if you found it helpful!**

<br>
<p style="color:green;">Edit: This Notebook now uses a CNN instead of a Perceptual Hasher to do similarity matching.</p>

First we install the framework using the cell down below;

In [None]:
! pip install -q --no-deps imagededup

Then we import the package along with some other libraries (I make use of them in future commits).

In [None]:
import imagededup
from imagededup.methods import PHash, CNN, DHash, WHash, AHash
from imagededup.utils import plot_duplicates

import warnings
import pandas as pd
import numpy as np
import cv2
import os
import shutil
from tqdm import tqdm
warnings.simplefilter("ignore")

In [None]:
test = pd.read_csv("../input/shopee-product-matching/test.csv")
sub = pd.read_csv("../input/shopee-product-matching/sample_submission.csv")
train = pd.read_csv("../input/shopee-product-matching/train.csv")

train.head()

Then we run the package. It's really this simple!

In [None]:
%%time
cnn = CNN()
encodings = cnn.encode_images(image_dir="../input/shopee-product-matching/train_images")
duplicates = cnn.find_duplicates(encoding_map=encodings)

Now here's a brief summary of what we are doing in the below cell:

* We initialize a `CNN` object.
* Generate the encodings by specifying the image directory.
* Find the duplicates using the above encodings.

Now I may-be wrong but what I understand from the below code does is that it generates encoded representations by passing every image through a pre-trained CNN (possibly Mobilenet) and then uses some sort of similarity function (like cosine similarity, etc) to find duplicates between the images.

It takes >20 minutes to do this computation since the training images are over 34,000+ in count.

We can also use **Perceptual Hashing (PHash), Difference Hashing (DHash), Wavelet Hashing (WHash), Average Hashing (AHash)**.

Below is the pseudo-code for above mentioned methods (it's markdown, don't try to execute it!):

```python
%%time

hashing_method = PHash() or DHash() or WHash() or AHash()

encodings = hashing_method.encode_images(image_dir="../input/shopee-product-matching/train_images")

duplicates = hashing_method.find_duplicates(encoding_map=encodings)
```

After we generate the encodings and find the duplicates in the dataset, we can plot all duplicates of any given image using the below function;

In [None]:
plot_duplicates(image_dir='../input/shopee-product-matching/train_images',
                duplicate_map=duplicates,
                filename='0cca4afba97e106abd0843ce72881ca4.jpg')

That's it for now!
All the credits go to [Imagededup's Github repo](https://github.com/idealo/imagededup) for this is where I learned to do all this.

**Also, I will be adding more stuff to this notebook in the future!**