# PetFinder | Identify Duplicates and Share Findings

![](https://i.postimg.cc/W1TZZrhN/download-5.png)

As mentioned by the admin in the [discussion](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/278309), there are duplicate images in the dataset. In this notebook, I will try to:
* identify those duplicates
* share the findings
* create a dataset without the duplicates

The findings are also summarized in [Discussion](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/278497).

The method for identifying duplicates is based on the [notebook "Let's find out duplicate images with imagehash"](https://www.kaggle.com/appian/let-s-find-out-duplicate-images-with-imagehash) shared by [Appian](https://www.kaggle.com/appian) for the previous PetFinder competition. This method uses image hashes, and is very simple yet powerful.

# Table of Contents
* [Identify duplicates](#1)
* [Share the findings](#2)
* [Create new dataset](#3)

<a id="1"></a>
# Identify duplicates
Again, this method is based on [Let's find out duplicate images with imagehash](https://www.kaggle.com/appian/let-s-find-out-duplicate-images-with-imagehash).

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import glob
import itertools
import collections

from PIL import Image
import cv2
from tqdm import tqdm_notebook as tqdm
import pandas as pd
import numpy as np
import torch
import imagehash

import matplotlib.pyplot as plt


train = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')

Calculate hash values of every train image. This takes around 10 minutes in a Kaggle Notebook.

In [None]:
def run():

    funcs = [
        imagehash.average_hash,
        imagehash.phash,
        imagehash.dhash,
        imagehash.whash,
    ]

    petids = []
    hashes = []
    for path in tqdm(glob.glob('../input/petfinder-pawpularity-score/train/*.jpg')):

        image = Image.open(path)
        imageid = path.split('/')[-1].split('.')[0]

        petids.append(imageid)
        hashes.append(np.array([f(image).hash for f in funcs]).reshape(256))

    return petids, np.array(hashes)

%time petids, hashes_all = run()

In [None]:
hashes_all = torch.Tensor(hashes_all.astype(int)).cuda()

Calculate similarity (normalized to 0-1 range) between all image pairs.

In [None]:
%time sims = np.array([(hashes_all[i] == hashes_all).sum(dim=1).cpu().numpy()/256 for i in range(hashes_all.shape[0])])

Define a function that allow us to display and retrieve the pairs with in a given threshold range of similarity.

In [None]:
def show_pairs(lower_sim=0.0, upper_sim=1.0, max_shown=100):
    indices1 = np.where((sims > lower_sim) & (sims <= upper_sim))
    indices2 = np.where(indices1[0] != indices1[1])
    dups = {tuple(sorted([petids[index1], petids[index2]])): sims[index1, index2] 
                for index1, index2 in zip(indices1[0][indices2], indices1[1][indices2])}
    print('Found %d pairs' % len(dups))
    
    cnt = 1
    for (id1, id2), sim in dups.items():
        path1 = f'../input/petfinder-pawpularity-score/train/{id1}.jpg'
        path2 = f'../input/petfinder-pawpularity-score/train/{id2}.jpg'
        pawp1 = train[train['Id'] == id1]['Pawpularity'].iloc[-1]
        pawp2 = train[train['Id'] == id2]['Pawpularity'].iloc[-1]

        image1 = cv2.imread(path1)
        image2 = cv2.imread(path2)
        image1 = cv2.cvtColor(image1, cv2.COLOR_BGR2RGB)
        image2 = cv2.cvtColor(image2, cv2.COLOR_BGR2RGB)

        fig, axes = plt.subplots(nrows=1, ncols=2)
        fig.set_size_inches(12, 6)
        axes[0].title.set_text(f'Pawpularity: {pawp1} \n ID: {id1}')
        axes[0].imshow(image1)
        axes[1].title.set_text(f'Pawpularity: {pawp2} \n ID: {id2}')
        axes[1].imshow(image2)
        fig.suptitle(f'Simularity: {sim}')
        plt.show()
        
        if cnt >= max_shown:
            break
        
        cnt += 1
    
    return dups

<a id="2"></a>
# Share the findings

First, let's look at the distribution of similarity using a box plot. 

Note that the similarity is expressed in matrix form, and that we are only interested in the non-diagonal component, since the diagonal component is always 1, which is the similarity to itself.

In [None]:
def offdiagonal(X, axis1, axis2):
    X = np.moveaxis(X, (axis1, axis2), (-2, -1))
    *s, n, _ = X.shape
    X = X.reshape(*s, n*n)[..., :-1].reshape(*s, n-1, n+1)[..., 1:].reshape(*s, n, n-1)
    return np.moveaxis(X, (-2, -1), (axis1, axis2))


plt.figure(figsize=(12, 4))
plt.title('Distribution of similarity')
plt.boxplot(offdiagonal(sims, 0, 1).flatten(), vert=False);

In this box plot, we can see isolated clusters around 1, which we assume to be duplicate images.

Now, let's look at the images for each threshold range of similarity.

## Similarity 0.9 - 1

* 27 pairs in this range
* Most pairs are idendical, at least in appearance

In [None]:
dups_90_00 = show_pairs(0.9, 1.0, max_shown=5)

In this range, almost all images appear to be duplicates. However, for those that do not have a similarity of 1, there must have been a very slight modification to the image. The most obvious pair is the one below, where you can observe that the right image has a slightly higher contrast.

![](https://i.postimg.cc/KYnpctkn/pair1.png)

## Similarity 0.85 - 0.9
- 5 pairs in this range
- Interesting patterns can be seen.

In [None]:
dups_85_90 = show_pairs(0.85, 0.9, max_shown=5)

### Interesting patterns
- The same pet photographed at a slight different time  

![](https://i.postimg.cc/wvgf13Ln/pairs2.png)  
![](https://i.postimg.cc/7Lf7m4Db/pairs3.png)


- Cropped

![](https://i.postimg.cc/Pxy6YhMJ/pairs4.png)  
![](https://i.postimg.cc/zBYBKYGs/pairs5.png)

## Similarity 0.8-0.85
- 174 pairs in this range
- Most of the pairs are completely different, but some interesting patterns can be seen.

In [None]:
dups_80_85 = show_pairs(0.80, 0.85, max_shown=5)

### Interesting patterns
- Different pets with the same background and the same frame  
![](https://i.postimg.cc/Zn0LNNRs/download.png)

- Similar pets with similar background and the same frame  
![](https://i.postimg.cc/wxPXP73H/download-1.png)

- Cropped  
![](https://i.postimg.cc/br2t89dP/download-2.png)

- Different pets with the same template  
![](https://i.postimg.cc/P51Q1TXH/download-4.png)

<a id="3"></a>
# Create new dataset
Here, I create a dataset that excludes only obvious duplicates with a similarity of 0.9 or higher.   
You can make your own dataset with different thresholds or processing of Pawpularity, too.

The dataset is also uploaded as a Kaggle Dataset. Feel free to use it.  
https://www.kaggle.com/schulta/petfinder-pawpularity-score-clean

In [None]:
!mkdir ../working/petfinder-pawpularity-score-clean
!cp -r ../input/petfinder-pawpularity-score/* ../working/petfinder-pawpularity-score-clean

In [None]:
ids1 = np.array(list(dups_90_00.keys()))[:, 0]
ids2 = np.array(list(dups_90_00.keys()))[:, 1]

train_new = train[~train["Id"].isin(ids2)]
train_new = train_new.reset_index(drop=True)

train_new.to_csv('../working/petfinder-pawpularity-score-clean/train.csv', 
                 index=False)
train_new

In [None]:
for id1, id2 in dups_90_00.keys():
    path2 = f'../working/petfinder-pawpularity-score-clean/train/{id2}.jpg'
    os.remove(path2)