### Summary

Duplicates are always harmful for training process: differently labeled duplicates produce noise in the dataset, while equally labeled duplicates lead to data leakage. 

In this short notebook I am looking through image hash of **Plant Pathology 2021** competition dataset with `image_hash` library, finding more than 50 duplicates.

### Update
Due to recent changes in the `train.csv` file mentioned in **[this discussion](https://www.kaggle.com/c/plant-pathology-2021-fgvc8/discussion/228465)**, we have no more the `cider_apple_rust` class. This version (7) is made after the changes.

### Imports

In [None]:
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imagehash
import PIL
import os

In [None]:
class CFG():
    
    threshold = .9
    img_size = 512
    seed = 42

## 1. Saving downscaled images to boost performance
Computing hash over original images of very high quality would take nearly 5 hours, thus we downscaling first.

In [None]:
root = '/kaggle/input/plant-pathology-2021-fgvc8/train_images'

paths = os.listdir(root)

df = pd.read_csv('/kaggle/input/plant-pathology-2021-fgvc8/train.csv', index_col='image')

for path in tqdm(paths, total=len(paths)):
    image = tf.io.read_file(os.path.join(root, path))
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, [CFG.img_size, CFG.img_size])
    image = tf.cast(image, tf.uint8).numpy()
    plt.imsave(path, image)

## 2. Hash computation

In [None]:
hash_functions = [
    imagehash.average_hash,
    imagehash.phash,
    imagehash.dhash,
    imagehash.whash]

image_ids = []
hashes = []

paths = tf.io.gfile.glob('./*.jpg')

for path in tqdm(paths, total=len(paths)):

    image = PIL.Image.open(path)

    hashes.append(np.array([x(image).hash for x in hash_functions]).reshape(-1,))
    image_ids.append(path.split('/')[-1])
    
hashes = np.array(hashes)
image_ids = np.array(image_ids)

## 3. Run search across hashed images
We firstly compare each image hash with all the hashes and then leave only unique pairs of matches

In [None]:
duplicate_ids = []

for i in tqdm(range(len(hashes)), total=len(hashes)):
    similarity = (hashes[i] == hashes).mean(axis=1)
    duplicate_ids.append(list(image_ids[similarity > CFG.threshold]))
    
duplicates = [frozenset([x] + y) for x, y in zip(image_ids, duplicate_ids)]
duplicates = set([x for x in duplicates if len(x) > 1])

Here we add some of the duplicates spotted by @kingofarmy in the corresponding **[discussion](https://www.kaggle.com/c/plant-pathology-2021-fgvc8/discussion/229851)**:

In [None]:
duplicates_by_kingofarmy = {
    frozenset(('8dbeda49894d522e.jpg', 'afbe5641896d522a.jpg')),
    frozenset(('af6292db1b611d98.jpg', 'a56292dadb618d95.jpg')),
    frozenset(('abf0b5a0df028b17.jpg', 'abf0b5819f028f0f.jpg')),
    frozenset(('e385830ecacd2d9e.jpg', 'c335971e8acd609e.jpg')),
    frozenset(('cebdc20f67838631.jpg', 'dfbdc047068b063d.jpg')),
    frozenset(('f392f11919991cea.jpg', 'f196f11a99d91ce0.jpg'))}

duplicates |= duplicates_by_kingofarmy

## 4. Let's see what is found

In [None]:
print(f'Found {len(duplicates)} duplicate pairs:')
for row in duplicates:
    print(', '.join(row))

In [None]:
print('Writing duplicates to "duplicates.csv".')
with open('duplicates.csv', 'w') as file:
    for row in duplicates:
        file.write(','.join(row) + '\n')

In [None]:
for row in duplicates:
    
    figure, axes = plt.subplots(1, len(row), figsize=[5 * len(row), 5])

    for i, image_id in enumerate(row):
        image = plt.imread(os.path.join('../input/plant-pathology-2021-fgvc8/train_images', image_id))
        axes[i].imshow(image)

        axes[i].set_title(f'{image_id} - {df.loc[image_id, "labels"]}')
        axes[i].axis('off')

    plt.show()

### Clear working folder to avoid output pollution

In [None]:
for file in tf.io.gfile.glob('./*.jpg'):
    os.remove(file)

### Acknowledgements

* This work is Copy&Edit form @appian **[notebook](https://www.kaggle.com/appian/let-s-find-out-duplicate-images-with-imagehash)** with a lot of changes, but still highly inspired. If you find this notebook useful, please, upvote his work too.
* Thanks to @kingofarmy for spotting more duplicates in **[his thread](https://www.kaggle.com/c/plant-pathology-2021-fgvc8/discussion/229851)**.