## A brief outline

In this notebook, we show how to search for duplicated images using image embeddings and the DBSCAN clustering algorithm. The embeddings are extracted with the help of a pre-trained EfficientNet model. We find 5 pairs of images in the Cassava train set which are potential duplicates. Visual inspection of the images in each pair reveals that all of them are true duplicates of one another, so the precission of our method on the training dataset is 100% (the recall is unknown). We find that one pair of duplicates contains images which are labeled differently which is a clear demonstration of the presence of noissy labels in the competition dataset.

This method can also be applied to search for the overlap between the current competition dataset and the [2019 data](https://www.kaggle.com/c/cassava-disease/data) (we are leaving this as an excercise for an interested reader). This is very important to keep all duplicates under control when doing cross-validation because we do not want to validate our model on the same data that the model was shown during the training phase. Identifying duplicated images will help us to enforce the uniqueness of the images in the training and the validation sets.

The discussion topic is here: [Searching for duplicated images with DBSCAN](https://www.kaggle.com/c/cassava-leaf-disease-classification/discussion/210723).

## Loading libraries

In [None]:
!pip install -q efficientnet

In [None]:
import re
import gc  
import os
import math
import json
import random
import warnings
import numpy as np
import pandas as pd
from PIL import Image
import tensorflow as tf
from pathlib import Path
from sklearn.cluster import DBSCAN
import efficientnet.tfkeras as efn
from collections import defaultdict
from matplotlib import pyplot as plt
from kaggle_datasets import KaggleDatasets

print("Tensorflow version " + tf.__version__)
AUTO = tf.data.experimental.AUTOTUNE

## Loading data

In [None]:
def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    tf.random.set_seed(seed)

In [None]:
MODEL='EffNB4'

HEIGHT=512
WIDTH=512
IMAGE_SIZE = [HEIGHT, WIDTH] # At this size, a GPU will run out of memory. Use the TPU.
                             # For GPU training, please select 224 x 224 px image size.    
DESCRIPTION='embeddings'
NAME=f'{MODEL}_{HEIGHT}x{WIDTH}_{DESCRIPTION}'
SEED=311
BATCH_SIZE_FACTOR = 32

model_selector={'EffNB0': efn.EfficientNetB0,
                'EffNB1': efn.EfficientNetB1,
                'EffNB2': efn.EfficientNetB2,
                'EffNB3': efn.EfficientNetB3,
                'EffNB4': efn.EfficientNetB4,
                'EffNB5': efn.EfficientNetB5,
                'EffNB6': efn.EfficientNetB6,
                'EffNB7': efn.EfficientNetB7,
               }

PATH=Path('/kaggle/input/cassava-leaf-disease-classification/')
train=pd.read_csv(PATH/'train.csv')
              
seed_everything(SEED)
warnings.filterwarnings('ignore')
print(f"Model name: {NAME}.")

In [None]:
print(f"The shape of the training set is {train.shape}.")
print(f"The columns in `train`:\n {list(train.columns)}.\n")

In [None]:
train.head()

In [None]:
labels=np.sort(train['label'].unique())
labels

Now, let's download the mapping between the label numbers and the disease names.

In [None]:
with open(os.path.join(PATH, "label_num_to_disease_map.json")) as file:
    label_mapping = json.loads(file.read())
    
label_mapping

Let's remove the abbreviations at the end of each decesase name and turn the keys of the dictionary into integers.

In [None]:
label_mapping={0: 'Cassava Bacterial Blight',
               1: 'Cassava Brown Streak Disease',
               2: 'Cassava Green Mottle',
               3: 'Cassava Mosaic Disease',
               4: 'Healthy',
              }

Save the abbreviated class names as follows:

In [None]:
SHORT_CLASSES=['CBB', 'CBSD', 'CGM', 'CMD', 'Healthy']

In [None]:
train['desease'] = train['label'].map(label_mapping)
train['desease'].value_counts()

In [None]:
%%time

CLASSES = [label_mapping[i] for i in range(len(label_mapping))]
CLASSES

## TPU or GPU detection

In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment 
    # variable is set. On Kaggle this is always the case.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()
    
BATCH_SIZE = BATCH_SIZE_FACTOR * strategy.num_replicas_in_sync

print("REPLICAS: ", strategy.num_replicas_in_sync)
print("BATCH SIZE: ", BATCH_SIZE)

## Data access

In [None]:
GCS_PATH = KaggleDatasets().get_gcs_path('cassava-leaf-disease-classification')
ALL_TFRECS=np.array(tf.io.gfile.glob(GCS_PATH + '/train_tfrecords/*.tfrec'))

print(GCS_PATH)

In [None]:
ALL_TFRECS

## Configuration

In [None]:
def count_data_items(filenames):
    # the number of data items is written in the name of the .tfrec files, 
    # i.e. test10-687.tfrec = 687 data items
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    
    return np.sum(n)

In [None]:
%%time

print(f"The total number of images is {count_data_items(ALL_TFRECS)}.")

## Datasets utility functions

Below are the standard functions that we will be using to read and process the data from the `.tfrec` files. 

In [None]:
def decode_image(image_data):
    """
        1. Decode a JPEG-encoded image to a uint8 tensor.
        2. Cast the tensor to float and normalizes (range between 0 and 1).
        3. Reshape the image to the expected shape.
    """
    image = tf.image.decode_jpeg(image_data, channels=3)
    # we normalize our inputs by subtracting ImageNet mean of 0.449 
    # and dividing by ImageNet standard deviation of 0.226. 
    # We have to do it because we won't be fine-tuning our model.
    image = ((tf.cast(image, tf.float32) / 255.0) - 0.449) / 0.226
    image = tf.reshape(image, [HEIGHT, WIDTH, 3])
    return image

In [None]:
def read_tfrecord(example, labeled=True):
    """
        1. Parse data based on the 'TFREC_FORMAT' map.
        2. Decode image.
        3. If 'labeled' returns (image, label) if not (image, name).
    """
    if labeled:
        TFREC_FORMAT = {
            'image': tf.io.FixedLenFeature([], tf.string), 
            'target': tf.io.FixedLenFeature([], tf.int64), 
        }
    else:
        TFREC_FORMAT = {
            'image': tf.io.FixedLenFeature([], tf.string), 
            'image_name': tf.io.FixedLenFeature([], tf.string), 
        }
    example = tf.io.parse_single_example(example, TFREC_FORMAT)
    image = decode_image(example['image'])
    if labeled:
        label_or_name = tf.cast(example['target'], tf.int32)
    else:
        label_or_name =  example['image_name']
    return image, label_or_name

In [None]:
def load_dataset(filenames, labeled=True, ordered=False):
    """
        Create a Tensorflow dataset from TFRecords.
    """
    ignore_order = tf.data.Options()
    if not ordered:
        # disable order, increase speed. Makes sense to do
        # if we are going to shuffle the data anyway
        ignore_order.experimental_deterministic = False
    
    # automatically interleaves reads from multiple files
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO)
    # if ordered=False uses data as soon as it streams in, rather than in its original order
    dataset = dataset.with_options(ignore_order)
    # returns a dataset of (image, label) pairs if labeled=True 
    # or (image, id) pairs if labeled=False
    dataset = dataset.map(lambda x: read_tfrecord(x, labeled=labeled), num_parallel_calls=AUTO)
    return dataset

Let's take a quick look at one example of data:

In [None]:
%%time

dataset = load_dataset(ALL_TFRECS)

print("Example of the training data:")
for image, label in dataset.take(1):
    print("The image batch size:", image.numpy().shape)
    print("Label:", label.numpy())

We won't be training our model, so our `get_dataset` function is very simple.

In [None]:
def get_dataset(FILENAMES):
    """
        Return a Tensorflow dataset ready for training or inference.
    """     
    dataset = load_dataset(FILENAMES, labeled=False, ordered=True)
    dataset = dataset.cache()
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.prefetch(AUTO)
    
    return dataset

## Visualization utilities

And here are the standard visualization utilities that we will use to visualize the dataset.

In [None]:
# numpy and matplotlib defaults
np.set_printoptions(threshold=15, linewidth=80)

In [None]:
def batch_to_numpy_images_and_labels(databatch):
    images, labels = databatch
    numpy_images = images.numpy()
    numpy_labels = labels.numpy()

    if numpy_labels.dtype == object: # binary string in this case, these are image ID strings
        # If no labels, only image IDs, return None for labels (this is the case for test data)
        numpy_labels = [None for _ in enumerate(numpy_images)]

    return numpy_images, numpy_labels

In [None]:
def title_from_label_and_target(label, correct_label):
    if correct_label is None:
        return SHORT_CLASSES[label], True
    correct = (label == correct_label)
    return "{} [{}{}{}]".format(SHORT_CLASSES[label], 'OK' if correct else 'NO', u"\u2192" 
                                if not correct else '', 
                                SHORT_CLASSES[correct_label] if not correct else ''), correct

In [None]:
def display_one_image(image, title, subplot, red=False, titlesize=16):
    plt.subplot(*subplot)
    plt.axis('off')
    plt.imshow(image)
    if len(title) > 0:
        plt.title(title, fontsize=int(titlesize) if not red else int(titlesize/1.2), 
                  color='red' if red else 'black', fontdict={'verticalalignment':'center'}, 
                  pad=int(titlesize/1.5)
                 )
    return (subplot[0], subplot[1], subplot[2]+1)

In [None]:
def display_batch_of_images(databatch, show_class_names=True, predictions=None):
    """ This will work with:
        display_batch_of_images(images)
        display_batch_of_images(images, predictions)
        display_batch_of_images((images, labels))
        display_batch_of_images((images, labels), predictions)
    """
    # data
    images, labels = batch_to_numpy_images_and_labels(databatch)
    if not any(l is None for l in labels):
        labels = np.argmax(labels, axis=-1)
        
    # auto-squaring: this will drop data that does  
    # not fit into square or square-ish rectangle
    rows = int(math.sqrt(len(images)))
    cols = len(images)//rows
        
    # size and spacing
    FIGSIZE = 13.0
    SPACING = 0.1
    subplot=(rows,cols,1)
    if rows < cols:
        plt.figure(figsize=(FIGSIZE,FIGSIZE/cols*rows))
    else:
        plt.figure(figsize=(FIGSIZE/rows*cols,FIGSIZE))
    
    # display
    for i, (image, label) in enumerate(zip(images[:rows*cols], labels[:rows*cols])):
        if show_class_names:
            title = '' if label is None else SHORT_CLASSES[label]
        else:
            title = ''
        correct = True
        if predictions is not None:
            title, correct = title_from_label_and_target(predictions[i], label)
        # magic formula tested to work from 1x1 to 10x10 images
        dynamic_titlesize = FIGSIZE*SPACING/max(rows,cols)*40+3
        subplot = display_one_image(image, title, subplot, 
                                     not correct, titlesize=dynamic_titlesize)
    
    #layout
    plt.tight_layout()
    if label is None and predictions is None:
        plt.subplots_adjust(wspace=0, hspace=0)
    else:
        plt.subplots_adjust(wspace=SPACING, hspace=SPACING)
    plt.show()

## Dataset visualizations

In [None]:
# Peek at training data

dataset = get_dataset(ALL_TFRECS)
dataset = dataset.unbatch().batch(20)
img_batch = iter(dataset)

In [None]:
# run this cell again for next set of images
display_batch_of_images(next(img_batch))

In [None]:
del dataset, img_batch
gc.collect()

## Extract image features

To extract image embeddings, we follow the procedure suggested by Chris Deotte in [one of his Melanoma competition public kernels](https://www.kaggle.com/cdeotte/rapids-cuml-knn-find-duplicates). The idea is very simple -- we add a `GlobalAveragePooling2D` layer to a pre-trained EfficientNet model and then send every image through the resulting neural net making predictions. The output of the `GlobalAveragePooling2D` layer is a very long vector wiht thousands of components. The exact number of these components depends on the type of the EfficientNet model. This vector will be used to represent the corresponding image. In other words, we will be using it as an embedding for the image. 

In [None]:
# EXTRACT LAST LAYER OF EFFICIENT NET WITH GLOBAL AVERAGE POOLING

def build_model():
    with strategy.scope():
        pretrained_model = model_selector[MODEL](input_shape=(*IMAGE_SIZE, 3),
                                                      weights='imagenet',
                                                      include_top=False
                                                      )
        inp = tf.keras.layers.Input(shape=(*IMAGE_SIZE, 3))
        x = pretrained_model(inp)
        out = tf.keras.layers.GlobalAveragePooling2D()(x)
        model = tf.keras.Model(inputs=inp, outputs=out)
        return model

In [None]:
%%time

model = build_model()

In [None]:
def retrieve_image(image, image_id_or_label):
    return image

In [None]:
def retrieve_id_or_label(image, image_id_or_label):
    return image_id_or_label

Computing the embeddings and the corresponding image ID's.

In [None]:
%%time

ds = get_dataset(ALL_TFRECS)

ds_imgs = ds.map(retrieve_image, num_parallel_calls=AUTO) 
ds_ids = ds.map(retrieve_id_or_label, num_parallel_calls=AUTO).unbatch()

embs = model.predict(ds_imgs,verbose=1)

image_ids=next(iter(ds_ids.batch(count_data_items(ALL_TFRECS)))).numpy().astype('U')

In [None]:
print(f"The shapes of the embeddings and image ID's are {embs.shape} and {image_ids.shape}, respectively.")

Saving the embeddings and the image ID's.

In [None]:
%%time

np.save(f'embeddings_train_{MODEL}_{HEIGHT}x{WIDTH}', embs)
np.save(f'image_ids_train_{MODEL}_{HEIGHT}x{WIDTH}', image_ids)

## Finding duplicates with DBSCAN

To search for duplicated images, we apply the method suggested by Alex Shonenkov in his Melanoma compentition [public kernenl](https://www.kaggle.com/shonenkov/dbscan-clustering-check-marking). The idea is to tune the parameters of the `DBSCAN` clustering algorithm to split the dataset into clusters of similar images. Then we will visually inspect the resulting clusters to see whether or not they contain true duplicates.

If you want to learn more about DBSCAN and how it works please refer to the following scikit-learn page: [DBSCAN](https://scikit-learn.org/stable/modules/clustering.html#dbscan).

In [None]:
%%time

clusters = defaultdict(list)
for image_name, cluster_id in zip(image_ids, DBSCAN(eps=3.0, min_samples=1, n_jobs=4).fit_predict(embs)):
    clusters[cluster_id].append(image_name)

Find clusters with more than one element.

In [None]:
potential_duplicates = np.array([c for c in clusters.values() if len(c)>1])
potential_duplicates

We found 5 pairs of potential duplicates. Let's plot them for visual inspection.

In [None]:
def plot_duplicates(dups, h_factor=7, v_factor=6, font_size=14):
    n_dups_max=max([len(d) for d in dups])
    n_rows=len(dups)

    fig, ax = plt.subplots(n_rows, n_dups_max, figsize=(n_dups_max*h_factor, n_rows*v_factor))

    for j, image_ids in enumerate(dups):
        for i, image_id in enumerate(image_ids):
            
            PATH_IMG = PATH/'train_images'
            label = str(train.loc[train['image_id'] == image_id, 'label'].values[0])
                
            image = Image.open(PATH_IMG/image_id)

            ax[j][i].imshow(image)
            ax[j][i].set_title("image_id: " + image_id + ";  label: " + label, 
                               fontsize=font_size)
            ax[j][i].axis('off')

In [None]:
plot_duplicates(potential_duplicates)

## Brief discussion of the result

Visual inspection reveals that all of these pairs contain duplicated images. The method yields zero number of false positives, so it's *precision* $ = TP/(TP + FP) = 1$, where $TP$ and $FP$ stand for the true and false positives, respectively. (Of course, this is the precision on the competition training set -- I cannot guarantee that the method's performance will be the same on some arbitrary data.)

Also, note that the `['1562043567.jpg', '3551135685.jpg']` pair contains duplicated images carrying different labels. This is a great examples of the noissy labels that we have to deal with in this competition.

## Saving the duplicate filenames

In [None]:
np.save('duplicates', potential_duplicates)

Thank you for reading! Please kindly upvote this notebook if you find it helpful!