# Detecting Duplicate Images via Image Embeddings

Convolutional Neural Network (CNN) for classification tasks consists of two parts, so called backbone and head. The backbone of CNN is made of convolutional and pooling layers and its job during training is to do feature learning which is extraction of relevant features from image. During inference time the backbone converts the input image to the so called embedding of the image which is a vector representation of the image. The head of CNN consists of a fully connected layer which serves as a classifier, i.e., it takes image embedding as an input and produces probabilities for each class as its output.

Another way to think about it is that the backbone converts unstructured data type such as images into more structured data type such as image embeddings in vector space which then can be more easily classified with traditional machine learning algorithms such as Logistic Regression, SVM, and decision trees based models such as Random Forest for example. A single fully connected layer is essentially Softmax Regression which is just generalization of Logistic Regression for multi-classification problems, i.e., classification problems with more than two classes.

Since embeddings are essentially vectors (one can think of the term embedding as a synonym word with vector) that means that we can apply concepts and methods of linear algebra in order to analyse our data once we produce embeddings for all our images. Specifically, for the task of weeding out duplicate images, embeddings present a simple way of measuring how similar one image is from the other one, it comes down to measurement of how close respected embeddings of these images are from one another. One can measure closeness or similarity between two vectors via either Euclidean distance or cosine similarity. I will use cosine similarity for my task since I know that embeddings I will be getting from backbone of CNN will be close to be of the similar length since I will be feeding normalized image-tensors to the backbone and backbone itself will also normalize data which passed through it via batch normalization. Thus, the question of how different one embedding from another can be better assessed via angle between them rather than Euclidean distance.

One last aspect to mention in case it is not clear, is that one has to use pre-trained CNN in order to produce embeddings for images. In other words, one cannot use CNN with randomly initialized weights, that network has not learned any features and cannot extract anything useful from images passed through it, it will produce meaningless embeddings. What data CNN has to be pre-trained on in order to be useful for conversion of images to embeddings? Well, there is no concrete answer to this question, it all depends on what kind of data, i.e., images one wants to convert into embeddings. The general "rule" is that the images you have, need to be somewhat close or analogues to the data a CNN was pre-trained on. In our case, images of leaves, we are extra lucky because famous in Deep Learning (DL) and Computer Vision (CV) specifically, ImageNet dataset which was used as a benchmark for pre-training almost all well known CNN family of models, has lots of images which contain leaves. What that means, is that CNNs pre-trained on ImageNet are well equipped for extracting useful features and converting images of leaves into embeddings. Had we had different types of data, let us say X-ray images, CNNs pre-trained on ImageNet could have not been so useful and we could have to actually train a CNN from scratch on the data we had and then use it to produce embeddings we need.

In [None]:
%matplotlib inline

# Loading packages

In [None]:
import os
import cv2
import ast
import numpy as np
import pandas as pd
import albumentations as A
import matplotlib.pyplot as plt

from tqdm import tqdm
from albumentations.pytorch import ToTensorV2
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
import torch
import torch.nn as nn
import torchvision.models as models
from torch.utils.data import Dataset, DataLoader

# Defining environment

In [None]:
DATA_ROOT = os.path.join('..', 'input')
DATA_COMPT = os.path.join(DATA_ROOT, 'plant-pathology-2021-fgvc8')
DATA_TRAIN_IMAGES = os.path.join(DATA_COMPT, 'train_images')

### I am using resized images in this notebook which I prepared and made publicly avaliable [here](https://www.kaggle.com/datasciencegeek/resize-images)

In [None]:
DATA_RESIZE_IMAGES = os.path.join(DATA_ROOT, 'resize-images')

In [None]:
DATA_TRAIN_IMAGES_2672x4000 = os.path.join(DATA_RESIZE_IMAGES, 'train_images_2672x4000')
DATA_TRAIN_IMAGES_224 = os.path.join(DATA_RESIZE_IMAGES, 'train_images_224')
DATA_TRAIN_IMAGES_224x336 = os.path.join(DATA_RESIZE_IMAGES, 'train_images_224x336')
DATA_TRAIN_IMAGES_448 = os.path.join(DATA_RESIZE_IMAGES, 'train_images_448')
DATA_TRAIN_IMAGES_448x670 = os.path.join(DATA_RESIZE_IMAGES, 'train_images_448x670')

In [None]:
DATA_DUPLS_HASH = os.path.join(DATA_ROOT, 'pp2021-duplicates-revealing')

In [None]:
DATA_IMG_STATS = os.path.join(DATA_ROOT, 'plant-pathology-2021-metadata-with-image-stats')

In [None]:
DATA_OUTPUT = './'

# Loading metadata

In [None]:
df_train = pd.read_csv(os.path.join(DATA_COMPT, 'train.csv'))

In [None]:
df_train.head()

In [None]:
meta_cols = list(df_train.columns); meta_cols

## Loading and formatting metadata of previously found 50 duplicates via hashing method [here](https://www.kaggle.com/nickuzmenkov/pp2021-duplicates-revealing) and discussed over [here](https://www.kaggle.com/c/plant-pathology-2021-fgvc8/discussion/227829)

In [None]:
df_dupls_hash = pd.read_csv(os.path.join(DATA_DUPLS_HASH, 'duplicates.csv'), header=None)

In [None]:
df_dupls_hash.head()

In [None]:
df_dupls_hash.shape

In [None]:
df_dupls_hash = pd.DataFrame(data=df_dupls_hash.apply(lambda x: [x[0],x[1]], axis=1), columns=['image'])

In [None]:
df_dupls_hash.head()

In [None]:
df_dupls_hash = pd.concat([df_dupls_hash,
          pd.DataFrame(data={'image':df_dupls_hash['image'].apply(lambda x: x[::-1])})]).reset_index(drop=True)

dupls_hash = set(df_dupls_hash['image'].apply(lambda x: tuple(x)))

In [None]:
df_dupls_hash.shape

In [None]:
len(dupls_hash)

# Images to Embeddings

## Defining all nesseserly functions and classes

In [None]:
def show_img(image):
    plt.figure(figsize=(10,10))
    plt.imshow(image)

In [None]:
def plot_dist_stats(df, col, **kwargs):
    mn  = round(df[col].min(), 2)
    mx  = round(df[col].max(), 2)
    avg = round(df[col].mean(), 2)
    std = round(df[col].std(), 2)

    df[col].hist(label=f'min, max = ({mn}, {mx})\navg, std = ({avg}, {std})', **kwargs)
    plt.legend()
    plt.title(col)
    plt.show()

In [None]:
def plot_duplicates(path:str, df:pd.DataFrame, df_meta:pd.DataFrame, target)->list:
    
    images_target_inconsistent = []
    for i in df.index:
        image_id1 = df.loc[i, 'image'][0]
        image_id2 = df.loc[i, 'image'][1]
        similarity = round(df.loc[i, 'similarity'], 3)
        
        image1 = cv2.imread(os.path.join(path, f'{image_id1}'))
        image1 = cv2.cvtColor(image1, cv2.COLOR_BGR2RGB)
        image2 = cv2.imread(os.path.join(path, f'{image_id2}'))
        image2 = cv2.cvtColor(image2, cv2.COLOR_BGR2RGB)

        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15,15))
        ax1.imshow(image1)
        ax2.imshow(image2)

        l1 = df_meta.loc[df_meta['image']==image_id1, target].tolist()[0].split()
        l2 = df_meta.loc[df_meta['image']==image_id2, target].tolist()[0].split()
        
        print(f"Index={i}, Similarity={similarity}")
        ax1.title.set_text(f'ImageID: {image_id1}\nLabel: {l1}')
        ax2.title.set_text(f'ImageID: {image_id2}\nLabel: {l2}')
        plt.show()

        if set(l1)!=set(l2): images_target_inconsistent.extend([image_id1,image_id2])
    
    return images_target_inconsistent

In [None]:
class Image2EmbeddingDataset(Dataset):
    
    def __init__(self, path:str, df:pd.DataFrame, filename_col, transform=None):
        self.path = path
        self.df = df
        self.filename_col = filename_col
        self.transform = transform
        
        self.filenames = self.df[self.filename_col].tolist()
        self.trfTranspose = A.Transpose(p=1)

    def __len__(self):
        return len(self.filenames)

    def __getitem__(self, idx):
        image_filepath = os.path.join(self.path, self.filenames[idx])
        image = cv2.imread(image_filepath)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        
        height, width, _ = image.shape
        if height/width > 1: image = self.trfTranspose(image=image)['image']
        
        if self.transform is not None: image = self.transform(image=image)["image"]
        return image

In [None]:
def images_to_embeddings(model, path:str, df:pd.DataFrame, resize_to:dict, filename_col, **kwargs)->pd.DataFrame:
    df = df.reset_index(drop=True)
    
    backbone = nn.Sequential(*list(model.children())[:-1]).eval()
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    backbone = backbone.to(device=device)
    
    transform = A.Compose(
        [
            A.Resize(height=resize_to['height'], width=resize_to['width'], p=1),
            A.Normalize(),
            ToTensorV2(),
        ]
    )
    dataset = Image2EmbeddingDataset(path=path, df=df, filename_col=filename_col, transform=transform)
    dataloader = DataLoader(dataset=dataset, shuffle=False, pin_memory=True, **kwargs)
    
    embeddings = []
    stream = tqdm(dataloader)
    with torch.no_grad():
        for images in stream:
            images = images.to(device=device, non_blocking=True)
            
            embeddings_batch = backbone(images)
            embeddings_batch = embeddings_batch.view(embeddings_batch.size(0), -1)
            
            embeddings_batch = embeddings_batch.cpu().numpy()
            embeddings.append(embeddings_batch)
            
            stream.set_description(f"Using device: {device}. Converting images to embeddings")
    
    embeddings = np.concatenate(embeddings)
    embeddings = pd.DataFrame(data=embeddings)
    
    return pd.concat([df, embeddings], axis=1)

In [None]:
def get_image_similarities(df:pd.DataFrame, filename_col, embedding_cols:list)->pd.DataFrame:
    ids = df[filename_col].tolist()
    
    similarity_matrix = cosine_similarity(X=df[embedding_cols])
    similarity_matrix = np.tril(m=similarity_matrix, k=-1)
    similarity_matrix += np.triu(m=np.full(shape=similarity_matrix.shape, fill_value=-2, dtype=int), k=0)
    
    idxs = np.unravel_index(indices=np.argsort(a=similarity_matrix, axis=None), shape=similarity_matrix.shape)
    idxs = (np.flip(m=idxs[0]), np.flip(m=idxs[1]))
    
    similarity_matrix = similarity_matrix[idxs]
    
    idxs = np.array(idxs)
    image_pairs = [[ids[idxs[:,i][0]], ids[idxs[:,i][1]]] for i in range(len(idxs[0]))]
    
    df_res = pd.DataFrame(data = {filename_col:image_pairs, 'similarity':similarity_matrix})
    df_res = df_res[df_res['similarity']!=-2].reset_index(drop=True)
    
    return df_res

## Loading pre-trained on ImageNet PyTorch native CNNs

Information on PyTorch native pre-trained models on ImageNet data with resolution 224x224 can be found [here](https://pytorch.org/vision/stable/models.html)

In [None]:
arch = models.resnet18(pretrained=True)
#arch = models.resnet34(pretrained=True)
#arch = models.resnet50(pretrained=True)

## Converting images to embeddings. It is way faster to do it on GPU!

In [None]:
resize_to_448x670 = {'height':448, 'width':670}

In [None]:
# I set it to load pre-calculated data with embeddings which I did on my local machine to save time running this notebook on Kaggle. Feel free to forck it and paly with it though.
try:
    df_train = pd.read_csv(os.path.join(DATA_IMG_STATS, 'train_resnet18_448x670.csv'))
except:
    df_train = images_to_embeddings(model=arch, path=DATA_TRAIN_IMAGES, df=df_train[meta_cols], resize_to=resize_to_448x670, filename_col='image', batch_size=64)

In [None]:
df_train.head()

# Computing cosine similarities for every pair of different images in our dataset, i.e, for N images in the dataset there are going to be $(N^2 - N)/2$ similarities.

In [None]:
embedding_cols = [col for col in df_train.columns if col not in meta_cols]

In [None]:
len(embedding_cols)

In [None]:
# Here I also load by default pre-computed pairs with respected similarities since full file of pairs is over 17Gb and won't fit in Kaggle's kernel memory.
# Thus, I am loading partial file of pairs with similarities where I retained all nessesarly pairs for this notebook to reproduce results I got on my local machine.
try:
    df_pairs = pd.read_csv(os.path.join(DATA_IMG_STATS, 'train_pairs_partial_resnet18_448x670.csv'))
except:
    df_pairs = get_image_similarities(df=df_train, filename_col='image', embedding_cols=embedding_cols)

In [None]:
df_pairs.head()

In [None]:
df_pairs['image'] = df_pairs['image'].apply(lambda x: ast.literal_eval(x))

### sanity check for number of similarities computed:

In [None]:
len(df_pairs)

In [None]:
(len(df_train)**2 - len(df_train))//2

In [None]:
plot_dist_stats(df=df_pairs, col='similarity')

In [None]:
df_pairs['similarity'].median()

# Detecting duplicates via image similarities

**Method Description:**

Since there are already 50 duplicates detected via hashing method, I will use them to speed up my process of duplicate detection.

1) I will determine the minimum threshold value to filter out suspects for duplicates as a threshold value above which all already detected 50 duplicates via hashing method are selected, since hashing method detects the most exact duplicates.

2) I will visually inspect all, if any, new suspects for duplicates selected once the minimum threshold for similarity found in the step #1 applied.

3) I will explore for potentially more duplicate images among images whose similarities fall below the minimum threshold found in the step #1 by selecting images in bins with respect to similarity values and keep going down the values of similarity until I reach actual number of duplicates detected to be too small to keep lowering the minimum value for the similarity threshold.

4) Finally, I will visually re-examine all suspects for duplicates which fall above the most minimal value of the threshold for similarity chosen in the step #3 and keep only actual duplicates detected.

### 1) I will determine the minimum threshold value to filter out suspects for duplicates as a threshold value above which all already detected 50 duplicates via hashing method are selected, since hashing method detects the most exact duplicates.

In [None]:
mask_similarity = df_pairs['similarity']>0.987; mask_similarity.sum()

In [None]:
# model   | img_size | min th  | count of image pairs above min th
# resnet18, 224,     th=0.9491, 22036
# resnet18, 224x336, th=0.9668, 2257
# resnet18, 448x670, th=0.987,  292 <-- best
# resnet18, 448,     th=0.984,  244 <-- second best
# resnet18,2672x4000,th=0.972,  19421934

# resnet34, 224x336, th=0.968,  42003
# resnet34, 224,     th=0.951,  299414
# resnet34, 448x670, th=0.9871, 833
# resnet34, 448,     th=0.983,  3569
# resnet34,2672x4000,th=0.974,  8960566

# resnet50, 224,     th=0.958,  68493
# resnet50, 224x336, th=0.974,  6240
# resnet50, 448x670, th=0.9877, 3145
# resnet50, 448,     th=0.984,  6507
# resnet50,2672x4000,th=0.982,  8329380

In [None]:
df_dupl = df_pairs[mask_similarity].reset_index(drop=True).copy()

In [None]:
dupls = set(df_dupl['image'].apply(lambda x: tuple(x)))

In [None]:
overlap = dupls.intersection(dupls_hash)

In [None]:
len(overlap)

### 2) I will visually inspect all, if any, new suspects for duplicates selected once the minimum threshold for similarity found in the step #1 applied.

In [None]:
df_dupl_new = df_dupl[~df_dupl['image'].apply(lambda x: tuple(x)).isin(overlap)].reset_index(drop=True)

In [None]:
len(df_dupl_new)

In [None]:
df_dupl_hash = df_dupl[df_dupl['image'].apply(lambda x: tuple(x)).isin(overlap)].reset_index(drop=True)

In [None]:
len(df_dupl_hash)

In [None]:
# visualizing previously found duplicates via hashing method
_ = plot_duplicates(path=DATA_TRAIN_IMAGES_224x336, df=df_dupl_hash.head(), df_meta=df_train, target='labels')

In [None]:
# visualizing newly found duplicates
_ = plot_duplicates(path=DATA_TRAIN_IMAGES_224x336, df=df_dupl_new.head(), df_meta=df_train, target='labels')

### 3) I will explore for potentially more duplicate images among images whose similarities fall below the minimum threshold found in the step #1 by selecting images in bins with respect to similarity values and keep going down the values of similarity until I reach actual number of duplicates detected to be too small to keep lowering the minimum value for the similarity threshold.

In [None]:
mask_similarities_bin = (df_pairs['similarity']>0.9846)&(df_pairs['similarity']<=0.985); mask_similarities_bin.sum()

In [None]:
# th bin       | findings
# 0.987-0.986,  has many dupls
# 0.986-0.9855, has 27 dupls of 173 pairs
# 0.9855-0.985, has 18 dupls of 248 pairs
# 0.985-0.9846, has 13 dupls of 278 pairs although, mostly minority classes dupls!

In [None]:
_ = plot_duplicates(path=DATA_TRAIN_IMAGES_224x336, df=df_pairs[mask_similarities_bin].head(), df_meta=df_train, target='labels')

### 4) Finally, I will visually re-examine all suspects for duplicates which fall above the most minimal value of the threshold for similarity chosen in the step #3 and keep only actual duplicates detected.

In [None]:
mask_similarity = df_pairs['similarity']>0.985; mask_similarity.sum()

In [None]:
df_dupl = df_pairs[mask_similarity].reset_index(drop=True).copy()

In [None]:
df_dupl_new = df_dupl[~df_dupl['image'].apply(lambda x: tuple(x)).isin(overlap)].reset_index(drop=True)

In [None]:
len(df_dupl_new)

In [None]:
len(df_dupl_hash)

In [None]:
chunk = 800 # from 0 to 800 with increment of 100, i.e., 0, 100, 200, ....800

In [None]:
_ = plot_duplicates(path=DATA_TRAIN_IMAGES_224x336, df=df_dupl_new.iloc[chunk:chunk+100].head(), df_meta=df_train, target='labels')

#### manually filled while visually inspecting all of the suspects for duplicates:

In [None]:
not_dupl_index = [48,63,67,77,82,88,89,90,92,93,105,112,114,126,128,129,132,139,143,146,147,149,151,153,156,157,
                 158,161,162,164,165,166,170,173,175,179,180,181,182,183,185,186,191,192,194,195,196,203,205,207,
                 208,210,211,212,214,215,216,218,220,221,222,225,226,227,228,229,230,231,232,234,235,236,237,238,
                 241,242,243,244,246,247,248,249,250,251,252,253,254,257,258,259,260,262,263,264,266,267,268,271,
                 272,274,278,279,281,282,283,284,285,287,288,289,290,291,292,293,297,298,299,300,302,303,304,305,
                 306,307,308,309,311,312,314,315,316,317,318,320,321,323,324,325,327,328,329,331,332,333,334,335,
                 337,340,341,343,345,347,348,350,351,353,354,355,357,358,359,360,361,362,363,364,365,366,367,368,
                 370,371,373,374,375,376,377,379,380,381,383,384,387,388,389,391,392,393,394,395,396,397,398,399,
                 400,401,402,403,404,405,406,408,409,412,413,415,416,418,419,420,421,422,423,424,425,426,427,428,
                 429,430,432,433,434,435,438,439,440,443,447,449,450,452,453,454,455,456,457,459,460,461,462,463,
                 464,465,466,467,468,470,472,473,475,476,477,478,479,480,481,482,485,486,487,488,489,490,491,492,
                 493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,512,513,514,516,517,518,519,
                 521,522,523,524,525,526,527,529,530,532,533,536,537,539,540,541,542,543,544,545,546,548,549,550,
                 551,552,553,554,555,556,557,558,559,560,562,563,564,567,568,569,570,571,572,573,575,577,578,579,
                 580,581,582,583,584,585,586,589,590,594,595,596,597,598,599,600,602,603,605,606,607,608,609,610,
                 611,612,613,614,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,632,633,634,635,637,
                 638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,
                 662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,
                 687,688,692,693,694,695,696,697,698,699,700,701,703,706,707,708,709,710,711,712,714,715,716,717,
                 718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,
                 742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,761,762,763,764,765,767,
                 768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,789,790,791,792,
                 793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,815,816,818,
                 819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,841,842,843,
                 844,845,846,847,848,850,851,853,854,855,856,857,858,859,860,861,862,864,865,866,867,868,869,870,
                 872]

In [None]:
len(not_dupl_index)

In [None]:
len(set(not_dupl_index))

In [None]:
df_dupl_new = df_dupl_new.drop(labels=not_dupl_index).reset_index(drop=True)

In [None]:
df_dupl_new.shape

## Conclusions

I found 270 more duplicates, although there are still more duplicates (at least 13) but I do not think it will be worth my time to keep hunting for them. The duplicates can be split into two categories:

1) Exact duplicates.

2) Images of the same leaf but taken differently, either at different angles, leaf location within image, or other conditions that vary.

The second class has rather various duplicate images with respect to how much they close or differ from one another. I have not decided yet how exactly I am going to deal with them when it comes to model development.

### putting together newly detected duplicates with 50 pairs found before:

In [None]:
df_dupl_all = pd.concat([df_dupl_hash, df_dupl_new]).sort_values(by='similarity', ascending=False).reset_index(drop=True)

In [None]:
df_dupl_all.shape

In [None]:
100*2*df_dupl_all.shape[0]/df_train.shape[0]

The duplicates account for just ~3.4% of all images. So much work for such a small result :( Well, at least now I nailed down the method! :)

In [None]:
plot_dist_stats(df=df_dupl_all, col='similarity')

In [None]:
df_dupl_all.to_csv(os.path.join(DATA_OUTPUT, f'train_duplicates_{df_dupl_all.shape[0]}.csv'), index=False)

# Extra Bonus: Detecting most unique and peculiar images via similarity

**Idea and Method Description:**

We just computed cosine similarities for every single image in the dataset vs all other images in the dataset. We can utilize this information to detect least similar or most unique images in our dataset. At least one reason why one would want to check out the most unique images in the dataset is that these images can be somehow faulty and it might be useful to detect such images and remove them before training the model.

Below are the steps to detecting the most unique images using image similarities:

1) Calculate pairwise cosine similarities between all images in the dataset. This step is already completed in the previous section where we utilized image similarities to detect duplicates.

2) Calculate average image similarity for every single image in the dataset over all similarity values calculated in the step #1 for each single image with respect to the rest of the images in the dataset.

3) Sort images by the average image similarities calculated in the step #2. Visualize a number of images in order from lowest to the higher average image similarities.

In [None]:
df_train.shape

In [None]:
df_pairs.head()

In [None]:
df_pairs.shape

In [None]:
len(set(df_pairs['image'].apply(lambda x: x[0])))

In [None]:
len(set(df_pairs['image'].apply(lambda x: x[1])))

In [None]:
df_pairs['image1'] = df_pairs['image'].apply(lambda x: x[0])

In [None]:
df_pairs['image2'] = df_pairs['image'].apply(lambda x: x[1])

In [None]:
df_pairs.head()

### 2) Calculate average image similarity for every single image in the dataset over all similarity values calculated in the step #1 for each single image with respect to the rest of the images in the dataset.

In [None]:
df_img_avg_sim = pd.concat([df_pairs[['image1','similarity']].groupby(by='image1').mean().reset_index().rename(columns={'image1':'image', 'similarity':'avg_similarity'}),
df_pairs[['image2','similarity']].groupby(by='image2').mean().reset_index().rename(columns={'image2':'image', 'similarity':'avg_similarity'})]).groupby(by='image').mean().reset_index()

In [None]:
df_img_avg_sim.head()

In [None]:
df_img_avg_sim = df_train[meta_cols].merge(right=df_img_avg_sim, on='image')

### 3) Sort images by the average image similarities calculated in the step #2. Visualize a number of images in order from lowest to the higher average image similarities.

In [None]:
df_img_avg_sim = df_img_avg_sim.sort_values(by='avg_similarity').reset_index(drop=True)

In [None]:
df_img_avg_sim.head()

In [None]:
for image_id in df_img_avg_sim.head(10)['image']:
    
    image = cv2.imread(os.path.join(DATA_TRAIN_IMAGES_224x336, f'{image_id}'))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    l = df_img_avg_sim.loc[df_img_avg_sim['image']==image_id, 'labels'].tolist()
    
    print(f'ImageID: {image_id}\nLabel: {l}')
    show_img(image=image)
    plt.show()

### visually selected to disregard from model development:

In [None]:
bad_images = ['cd3a1d64e6806eb5.jpg','ead085dfac287263.jpg',
              'ccec54723ff91860.jpg','da8770e819d2696d.jpg',]

In [None]:
df_img_avg_sim[df_img_avg_sim['image'].isin(bad_images)].to_csv(os.path.join(DATA_OUTPUT, 'train_bad_images.csv'), index=False)