This notebook is a demonstration of how someone could use cosine similarity to find similar images within the trainset. 

I have seen a couple of discussions talking about how people were able to find "potatoes" within the trainset, which they expected to be filled with leaves only. 

In this notebook we want to identify how many of these "potatoes" are within our trainset by using cosine similarity.

In [None]:
import pandas as pd
from PIL import Image
import torch
import numpy as np
import cv2
from albumentations.pytorch.transforms import ToTensorV2
import albumentations as A

In [None]:
train_csv = pd.read_csv("../input/cassava-leaf-disease-classification/train.csv")

In [None]:
class CassavaDataset(torch.utils.data.Dataset):
    def __init__(self, img_path, df,transforms=None):
        self.path = img_path
        self.df = df
        self.transforms = transforms
        
    def __getitem__(self, index):
        img_name = self.df.image_id.values[index]
        img_arr = cv2.imread(self.path+img_name)
        img_arr_rgb = cv2.cvtColor(img_arr, cv2.COLOR_BGR2RGB)
        
        if self.transforms:
            sample = {'image':img_arr_rgb}
            sample = self.transforms(**sample)
            img_tens = sample['image']
        else:
            img_tens = torchvision.transforms.ToTensor()(img_arr_rgb)
            
        return img_tens
    
    def __len__(self):
        return len(self.df)

In [None]:
valid_transform = A.Compose(
    [A.Resize(256,256),
     A.Normalize(),
     ToTensorV2()])

In [None]:
!pip install timm

In [None]:
import timm

In [None]:
def create_model_ef():
    model = timm.create_model("tf_efficientnet_b1", pretrained=False)
    # five classes only
    num_classes = 5
    model.classifier = torch.nn.Linear(model.classifier.in_features, num_classes)
    return model

In [None]:
# create model and load pretrained weights
model = create_model_ef()
model_weights = torch.load("../input/cassava-leaf-disease-classification-training/trained_weights_1", torch.device('cpu'))
model.load_state_dict(model_weights)

In [None]:
# "remove" last classifier layer by setting it to identity passing the last layer completly through
model.classifier = torch.nn.Identity()

In [None]:
train_ds = CassavaDataset("../input/cassava-leaf-disease-classification/train_images/", train_csv, transforms=valid_transform)
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=32, num_workers=2)

In [None]:
features = []
for images in train_dl:
    with torch.no_grad():
        feature = model(images)
        features.append(feature)

In [None]:
# manually checked if its a potatoe
potatoe = features[1][10]

# the corresponding image we want to check all other similarities to (our input/identity image)
Image.open("../input/cassava-leaf-disease-classification/train_images/" + train_csv.iloc[42]["image_id"])

In [None]:
# flatten the arrays
feature_vec = []
for x in features:
    for k in x:
        feature_vec.append(k)

In [None]:
# we will store all relevant indices here
relevant = []
# random threshold score (-> higher is more similar)
threshold = 0.65

for i, feat in enumerate(feature_vec):
    if float(torch.nn.CosineSimilarity(dim=0)(potatoe,feat)) > threshold:
        relevant.append(i)

In [None]:
# indices of similar images
# using np.array for easier visuals
np.array(relevant)

In [None]:
# how many images with a similarity score over 0.65 are in our trainset
len(relevant)

In [None]:
# example 1
# random sample from the relevant list, you can check them in your notebook 
Image.open("../input/cassava-leaf-disease-classification/train_images/" + train_csv.iloc[relevant[22]]["image_id"])

In [None]:
# example 2
# random sample from the relevant list, you can check them in your notebook 
Image.open("../input/cassava-leaf-disease-classification/train_images/" + train_csv.iloc[relevant[50]]["image_id"])

In [None]:
# example 3
# random sample from the relevant list, you can check them in your notebook 
Image.open("../input/cassava-leaf-disease-classification/train_images/" + train_csv.iloc[relevant[0]]["image_id"])

In [None]:
# example 4
# random sample from the relevant list, you can check them in your notebook 
Image.open("../input/cassava-leaf-disease-classification/train_images/" + train_csv.iloc[relevant[140]]["image_id"])

In [None]:
# example 5
# random sample from the relevant list, you can check them in your notebook 
Image.open("../input/cassava-leaf-disease-classification/train_images/" + train_csv.iloc[relevant[146]]["image_id"])

We have checked the whole trainset and there are **147 images** with a cosine similarity higher than **0.65**, thats around **0.6%** of the whole trainset and should therefore only affect our training a little bit. 

Keeping them in our training process should not have a big effect on the model, we also dont know if the testset includes such images or not. 

Someone could try to see if they achieve a higher LB-Score by excluding these images.