In [None]:
import numpy as np
import pandas as pd
import cv2
import matplotlib.pyplot as plt

**How clean is the data?**

I ran the ImageHash library against every image in the training_images, trying to find if some pictures of individuals were mis-identified, even though they were taken seconds apart. I found two such instances of this happening.

My processes basically followed what @appian did here https://www.kaggle.com/code/appian/let-s-find-out-duplicate-images-with-imagehash/notebook. I had to run the hashing locally, as Kaggle kept dying on me. I then had to pay for CoLab+ to run the similarity calculations, as it took about 20GB of RAM to do, which I don't have.

I have supplied the results of my runs as a saved numpy file in the "whalesimularitydata". 90.npy contains above 90% similarity, 91.py contains above 91% similarity and so on. Format of each is is a ndarray with each element being [(whale1image,whale1species,whale1ID),(whale2image,whale2species,whale2ID)]

I found that only >0.95 images had a chance of being true positives of mis-identified whales. 0.94 and below were likely false positives.

**How does this impact the competition?**
It doesn't as far as I can tell, as these mis-identified images don't show up in the testset. See below for more details. 

In [None]:
def show(row1, row2):
    print('Image: %s / %s' % (row1[2], row2[2]))
    print('Species: %s / %s' % (row1[1], row2[1]))
    print('Individual: %s / %s' % (row1[0], row2[0]))
    
    image1 = cv2.imread('../input/happy-whale-and-dolphin/train_images/%s' % (row1[2]))
    image2 = cv2.imread('../input/happy-whale-and-dolphin/train_images/%s' % (row2[2]))
    image1 = cv2.cvtColor(image1, cv2.COLOR_BGR2RGB)
    image2 = cv2.cvtColor(image2, cv2.COLOR_BGR2RGB)
    
    fig = plt.figure(figsize=(10, 20))
    fig.add_subplot(1,2,1)
    plt.imshow(image1)
    fig.add_subplot(1,2, 2)
    plt.imshow(image2)
    plt.show()

In [None]:
#data is stored in a id, species, image format
potential_duplicates=np.load("../input/whalesimularitydata/95.npy")

for whale1, whale2 in potential_duplicates:
    if whale1[0] != whale2[0]:
        show(whale1,whale2)

In [None]:
train = pd.read_csv("../input/happy-whale-and-dolphin/train.csv")
hw1 = train[train.individual_id == "33dfa6052821"]
hw2 = train[train.individual_id == "79fe61c84d86"]
print("Number of photos of each of the humpback whales in question")
print(len(hw1), "and", len(hw2))
kw1 = train[train.individual_id == "1b589cc6179d"]
kw2 = train[train.individual_id == "fc5088954f84"]
print("Number of photos of each of the killer whales in question")
print(len(kw1), "and", len(kw2))

# MORE analysis

I went to bed and wanted to find out if these bad actor photos were also in the test set and could possibly result in lower lb scores due to bad data. I couldn't find the photos in the test set, so we're probably safe. I did find many more "near duplicate" photos scattered throughout though (81 in total). This is the result of me running the same ImageHash across all the images in the dataset and returning any matches >95% similarity. Obviously some are false positives, but **I invite the competition organizers to manually review the images in question to ensure that mis-identification (like happened in the training set) isn't ALSO happening in the test dataset.**

In [None]:
def attempt_to_show(whale1, whale2):
    species=[[],[]]
    individual=[[],[]]
    
    image1 = cv2.imread('../input/happy-whale-and-dolphin/train_images/%s' % (whale1))
    image2 = cv2.imread('../input/happy-whale-and-dolphin/train_images/%s' % (whale2))
    if image1 is None:
        image1 = cv2.imread('../input/happy-whale-and-dolphin/test_images/%s' % (whale1))
        species[0]="TEST"
        individual[0]="TEST"
    else:
        species[0] = train[train.image == whale1].iloc[0].species
        individual[0] = train[train.image == whale1].iloc[0].individual_id
        
    if image2 is None:
        image2 = cv2.imread('../input/happy-whale-and-dolphin/test_images/%s' % (whale2))
        species[1]="TEST"
        individual[1]="TEST"
    else:
        species[1] = train[train.image == whale2].iloc[0].species
        individual[1] = train[train.image == whale2].iloc[0].individual_id
        
    print('Image: %s / %s' % (whale1, whale2))
    print('Species: %s / %s' % (species[0],species[1]))
    print('Individual: %s / %s' % (individual[0], individual[1]))
    
    image1 = cv2.cvtColor(image1, cv2.COLOR_BGR2RGB)
    image2 = cv2.cvtColor(image2, cv2.COLOR_BGR2RGB)
    
    fig = plt.figure(figsize=(10, 20))
    fig.add_subplot(1,2,1)
    plt.imshow(image1)
    fig.add_subplot(1,2, 2)
    plt.imshow(image2)
    plt.show()


all_duplicates=np.load("../input/whalesimularitydata/all95.npy")
train = pd.read_csv("../input/happy-whale-and-dolphin/train.csv")
for whale1, whale2 in all_duplicates:
    attempt_to_show(whale1,whale2)