Like me, a lot of people might not be having the required computational resources to participate in this comepetition.


Why? Because there are a lot of images, about 1.5M and the datasize is huge too approximating at 100GBs.


But for each landmark, there are a lot of images that can be considered augmentations of the landmark. These we would be anyway generating while training, so its okay if we do not include some of these images in our training set.


Here's a way to get the most 'unrelated' images so we can train our model with the most diverse images.


------
[GitHub](http://https://github.com/JohannesBuchner/imagehash)


Image hashes tell whether two images look nearly identical. This is different from cryptographic hashing algorithms (like MD5, SHA-1) where tiny changes in the image give completely different hashes. In image fingerprinting, we actually want our similar inputs to have similar output hashes as well.

In [None]:
!pip install imagehash

In [None]:
# You can modify these to suit your needs and resources.

NUM_SAMPLES_TO_COMPARE = 1000  # we random sample these many images to choose from.
NUM_SAMPLES_SAVE_PER_LANDMARK = 25  # we finally save maximum these many images per landmark

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import cv2
import random
import imageio
import imagehash
from PIL import Image
import seaborn as sns
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

In [None]:
TRAIN_PATH = "../input/landmark-recognition-2020/train"

def id2filename(imageid):
    sid = str(imageid)
    return f"{TRAIN_PATH}/{sid[0]}/{sid[1]}/{sid[2]}/{sid}.jpg"


def plot_matches(match_dict, idx=-1):
    pair = match_dict[idx]
    id1, id2 = pair[0].split('_')
    img1 = imageio.imread(id2filename(id1))
    img2 = imageio.imread(id2filename(id2))
    
    plt.figure()
    plt.suptitle(f"#hash diff {pair[1]}")
    plt.subplot(121);plt.imshow(img1)
    plt.subplot(122);plt.imshow(img2)
    
    
def get_matches(landmark_id):
    mask = df['landmark_id'] == landmark_id
    image_names = df[mask]['id'].values.tolist()
    num_samples = min(len(image_names), NUM_SAMPLES_TO_COMPARE)
    image_names = random.sample(image_names, num_samples)
    img_hash = []
    for name in image_names:
        img_hash.append(imagehash.dhash(Image.open(id2filename(name))))
        
    match_pct = {}
    for i in range(len(img_hash)):
        base_name = image_names[i]
        for j in range(i+1, len(img_hash)):
            curr_name = image_names[j]
            match_pct[f"{base_name}_{curr_name}"] = img_hash[i] - img_hash[j]
    
    match_pct = sorted(match_pct.items(), key=lambda item: item[1])
    return match_pct

In [None]:
df = pd.read_csv('../input/landmark-recognition-2020/train.csv')
df

A visualization of how the similar or dissimilar the images look according to the image hashes

In [None]:
%%time

for landmark_id in [20409, 83144, 138982]:
    print(landmark_id)
    match_pct = get_matches(landmark_id)
    plt.figure(); plot_matches(match_pct, idx=0)
    plt.figure(); plot_matches(match_pct, idx=-1)

Let's get the sorted count of images per landmarks.

In [None]:
count_df = df.groupby('landmark_id').count().reset_index().rename(columns={'id': 'count'})
count_df = count_df.sort_values(by=['count'], ascending=False)
count_df

It's best to get an idea of the distribution of the counts before we begin

In [None]:
count_df['count'].hist(bins=100, figsize=(10, 4))

I can't see anything, can you?
Seems most of the images are less than 500 in number. 

Let's skip the first 100 (the dataframe is sorted) and see the distribution.

In [None]:
count_df['count'][100:].hist(bins=100, figsize=(10, 4))

Great! Now we can process those landmarks that have more than 25 image samples and choose the most dissimilar of them as our training set.

For landmarks that have less than 25, we can keep them as such and maybe random sample them at training time.

We'll use multiprocessing to make the best use of resources given to us.

In [None]:
count_mask = count_df['count'] >= NUM_SAMPLES_SAVE_PER_LANDMARK
count_mask.sum(), (~count_mask).sum()

In [None]:
from multiprocessing import Pool

def f(landmark_id):
    match_pct = get_matches(landmark_id)
    
    filenames = set()
    for idx in range(len(match_pct)):
        pair = match_pct[-idx]  # sorted in ascending order of hamming distance
        id1, id2 = pair[0].split('_')
        filenames.add((id1, landmark_id))
        filenames.add((id2, landmark_id))
        
        # ensuring an upper cap to the number
        if len(filenames) >= NUM_SAMPLES_SAVE_PER_LANDMARK:
            break
    return filenames
    

if __name__ == '__main__':
    with Pool(5) as p:
        values = count_df[count_mask]['landmark_id']
        l = list(tqdm(p.imap(f, values), total=len(values)))
    
samples_df = []
for filenames in l:
    samples_df += list(filenames)
samples_df = pd.DataFrame(samples_df, columns=['id', 'landmark_id'])

Get those landmark_ids which we didn't process, so we can add them as it is to the final dataframe

In [None]:
unsampled_landmarks = count_df[(~count_mask)]['landmark_id']  
unsamled_mask = df['landmark_id'].isin(unsampled_landmarks)
unsampled_landmarks_df = df[unsamled_mask]
unsampled_landmarks_df

Combine them!

In [None]:
final_df = pd.concat([samples_df, unsampled_landmarks_df])
final_df.to_csv(f'train_reduced_to_{NUM_SAMPLES_SAVE_PER_LANDMARK}_samples.csv')

Image count distribution of the final dataframe

In [None]:
final_count_df = final_df.groupby('landmark_id').count().reset_index().rename(columns={'id': 'count'})
final_count_df = final_count_df.sort_values(by=['count'], ascending=False)
final_count_df['count'].hist();

In [None]:
final_df['landmark_id'].unique().shape == df['landmark_id'].unique().shape

In [None]:
final_df.shape, df.shape