Hello Fellow Kagglers,

This notebook demonstrates how to get extra train images of low occuring classes. The train data is highly unbalanced, with some classes having thousands of samples and others just a handful of sampples. All classes are filled up to a maximum of 20 samples, greatly increasing the training data for low occuring classes. This should result in a lower class inbalance, lower bias towards the majority class and better recognition for low occuring classes.

All data is crawled from [this](https://github.com/cvdfoundation/google-landmark) GitHub repository. Over 400,000 new images are added. The provided training set in this competition contains 1.5M images, whereas the complete dataset contains over 4M images!

All 4M images are downloaded, if the Kaggle training set does not contain the image and the image belongs to a class with less than 20 samples the images is kept, it's as simple as that. The complete dataset also contains over 200,000 classes, most of those are not present in the Kaggle dataset. Only classes present in the Kaggle dataset are added.

The dataset this notebook results in can be found [here](https://www.kaggle.com/markwijkhuizen/google-landmark-recognition-extra-train-data-pub) and [this](https://www.kaggle.com/markwijkhuizen/google-landmark-recognition-extra-data-tfrec-pub) notebook shows how to convert the images to TFRecords, resulting in [this](https://www.kaggle.com/markwijkhuizen/google-landmark-recognition-extra-train-tfrecs-pub) TFRecords dataset.

In [None]:
# Silence All Tensorflow Warnings
!pip install -q silence_tensorflow

In [None]:
# Silence Tensorflow
import silence_tensorflow.auto

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf

import matplotlib.pyplot as plt

from tqdm.notebook import tqdm
from multiprocessing import cpu_count

import joblib
import imageio
import cv2
import os
import glob
import multiprocessing

tqdm.pandas()

In [None]:
# Set CV2 to run single threaded, speeds up multithreading
cv2.setNumThreads(1)

# Train Original

In [None]:
# Original Train
train_original = pd.read_csv('/kaggle/input/landmark-recognition-2021/train.csv')

In [None]:
display(train_original.head())

In [None]:
display(train_original.info())

In [None]:
# Print class value counts, many classes have just 2 samples!
train_original['landmark_id'].value_counts()

In [None]:
train_original_landmark_ids = set(train_original['landmark_id'].unique())
print(f'There are {len(train_original_landmark_ids)} unique landmarks')

In [None]:
# Landmark ID occurances, we fill them up to 20 images
original_landmark_id2count = train_original.groupby('landmark_id').count().squeeze().to_dict()

In [None]:
# Original image ids to check for duplicates
original_ids = set(train_original['id'])

# Train GitHub

In [None]:
# GitHub Train of complete dataset
!wget -cq "https://s3.amazonaws.com/google-landmark/metadata/train.csv"
train_github = pd.read_csv('./train.csv')

In [None]:
display(train_github.head())

In [None]:
# The complete dataset contains 203094 classes, many more than the Kaggle dataset
train_github['landmark_id'].value_counts()

In [None]:
display(train_github.info())

In [None]:
print(f'There are {train_github["landmark_id"].nunique()} unique landmarks')

In [None]:
github_id2landmark_id = train_github[['id', 'landmark_id']].set_index('id').squeeze().to_dict()

# Extra Image Potential

This function computes the maximum number of additional images for a given fill value. To compute this the assumption is made that each class is filled up to the fill value. Thus with a fill value of 20 the assumption is made the number of samples for each class will be filled up to 20. As can be seen, with a fill value of 100 the additional image potential is over 6 million!

In [None]:
res = []
for n in tqdm(range(101)):
    potential = 0
    for k, count in original_landmark_id2count.items():
        potential += max(0, n - count)
    res.append(potential)

In [None]:
plt.figure(figsize=(12, 6))
pd.Series(res).plot()
plt.grid()
plt.title(f'Potential Number of Extra Train Images per Threshold', size=18)
plt.xlabel('Threshold', size=16)
plt.ylabel('Potential Number of Extra Train Images', size=16)
plt.show()

In [None]:
pd.DataFrame({ 'Potential Number of Extra Train Images': res[:26] })

# Process Download

In [None]:
# Process Extraced Images, beating heart of this notebook
def process_download(idx):
    # Get all paths to the newly downloaded images
    file_paths = glob.glob('/kaggle/working/temp/*/*/*/*.jpg')
    new_train_data = 0
    for file_path in file_paths:
        # Get Image ID to check for duplicates
        image_id = file_path.split('/')[-1].split('.')[0]
        landmark_id = github_id2landmark_id[image_id]
        # Check for duplicates and check if class is under Kaggle dataset
        if landmark_id in train_original_landmark_ids and image_id not in original_ids:
            # Only add image if class count is below threshold
            count = original_landmark_id2count[landmark_id]
            if count < THRESHOLD:
                # Increase class count
                original_landmark_id2count[landmark_id] += 1
                # Increase newly found images count
                new_train_data += 1
                # Continue, do not remove this image
                continue
        # Remove image
        os.remove(file_path)

    # Ratio of images kept
    keep_ratio = new_train_data / len(file_paths) * 100
    # Count total new training data
    total_new_files = new_train_data + len(glob.glob('/kaggle/working/train/*/*/*/*.jpg'))
    # Print info
    if idx % 10 == 0:
        print(
            f'{idx:03d} | ' +
            f'{str(new_train_data).rjust(4)}/{len(file_paths)} ' +
            f'({keep_ratio:05.2f}%) images kept' +
            f', total new files: {total_new_files}'
        )

# Downsize Images

The notebook disk size limit is just 20GB, therefore the images are downsized to have a smaller side of 384 pixels. This allows for more new training data!

In [None]:
def downsize_single_image(fp):
    img = imageio.imread(fp)
    h, w, _ = img.shape

    # Check whether image is bigger than IMG_SIZE
    if min(h,w) > IMG_SIZE:
        r = IMG_SIZE / min(w, h)
        w_resize = int(w * r)
        h_resize = int(h * r)
        # Resize using high quality LANCZOS algorithm
        img = cv2.resize(img, (w_resize, h_resize), interpolation=cv2.INTER_LANCZOS4)
        # Save as JPEG with quality set to 70, just as original images
        img_jpeg = tf.io.encode_jpeg(img, quality=70, optimize_size=True).numpy()
        # Overwrite image with lower res version
        with open(fp, 'wb') as f:
            f.write(img_jpeg)

# Downsize images in parallel, speeds up the whole process
def downsize_images_parallel():
    jobs = [joblib.delayed(downsize_single_image)(fp) for fp in glob.glob('/kaggle/working/temp/*/*/*/*.jpg')]
    joblib.Parallel(
        n_jobs=cpu_count(),
        verbose=0,
        require='sharedmem'
    )(jobs)

# Add Extra Training Data

In [None]:
!rm -rf *

In [None]:
# Fill Value
THRESHOLD = 20
# Downsize Image Resolution
IMG_SIZE = 384
# Number of cores
N_CORES = cpu_count()

In [None]:
!mkdir train temp

In [None]:
# Install AXEL for multithreading download
!apt-get -qq install axel

The files are split up in 500 TAR files, they will all be downloaded and processed. Yes, that's processing half a Terabyte, over 4 million images, in about 6 hours.

In [None]:
# Process all TAR files
for i in tqdm(range(0, 500)):
    idx = str(i).rjust(3, '0')
    file = f'images_{idx}.tar'

    # Get tar file, downloaded in parallel for speedup
    !axel -q -n "$N_CORES" "https://s3.amazonaws.com/google-landmark/train/$file" -o "temp"

    # Extract tar file
    !tar -xf "/kaggle/working/temp/$file" -C "/kaggle/working/temp"

    # Process Download
    process_download(i)
    
    # Downsize Images in parallel
    downsize_images_parallel()
    
    # Remove tar file
    !rm -rf "/kaggle/working/temp/$file"

    # Move all accepted images
    for source in glob.glob('/kaggle/working/temp/*'):
        !cp -r "$source" "/kaggle/working/train"
        !rm -rf "$source"

# Mean Image Size

In [None]:
# Computes the mean images size in bytes, used for debug purposes
file_paths = glob.glob('/kaggle/working/train/*/*/*/*.jpg')
mean_img_size = 0
for fp in tqdm(file_paths):
    with open(fp, 'rb') as f:
        mean_img_size += len(f.read()) / len(file_paths)
        
print(f'Mean image size: {mean_img_size / 2**10:.2f}KB')
print(f'Maximum amount of images in 20GB dataset: {20 * 2**30 / mean_img_size / 1000:.1f}K')

# Create Train Extra DataFrame

In [None]:
train_extra_list = []

for file_path in glob.glob('/kaggle/working/train/*/*/*/*.jpg'):
    image_id = file_path.split('/')[-1].split('.')[0]
    landmark_id = github_id2landmark_id[image_id]
    
    train_extra_list.append({ 'id': image_id, 'landmark_id': landmark_id })

In [None]:
train_extra = pd.DataFrame.from_dict(train_extra_list)

In [None]:
display(train_extra.head())

In [None]:
display(train_extra.info())

In [None]:
# Save Train Extra DataFrame with ID and Landmark ID
train_extra.to_pickle('train_extra.pkl.xz')

In [None]:
# Sanity check, there should be no duplicate images present in both train and train_extra
duplicate_landmark_ids = len(set(train_extra['id']).intersection(set(train_original['id'])))
print(f'Found {duplicate_landmark_ids} landmark-ids occuring both in the original and extra dataset')

# Zip Dataset

This step is extremely important, all images must be zipped. Otherwise the notebook crashes, it will add all images as output when the notebook result is converted to HTML. It will thus add over 400K images to a HTML file, the notebook will fail. When zipping the images this will not happen. When creating the dataset Kaggle will automatically unzip the files.

In [None]:
for source in tqdm(glob.glob('/kaggle/working/train/*')):
    # Ignore files
    if '.' not in source:
        print(f'Zipping folder {source}')
        folder = source.split('/')[-1]
        target = f'{folder}.zip'
        # Zip
        !cd "/kaggle/working/train" ; zip -qr "$target" "$folder"
        # Remove original folder
        !rm -rf "$source"