# Introduction

Welcome to the fourth Landmark Retrieval competition! This year, we introduce a lot more diversity in the challenge’s test images in order to measure global landmark retrieval performance in a fairer manner. And following last year’s success, we set this up as a code competition.

Image retrieval is a central problem in computer vision, relevant to many applications. The problem is usually posed as follows: given a query image, can you find similar images in a large database? This is especially important for query images containing landmarks, which accounts for a large portion of what people like to photograph.

> In this competition, you are asked to develop models that can efficiently retrieve landmark images from a large database. The training set is available in the train/ folder, with corresponding landmark labels in train.csv. The query images are listed in the test/ folder, while the "index" images from which you are retrieving are listed in index/. Each image has a unique id. Since there are a large number of images, each image is placed within three subfolders according to the first three characters of the image id (i.e. image abcdef.jpg is placed in a/b/c/abcdef.jpg).

# Info


Submissions are given 12 hours to run, as compared to the site-wide session limit of 9 hours. While your commit must still finish in the 9 hour limit in order to be eligible to submit, the rerun may take the full 12 hours.

* train.csv: This file contains, ids and targets
 - id: image id
 - landmark_id: target landmark id
 

In [None]:
import os


import random
import seaborn as sns
import cv2

# General packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import PIL
import IPython.display as ipd
import glob
import h5py
import plotly.graph_objs as go
import plotly.express as px
from PIL import Image
from tempfile import mktemp

from bokeh.plotting import figure, output_notebook, show
from math import pi

output_notebook()


from IPython.display import Image, display
import warnings
warnings.filterwarnings("ignore")

In [None]:
os.listdir('../input/landmark-retrieval-2021/')

In [None]:
DATASET_DIR = '../input/landmark-retrieval-2021/'

TRAIN_IMAGE_DIR = f'{DATASET_DIR}/train'
TEST_IMAGE_DIR = f'{DATASET_DIR}/test'
train = pd.read_csv(f'{DATASET_DIR}/train.csv')
SUB = pd.read_csv(f'{DATASET_DIR}/sample_submission.csv')

In [None]:
display(train.head())
print("Shape of train_data :", train.shape)

In [None]:
landmark = train.landmark_id.value_counts()
landmark_df = pd.DataFrame({'landmark_id':landmark.index, 'frequency':landmark.values}).head(30)

landmark_df['landmark_id'] =   landmark_df.landmark_id.apply(lambda x: f'landmark_id_{x}')

fig = px.bar(landmark_df, x="frequency", y="landmark_id",color='landmark_id',
             hover_data=["landmark_id", "frequency"],
             height=1000,
             title='Number of images per landmark_id (Top 30 landmark_ids)',
             color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

In [None]:
landmark.hist()

In [None]:
#Landmark ID distribution
plt.figure(figsize = (10, 8))
plt.title('Landmark ID Distribuition')
sns.distplot(train['landmark_id'])

plt.show()

In [None]:
sns.set()
plt.title('Training set: number of images per class(line plot)')
sns.set_color_codes("pastel")
landmarks_fold = pd.DataFrame(train['landmark_id'].value_counts())
landmarks_fold.reset_index(inplace=True)
landmarks_fold.columns = ['landmark_id','count']
ax = landmarks_fold['count'].plot(logy=True, grid=True)
locs, labels = plt.xticks()
plt.setp(labels, rotation=30)
ax.set(xlabel="Landmarks", ylabel="Number of images")
plt.show()

In [None]:
# Visualize outliers, min/max or quantiles of the landmarks count
sns.set()
ax = landmarks_fold.boxplot(column='count')
ax.set_yscale('log')

In [None]:
train.landmark_id.nunique()

- There are 81313 unique landmark_ids

In [None]:
landmark[:5]

- There is only one landmark which has more than 2300 images (landmark_id: 138982)

In [None]:
landmark.describe()

- Number of images per landmark_id ranges from 2 to 6272.
- median is 9, mean is 19


In [None]:
landmark[landmark < 100].shape

In [None]:
landmark.shape


- Out of 81313, there are 79298 (97.5%) landmark_ids with less than 100 images.

In [None]:
import PIL
from PIL import Image, ImageDraw


def display_images(images, title=None): 
    """
    func for display images 
    Thank you @rohitsingh9990 for this fucntion
    """
    f, ax = plt.subplots(5,5, figsize=(18,22))
    if title:
        f.suptitle(title, fontsize = 30)

    for i, image_id in enumerate(images):
        image_path = os.path.join(TRAIN_IMAGE_DIR, f'{image_id[0]}/{image_id[1]}/{image_id[2]}/{image_id}.jpg')
        image = Image.open(image_path)
        
        ax[i//5, i%5].imshow(image) 
        image.close()       
        ax[i//5, i%5].axis('off')

        landmark_id = train[train.id==image_id.split('.')[0]].landmark_id.values[0]
        ax[i//5, i%5].set_title(f"ID: {image_id.split('.')[0]}\nLandmark_id: {landmark_id}", fontsize="12")

    plt.show() 

# Visualizing

In [None]:
samples = train.sample(25).id.values
display_images(samples, 'Random')

In [None]:
samples = train[train.landmark_id == 138982].sample(25).id.values
display_images(samples, 'Top 1')

In [None]:
samples = train[train.landmark_id == 126637].sample(25).id.values
display_images(samples, 'Top 2')

In [None]:
samples = train[train.landmark_id == 20409].sample(25).id.values
display_images(samples, 'Top 3')