# Introduction

Welcome to the fourth Landmark Recognition competition! This year, we introduce a lot more diversity in the challenge’s test images in order to measure global landmark recognition performance in a fairer manner. And following last year’s success, we set this up as a code competition.

Have you ever gone through your vacation photos and asked yourself: What is the name of this temple I visited in China? Who created this monument I saw in France? Landmark recognition can help! This technology can predict landmark labels directly from image pixels, to help people better understand and organize their photo collections. This competition challenges Kagglers to build models that recognize the correct landmark (if any) in a dataset of challenging test images.

> Google Landmark Recognition is an image classification challenges like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which aims to recognize 1K general object categories. Landmark recognition is a little different from that: it contains a much larger number of classes (there are more than 81K classes in this challenge), and the number of training examples per class may not be very large. Landmark recognition is challenging in its own way.

# Info

In this competition, you are asked to take test images and recognize which landmarks (if any) are depicted in them. The training set is available in the train/ folder, with corresponding landmark labels in train.csv. The test set images are listed in the test/ folder. Each image has a unique id. Since there are a large number of images, each image is placed within three subfolders according to the first three characters of the image id (i.e. image abcdef.jpg is placed in a/b/c/abcdef.jpg).

* This is a synchronous rerun code competition. The provided test set is a representative set of files to demonstrate the format of the private test set. When you submit your notebook, Kaggle will rerun your code on the private dataset. Additionally, this competition also has two unique characteristics:

* To facilitate recognition-by-retrieval approaches, the private training set contains only a 100k subset of the total public training set. This 100k subset contains all of the training set images associated with the landmarks in the private test set. You may still attach the full training set as an external data set if you wish.

Submissions are given 12 hours to run, as compared to the site-wide session limit of 9 hours. While your commit must still finish in the 9 hour limit in order to be eligible to submit, the rerun may take the full 12 hours.

* train.csv: This file contains, ids and targets
 - id: image id
 - landmark_id: target landmark id
 

In [None]:
import os


import random
import seaborn as sns
import cv2

# General packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import PIL
import IPython.display as ipd
import glob
import h5py
import plotly.graph_objs as go
import plotly.express as px
from PIL import Image
from tempfile import mktemp


from bokeh.plotting import figure, output_notebook, show
from math import pi

output_notebook()


from IPython.display import Image, display
import warnings
warnings.filterwarnings("ignore")

In [None]:
os.listdir('../input/landmark-recognition-2021/')

In [None]:
DATASET_DIR = '../input/landmark-recognition-2021'

TRAIN_IMAGE_DIR = f'{DATASET_DIR}/train'
TEST_IMAGE_DIR = f'{DATASET_DIR}/test'
train = pd.read_csv(f'{DATASET_DIR}/train.csv')
SUB = pd.read_csv(f'{DATASET_DIR}/sample_submission.csv')

In [None]:
display(train.head())
print("Shape of train_data :", train.shape)

In [None]:
landmark = train.landmark_id.value_counts()
landmark_df = pd.DataFrame({'landmark_id':landmark.index, 'frequency':landmark.values}).head(30)

landmark_df['landmark_id'] =   landmark_df.landmark_id.apply(lambda x: f'landmark_id_{x}')

fig = px.bar(landmark_df, x="frequency", y="landmark_id",color='landmark_id',
             hover_data=["landmark_id", "frequency"],
             height=1000,
             title='Number of images per landmark_id (Top 30 landmark_ids)',
             color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

In [None]:
landmark.hist()

In [None]:
#Landmark ID distribution
plt.figure(figsize = (10, 8))
plt.title('Landmark ID Distribuition')
sns.distplot(train['landmark_id'])

plt.show()

In [None]:
sns.set()
plt.title('Training set: number of images per class(line plot)')
sns.set_color_codes("pastel")
landmarks_fold = pd.DataFrame(train['landmark_id'].value_counts())
landmarks_fold.reset_index(inplace=True)
landmarks_fold.columns = ['landmark_id','count']
ax = landmarks_fold['count'].plot(logy=True, grid=True)
locs, labels = plt.xticks()
plt.setp(labels, rotation=30)
ax.set(xlabel="Landmarks", ylabel="Number of images")
plt.show()

In [None]:
# Visualize outliers, min/max or quantiles of the landmarks count
sns.set()
ax = landmarks_fold.boxplot(column='count')
ax.set_yscale('log')

In [None]:
train.landmark_id.nunique()

- There are 81313 unique landmark_ids

In [None]:
landmark[:5]

- There is only one landmark which has more than 2300 images (landmark_id: 138982)

In [None]:
landmark.describe()

- Number of images per landmark_id ranges from 2 to 6272.
- median is 9, mean is 19


In [None]:
landmark[landmark < 100].shape

In [None]:
landmark.shape



- Out of 81313, there are 79298 (97.5%) landmark_ids with less than 100 images.

In [None]:
import PIL
from PIL import Image, ImageDraw


def display_images(images, title=None): 
    """
    func for display images 
    Thank you @rohitsingh9990 for this fucntion
    """
    f, ax = plt.subplots(5,5, figsize=(18,22))
    if title:
        f.suptitle(title, fontsize = 30)

    for i, image_id in enumerate(images):
        image_path = os.path.join(TRAIN_IMAGE_DIR, f'{image_id[0]}/{image_id[1]}/{image_id[2]}/{image_id}.jpg')
        image = Image.open(image_path)
        
        ax[i//5, i%5].imshow(image) 
        image.close()       
        ax[i//5, i%5].axis('off')

        landmark_id = train[train.id==image_id.split('.')[0]].landmark_id.values[0]
        ax[i//5, i%5].set_title(f"ID: {image_id.split('.')[0]}\nLandmark_id: {landmark_id}", fontsize="12")

    plt.show() 

# Visualizing

In [None]:
samples = train.sample(25).id.values
display_images(samples, 'Random')

In [None]:
samples = train[train.landmark_id == 138982].sample(25).id.values
display_images(samples, 'Top 1')

In [None]:
samples = train[train.landmark_id == 126637].sample(25).id.values
display_images(samples, 'Top 2')

In [None]:
samples = train[train.landmark_id == 20409].sample(25).id.values
display_images(samples, 'Top 3')

# EfficientNetB0 inference

In [None]:
!pip install ../input/keras-efficientnet-whl/Keras_Applications-1.0.8-py3-none-any.whl
!pip install ../input/keras-efficientnet-whl/efficientnet-1.1.1-py3-none-any.whl

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import efficientnet.keras as efn
import tensorflow.keras.layers as L
from tensorflow.keras.utils import Sequence
from tensorflow.keras.preprocessing import image
from random import shuffle
from sklearn.model_selection import train_test_split
import cv2
import math

In [None]:
class DataGenerator(Sequence):
    def __init__(self, path, list_IDs, data, img_size, img_channel, batch_size):
        self.path = path
        self.list_IDs = list_IDs
        self.data = data
        self.img_size = img_size
        self.img_channel = img_channel
        self.batch_size = batch_size
        self.indexes = np.arange(len(self.list_IDs))
        
    def __len__(self):
        len_ = int(len(self.list_IDs)/self.batch_size)
        if len_*self.batch_size < len(self.list_IDs):
            len_ += 1
        return len_
    
    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        list_IDs_temp = [self.list_IDs[k] for k in indexes]
        X, y = self.__data_generation(list_IDs_temp)
        return X, y
            
    
    def __data_generation(self, list_IDs_temp):
        X = np.zeros((self.batch_size, self.img_size, self.img_size, self.img_channel))
        y = np.zeros((self.batch_size, 1), dtype=int)
        for i, ID in enumerate(list_IDs_temp):
            
            image_id = self.data.loc[ID, 'id']
            file = image_id+'.jpg'
            subpath = '/'.join([char for char in image_id[0:3]]) 
            
#             print(self.path+subpath+'/'+file)
            img = cv2.imread(self.path+subpath+'/'+file)
#             print(img)
            img = img/255
            img = cv2.resize(img, (self.img_size, self.img_size))
            X[i, ] = img
            if self.path.find('train')>=0:
                y[i, ] = self.data.loc[ID, 'landmark_id']
            else:
                y[i, ] = 0
        return X, y
    
img_size = 256
img_channel = 3

batch_size = 1
sub = pd.read_csv('../input/landmark-recognition-2021/sample_submission.csv')
list_IDs_test = list(sub.index)

test_generator = DataGenerator('../input/landmark-recognition-2021/'+'test/', list_IDs_test, sub, img_size, img_channel, batch_size)

In [None]:
model = tf.keras.models.load_model('../input/effnetb0trainedmodel/effnetB0.h5')

In [None]:
preds = model.predict_generator(test_generator)
preds

In [None]:
sample_submission = pd.read_csv('../input/landmark-recognition-2021/sample_submission.csv')
for i in range(len(sample_submission.index)):
    category = np.argmax(preds[i])
    score = preds[i][np.argmax(preds[i])].round(2)
    sample_submission.loc[i, 'landmarks'] = str(category)+' '+str(score)
    
sample_submission.to_csv('submission.csv',index=False)

In [None]:
sample_submission['landmarks']