# 1. Introduction

Google Landmark Recognition is an image classification challenges like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which aims to recognize `1K` general object categories. Landmark recognition is a little different from that: it contains a much larger number of classes (there are more than `81K classes` in this challenge), and the number of training examples per class may not be very large. Landmark recognition is challenging in its own way.


![](https://miro.medium.com/max/1280/1*OVP48VCImepxkHl7AVzkug.png)



## 1.1 What's new in this year competition

In the previous editions of this challenge (2018 and 2019), submissions were handled by uploading prediction files to the system. This year's competition is structured in a synchronous rerun format, where participants need to submit their Kaggle notebooks for scoring.


# 2. Preliminaries

**Now Let's Begin by Importing the data**


In [None]:
import os


import random
import seaborn as sns
import cv2

# General packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import PIL
import IPython.display as ipd
import glob
import h5py
import plotly.graph_objs as go
import plotly.express as px
from PIL import Image
from tempfile import mktemp

from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, LinearAxis, Range1d
from bokeh.models.tools import HoverTool
from bokeh.palettes import BuGn4
from bokeh.plotting import figure, output_notebook, show
from bokeh.transform import cumsum
from math import pi

output_notebook()


from IPython.display import Image, display
import warnings
warnings.filterwarnings("ignore")

In [None]:
os.listdir('../input/landmark-recognition-2020/')


## 2.1 Loading Data

In [None]:
BASE_PATH = '../input/landmark-recognition-2020'

TRAIN_DIR = f'{BASE_PATH}/train'
TEST_DIR = f'{BASE_PATH}/test'

print('Reading data...')
train = pd.read_csv(f'{BASE_PATH}/train.csv')
submission = pd.read_csv(f'{BASE_PATH}/sample_submission.csv')
print('Reading data completed')

### The dataset comprises of following important files:

* **train.csv**: This file contains, ids and targets
    * `id`: image id
    * `landmark_id`: target landmark id
    

* The training set is available in the `train/` folder, with corresponding landmark labels in `train.csv`. 
* The test set images are listed in the `test/` folder. Each image has a unique id. 

> Note: Since there are a large number of images, each image is placed within three subfolders according to the first three characters of the image id (i.e. image abcdef.jpg is placed in a/b/c/abcdef.jpg).


In [None]:
display(train.head())
print("Shape of train_data :", train.shape)

In [None]:
display(submission.head())
print("Shape of submission :", submission.shape)

# 3. Let's perform some EDA


### 3.1 Target Distribution (Number of images per landmark_id)


In [None]:
# displaying only top 30 landmark
landmark = train.landmark_id.value_counts()
landmark_df = pd.DataFrame({'landmark_id':landmark.index, 'frequency':landmark.values}).head(30)

landmark_df['landmark_id'] =   landmark_df.landmark_id.apply(lambda x: f'landmark_id_{x}')

fig = px.bar(landmark_df, x="frequency", y="landmark_id",color='landmark_id', orientation='h',
             hover_data=["landmark_id", "frequency"],
             height=1000,
             title='Number of images per landmark_id (Top 30 landmark_ids)')
fig.show()

**Inference**

* There are 81313 unique landmark_ids
* There is only one landmark which has more than 2300 images (landmark_id: 138982)
* Number of images per landmark_id ranges from 2 to 6272.
* Out of 81313, there are 79298 (97.5%) landmark_ids with less than 100 images.

## 4. Let's visualize few images

### 4.1 Visualizing random images

In [None]:
import PIL
from PIL import Image, ImageDraw


def display_images(images, title=None): 
    f, ax = plt.subplots(5,5, figsize=(18,22))
    if title:
        f.suptitle(title, fontsize = 30)

    for i, image_id in enumerate(images):
        image_path = os.path.join(TRAIN_DIR, f'{image_id[0]}/{image_id[1]}/{image_id[2]}/{image_id}.jpg')
        image = Image.open(image_path)
        
        ax[i//5, i%5].imshow(image) 
        image.close()       
        ax[i//5, i%5].axis('off')

        landmark_id = train[train.id==image_id.split('.')[0]].landmark_id.values[0]
        ax[i//5, i%5].set_title(f"ID: {image_id.split('.')[0]}\nLandmark_id: {landmark_id}", fontsize="12")

    plt.show() 

In [None]:
samples = train.sample(25).id.values
display_images(samples)

### 4.2 Visualizing landmark with most number of images (landmark_id: 138982)

In [None]:
samples = train[train.landmark_id == 138982].sample(25).id.values

display_images(samples)

### 4.3 Visualizing landmark with 2nd most number of images (landmark_id: 126637)

In [None]:
samples = train[train.landmark_id == 126637].sample(25).id.values

display_images(samples)

### 4.4 Visualizing landmark with 3rd most number of images (landmark_id: 20409)

In [None]:
samples = train[train.landmark_id == 20409].sample(25).id.values

display_images(samples)

### 4.5 Visualizing landmark with 4th most number of images (landmark_id: 83144)

In [None]:
samples = train[train.landmark_id == 83144].sample(25).id.values

display_images(samples)

#### <p><span style="color:green">This Kernel is work in progress, will update as competition progresses :) </br></span></p>

### <p><span style="color:red"><br>Please upvote this kernel if you like it . It motivates me to produce more quality content :)</br></span></p>