<div align = "center">
    <h1>Google Landmark Recognition Challenge</h1>
    <img src = "https://miro.medium.com/max/1280/1*OVP48VCImepxkHl7AVzkug.png">
</div>

<div align = "center">
    <h3>What is this challenge all about?</h3>
    <br>
</div>
    
<div align = "center">Did you ever think about a place you visited earlier and forgot its name or location? Landmark recognition can help! <b>Google</b> aims to predict landmark labels directly from image pixels, to help people better understand and organize their photo collections.</div>
  


Let's start by importing the required libraries:

In [None]:
import pandas as pd
import numpy as np
from collections import Counter
import plotly.express as px
import matplotlib.pyplot as plt
import cv2

## About the dataset:

The `train.csv` contains two columns, `id` and `landmark_id`:

In [None]:
df = pd.read_csv('/kaggle/input/landmark-recognition-2020/train.csv')
df.head()

Woah! The training dataset has ~1.5 million images!

In [None]:
print("Number of training images:", len(df))

But we got only ~81k landmarks:

In [None]:
print("Number of landmarks:" ,df['landmark_id'].nunique())

## Most occuring landmarks:

Let's see the most occuring landmarks:

In [None]:
landmark_counts = dict(Counter(df['landmark_id']))
landmark_dict = {'landmark_id': list(landmark_counts.keys()), 'count': list(landmark_counts.values())}

landmark_count_df = pd.DataFrame.from_dict(landmark_dict)
landmark_count_sorted = landmark_count_df.sort_values('count', ascending = False)
landmark_count_sorted.head(30)

## Distribution of Landmarks with their counts:

In [None]:
fig_count = px.histogram(landmark_count_df, x = 'landmark_id', y = 'count')
fig_count.update_layout(
    title_text='Distribution of Landmarks',
    xaxis_title_text='Landmark ID',
    yaxis_title_text='Count'
)

fig_count.show()

## Common Image Sizes:

Let's see which image sizes are common in the dataset:

> NOTE: I'm using the first 1000 images, since I'm reading the image to calculate image sizes.

In [None]:
BASE_DIR = '../input/landmark-recognition-2020'
TRAIN_DIR = BASE_DIR + '/train'

import os

filelist = []
for root, dirs, files in os.walk(TRAIN_DIR):
    for file in files:
        filelist.append(os.path.join(root,file))
len(filelist)

In [None]:
img_sizes = []

for img_path in filelist[:1000]:
    img = cv2.imread(img_path)
    img_sizes.append("{}x{}".format(img.shape[0], img.shape[1]))

In [None]:
size_counts = dict(Counter(img_sizes))
size_dict = {'size': list(size_counts.keys()), 'count': list(size_counts.values())}

size_df = pd.DataFrame.from_dict(size_dict)
size_sorted = size_df.sort_values('count', ascending = False)
size_sorted = size_sorted[:10]

fig_image_sizes = px.bar(size_sorted, x = 'size', y = 'count')
fig_image_sizes.update_layout(title = 'Image Sizes')
fig_image_sizes.show()

# Let's see the top 10 landmarks!

In [None]:
def retrieve_image(image_id):
    img = cv2.imread(os.path.join(os.path.join(BASE_DIR, 'train'), image_id[0], image_id[1], image_id[2], image_id + '.jpg'))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    return img

def get_image_id(image_id):
    return df[df['landmark_id'] == image_id]['id'][:1].values[0]

In [None]:
fig, ax = plt.subplots(5, 2, figsize = (30, 30), dpi = 250)
ax = ax.flatten()

top_10_landmarks = landmark_count_sorted['landmark_id'][:10].values

for i in range(10):    
    ax[i].set_title(get_image_id(top_10_landmarks[i]))
    ax[i].set_xticks([])
    ax[i].set_yticks([])
    ax[i].imshow(retrieve_image(get_image_id(top_10_landmarks[i])))
fig.tight_layout()    
# plt.show()

# Bottom 10 landmarks:

In [None]:
fig, ax = plt.subplots(5, 2, figsize = (30, 30), dpi = 250)
ax = ax.flatten()
bottom_10_landmarks = landmark_count_sorted['landmark_id'][-10:].values

for i in range(10):
    ax[i].set_xticks([])
    ax[i].set_yticks([])    
    ax[i].imshow(retrieve_image(get_image_id(bottom_10_landmarks[i])))
    ax[i].set_title(get_image_id(bottom_10_landmarks[i]))
fig.tight_layout()
# plt.show()