# Google Landmark Retrieval 2020 - EDA
<img src="https://github.com/seriousran/img_link/blob/master/kg/glr.jpg?raw=true" alt="drawing" style="width:520px;"/>

"This year, we have worked to set this up as a code competition and we have completely refreshed the test and index image sets."

Let's start to dig it! :)

### Past Related Competitions

1. [Google Landmark Retrieval 2019](https://www.kaggle.com/c/landmark-retrieval-2019)
1. [Google Landmark Retrieval Challenge](https://www.kaggle.com/c/landmark-retrieval-challenge)

### The winner in last competition
<img src="https://github.com/seriousran/img_link/blob/master/kg/lb_glr_2019.PNG?raw=true" alt="drawing"/>



Reference: https://www.kaggle.com/huangxiaoquan/google-landmarks-v2-exploratory-data-analysis-eda

## Outline
1. [File Exploration](#1)
    1. [Training data](#2)
    1. [Index data](#3)
    1. [Display examples](#4)

In [None]:
import os
import glob
import cv2
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats



%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## 1.1 Train data

In this competition, you are asked to develop models that can efficiently retrieve landmark images from a large database. 
The training set is available in the train/ folder, with corresponding landmark labels in train.csv. 

In [None]:
train_df = pd.read_csv('../input/landmark-retrieval-2020/train.csv')
train_df

### Landmark_id distribuition

In [None]:
plt.title('landmark_id distribution')
sns.distplot(train_df['landmark_id'])

### Training set: number of images per class(line plot)



In [None]:
sns.set()
plt.title('Training set: number of images per class(line plot)')
landmarks_fold = pd.DataFrame(train_df['landmark_id'].value_counts())
landmarks_fold.reset_index(inplace=True)
landmarks_fold.columns = ['landmark_id','count']
ax = landmarks_fold['count'].plot(logy=True, grid=True)
locs, labels = plt.xticks()
plt.setp(labels, rotation=30)
ax.set(xlabel="Landmarks", ylabel="Number of images")

### Training set: number of images per class(scatter plot)

In [None]:
sns.set()
landmarks_fold_sorted = pd.DataFrame(train_df['landmark_id'].value_counts())
landmarks_fold_sorted.reset_index(inplace=True)
landmarks_fold_sorted.columns = ['landmark_id','count']
landmarks_fold_sorted = landmarks_fold_sorted.sort_values('landmark_id')
ax = landmarks_fold_sorted.plot.scatter(\
     x='landmark_id',y='count',
     title='Training set: number of images per class(statter plot)')
locs, labels = plt.xticks()
plt.setp(labels, rotation=30)
ax.set(xlabel="Landmarks", ylabel="Number of images")

## 1.2 Test and Index data

The query images are listed in the test/ folder, while the "index" images from which you are retrieving are listed in index/. 

Each image has a unique id. Since there are a large number of images, each image is placed within three subfolders according to the first three characters of the image id (i.e. image abcdef.jpg is placed in a/b/c/abcdef.jpg).

0-f in 0-f in 0-f

In [None]:
test_list = glob.glob('../input/landmark-retrieval-2020/test/*/*/*/*')
index_list = glob.glob('../input/landmark-retrieval-2020/index/*/*/*/*')

In [None]:
print( 'Query', len(test_list), ' test images in ', len(index_list), 'index images')

## 1.3 Display examples

In [None]:
plt.rcParams["axes.grid"] = False
f, axarr = plt.subplots(4, 3, figsize=(24, 22))

curr_row = 0
for i in range(12):
    example = cv2.imread(test_list[i])
    example = example[:,:,::-1]
    
    col = i%4
    axarr[col, curr_row].imshow(example)
    if col == 3:
        curr_row += 1
            
#     plt.imshow(example)
#     plt.show()