![](https://1.bp.blogspot.com/-EQNRBJuBXJo/XMthch8bdWI/AAAAAAAAEG0/oHVw1fxfiXoMfsSn_rNQPR-fyRqM_N4CwCEwYBhgL/s1600/image1.png)
## What you need to know about this competition?
In this competition, you are asked to take test images and recognize which landmarks (if any) are depicted in them. The training set is available in the `train/ folder`, with corresponding landmark labels in train.csv. The test set images are listed in the `test/ folder`. **Each image has a unique id**. Since there are a large number of images, each image is placed within three subfolders according to the first three characters of the image id (i.e. image abcdef.jpg is placed in a/b/c/abcdef.jpg).

**This is a synchronous rerun code competition.** The provided test set is a representative set of files to demonstrate the format of the private test set. When you submit your notebook, Kaggle will rerun your code on the private dataset. Additionally, this competition also has two unique characteristics:

* **To facilitate recognition-by-retrieval approaches, the private training set contains only a 100k subset of the total public training set.** This 100k subset contains all of the training set images associated with the landmarks in the private test set. You may still attach the full training set as an external data set if you wish.
* **Submissions are given 12 hours to run, as compared to the site-wide session limit of 9 hours. While your commit must still finish in the 9 hour limit in order to be eligible to submit, the rerun may take the full 12 hours.**

### Loading Libraries

In [None]:
import numpy as np
import pandas as pd
from os import listdir
from glob import glob

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

### Let's have a look at the given datasets. 

In [None]:
train_data = pd.read_csv('../input/landmark-recognition-2020/train.csv')
submission = pd.read_csv("../input/landmark-recognition-2020/sample_submission.csv")
print('Training dataframe shape: ', train_data.shape)
print('Total number of training images', train_data.shape[0])
print('Total number of test images', len(glob('../input/landmark-recognition-2020/test/*/*/*/*.jpg')))
train_data.head()

In [None]:
# Let's have a look at the sample submission file. 
submission.head()

### Let's have a look if there is any null values in the training data. 

In [None]:
train_data['landmark_id'].value_counts().hist()

In [None]:
# missing data in training data 
total = train_data.isnull().sum().sort_values(ascending = False)
print(total)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending = False)
missing_train_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_train_data.head()

Now Let's have a look at the **most frequent landmarks** in the dataset. 

In [None]:
most_frequent = 10
# Occurance of landmark_id in decreasing order(Top categories)
temp = pd.DataFrame(train_data.landmark_id.value_counts().head(most_frequent)).copy()
temp.reset_index(inplace=True)
temp.columns = ['landmark_id','count']
temp.sort_values(by='count', ascending=False, inplace = True)
temp.set_index('landmark_id', inplace = True)
temp.plot.bar()
most_ids = temp.index.copy()

Here the most frequent landmark is with **landmark_id = 138982** and it has **6272** samples in it. Let's have a look at the comparative frequency plots of the dataset. 

Now Let's have a look at the **least frequent landmarks** in the dataset. 

In [None]:
least_frequent = 10
# Occurance of landmark_id in decreasing order(Top categories)
temp = pd.DataFrame(train_data.landmark_id.value_counts().tail(least_frequent)).copy()
temp.reset_index(inplace=True)
temp.columns = ['landmark_id','count']
temp.sort_values(by='count', ascending=False, inplace = True)
temp.set_index('landmark_id', inplace = True)
print(temp)
temp.plot.bar()
least_ids = temp.index.copy()

Therefore, it is evident that, each of the class has at least two data samples in them and no class is empty. 

In [None]:
train_data[train_data['landmark_id'] == 110417].reset_index()['id']

## Now let's have a look at some of the training images
### Most Frequent Location ( landmark_id = 138982)

In [None]:

def view_images(ID=110417):
    import PIL
    from PIL import Image
#     ID = 110417
#     Generating the filepaths 
    image_file_names = train_data[train_data['landmark_id'] == ID].reset_index()['id']
#     print(image_file_names)
    image_paths = []
    for image_name in image_file_names:
        sub_folder = image_name[0] + '/'+ image_name[1] + '/' +  image_name[2] + '/'
        image_paths.append('../input/landmark-recognition-2020/train/' + sub_folder + image_name + '.jpg')
#     print(image_paths)



    grid_size = 4

    rows = grid_size
    cols = grid_size
    fig = plt.figure(figsize=(grid_size*3+2, grid_size*3))
    for i in range(1, rows*cols+1):
        if(i>len(image_paths)):
            break
        fig.add_subplot(rows, cols, i)
        plt.imshow(Image.open(image_paths[i-1]))
        plt.title('ID =' + str(ID), fontsize=16)
        plt.axis(False)
        fig.add_subplot
    plt.show()

## View the Images Interactively
### Most frequent images
* The top 10 location_ids with most number of samples are listed below and you can have a look at them by selecting the maximum number images. You can select the location id from the dropdown menu and then you can select the IDs and view thm separately. 
* You might have to wait a bit more time to let image generate, plot and view. 

In [None]:
from ipywidgets import interact
interact(view_images, ID = most_ids)


## View the Images Interactively
### Most frequent images
* The top 10 location_ids with least number of samples are listed below and you can have a look at them by selecting the maximum number images. You can select the location id from the dropdown menu and then you can select the IDs and view thm separately. 
* You might have to wait a bit more time to let image generate, plot and view. 

In [None]:
interact(view_images, ID = least_ids)

### Images of class_id = 110417 (One of the least frequent)

In [None]:
view_images(least_ids[0])

### Images of class_id = 59905  (One of the least frequent)

In [None]:
view_images(least_ids[1])


### Images of class_id = 4171  (One of the least frequent)

In [None]:
view_images(least_ids[2])


### Images of class_id = 110417  (One of the least frequent)

In [None]:
view_images(least_ids[3])


### Images of class_id = 110417

In [None]:
view_images(least_ids[0])


### Images of class_id = 195143  (One of the least frequent)

In [None]:
view_images(least_ids[4])


### Images of class_id = 180503  (One of the least frequent)

In [None]:
view_images(least_ids[5])


## Now Let's have a look at the most frequent images in the class.
### Images of class_id = 138982  (One of the most frequent)
* Here in each of these categories, there might be a huge number of images but it is quite evident that most of the image in one categories are kind of similar. 
* For example, for the image `landmark_id=138982` has most of the image which are paintings of the wall where the details of the paintings are written usually at the bottom. 

In [None]:
view_images(most_ids[0])

### Images of class_id = 126637  (One of the most frequent)
* Visually it might seem that the images are from industrial area or location

In [None]:
view_images(most_ids[1])

### Images of class_id = 20409  (One of the most frequent)
* Visually it is evident that most of the images are from ancient brick work or some kind of terracota. 

In [None]:
view_images(most_ids[2])

### Images of class_id = 83144  (One of the most frequent)
* Visually it is evident that these are images of single hut or similar buildings. 

In [None]:
view_images(most_ids[3])

In [None]:
view_images(most_ids[4])

# Observations:
### * There is massive class imbalance in there
### * There are some noise in lot of images
### * Images are of different shapes. If we convert all of them at the same size, it will create problem because size will be distorted. 
### * Need some intelligent system of cropping the images. 