## Data Description

Dataset consist of:

* **train.csv**: CSV file having information about 'image id' and 'landmark id'

* **sample_submission.csv**: CSV file having information about submition format

* **train**: FOLDER having images for training  *Since there are a large number of images, each image is placed within three subfolders according to the first three characters of the image id (i.e. image abcdef.jpg is placed in a/b/c/abcdef.jpg).*

* **test**: FOLDER having images for testing

In this competition, you are asked to take test images and recognize which landmarks (if any) are depicted in them.


In [None]:
# Folders and files in dataset
import glob
print(glob.os.listdir("/kaggle/input/landmark-recognition-2020/"))

## About train.csv

In [None]:
import pandas as pd
import numpy as np
trainCsv = pd.read_csv("/kaggle/input/landmark-recognition-2020/train.csv")

print(f"Shape of train.csv dataframe: {trainCsv.shape}")

In [None]:
# Head sample
trainCsv.head(3)

In [None]:
# Tail sample
trainCsv.tail(3)

### Information extraction

In [None]:
landmarkValueCount = pd.value_counts(trainCsv["landmark_id"])
print(f"Number of n/a values:\n{trainCsv.isna().sum()}")
print("No n/a found.\n")
print("All Id: ", len(trainCsv["id"]), "\nUnique Id: ", len(pd.unique(trainCsv["id"])))
print("No id is repeted.\n")
print(f"Unique number of landmark id: {len(trainCsv['landmark_id'].unique())}\n")
print(f"Average Images per class: {len(trainCsv)/len(trainCsv['landmark_id'].unique())}")

print(f"Mininum number of images of a landmark: {min(landmarkValueCount)}")
print(f"Maximum number of images of a landmark: {max(landmarkValueCount)}")
print(f"Average number of images of a landmark: {np.mean(landmarkValueCount.values)}\n")
print(f"Number of landmark having 2 images: {sum(landmarkValueCount==2)}")
print(f"Number of landmark having 6272 images: {sum(landmarkValueCount==6272)}")
print(f"Number of landmark having 19 images: {sum(landmarkValueCount==19)}")

## Ploting train.csv

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

### Distribution plot: Landmark Id

In [None]:
plt.figure(figsize=(15,7))
ax = sns.distplot(trainCsv["landmark_id"])
plt.xlabel('landmark_id')
plt.title("Distribution of landmark_id")
plt.show()

### Distribution plot: Landmark Id Count and log of Landmark Id Count

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,7))
sns.distplot(landmarkValueCount, ax=ax[0])
ax[0].set_xlabel('landmark_id count')
ax[0].set_title("Distribution of landmark_id count")

sns.distplot(np.log10(landmarkValueCount), ax=ax[1])
ax[1].set_xlabel('Log of landmark_id count')
ax[1].set_title("Distribution of log of landmark_id count")
plt.show()

### First 50 samples Bar plot: Count of images per landmark_id

In [None]:
plt.figure(figsize=(15,7))
sample = landmarkValueCount[0:50].reset_index()
ax = sns.barplot("index", "landmark_id", data=sample, order=sample["index"], palette="Blues_d")
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(sample["landmark_id"].iteritems()):        
    ax.text(i-.5 ,v[1], "{:,}".format(v[1]), rotation="45")
plt.xlabel('landmark_id')
plt.ylabel('count of images')
plt.title("Count of images per landmark_id")
plt.show()

### Last 50 samples Bar plot: Count of images per landmark_id

In [None]:
plt.figure(figsize=(15,7))
sample = landmarkValueCount[-50:].reset_index()
ax = sns.barplot("index", "landmark_id", data=sample, order=sample["index"], palette="Blues_d")
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(sample["landmark_id"].iteritems()):        
    ax.text(i-.5 ,v[1], "{:,}".format(v[1]), rotation="45")
plt.xlabel('landmark_id')
plt.ylabel('count of images')
plt.title("Count of images per landmark_id")
plt.show()

### Overall 50 samples Bar plot: Count of images per landmark_id

In [None]:
plt.figure(figsize=(15,7))
sample = landmarkValueCount[0:len(landmarkValueCount):int(len(landmarkValueCount)/50)].reset_index()
ax = sns.barplot("index", "landmark_id", data=sample, order=sample["index"], palette="Blues_d")
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(sample["landmark_id"].iteritems()):        
    ax.text(i-.5 ,v[1], "{:,}".format(v[1]))
plt.xlabel('landmark_id')
plt.ylabel('count of images')
plt.title("Count of images per landmark_id")
plt.show()

#### As we can see in above plot, 6272 images have landmark_id=138982. which is squeezeing our plot.
#### Let's remove first landmark_id and look again

In [None]:
plt.figure(figsize=(15,7))
sample = landmarkValueCount[0:len(landmarkValueCount):int(len(landmarkValueCount)/50)].reset_index()
sample = sample[1:]
ax = sns.barplot("index", "landmark_id", data=sample, order=sample["index"], palette="Blues_d")
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(sample["landmark_id"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), va ='bottom')
plt.xlabel('landmark_id')
plt.ylabel('count of images')
plt.title("Count of images per landmark_id after removing most occured landmark_id")
plt.show()

### Scatter plot: Count of images per landmark_id an log of images per landmark_id

In [None]:
hue = np.zeros_like(landmarkValueCount)

hue[landmarkValueCount<3000] = 1
hue[landmarkValueCount<1000] = 2
hue[landmarkValueCount<300] = 3
hue[landmarkValueCount<3] = 4
cl = ["images>3000", "1000<images<3000", "300<images<1000", "3<images<300", "0<images<3"]


fig, ax = plt.subplots(1,2,figsize=(20,7))
sns.scatterplot(landmarkValueCount.index, landmarkValueCount.values, alpha=.5, hue = hue, s=100, ax=ax[0], palette="bright")
ax[0].set_xlabel('Landmark_id count')
ax[0].set_ylabel('Count of images')
ax[0].set_title("Count of images per landmark_id")
ax[0].legend(cl)

sns.scatterplot(landmarkValueCount.index, np.log10(landmarkValueCount.values), alpha=.5, hue = hue, s=100, ax=ax[1], palette="bright")
ax[0].set_xlabel('Log of Landmark_id count')
ax[1].set_ylabel('Count of images')
ax[1].set_title("Count of log of images per landmark_id")
plt.show()


### For better understanding

In [None]:
hueSort = pd.value_counts(hue).sort_index()
huedf = pd.Series(hueSort.values, index=cl).reset_index()
percent = huedf.iloc[:,1]/len(hue)*100
huedfPerc = pd.concat([huedf, percent], axis=1)
huedfPerc.columns = ["Desc", "Count of images", "Percent of count of images"]
huedfPerc["Desc"] = "Landmark Id having "+ huedfPerc["Desc"]
huedfPerc

### Sample Images

In [None]:
from PIL import Image
imagePath = glob.glob("/kaggle/input/landmark-recognition-2020/train/0/0/*/*")[:40]

fig, ax = plt.subplots(10, 4, figsize=(15, 40))
ax = np.ravel(ax)
for i in range(40):
    ax[i].imshow(Image.open(imagePath[i]))


### Evaluation
Submissions are evaluated using Global Average Precision (GAP) at k, where k=1. This metric is also known as micro Average Precision (μAP), as per [1,2]. It works as follows:

For each test image, you will predict one landmark label and a corresponding confidence score. The evaluation treats each prediction as an individual data point in a long list of predictions (sorted in descending order by confidence scores), and computes the Average Precision based on this list.

If a submission has N predictions (label/confidence pairs) sorted in descending order by their confidence scores, then the Global Average Precision is computed as:

GAP = 1/M * ∑(P(i)rel(i)): i -> 0 to N

where:

N is the total number of predictions returned by the system, across all queries

M is the total number of queries with at least one landmark from the training set visible in it (note that some queries may not depict landmarks)

P(i) is the precision at rank i
rel(i) denotes the relevance of prediciton i: it’s 1 if the i-th prediction is correct, and 0 otherwise

### Submission File

In [None]:
submittionSample = pd.read_csv("/kaggle/input/landmark-recognition-2020/sample_submission.csv")
submittionSample.head(3)

In [None]:
submittionSample.info()

For each id in the test set, you can predict at most one landmark and its corresponding confidence score. **Some images contain no landmarks.** You may decide not to predict any result for a given query, by submitting an empty prediction. The submission file should contain a header and have the following format (larger scores denote more confident matches):

id,landmarks

000088da12d664db,______8815 0.03

0001623c6d808702,

0001bbb682d45002,______5328 0.5

etc.