# Google Landmark Recognition-2020

**Here, I will try to perform the Exploratory Data Analysis (EDA) on this dataset. I will try to explain each step as clearly as possible. As I'm a beginner myself, if you find any mistakes, please suggest your valuable opinions in the comment section.**

These notebooks gave me the necessary ideas for this task and I'm really grateful to them:
1. https://www.kaggle.com/chirag9073/landmark-recognition-exploratory-data-analysis/notebook
2. https://www.kaggle.com/azaemon/mura-classification
3. [https://www.kaggle.com/azaemon/eda-data-augmentation-for-beginners?scriptVersionId=40504799](https://www.kaggle.com/azaemon/eda-data-augmentation-for-beginners?scriptVersionId=40504799)




**To begin with, Let's first import the necessary modules.**

In [None]:
import numpy as np
import pandas as pd
import cv2
import matplotlib.pyplot as plt #for plotting the graphs or images
import seaborn as sns
import plotly.offline as py
import plotly.express as px
import plotly.graph_objs as go
import plotly.tools as tls#visualization
import plotly.figure_factory as ff#visualization
import matplotlib.image as mpimg

# Set Color Palettes for the notebook (https://color.adobe.com/)
colors_nude = ['#FFE61A','#B2125F','#FF007B','#14B4CC','#099CB3']
sns.palplot(sns.color_palette(colors_nude))

# Set Style
sns.set_style("whitegrid")
sns.despine(left=True, bottom=True)

# Exploratory Data Analysis

**Load the training csv file**

In [None]:
train_data= pd.read_csv("../input/landmark-recognition-2020/train.csv")

**Take a look at the first 10 entries. (we defined the number inside the parentheses. You can change the value to whatever you want. By default it is 5)**

In [None]:
print(train_data.head(10))
print()
print("Here, id means Image Id and landmark_id points to a specific ID of the landmark ")

**Now let's take a look at the summary of the loaded data**

In [None]:
train_data.describe()

**Here I checked for any missing value in the csv file and found that, there is no missing values.**

In [None]:
print(train_data.isna().sum())
print()

**Now, Let's perform the Exploratory Data Analysis importing the *basic_image_eda* library**

In [None]:
!pip install basic_image_eda
from basic_image_eda import BasicImageEDA

**There are a total of 1580470 images in the train folder. It will take a huge amout of time to perform EDA over all the images. So, here I am applying this only for one of the subfolders. You can choose any of the subfolder by just changing the path. Like, if you want to use the subfolder "1", then the data_dir value will be "../input/landmark-recognition-2020/train/1" or if you want to perform the operation over whole training images then, cahnge the value to "../input/landmark-recognition-2020/train".**

In [None]:
data_dir = "../input/landmark-recognition-2020/train/0"
extensions = ['jpg']
threads = 0
dimension_plot = True
channel_hist = True
nonzero = False
hw_division_factor = 1.0

BasicImageEDA.explore(data_dir, extensions, threads, dimension_plot, channel_hist, nonzero, hw_division_factor)

Now, let's analyze the number of landmark types and their distributions.

In [None]:
train_data['landmark_id'].value_counts()
print("Types of Landmarks: {81313}")
print("Landmark ID: 138982 has the highest number of images (6272)")

**Most frequent landmark counts (Top 10)**

In [None]:
# Occurance of landmark_id in decreasing order(Top categories)
temp = pd.DataFrame(train_data.landmark_id.value_counts().head(10))
temp.reset_index(inplace=True)
temp.columns = ['Landmark ID','Number of Images']

# Plot the most frequent landmark_ids
plt.figure(figsize = (9, 10))
plt.title('Top 10 the mostfrequent landmarks')
sns.set_color_codes("deep")
sns.barplot(x="Landmark ID", y="Number of Images", data=temp,
            label="Count")
plt.show()


**Least frequent landmark counts (Top 10)**

In [None]:
temp = pd.DataFrame(train_data.landmark_id.value_counts().tail(10))
temp.reset_index(inplace=True)
temp.columns = ['Landmark ID','Number of Images']
# Plot the least frequent landmark_ids
plt.figure(figsize = (9, 10))
plt.title('Top 10 the least frequent landmarks')
sns.set_color_codes("deep")
sns.barplot(x="Landmark ID", y="Number of Images", data=temp,
            label="Count")
plt.show()


Now let's plot some random images

In [None]:
from random import randrange
fig= plt.figure(figsize=(20,10))
index= '../input/landmark-recognition-2020/train/2/3/6/23603d71816b6452.jpg'
a= fig.add_subplot(2,3,1)
a.set_title(index.split("/")[-1])
plt.imshow(plt.imread(index))

index= '../input/landmark-recognition-2020/train/7/0/4/7040a5cfa43e0633.jpg'
a= fig.add_subplot(2,3,2)
a.set_title(index.split("/")[-1])
plt.imshow(plt.imread(index))

index= '../input/landmark-recognition-2020/train/4/1/0/41000aafca574dfe.jpg'
a= fig.add_subplot(2,3,3)
a.set_title(index.split("/")[-1])
plt.imshow(plt.imread(index))

index= '../input/landmark-recognition-2020/train/4/3/1/43101b9ac11ed672.jpg'
a= fig.add_subplot(2,3,4)
a.set_title(index.split("/")[-1])
plt.imshow(plt.imread(index))

index= '../input/landmark-recognition-2020/train/4/3/1/43105797059abd97.jpg'
a= fig.add_subplot(2,3,5)
a.set_title(index.split("/")[-1])
plt.imshow(plt.imread(index))

index= '../input/landmark-recognition-2020/train/4/1/0/41008546ba23b770.jpg'
a= fig.add_subplot(2,3,6)
a.set_title(index.split("/")[-1])
plt.imshow(plt.imread(index))

plt.show()
    

**If you found this notebook helpful or you just liked it , some upvotes would be very much appreciated - That will keep me motivated :)**