# Exploratory Data Analysis

![Google Landmark Dataset](https://1.bp.blogspot.com/-EQNRBJuBXJo/XMthch8bdWI/AAAAAAAAEG0/oHVw1fxfiXoMfsSn_rNQPR-fyRqM_N4CwCEwYBhgL/s1600/image1.png)


The most Important step of any machine learning project is understanding the underlying patterns of our dataset. In this particular case of landmark dataset our data analysis is mainly focused on the class distribution. as stated by the Google's Landmark Dataset [Paper](https://arxiv.org/pdf/2004.01804.pdf) the class distrubution is **"extremely long-tailed"**. In order to see the other unique features of the landmark dataset we will take a look at some of the images in the train and test datasets.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import PIL.Image as Image



In [None]:
def get_folder_names(mode, dataid= str):
    return f'../input/landmark-recognition-2020/{mode}/{dataid[0]}/{dataid[1]}/{dataid[2]}/{dataid}.jpg'

In [None]:
train_csv = pd.read_csv('../input/landmark-recognition-2020/train.csv')
print('Number of landmark images in train set:', len(train_csv))
train_csv.head()


In [None]:
submission_csv= pd.read_csv('../input/landmark-recognition-2020/sample_submission.csv')
submission_csv.head()

## Tidy Up 

Clean data results in happy models. The first step in any ML project. Well at least in the case of raw data. In the lion's share of kaggle datasets this part can be overlooked due to the already cleaned and well-organized datasets.

1. Check for missing values

In [None]:
print('Any missing values?\n', train_csv.isnull().values.any()) 

2. Check for Duplicate rows

In [None]:
print('Any Duplicates?', train_csv.duplicated().values.any())

## Explore landmarks

This dataset has a large amount of unique landmarks which we have to predicts. However, the problem does not lie in the total number of labels. The real challenge is the high variance in the label volume.  

In [None]:
landmark_count = train_csv.landmark_id.value_counts()

In [None]:
fig = plt.figure(figsize = (20, 7))
sns.distplot(landmark_count, hist = False);
plt.title('Class Distribution', size = 20);
plt.xlabel('number of images', size = 15);

In [None]:
limits = [None, (0,200), (0,100)]
fig = plt.figure(figsize = (20, 7))
for i, lim in enumerate(limits):
    plt.subplot(len(limits),1,i+1)
    sns.boxplot(landmark_count)
    plt.xlim(lim)
    plt.title(lim)
plt.tight_layout()

print((landmark_count>200).sum())

While one class contains **6000** images, ***75%*** contain less then **25**. Only **483** of the classes contain more then **200** images. 

# Sightseeing

Now let's take a look at some of the landmarks. First we will look at the classes with the greatest image count, and then we will go over some of the data on the other side.  

In [None]:
landmark_count_id = list(landmark_count.index)

In [None]:
def get_images(data, landmarkid, num):
    sub = data.id[data.landmark_id == landmarkid] 
    fig = plt.figure(figsize= (10, 10))
    fig.suptitle(f"landmark ID  {landmarkid}")
    for i in range(num):
        if num > 3:
            plt.subplot(num ** (1/2), num ** (1/2), i+1)
        else:
            plt.subplot(1, num, i+1)
        img = Image.open(get_folder_names('train' ,list(sub)[i]))
        plt.imshow(img)
        plt.axis('off')
    

get_images(train_csv, landmark_count_id[0], 9)

In [None]:
get_images(train_csv, landmark_count_id[1], 9)

In [None]:
get_images(train_csv, landmark_count_id[8], 9)

The pictures suggest that the dataset suffers from **Intra-Class Variation**. In other words, there is a high variation in the pictures of each class. For instance, in the 126637 landmark pictures which apparently represents a coast, the images show different parts of the landmark and from different angles. This means that our predictive model should be highly robust. 

In [None]:
get_images(train_csv, landmark_count_id[-1], 2)

In [None]:
get_images(train_csv, landmark_count_id[-2], 2)

In [None]:
get_images(train_csv, landmark_count_id[-3], 2)

## Visualize Testset Dataset

In [None]:
a = glob.glob('../input/landmark-recognition-2020/test/*/*/*/*.jpg')
num = 10
fig = plt.figure(figsize = (20, 10))
for i in range(num):
    plt.subplot(1, num, i+1)
    plt.axis('off')
    randint = np.random.randint(0, len(a))
    img = Image.open(a[randint])
    plt.imshow(img)
    
    

Finally, the last feature of Google's landmark dataset lies in its testset. Most of the pictures in the testset do not belong to any landmark. Therefore, we need to leave a white space for some of the testset pictures.