# GLR for beginners: fetch the data, basic EDA

I am making this notebook for anyone who's starting right now with this competition and realized that it's much more complicated than the example scenarios of minicourses here in Kaggle, coursera and so on. This is in part to try to make things clear for myself, but I hope it can also help others.

I will be handling the imports as they are needed, so that if you have problems with just a subset of what I'm doing you can get to that cell and know which imports you need. Here at the top I will just import some things that are used widely throughout the notebook.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot  as plt # data visualization


First thing, let's set some paths.

In [None]:
import os

# Dataset parameters:
INPUT_DIR = os.path.join('..', 'input')

DATASET_DIR = os.path.join(INPUT_DIR, 'landmark-recognition-2020')
TEST_IMAGE_DIR = os.path.join(DATASET_DIR, 'test')
TRAIN_IMAGE_DIR = os.path.join(DATASET_DIR, 'train')
TRAIN_LABELMAP_PATH = os.path.join(DATASET_DIR, 'train.csv')


Now let's fetch the data:

In [None]:
train = pd.read_csv(f'{DATASET_DIR}/train.csv')

And let's have a first look at it:

In [None]:
print("Shape of train_data :", train.shape)
print("Number of unique landmarks :", train["landmark_id"].nunique())

We have 1580470 entries, corresponding to 81313 unique landmarks.

Let's look at the first few rows of what we imported:

In [None]:
train.head()

If we visualize the first few rowns of train we can see that the dataset is composed by the id of the figure followed by the id of the landmark. We need to use this figure ID to load the figures.

# Image visualization

Let's look at how the indexing of images works.

One example of an image id is:


In [None]:
idx = train.id[1]
idx

If we look at the folder structure (on the right-hand side column) we can quickly notice that each image is nested three times based on the first three digits of the id. 

So, in our case, we want to open:

> /input/landmark-recognition-2020/train/9/2/b/92b6290d571448f6.jpg

Given this structure, it's useful to make a quick helper function:


In [None]:
def get_image_full_path(idx):
    return os.path.join(TRAIN_IMAGE_DIR,  f'{idx[0]}/{idx[1]}/{idx[2]}/{idx}.jpg')

In [None]:
from PIL import Image, ImageDraw

image = Image.open(get_image_full_path(idx))
plt.imshow(image) 
image.close()       
plt.axis("off")

plt.show() 


We can also look at all the images with a certain Landmark Id:

In [None]:
example = train[train["landmark_id"]==1]
for idx in example["id"]:
    image = Image.open(get_image_full_path(idx))
    plt.imshow(image) 
    image.close()       
    plt.axis("off")
    plt.show() 


# Image sizes

Since we will need to feed our classifier images that are all of the same size, it is reasonable to check of which sizes the images of the training set currently are. To do so, we take ispiration from [EDA + Data Augmentation for Beginners](https://www.kaggle.com/azaemon/eda-data-augmentation-for-beginners) and use the package basic_image_eda, applying our exploratory analysis only to one of the subfolders to keep the computation time short enough.

***Important note: In the end, the next two code blocks were taken from [EDA + Data Augmentation for Beginners](https://www.kaggle.com/azaemon/eda-data-augmentation-for-beginners) as they were (this might change in the future). I am still including them for completeness, since I think looking at this aspect of the data is important... but if your upvote my notebook, please consider upvoting also the one I took this bit of code from.***

In [None]:
!pip install basic_image_eda
from basic_image_eda import BasicImageEDA

In [None]:
data_dir = "../input/landmark-recognition-2020/train/0"
extensions = ['png', 'jpg', 'jpeg']
threads = 0
dimension_plot = True
channel_hist = True
nonzero = False
hw_division_factor = 1.0

BasicImageEDA.explore(data_dir, extensions, threads, dimension_plot, channel_hist, nonzero, hw_division_factor)

Let's notice that

> min height                               |  49
> max height                               |  800
> 
> min width                                |  120
> max width                                |  800

the function reccomends:

> recommended input size(by mean)          |  [608 736] 

in the height/width scatterplot we can notice that there are some popular dimensions for either height and width (the orizontal and vertical "lines") and some popular form factors (the tilted lines)


# Landmark distribution

Not all landmarks are created equal, and in this dataset we have a huge variety of how much they are represented: some have are in thousands of images, other in as few as just two. Let's first look at the distribution, we might want to exclude those that appear very rarely from out classification.

In [None]:
import seaborn as sns 

ad = sns.distplot(train['landmark_id'].value_counts()[-75000:])
ad.set(xlabel='Landmark Counts', ylabel='Probability Density', title='Distribution of less common landmarks')
plt.show()