# Introduction

Hello everyone! This is a small and fast EDA, provided in order to have a first look at the data. Here you will:

* see pictures of animals in the data;
* observe what is the difference between pictures with high and low pawpularity scores;
* gain a better understanding of what various attributes mean.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

%matplotlib inline

DATA_PATH = '/kaggle/input/petfinder-pawpularity-score'

plt.rcParams['figure.figsize'] = [20, 15]

# Train dataframe

In [None]:
df_train = pd.read_csv(os.path.join(DATA_PATH, 'train.csv'))
df_train

Let's verify that picture attributes are binary variables and pawpularity score is on the scale from 1 to 100:

In [None]:
df_stats = pd.DataFrame([], columns=['column name', '# unique values', 'min', 'max', 'mean'])
for col in df_train.columns.drop('Id'):
    df_stats.loc[len(df_stats)] = [col, df_train[col].nunique(), df_train[col].min(), df_train[col].max(), df_train[col].mean()]
df_stats

Let's visualize the distribution of pawpularity scores:

In [None]:
plt.figure(figsize=(11, 8))
plt.hist(df_train['Pawpularity'], bins=50)
plt.title('Pawpularity distribution')
plt.xlabel('Pawpularity')
plt.ylabel('Counts')
plt.show()

We can see that the majority of pawpularity scores are on the interval from 20 to 50, and there is a distinctively large amount of animals with a popularity score around 100.

# Animal pictures

Let's plot random animal pictures:

In [None]:
NUM_PICTURES = 8

fig, ax = plt.subplots(nrows=1, ncols=NUM_PICTURES)
pictures_selected = df_train.sample(n=NUM_PICTURES)
for i in range(NUM_PICTURES):
    picture_row = pictures_selected.iloc[i]
    ax[i].imshow(plt.imread(os.path.join(DATA_PATH, 'train/{}.jpg'.format(picture_row['Id']))))
    ax[i].set_title('Pawpularity: {}'.format(picture_row['Pawpularity']))
    ax[i].axis('off')
fig.suptitle('Random {} pictures'.format(NUM_PICTURES), fontsize=16)
plt.subplots_adjust(top=1.5)
plt.show()

Pictures differ by size, content and various attributes. Let's plot the best pictures according to the pawpularity score:

In [None]:
NUM_PICTURES = 8

fig, ax = plt.subplots(nrows=1, ncols=NUM_PICTURES)
pictures_selected = df_train.sort_values('Pawpularity', ascending=False)
for i in range(NUM_PICTURES):
    picture_row = pictures_selected.iloc[i]
    ax[i].imshow(plt.imread(os.path.join(DATA_PATH, 'train/{}.jpg'.format(picture_row['Id']))))
    ax[i].set_title('Pawpularity: {}'.format(picture_row['Pawpularity']))
    ax[i].axis('off')
fig.suptitle('Top-{} pictures by pawpularity score'.format(NUM_PICTURES), fontsize=16)
plt.subplots_adjust(top=1.5)
plt.show()

Let's plot the worst pictures according to the pawpularity score:

In [None]:
NUM_PICTURES = 8

fig, ax = plt.subplots(nrows=1, ncols=NUM_PICTURES)
pictures_selected = df_train.sort_values('Pawpularity', ascending=True)
for i in range(NUM_PICTURES):
    picture_row = pictures_selected.iloc[i]
    ax[i].imshow(plt.imread(os.path.join(DATA_PATH, 'train/{}.jpg'.format(picture_row['Id']))))
    ax[i].set_title('Pawpularity: {}'.format(picture_row['Pawpularity']))
    ax[i].axis('off')
fig.suptitle('Bottom-{} pictures by pawpularity score'.format(NUM_PICTURES), fontsize=16)
plt.subplots_adjust(top=1.5)
plt.show()

At first glance, pictures with the low pawpularity score seem much poorly made than the ones with the high pawpularity score.

Finally, let's plot some pictures to see what different attributes mean. We will leave the conclusions for you to make!

In [None]:
for col in ['Subject Focus', 'Eyes', 'Face', 'Near', 'Action', 'Accessory',
            'Group', 'Collage', 'Human', 'Occlusion', 'Info', 'Blur']:

    NUM_PICTURES = 8

    fig, ax = plt.subplots(nrows=1, ncols=NUM_PICTURES)
    pictures_selected = df_train[df_train[col] == 1].sample(n=NUM_PICTURES)
    for i in range(NUM_PICTURES):
        picture_row = pictures_selected.iloc[i]
        ax[i].imshow(plt.imread(os.path.join(DATA_PATH, 'train/{}.jpg'.format(picture_row['Id']))))
        ax[i].set_title('Pawpularity: {}'.format(picture_row['Pawpularity']))
        ax[i].axis('off')
    fig.suptitle(
        '=======================================================\n{} pictures with {} == 1'.format(NUM_PICTURES, col),
        fontsize=16
    )
    plt.subplots_adjust(top=1.5)
    plt.show()