# Pawpularity EDA

### How Pawpularity Score Is Derived

* The Pawpularity Score is derived from each pet profile's page view statistics at the listing pages, using an algorithm that normalizes the traffic data across different pages, platforms (web & mobile) and various metrics.
* Duplicate clicks, crawler bot accesses and sponsored profiles are excluded from the analysis.


In this notebook we'll be showing you how to proceed with EDA on a dataset (as of pet images and also metadata here). 

**The process of Exploratory Data Analysis comprises of the following steps:**

* Import required libraries
* Access competition data
* Heatmap to visualize correlation of features
* Build a distribution of pawpularity scores across all pet images
* Visualize distribution of pawpularity for each feature
* Display contradicting feature for each pet image
* List most pawpular and least pawpular pet images (and also images for each 10 pawpularity score)

### Importing required libraries

In [None]:
import numpy as np
import pandas as pd 
import os
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from glob import glob

### Initialising dataframes for train and test

In [None]:
#source path (where the Pawpularity contest data resides)
path = '../input/petfinder-pawpularity-score/'

#Get the metadata (the .csv data) and put it into DataFrames
train_df = pd.read_csv(path + 'train.csv')
test_df = pd.read_csv(path + 'test.csv')

#Get the image data (the .jpg data) and put it into lists of filenames
train_jpg = glob(path + "train/*.jpg")
test_jpg = glob(path + "test/*.jpg")

In [None]:
#show the dimensions of the train metadata.
print('train_df dimensions: ', train_df.shape)
print('train_df column names: ', train_df.columns.values.tolist())

#print an extra row could use '\n' as well in a print statement
print('')

#show the dimensions of the test metadata
print('test_df dimensions: ',test_df.shape)
print('test_df column names: ', test_df.columns.values.tolist())

In [None]:
#show the type of train_jpg and test_jpg as well as length of the list.
print('train_jpg is of type ',type(train_jpg), ' and length ', len(train_jpg))
#Also show the first 3 elements
print('train_jpg list 1st 3 elements: ', train_jpg[0:3], '\n')

print('test_jpg is of type ',type(test_jpg), ' and length ', len(test_jpg))
#Also show the first 3 elements
print('test_jpg list 1st 3 elements: ', test_jpg[0:3])

### Let's see some info about the dataset in concern

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.describe()

# Correlation Heatmap

In [None]:
sns.set(rc={'figure.figsize':(20,10)})
sns.heatmap(train_df.corr(), annot=True, fmt='.1g', cmap='coolwarm', square=True)
plt.title('Correlation Matrix', fontsize=20, fontweight='bold')
plt.show()

We are able to see some correlations, but overall the predictor variables do not provide us much information about the Pawpularity score itself. It rather only states the obvious. For example Eyes and Face have a high positive relationship, meaning that if the face is visible, then the eyes are aswell and vice versa.

# Distribution of Pawpularity Scores

In [None]:
#Lets see a the distribution of Pawpularity Scores
sns.set(rc={'figure.figsize':(20,10)})
fig = plt.figure()
sns.histplot(data=train_df, x='Pawpularity', bins=100)
plt.axvline(train_df['Pawpularity'].mean(), c='red', ls='-', lw=3, label='Mean Pawpularity')
plt.axvline(train_df['Pawpularity'].median(),c='blue',ls='-',lw=3, label='Median Pawpularity')
plt.title('Distribution of Pawpularity Scores', fontsize=20, fontweight='bold')
plt.legend()
plt.show()

After observing the histogram, we note the skew in the distribution of the pawpularity scores. Interesting that there is a small curve close to zero Pawpularity as well. Also there exits close to 300 pet images with 100 pawpularity score

# Vizualisation of distribution of pawpularity for each feature

We show some simple box plots and histograms. Basically, we plot the Pawpularity scores on the y axes and the 0s and 1s of each feature variable on the x axes for the boxplots. For the histograms, we plot the pawpularity on the x axes and count the 0s and 1s at each Pawpularity score. This could help us visualize if 0s or 1s in each feature variable have an impact on the Pawpularity scores.

In [None]:
feature_variables = train_df.columns.values.tolist()

# For each of the feature variables, doesn't include Id and Pawpularity by using [1:-1]
# Show a boxplot and distribution plot against pawpularity
for variable in feature_variables[1:-1]:
    fig, ax = plt.subplots(1,2)
    sns.boxplot(data=train_df, x=variable, y='Pawpularity', ax=ax[0])
    sns.histplot(train_df, x="Pawpularity", hue=variable, kde=True, ax=ax[1])
    plt.suptitle(variable, fontsize=20, fontweight='bold')
    fig.show()

### Next we show example of images with each contradicting feature

We visualize what the different labels in the tabular data actually mean, and how they are distributed.

In [None]:
fig, ax = plt.subplots(12, 3, figsize=(14,40))

for a in ax.ravel():
    a.set(xticks=[], yticks=[])

for r in range(12):
    label = train_df.columns[r+1] # first one is Id, code is currently a mess :(
    count = train_df[label].value_counts().sort_values()
    colors = ['red','green']
    for i in [1, 0]:
        img_id = train_df[train_df[label] == i].sample()['Id'].values[0]
        img = plt.imread(f'../input/petfinder-pawpularity-score/train/{img_id}.jpg')
        c = 0 if i == 1 else 2
        ax[r, c].imshow(img)
        ax[r, c].set_title(f'{label}={i}')
    ax[r, 1].pie(count, labels=[0, 1], autopct='%1.1f%%', colors = colors, wedgeprops = {'linewidth': 3})
    ax[r, 1].set_title(f'{label}', fontweight='bold', fontsize=20)

fig.tight_layout()
fig.show()


# Least Pawpular Images

In [None]:
bottom = train_df[train_df['Pawpularity'] == 1]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(bottom.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Least Pawpular Images', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

# Most Pawpular Images

In [None]:
top = train_df[train_df['Pawpularity'] == 100]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(top.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Most Pawpular Images', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

# Pet Images for each 10 Pawpularity score

### Pawpularity <10

In [None]:
t10 = train_df[train_df["Pawpularity"] <= 10]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t10.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity less than 10', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 10~20**

In [None]:
t20 = train_df[(10 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 20)]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t20.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 10~20', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 20~30**

In [None]:
t30 = train_df[(20 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 30)]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t30.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 20~30', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 30~40**

In [None]:
t40 = train_df[(30 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 40)]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t40.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 30~40', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 40~50**

In [None]:
t50 = train_df[(40 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 50)]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t50.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 40~50', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 50~60**

In [None]:
t60 = train_df[(50 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 60)]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t60.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 50~60', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 60~70**

In [None]:
t70 = train_df[(60 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 70)]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t70.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 60~70', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 70~80**

In [None]:
t80 = train_df[(70 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 80)]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t80.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 70~80', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 80~99**

In [None]:
t90 = train_df[(80 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 99)]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t90.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 80~99', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()

**Pawpularity 100**

In [None]:
t100 = train_df[train_df["Pawpularity"] == 100]['Id']

fig, ax = plt.subplots(1,3)

for i, ax in zip(t100.sample(3), ax.ravel()):
    ax.set(xticks=[], yticks=[])
    img = plt.imread(f'../input/petfinder-pawpularity-score/train/{i}.jpg')
    ax.imshow(img)
    
fig.suptitle('Pawpularity 100', fontsize=20, fontweight='bold')
fig.tight_layout()
fig.show()



<font size="+1" color='#9b24a3'><b>I hope you enjoyed this kernel , Please don't forget to appreciate me with an Upvote.</b></font>

<img src="https://i.pinimg.com/originals/e2/d7/c7/e2d7c71b09ae9041c310cb6b2e2918da.gif">