![PetFinder Pawpularity Contest](https://www.petfinder.my/images/cuteness_meter.jpg)

This notebook is for the [PetFinder.my - Pawpularity Contest](https://www.kaggle.com/c/petfinder-pawpularity-score).  The [PetFinder.my](https://petfinder.my/) platform supports pet adoption.  They have calculated a **Pawpularity** score for each pet based on the pet profile's page view statistics.  The contest goal is to predict **Pawpularity** from a pet's photo.

# Libraries & Constants

In [None]:
!pip install ipyplot
!pip install imagesize

In [None]:
import imagesize
import ipyplot
import matplotlib.pyplot as plt
import pandas as pd
import PIL
import seaborn as sns

DIR_INPUT = "/kaggle/input/petfinder-pawpularity-score"

# Load Data

In [None]:
df_train = pd.read_csv(f"{DIR_INPUT}/train.csv", index_col=0)
df_test = pd.read_csv(f"{DIR_INPUT}/test.csv", index_col=0)

df_train["ImagePath"] = [f"{DIR_INPUT}/train/{x}.jpg" for x in df_train.index]
df_test["ImagePath"] = [f"{DIR_INPUT}/test/{x}.jpg" for x in df_test.index]

# Pawpularity Scores

There are 9,912 images in the training data, with Pawpularity scores ranging from 1-100.  The median and mean scores are 33 and 38, respectively.

In [None]:
df_train.describe()['Pawpularity']

Pawpularity score is right skewed (long tail to the right) and has clusters of values around 1 and 100.

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, gridspec_kw=dict(height_ratios=(1,3)))

sns.boxplot(x=df_train.Pawpularity, ax=ax1)
sns.histplot(x=df_train.Pawpularity, ax=ax2)
plt.show()

# Metadata
PetFinder provided optional metadata that was scored manually for each photo.

## Variable Descriptions

All metadata variables are coded as 1=Yes, 0=No.  

- (Photo Key) **Id** - Unique ID for photo.  e.g. "0007de18844b0dbbb5e1f607da0606e0	"
- (Metadata) **Focus** - Pet stands out against uncluttered background, not too close / far.
- (Metadata) **Eyes** - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.
- (Metadata) **Face** - Decently clear face, facing front or near-front.
- (Metadata) **Near** - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).
- (Metadata) **Action** - Pet in the middle of an action (e.g., jumping).
- (Metadata) **Accessory** - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.
- (Metadata) **Group** - More than 1 pet in the photo.
- (Metadata) **Collage** - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).
- (Metadata) **Human** - Human in the photo.
- (Metadata) **Occlusion** - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion.
- (Metadata) **Info** - Custom-added text or labels (i.e. pet name, description).
- (Metadata) **Blur** - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.
- (Photo Score) **Pawpularity** - Measure of engagement with a pet's profile based on the photograph for that profile [Integer Range: 1-100]. Derived from each pet profile's page view statistics at the listing pages, using an algorithm that normalizes the traffic data across different pages, platforms (web & mobile) and various metrics. 

## Missing Data
There are no missing values in the metadata.

In [None]:
df_train.isnull().sum().sum()

## Frequencies
Distribution of Yes & No values for each variable:

In [None]:
data = df_train.melt(id_vars=['ImagePath','Pawpularity']).replace({"value":{1:"Yes",0:"No"}})

sns.histplot(data=data, y='variable', hue='value', multiple='stack')
plt.show()

## Correlation with Pawpularity Score

There does not appear to be any linear correlation between the metadata variables and Pawpularity score.

In [None]:
df_train.corr().Pawpularity

A plot aids in visualizing this.  For each feature (x-axis), there is no difference in distribution of Pawpularity score (y-axis) between Yes (orange) and No (blue) values.

In [None]:
plt.figure(figsize=(20,5))
sns.boxplot(x="variable", y="Pawpularity", hue="value", data=data)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

# Photos
## Visualizing A Sample
A sample of photos at different **Pawpularity** levels:

In [None]:
scores = [1,25,50,75,100]
sample = df_train[df_train.Pawpularity.isin(scores)]

ipyplot.plot_class_tabs(
    images = [PIL.Image.open(path) for path in sample.ImagePath],
    labels = [f"Pawpularity={x}" for x in sample.Pawpularity],
    max_imgs_per_tab = 8,
    tabs_order = [f"Pawpularity={x}" for x in scores]
)

Some thoughts about factors that might influence **Pawpularity**:

- Cat vs Dog 
- Breed
- Individual Animal Attractiveness (e.g. Facial Features, Fluffiness)
- Animal Mood (e.g. Hesitant, Relaxed)
- Animal Pose (e.g. Looking Up, Looking Forward)

## Sizes

We calculate the sizes of all images and join with Pawpularity score for review.

In [None]:
sizes = [imagesize.get(f) for f in df_train.ImagePath]
sizes_df = pd.DataFrame(sizes, columns=["width","height"], index=df_train.index.tolist())
sizes_df['total_size'] = sizes_df.width * sizes_df.height
sizes_df['Pawpularity'] = df_train.Pawpularity

There is a large distribution of sizes, with what appears to be 8 common ones.

In [None]:
sns.histplot(data=sizes_df, x="total_size")
plt.show()

There does not appear to be a linear correlation with any of the size measurements.

In [None]:
sizes_df.corr().Pawpularity

However, we will store this size information in case we choose to explore further downstream.

In [None]:
sizes_df[['width','height','total_size']].to_csv('/kaggle/working/image_sizes.csv')

## Duplicates

Duplicates have been identified in this dataset.  SEE: [Discussion Topic by Schulta](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/278497). These may be explored in a separate notebook.