Hello Kagglers! Welcome to another very interesting PetFinder Competition. The past competitions were amazing and I am pretty sure that this one would be too

# Task
The task is pretty simple. Given a photo of a pet along with some features (hand-labeled metadata), we are asked to provide an engagement score or the PawPularity score.
I will use the terms popularity, engagement, and pawpularity very loosely here. Although the meaning of three can be very different when put in a proper context, I will use them interchangeably. IMO, if a photo is engaging, it will be more popular and will have a high pawpularity score.
Without any further delay, let's jump in!

In [None]:
! pip install -qq --upgrade seaborn

In [None]:
import os
import cv2
import glob
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

sns.set()
pio.templates.default = "ggplot2"

seed = 1234
np.random.seed(seed)

%config IPCompleter.use_jedi = False

# Dataset

We are provided with a `train` csv and a `test` csv. The images for the training data are stored in the `train` directory while the images for test data is stored in the `test` directory 

In [None]:
# Path to the data directory
data_dir = Path("../input/petfinder-pawpularity-score/")

# Paths to train and test images
train_images_dir = data_dir / "train"
test_images_dir = data_dir / "test"

In [None]:
# Read the CSVs
train_df = pd.read_csv(data_dir / "train.csv")
test_df = pd.read_csv(data_dir / "test.csv")

print("Number of training samples: ", len(train_df))
print("Number of test samples: ", len(test_df))

In [None]:
# What's in the training data?
train_df.head()

In [None]:
# What's in the test data?
test_df.head()

# Pawpularity Distribution

The first thing that we will check is the distribution of `pawpularity` score. We will use `sns.histplot(..).` for plotting the distribution. You can read about the API [here](https://seaborn.pydata.org/generated/seaborn.histplot.html)

In [None]:
_, ax = plt.subplots(1,1, figsize=(15, 8))
sns.histplot(data=train_df, x="Pawpularity", color="blue", kde=True, ax=ax)
plt.show()

Things we can notice quickly from the distribution of the scores:

1. Majority of the photos have a pawpularity score between 20-50
2. We have a long tail on the right-hand side with almost 300 images with a perfect score of 100
3. We surely can't ignore the images with a perfect score during the training phase

# Features and PawPularity

We are provided with **twelve** features, as meta-data, that we can use as additional features for training our models. Each of these features is binary, meaning they are either present or absent in the image. These features are:

1. **Focus** - Pet stands out against the uncluttered background, not too close / far.
2. **Eyes** - Both eyes are facing front or near-front, with at least 1 eye/pupil decently clear.
3. **Face** - Decently clear face, facing front or near-front.
4. **Near** - Single pet taking up a significant portion of photo (roughly over 50% of photo width or height).
5. **Action** - Pet in the middle of an action (e.g., jumping).
6. **Accessory** - Accompanying physical or digital accessory/prop (i.e. toy, digital sticker), excluding collar and leash.
7. **Group** - More than 1 pet in the photo.
8. **Collage** - Digitally-retouched photo (i.e. with digital photo frame, a combination of multiple photos).
9. **Human** - Human in the photo.
10. **Occlusion** - Specific undesirable objects blocking part of the pet (i.e. human, cage, or fence). Note that not all blocking objects are considered occlusion.
11. **Info** - Custom-added text or labels (i.e. pet name, description).
12. **Blur** - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.

Before discussing why these features are needed, let's check how the pawpularity score is affected by the presence of a certain feature. We will use the same distribution plot but with `hue` where hue would be a feature from the given features

In [None]:
features = train_df.columns[1:-1].tolist()
num_cols = 2
num_rows = len(features) // num_cols


fig, axs = plt.subplots(num_rows,
                        num_cols,
                        figsize=(20, 15),
                        sharex=False,
                        sharey=True
                       )

for i, feature in enumerate(features):
    _ = sns.histplot(data=train_df,
                 x="Pawpularity",
                 kde=False,
                 ax=axs[i // num_cols, i % num_cols],
                 hue=feature,
                )
plt.show()

The above plot is pretty interesting. A few things that we can notice from this plot:

1. Although one would expect **Subject Focus** to be a very important feature for making a photo popular, in this case, it hardly contributes to a high score.
2. **Eyes, Face, and Near (Single Pet)** are the only three features that are dominant for a popularity score of more than 50. These are the only features that contributed most to  a score of 100
3. A group photo with other pets/humans doesn't give a good score
4. Collage, as expected, doesn't improve the score. The distribution of scores with/without Collage is the same
5. **Blur(Out of focus or noisy)** tends to decrease the score as expected. Most blurred photos scored between 20-30
6. Any pet doing an action in a photo doesn't make the pet more attractive, hence the score isn't affected at all


# Why are the features important?

Engagement with a photo depends very much on the **aesthetics** of a photo. To give you a simple example, a not-so good looking pet (Sorry, every pet is cute! Here I am just talking in terms of the photo), would look cuter with a focus on the features of the pet rather than a good looking pet doing some weird trick far away from the camera. 

And aesthetics isn't just that. The term aesthetics itself is very broad as it depends very much on an individual perception of a photo. "How to capture aesthetics of a photo in an ML model" is an open research area. So, instead of just looking at raw photos and trying to predict an engagement/pawpularity score is much harder than predicting the score for the same photo but with additional features that aren't directly captured in a simple model, especially traditional ML models 

# Feature specific photos

We will do a simple exercise here to visualize the images and their corresponding scores to see if they make sense. We will do the following:

1. Filter the training dataframe using a specific feature
2. Gather some random samples from the filtered dataframe
3. Record the pawpularity score for these samples along with the values of some other popular feature
4. Plot the samples with the above information

In [None]:
def plot_images(images, labels, num_images, num_cols=4, figsize=(15, 8), title="Images"):
    num_rows = num_images // num_cols
    
    _, ax = plt.subplots(num_rows, num_cols, figsize=figsize)
    
    for i, img in enumerate(images):
        ax[i // num_cols, i % num_cols].imshow(images[i])
        ax[i // num_cols, i % num_cols].axis("off")
        ax[i // num_cols, i % num_cols].set_title(labels[i])
        
    plt.tight_layout()
    plt.suptitle(title, x=0.5, y=1.09, fontsize=16)
    plt.show()
    

def filter_df(df, feature, sample_size=12):
    
    df = df[df[feature]==1].reset_index(drop=True)
    indices = np.random.choice(np.arange(len(df)), size=sample_size)

    images, labels = [], []

    for idx in indices:
        img = df.iloc[idx]["Id"]
        face = df.iloc[idx]["Face"]
        near = df.iloc[idx]["Near"]
        score = df.iloc[idx]["Pawpularity"]
        label = f"Face: {face} Near: {near} Score: {score}"

        img_path = str(train_images_dir / img)
        if os.path.exists(img_path + ".jpg"):
            img_path = img_path + ".jpg"
        else:
            img_path = img_path + ".png"

        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        images.append(img)
        labels.append(label)
    
    return images, labels

In [None]:
# Eyes are visible
images, labels = filter_df(train_df, feature="Eyes")
plot_images(images, labels, num_images=len(images), title="Photos where eyes are visible clearly")

In [None]:
# Subject Focus
images, labels = filter_df(train_df, feature="Subject Focus")
plot_images(images, labels, num_images=len(images), title="Photos where focus is on the subject")