# PetFinder.my - Pawpularity Contest
### Predict the popularity of shelter pet photos

In this competition, you’ll analyze raw images and metadata to predict the “Pawpularity” of pet photos. You'll train and test your model on PetFinder.my's thousands of pet profiles. Winning versions will offer accurate recommendations that will improve animal welfare.

![https://www.kaggle.com/c/petfinder-pawpularity-score](https://blog.groomit.me/wp-content/uploads/2018/02/petfinder2.jpg)

This notebook is a quick exploration of the new [Petfinder 2021 competition](https://www.kaggle.com/c/petfinder-pawpularity-score) and yields some insight into how important the tabular features will be in this competition!

This notebook contains code from:

https://www.kaggle.com/currypurin/petfinder-eda-lgb-meta-features-and-img-size/notebook
https://www.kaggle.com/aakashnain/which-features-to-use-and-why
https://www.kaggle.com/nicapotato/pawpular-eda

In [None]:
!python -m pip install "../input/ipyplot-package/ipyplot-1.1.0-py3-none-any.whl" --quiet
!pip install nicaviz

In [None]:



import os
import glob
import torch
import ipyplot
import nicaviz
import numpy as np
import pandas as pd
import random as rn
from PIL import Image
from glob import glob
import tensorflow as tf
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


# Seed for reproducability
seed = 1234
rn.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)

In [None]:
# Path variables
BASE_PATH = "../input/petfinder-pawpularity-score/"
TRAIN_PATH = BASE_PATH + "train.csv"
TEST_PATH = BASE_PATH + "test.csv"
TRAIN_IMAGES = glob(BASE_PATH + "train/*.jpg")
TEST_IMAGES = glob(BASE_PATH + "test/*.jpg")

# We are trying to predict this "Pawpularity" variable
TARGET = "Pawpularity"

# Preparing Train & Test data

In [None]:
#Load the CSV Files
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

train_df

In [None]:
#Add image paths to Dataframes
train_path_creator = lambda x : f'../input/petfinder-pawpularity-score/train/{x}.jpg'
test_path_creator = lambda x : f'../input/petfinder-pawpularity-score/test/{x}.jpg'

train_df['img_path'] = train_df['Id'].apply(lambda x: train_path_creator(x))
test_df['img_path'] = test_df['Id'].apply(lambda x: test_path_creator(x))



#Adding Image Height & Width
def create_shape_feature(df):
    width_height_list = []
    file_size_list = []
    for path_ in tqdm(df['img_path']):
        width_height_list.append(Image.open(path_).size)
        file_size_list.append(os.path.getsize(path_))
    df['width_height'] = width_height_list
    df['file_size'] = file_size_list
    df['width'] = df['width_height'].apply(lambda x: x[0])
    df['height'] = df['width_height'].apply(lambda x: x[1])
    return df

train_df = create_shape_feature(train_df)
test_df = create_shape_feature(test_df)

In [None]:
train_df.head()

In [None]:
#Dataset Summary

print(f"There are {len(TRAIN_IMAGES)} train images provided.")
print(f"train.csv file has {len(train_df)} rows.")

print(f"There are {len(TEST_IMAGES)} test images provided.")
print(f"test.csv file has {len(test_df)} rows.")


In [None]:
# All relevant tabular futures

non_feature_col = ['Id','width_height','width','height','file_size', 'img_path', 'Pawpularity']
FEATURES = [col for col in train_df.columns if col not in non_feature_col]
FEATURES

# Metrics and Loss (RMSE)

Scoring metric is Root Mean Squared Error (RMSE). 

Formally defined as:


$$\sqrt{\Sigma_{i=1}^{n}{\Big(\frac{\hat{y}_i - y_i}{n}\Big)^2}}$$

where $n$ denotes the number of samples, $y_i$ the ground truth value and $\hat{y}_i$ the prediction value.

In [None]:
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    """Numpy RMSE"""
    return np.sqrt(mean_squared_error(y_true, y_pred))

def rmse_pytorch(outputs, labels):
    "Pytorch RMSE loss"
    return torch.sqrt(torch.mean((outputs - labels)**2))

def rmse_tf(y_true, y_pred):
    """Tensorflow RMSE loss"""
    return tf.sqrt(tf.reduce_mean(tf.squared_difference(y_true, y_pred)))

# EDA on Tabular data



## Pawpularity Distribution

The first thing that we will check is the distribution of `pawpularity` score. We will use `sns.histplot(..).` for plotting the distribution. You can read about the API [here](https://seaborn.pydata.org/generated/seaborn.histplot.html)

In [None]:
import seaborn as sns

_, ax = plt.subplots(1,1, figsize=(15, 8))
sns.histplot(data=train_df, x="Pawpularity", color="blue", kde=True, ax=ax)
plt.show()

Things we can notice quickly from the distribution of the scores:

1. Majority of the photos have a pawpularity score between 20-50
2. We have a long tail on the right-hand side with almost 300 images with a perfect score of 100
3. We surely can't ignore the images with a perfect score during the training phase

## Features and PawPularity

We are provided with **twelve** features, as meta-data, that we can use as additional features for training our models. Each of these features is binary, meaning they are either present or absent in the image. These features are:

1. **Focus** - Pet stands out against the uncluttered background, not too close / far.
2. **Eyes** - Both eyes are facing front or near-front, with at least 1 eye/pupil decently clear.
3. **Face** - Decently clear face, facing front or near-front.
4. **Near** - Single pet taking up a significant portion of photo (roughly over 50% of photo width or height).
5. **Action** - Pet in the middle of an action (e.g., jumping).
6. **Accessory** - Accompanying physical or digital accessory/prop (i.e. toy, digital sticker), excluding collar and leash.
7. **Group** - More than 1 pet in the photo.
8. **Collage** - Digitally-retouched photo (i.e. with digital photo frame, a combination of multiple photos).
9. **Human** - Human in the photo.
10. **Occlusion** - Specific undesirable objects blocking part of the pet (i.e. human, cage, or fence). Note that not all blocking objects are considered occlusion.
11. **Info** - Custom-added text or labels (i.e. pet name, description).
12. **Blur** - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.

Before discussing why these features are needed, let's check how the pawpularity score is affected by the presence of a certain feature. We will use the same distribution plot but with `hue` where hue would be a feature from the given features

In [None]:
# FEATURES

# features = train_df.columns[1:-1].tolist()
num_cols = 2
num_rows = len(FEATURES) // num_cols


fig, axs = plt.subplots(num_rows,
                        num_cols,
                        figsize=(20, 15),
                        sharex=False,
                        sharey=True
                       )

for i, feature in enumerate(FEATURES):
    _ = sns.histplot(data=train_df,
                 x="Pawpularity",
                 kde=False,
                 ax=axs[i // num_cols, i % num_cols],
                 hue= feature,
                )
plt.show()

The above plot is pretty interesting. A few things that we can notice from this plot:

1. Although one would expect **Subject Focus** to be a very important feature for making a photo popular, in this case, it hardly contributes to a high score.
2. **Eyes, Face, and Near (Single Pet)** are the only three features that are dominant for a popularity score of more than 50. These are the only features that contributed most to  a score of 100
3. A group photo with other pets/humans doesn't give a good score
4. Collage, as expected, doesn't improve the score. The distribution of scores with/without Collage is the same
5. **Blur(Out of focus or noisy)** tends to decrease the score as expected. Most blurred photos scored between 20-30
6. Any pet doing an action in a photo doesn't make the pet more attractive, hence the score isn't affected at all


## Why are the features important?

Engagement with a photo depends very much on the **aesthetics** of a photo. To give you a simple example, a not-so good looking pet (Sorry, every pet is cute! Here I am just talking in terms of the photo), would look cuter with a focus on the features of the pet rather than a good looking pet doing some weird trick far away from the camera. 

And aesthetics isn't just that. The term aesthetics itself is very broad as it depends very much on an individual perception of a photo. "How to capture aesthetics of a photo in an ML model" is an open research area. So, instead of just looking at raw photos and trying to predict an engagement/pawpularity score is much harder than predicting the score for the same photo but with additional features that aren't directly captured in a simple model, especially traditional ML models 

In [None]:
image_paths = []
labels = []
custom_texts = []

for col in FEATURES:
    tmp_df = train_df[train_df[col] == 1]
    for i in range(4):
        image_paths.append(tmp_df.iloc[i, :]['img_path'])
        labels.append(col)
        target = str(tmp_df.iloc[i, :][TARGET])
        meta = tmp_df.iloc[i, :][FEATURES + ['width', 'height']].values
        meta = ''.join([f'{col}:{m}, ' for m, col in zip(meta, FEATURES + ['width', 'height'])])
        custom_texts.append(f'target: {target}\n{meta}')
    tmp_df = train_df[train_df[col] == 0]
    
#Visualizing feature-wise image
ipyplot.plot_class_tabs(image_paths, labels, custom_texts=custom_texts, force_b64=True, img_width=450)

# Feature Analysis

In [None]:

train_df.nica.mass_plot(
    plt_set = FEATURES,
    columns = 3,
    plottype = "countplot")

In [None]:
#Plot the correlation Heatmap

data_corr = train_df[FEATURES + [TARGET]].corr()
plt.figure(figsize = (15,15))
dataplot = sns.heatmap(data_corr, annot=True)
  
plt.show()

However, the features have a low correlation to the target variable. Linear models trained on these features are therefore likely to perform poorly.

Of all the tabular features, **"Blur" seems to be the most predictive for the target.**

In [None]:
train_df.nica.pivot_plots( FEATURES, 'Pawpularity', np.mean, palette=["Reds"])

# Image Analysis

In [None]:
import cv2

def plot_pictures(target_df):
    plt.figure(figsize=(20, 50))
    
    n_rows = min(60, target_df.shape[0])
    
    for i in range(n_rows):
        row = target_df.iloc[i]
        img_path = f"../input/petfinder-pawpularity-score/train/{row['Id']}.jpg"
        Pawpularity = row["Pawpularity"]
        img = cv2.cvtColor(cv2.imread(img_path), cv2.COLOR_BGR2RGB)
        plt.subplot(12, 5, i+1)
        plt.title(f"Pawpularity: {Pawpularity}")
        plt.imshow(img)
    plt.tight_layout()
    plt.show()
    plt.close()

## Least Pawpular : Pawpularity < 10

In [None]:
target_df = train_df[train_df["Pawpularity"] <= 3]
plot_pictures(target_df.head(10))

## Average Pawpular : Pawpularity 45 ~ 55

In [None]:
target_df = train_df[(45 < train_df["Pawpularity"]) & (train_df["Pawpularity"] <= 55)]
plot_pictures(target_df.head(10))

## Most Pawpular : Pawpularity > 98

In [None]:
target_df = train_df[train_df["Pawpularity"] >= 98]
plot_pictures(target_df.head(10))

### Least Pawpular example (1)

In [None]:
#Least Popular image
least_pawpular = train_df[train_df[TARGET] == train_df[TARGET].min()].iloc[0]
path = f"{BASE_PATH}train/{least_pawpular['Id']}.jpg"
im = plt.imread(path)
plt.figure(figsize=(15, 6))
plt.imshow(im)
plt.title(path.split("/")[-1])
plt.xticks([]), plt.yticks([])
print(f"Accompanying features:")
train_df[train_df['Id']==path.split('/')[-1].split('.')[0]]

## Test Image Analysis

In [None]:
#Random Test Image
path = np.random.choice(TEST_IMAGES)
im = plt.imread(path)
plt.figure(figsize=(15, 6))
plt.imshow(im)
plt.title(path.split("/")[-1])
plt.xticks([]), plt.yticks([])
print(f"Accompanying features:")
test_df[test_df['Id']==path.split('/')[-1].split('.')[0]]

The test data images in the dataset are randomly generated images.

**The actual test data comprises about 6800 pet photos similar to the training set photos.**

In [None]:
ipyplot.plot_images(test_df['img_path'].values, force_b64=True, img_width=200)

# Naive Baseline (Mean of target): 

In [None]:
train_df['mean_pred'] = train_df[TARGET].mean()

In [None]:
print(f"Mean prediction is: {train_df['mean_pred'].iloc[0].round(2)}")
print(f"Using this naive baseline train RMSE is: {rmse(train_df[TARGET], train_df['mean_pred']).round(2)}")

## Baseline (Decision Tree Regressor):

Our baseline model will be a decision tree with a depth of 3. In this way we can easily visualize the tree and get insight in the most important feature rules. Unfortunately, the binary features do not seem to yield important rules for predicting Pawpularity. It seems that the image data will yield the most important features to predict Pawpularity in this competition.

In [None]:
from sklearn.tree import DecisionTreeRegressor, export_text, plot_tree

X_train, X_test, y_train, y_test = train_test_split(train_df[FEATURES], train_df[TARGET], test_size=0.2, random_state=seed)
reg = DecisionTreeRegressor(random_state=seed, max_depth=3)
reg.fit(X_train, y_train)

In [None]:
print(f"Train RMSE: {rmse(y_train, reg.predict(X_train)).round(4)}")
print(f"Test RMSE: {rmse(y_test, reg.predict(X_test)).round(4)}")

## Visualize decision tree

The tree we have trained almost always predicts values close around the mean. There is an exception where Blur=1, Action=1 and Face=0.

In [None]:
# Create PNG file
text_representation = export_text(reg)
with open("tree.log", "w") as f:
    f.write(text_representation)

fig = plt.figure(figsize=(25, 10))
_ = plot_tree(reg, 
              feature_names=FEATURES,
              class_names=TARGET,
              filled=True)

In [None]:
# Train final model on all training data
reg.fit(train_df[FEATURES], train_df[TARGET])

## Submission

In [None]:
test_df.head(2)

In [None]:
test_df[TARGET] = reg.predict(test_df[FEATURES])
sub = test_df[['Id', TARGET]]
sub.to_csv("submission.csv", index=False)

In [None]:
sub.head(2)

Our model predicts mostly around the mean of all labels.

In [None]:
sub[TARGET].plot(kind='hist', bins=15, title='Prediction distribution');

That's it! I hope this notebook helped you to get started for the Petfinder 2021 competition!

If you have any questions or feedback, feel free to comment below. You can also contact me on Twitter [@carlolepelaars](https://twitter.com/carlolepelaars).