![https://www.kaggle.com/c/petfinder-pawpularity-score](https://blog.groomit.me/wp-content/uploads/2018/02/petfinder2.jpg)

This notebook is a quick exploration of the new [Petfinder 2021 competition](https://www.kaggle.com/c/petfinder-pawpularity-score) and yields some insight into how important the tabular features will be in this competition!

In [None]:
import os
import torch
import numpy as np
import pandas as pd
import random as rn
from glob import glob
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Path variables
BASE_PATH = "../input/petfinder-pawpularity-score/"
TRAIN_PATH = BASE_PATH + "train.csv"
TEST_PATH = BASE_PATH + "test.csv"
TRAIN_IMAGES = glob(BASE_PATH + "train/*.jpg")
TEST_IMAGES = glob(BASE_PATH + "test/*.jpg")

# We are trying to predict this "Pawpularity" variable
TARGET = "Pawpularity"

# Seed for reproducability
seed = 1234
rn.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)

In [None]:
df = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

# All relevant tabular futures
FEATURES = [col for col in df.columns if col not in ['Id', TARGET]]

## Metrics and loss (RMSE)

Scoring metric is Root Mean Squared Error (RMSE). 

Formally defined as:


$$\sqrt{\Sigma_{i=1}^{n}{\Big(\frac{\hat{y}_i - y_i}{n}\Big)^2}}$$

where $n$ denotes the number of samples, $y_i$ the ground truth value and $\hat{y}_i$ the prediction value.

In [None]:
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    """Numpy RMSE"""
    return np.sqrt(mean_squared_error(y_true, y_pred))

def rmse_pytorch(outputs, labels):
    "Pytorch RMSE loss"
    return torch.sqrt(torch.mean((outputs - labels)**2))

def rmse_tf(y_true, y_pred):
    """Tensorflow RMSE loss"""
    return tf.sqrt(tf.reduce_mean(tf.squared_difference(y_true, y_pred)))

## EDA

The features given in the CSV are additional binary descriptive features. 

Our target is the "Pawpularity" score which ranges between 1 and 100.

In [None]:
print(df.shape)
df.head()

In [None]:
df[TARGET].plot(kind='hist', bins=100, figsize=(15, 6));
plt.title("Target distribution", weight='bold', fontsize=16);

Some features seem to be substantially correlated to each other. For example, "Human" and "Occlusion".

In [None]:
corr_matrix = df[FEATURES + [TARGET]].corr()
corr_matrix

However, the features have a low correlation to the target variable. Linear models trained on these features are therefore likely to perform poorly.

In [None]:
target_corr = corr_matrix[TARGET][:-1]
target_corr

Of all the tabular features, "Blur" seems to be the most predictive for the target.

## Image sample

In [None]:
path = np.random.choice(TRAIN_IMAGES)
im = plt.imread(path)
plt.figure(figsize=(15, 6))
plt.imshow(im)
plt.title(path.split("/")[-1])
plt.xticks([]), plt.yticks([])
print(f"Accompanying features:")
df[df['Id']==path.split('/')[-1].split('.')[0]]

### Most Pawpular example (100)

In [None]:
most_pawpular = df[df[TARGET] == df[TARGET].max()].iloc[0]
path = f"{BASE_PATH}train/{most_pawpular['Id']}.jpg"
im = plt.imread(path)
plt.figure(figsize=(15, 6))
plt.imshow(im)
plt.title(path.split("/")[-1])
plt.xticks([]), plt.yticks([])
print(f"Accompanying features:")
df[df['Id']==path.split('/')[-1].split('.')[0]]

### Least Pawpular example (1)

In [None]:
least_pawpular = df[df[TARGET] == df[TARGET].min()].iloc[0]
path = f"{BASE_PATH}train/{least_pawpular['Id']}.jpg"
im = plt.imread(path)
plt.figure(figsize=(15, 6))
plt.imshow(im)
plt.title(path.split("/")[-1])
plt.xticks([]), plt.yticks([])
print(f"Accompanying features:")
df[df['Id']==path.split('/')[-1].split('.')[0]]

## Naive Baseline (Mean of target): 

In [None]:
df['mean_pred'] = df[TARGET].mean()

In [None]:
print(f"Mean prediction is: {df['mean_pred'].iloc[0].round(2)}")
print(f"Using this naive baseline train RMSE is: {rmse(df[TARGET], df['mean_pred']).round(2)}")

## Baseline (Decision Tree Regressor):

Our baseline model will be a decision tree with a depth of 3. In this way we can easily visualize the tree and get insight in the most important feature rules. Unfortunately, the binary features do not seem to yield important rules for predicting Pawpularity. It seems that the image data will yield the most important features to predict Pawpularity in this competition.

In [None]:
from sklearn.tree import DecisionTreeRegressor, export_text, plot_tree

X_train, X_test, y_train, y_test = train_test_split(df[FEATURES], df[TARGET], test_size=0.2, random_state=seed)
reg = DecisionTreeRegressor(random_state=seed, max_depth=3)
reg.fit(X_train, y_train)

In [None]:
print(f"Train RMSE: {rmse(y_train, reg.predict(X_train)).round(4)}")
print(f"Test RMSE: {rmse(y_test, reg.predict(X_test)).round(4)}")

## Visualize decision tree

The tree we have trained almost always predicts values close around the mean. There is an exception where Blur=1, Action=1 and Face=0.

In [None]:
# Create PNG file
text_representation = export_text(reg)
with open("tree.log", "w") as f:
    f.write(text_representation)

fig = plt.figure(figsize=(25, 10))
_ = plot_tree(reg, 
              feature_names=FEATURES,
              class_names=TARGET,
              filled=True)

In [None]:
# Train final model on all training data
reg.fit(df[FEATURES], df[TARGET])

## Submission

In [None]:
test.head(2)

In [None]:
test[TARGET] = reg.predict(test[FEATURES])
sub = test[['Id', TARGET]]
sub.to_csv("submission.csv", index=False)

In [None]:
sub.head(2)

Our model predicts mostly around the mean of all labels.

In [None]:
sub[TARGET].plot(kind='hist', bins=15, title='Prediction distribution');

That's it! I hope this notebook helped you to get started for the Petfinder 2021 competition!

If you have any questions or feedback, feel free to comment below. You can also contact me on Twitter [@carlolepelaars](https://twitter.com/carlolepelaars).