# PetFinder.my - Pawpularity Contest
![](https://storage.googleapis.com/kaggle-competitions/kaggle/25383/logos/header.png?t=2021-08-31-18-49-29&quot)

# Competiton Description:

https://www.petfinder.my/cutenessmeter

A picture is worth a thousand words. But did you know a picture can save a thousand lives? Millions of stray animals suffer on the streets or are euthanized in shelters every day around the world. You might expect pets with attractive photos to generate more interest and be adopted faster. But what makes a good picture? With the help of data science, you may be able to accurately determine a pet photo’s appeal and even suggest improvements to give these rescue animals a higher chance of loving homes.

PetFinder.my is Malaysia’s leading animal welfare platform, featuring over 180,000 animals with 54,000 happily adopted. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

Currently, PetFinder.my uses a basic Cuteness Meter to rank pet photos. It analyzes picture composition and other factors compared to the performance of thousands of pet profiles. While this basic tool is helpful, it's still in an experimental stage and the algorithm could be improved.

In this competition, you’ll analyze raw images and metadata to predict the “Pawpularity” of pet photos. You'll train and test your model on PetFinder.my's thousands of pet profiles. Winning versions will offer accurate recommendations that will improve animal welfare.

If successful, your solution will be adapted into AI tools that will guide shelters and rescuers around the world to improve the appeal of their pet profiles, automatically enhancing photo quality and recommending composition improvements. As a result, stray dogs and cats can find their "furever" homes much faster. With a little assistance from the Kaggle community, many precious lives could be saved and more happy families created.

Top participants may be invited to collaborate on implementing their solutions and creatively improve global animal welfare with their AI skills.

# Data Decsription:

In this competition, your task is to predict engagement with a pet's profile based on the photograph for that profile. You are also provided with hand-labelled metadata for each photo. The dataset for this competition therefore comprises both images and tabular data.

## How Pawpularity Score Is Derived:

The Pawpularity Score is derived from each pet profile's page view statistics at the listing pages, using an algorithm that normalizes the traffic data across different pages, platforms (web & mobile) and various metrics.
Duplicate clicks, crawler bot accesses and sponsored profiles are excluded from the analysis.

## Purpose of Photo Metadata:

We have included optional Photo Metadata, manually labeling each photo for key visual quality and composition parameters.
These labels are not used for deriving our Pawpularity score, but it may be beneficial for better understanding the content and co-relating them to a photo's attractiveness. Our end goal is to deploy AI solutions that can generate intelligent recommendations (i.e. show a closer frontal pet face, add accessories, increase subject focus, etc) and automatic enhancements (i.e. brightness, contrast) on the photos, so we are hoping to have predictions that are more easily interpretable.
You may use these labels as you see fit, and optionally build an intermediate / supplementary model to predict the labels from the photos. If your supplementary model is good, we may integrate it into our AI tools as well.
In our production system, new photos that are dynamically scored will not contain any photo labels. If the Pawpularity prediction model requires photo label scores, we will use an intermediary model to derive such parameters, before feeding them to the final model.

## Training Data:

train/ - Folder containing training set photos of the form {id}.jpg, where {id} is a unique Pet Profile ID.

train.csv - Metadata (described below) for each photo in the training set as well as the target, the photo's Pawpularity score. The Id column gives the photo's unique Pet Profile ID corresponding the photo's file name.

## Example Test Data:

In addition to the training data, we include some randomly generated example test data to help you author submission code. When your submitted notebook is scored, this example data will be replaced by the actual test data (including the sample submission).

test/ - Folder containing randomly generated images in a format similar to the training set photos. The actual test data comprises about 6800 pet photos similar to the training set photos.

test.csv - Randomly generated metadata similar to the training set metadata.

sample_submission.csv - A sample submission file in the correct format.

## Photo Metadata:

The train.csv and test.csv files contain metadata for photos in the training set and test set, respectively. Each pet photo is labeled with the value of 1 (Yes) or 0 (No) for each of the following features:

Focus - Pet stands out against uncluttered background, not too close / far.

Eyes - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.

Face - Decently clear face, facing front or near-front.

Near - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).

Action - Pet in the middle of an action (e.g., jumping).

Accessory - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.

Group - More than 1 pet in the photo.

Collage - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).

Human - Human in the photo.

Occlusion - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion.

Info - Custom-added text or labels (i.e. pet name, description).

Blur - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.

# Import packages

In [None]:
# Install packages
# !pip install pip install efficientnet_pytorch
# !pip install boostaroota
!pip install ../input/kerasapplications/ > /dev/null
!pip install ../input/efficientnet-keras-source-code/ > /dev/null

In [None]:
import numpy as np
import pandas as pd
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode() # Set notebook mode to work in offline
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import cv2
from tqdm import tqdm
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from sklearn import model_selection as sk_model_selection
from xgboost.sklearn import XGBRegressor
from sklearn.metrics import mean_squared_error,roc_auc_score,precision_score
from sklearn import metrics
import optuna
# from boostaroota import BoostARoota
from sklearn.metrics import log_loss
from optuna.samplers import TPESampler
import functools
from functools import partial
import xgboost as xgb
import joblib
import sys
package_path = '/kaggle/input/efficientnet100minimal/'
sys.path.append(package_path)
import efficientnet.keras as efn 

SEED = 42

# Load data files

In [None]:
train_metadata = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')
test_metadata = pd.read_csv('../input/petfinder-pawpularity-score/test.csv')
sample_submission = pd.read_csv('../input/petfinder-pawpularity-score/sample_submission.csv')

# EDA

In [None]:
print(train_metadata.shape)
print('\n# Ids:',train_metadata['Id'].nunique())
train_metadata.head()

In [None]:
print(test_metadata.shape)
test_metadata.head()

In [None]:
print(sample_submission.shape)
sample_submission.head()

In [None]:
train_metadata.describe()

## Distribution of pawpularity by various metadata features

In [None]:
fig = px.box(train_metadata, y="Pawpularity")
fig.show()

In [None]:
fig = px.histogram(train_metadata, x="Pawpularity", nbins=20)
fig.show()

In [None]:
def box_plot_by_metadata(metadata):
    fig = px.box(train_metadata, y="Pawpularity", x = metadata)
    fig.show()
    
metadata_list = ['Subject Focus', 'Eyes', 'Face', 'Near', 'Action', 'Accessory',
       'Group', 'Collage', 'Human', 'Occlusion', 'Info', 'Blur']

w = widgets.interactive(box_plot_by_metadata, metadata = metadata_list)
display(w)

In [None]:
def box_plot_by_multiple_metadata(subject_focus, eyes, face, near, action, accessory, group, collage, human, occlusion, info, blur):
    data = train_metadata.copy(deep = True)
    data['Filter'] = 0
    list_filters = [subject_focus, eyes, face, near, action, accessory, group, collage, human, occlusion, info, blur]
    
    for i in range(0, len(metadata_list)):
        col = metadata_list[i]
        col_value = list_filters[i]
        if col_value == 1:
            data['Filter'] = np.where(data[col]==1, 1, data['Filter'])
    fig = px.box(data, y="Pawpularity", x = 'Filter')
    print('Filters: \n',metadata_list,'\n',list_filters)
    fig.show()
    
w = widgets.interactive(box_plot_by_multiple_metadata,
                        subject_focus = [0,1],
                        eyes = [0,1],
                        face = [0,1],
                        near = [0,1],
                        action = [0,1],
                        accessory = [0,1],
                        group = [0,1],
                        collage = [0,1],
                        human = [0,1],
                        occlusion = [0,1],
                        info = [0,1],
                        blur = [0,1])
display(w)

## Look at photos

In [None]:
TRAIN_PATH = '../input/petfinder-pawpularity-score/train/'

sample_id = '0007de18844b0dbbb5e1f607da0606e0'
display(train_metadata[train_metadata['Id']==sample_id])
img=mpimg.imread(TRAIN_PATH + sample_id + '.jpg')
plt.imshow(img)
plt.show()
img.shape, img.min(), img.max(), img.mean()

# Modelling

In [None]:
# Load model
model = efn.EfficientNetB7(include_top=False, input_shape=(256,256,3), weights=None)
# now this would usually download the weights, but because this is offline we will import 
# the weights from another datasource
# just be sure to add the matching b0 to b7 number, depending on which model you started above
model.load_weights('../input/efficientnetb0b7-keras-weights/efficientnet-b7_weights_tf_dim_ordering_tf_kernels_autoaugment_notop.h5')

In [None]:
# train_metadata = train_metadata.head(100) - take a sample to test model

list_features_from_model = ['feature_' + str(x+1) for x in range(0, 2560)]
train_data = train_metadata.copy(deep = True)
train_data[list_features_from_model] = np.NaN

test_data = test_metadata.copy(deep = True)
test_data[list_features_from_model] = np.NaN

print('For train data')
for i in tqdm(range(0, train_data.shape[0])):
    id_ = train_data['Id'].iloc[i]
    img=mpimg.imread(TRAIN_PATH + id_ + '.jpg')
    features = tf.keras.layers.GlobalAveragePooling2D()(model(np.expand_dims(cv2.resize(img, (256, 256)) / 255, axis = 0)))
    train_data.iloc[i] = list(train_data[[x for x in train_data.columns if 'feature' not in x]].iloc[i]) + list(np.array(features)[0])
    
print('\nFor test data')
for i in tqdm(range(0, test_data.shape[0])):
    id_ = test_data['Id'].iloc[i]
    img=mpimg.imread(TRAIN_PATH.replace('train','test') + id_ + '.jpg')
    features = tf.keras.layers.GlobalAveragePooling2D()(model(np.expand_dims(cv2.resize(img, (256, 256)) / 255, axis = 0)))
    test_data.iloc[i] = list(test_data[[x for x in test_data.columns if 'feature' not in x]].iloc[i]) + list(np.array(features)[0])

In [None]:
def objective_regressor(X_train, y_train, X_val, y_val, target_value, trial):
    """It tries to find the best hyper-parameters for XGBOOST model for given task

        Details:
            It uses OPTUNA library which is based on Baseian-optimization to tune the hyper-params.

        Args:
            X_train: training data
            X_test: testing data
            y_tain: training label
            y_val: validation label
            trail: object of optuna for optimizing the task in hand

        Returns:
            best score till now

    """
    if ((target_value)):
        tree_methods = ['approx', 'hist', 'exact']
#         tree_methods = ['gpu_hist']
        boosting_lists = ['gbtree', 'gblinear']
        objective_list_reg = ['reg:squarederror']  # 'reg:gamma', 'reg:tweedie'
        boosting = trial.suggest_categorical('boosting', boosting_lists),
        tree_method = trial.suggest_categorical('tree_method', tree_methods),
        n_estimator = trial.suggest_int('n_estimators',20, 200, 10),
        max_depth = trial.suggest_int('max_depth', 1, 50),
        reg_alpha = trial.suggest_int('reg_alpha', 5,10),
        reg_lambda = trial.suggest_int('reg_lambda', 5,10),
        min_child_weight = trial.suggest_int('min_child_weight', 2,5),
        gamma = trial.suggest_int('gamma', 1, 5),
        learning_rate = trial.suggest_loguniform('learning_rate', 0.01, 0.1),
        objective = trial.suggest_categorical('objective', objective_list_reg),
        colsample_bytree = trial.suggest_discrete_uniform('colsample_bytree', 0.8, 1, 0.05),
        colsample_bynode = trial.suggest_discrete_uniform('colsample_bynode', 0.8, 1, 0.05),
        colsample_bylevel = trial.suggest_discrete_uniform('colsample_bylevel', 0.8, 1, 0.05),
        subsample = trial.suggest_discrete_uniform('subsample', 0.7, 1, 0.05),
        nthread = -1
        
        
    xgboost_tune = xgb.XGBRegressor(
        tree_method=tree_method[0],
        boosting=boosting[0],
        reg_alpha=reg_alpha[0],
        reg_lambda=reg_lambda[0],
        gamma=gamma[0],
        objective=objective[0],
        colsample_bynode=colsample_bynode[0],
        colsample_bylevel=colsample_bylevel[0],
        n_estimators=n_estimator[0],
        max_depth=max_depth[0],
        min_child_weight=min_child_weight[0],
        learning_rate=learning_rate[0],
        subsample=subsample[0],
        colsample_bytree=colsample_bytree[0],
#         scale_pos_weight=scale_pos_weight,
        eval_metric='rmse',
        n_jobs=nthread,
        random_state=SEED)
    
    xgboost_tune.fit(X_train, y_train)
    pred_val = xgboost_tune.predict(X_val)
    
    return mean_squared_error(y_val, pred_val, squared=False)

In [None]:
# XGBoost model
df_train, df_valid = sk_model_selection.train_test_split(
    train_data, 
    test_size=0.15, 
    random_state=42)

feature_cols = [x for x in train_data.columns if x not in ['Id','Pawpularity']]

X_train = df_train[feature_cols]
y_train = df_train[['Pawpularity']]
X_valid = df_valid[feature_cols]
y_valid = df_valid[['Pawpularity']]

# br = BoostARoota(metric='rmse', silent = True)
# br.fit(X_train,y_train)
# X_train=X_train[br.keep_vars_.tolist()]
# X_valid=X_valid[br.keep_vars_.tolist()]

study = optuna.create_study(direction='minimize', sampler=TPESampler(seed=SEED))
study.optimize(
    functools.partial(objective_regressor, X_train, y_train, X_valid, y_valid,'trial'),
            timeout=500)

model_xgb = xgb.XGBRegressor(**study.best_params, random_state=SEED)
model_xgb.fit(X_train,y_train)

print('\nTrain data performance:')
y_predicted = model_xgb.predict(X_train)
print('RMSE: ', mean_squared_error(y_train, y_predicted, squared=False))
print('\nValidation data performance:')
y_predicted = model_xgb.predict(X_valid)
print('RMSE: ', mean_squared_error(y_valid, y_predicted, squared=False))

print("Saving model .. ",end=" ")
joblib.dump(model_xgb,"XGBoost_model.pkl")

In [None]:
# Train model with full data and optimal hyperparameters

model_xgb = joblib.load("XGBoost_model.pkl")
model_xgb.fit(pd.concat([X_train, X_valid], axis = 0).reset_index(drop = True), pd.concat([y_train, y_valid], axis = 0).reset_index(drop = True))

print('\nTrain data performance:')
y_predicted = model_xgb.predict(X_train)
print('RMSE: ', mean_squared_error(y_train, y_predicted, squared=False))
print('\nValidation data performance:')
y_predicted = model_xgb.predict(X_valid)
print('RMSE: ', mean_squared_error(y_valid, y_predicted, squared=False))

print("Saving model .. ",end=" ")
joblib.dump(model_xgb,"XGBoost_model_full_data.pkl")

# Predictions on test data

In [None]:
X_test = test_data[feature_cols]
y_predicted = model_xgb.predict(X_test)
output_df = sample_submission.copy(deep = True)
output_df['Pawpularity'] = y_predicted

In [None]:
output_df

In [None]:
output_df['Pawpularity'].describe()

In [None]:
output_df.to_csv('submission.csv', index = False)