![](https://drive.google.com/uc?id=1fZu2LfKginyy8jxhMlCRtUNbkSGKbu98)

# <h1 style='background:#AB51E9; border:0; color:black'><center> PetFinder.my - Pawpularity Contest </center></h1> 

PetFinder.my is Malaysia‚Äôs leading animal welfare platform, featuring over 180,000 animals with 54,000 happily adopted. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

Currently, PetFinder.my uses a basic Cuteness Meter to rank pet photos. It analyzes picture composition and other factors compared to the performance of thousands of pet profiles. While this basic tool is helpful, it's still in an experimental stage and the algorithm could be improved.

# **<span style="color:#AB51E9;">Goal</span>**
 
The goal is to analyze raw images and metadata to predict the ‚ÄúPawpularity‚Äù of pet photos.

# **<span style="color:#AB51E9;">Data</span>**

**Training Data**

> - ```train/``` - Folder containing training set photos of the form {id}.jpg, where {id} is a unique Pet Profile ID.
> - ```train.csv``` - Metadata (described below) for each photo in the training set as well as the target, the photo's Pawpularity score. The Id column gives the photo's unique Pet Profile ID corresponding the photo's file name.

**Example Test Data**

In addition to the training data, we include some randomly generated example test data to help you author submission code. When your submitted notebook is scored, this example data will be replaced by the actual test data (including the sample submission).

> - ```test/``` - Folder containing randomly generated images in a format similar to the training set photos. The actual test data comprises about 6800 pet photos similar to the training set photos.
> - ```test.csv``` - Randomly generated metadata similar to the training set metadata.
sample_submission.csv - A sample submission file in the correct format.

**Photo Metadata**

The train.csv and test.csv files contain metadata for photos in the training set and test set, respectively. Each pet photo is labeled with the value of 1 (Yes) or 0 (No) for each of the following features:

> - ```Focus``` - Pet stands out against uncluttered background, not too close / far.
> - ```Eyes``` - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.
> - ```Face``` - Decently clear face, facing front or near-front.
> - ```Near``` - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).
> - ```Action``` - Pet in the middle of an action (e.g., jumping).
> - ```Accessory``` - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.
> - ```Group``` - More than 1 pet in the photo.
> - ```Collage``` - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).
> - ```Human``` - Human in the photo.
> - ```Occlusion``` - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion.
> - ```Info``` - Custom-added text or labels (i.e. pet name, description).
> - ```Blur``` - Noticeably out of focus or noisy, especially for the pet‚Äôs eyes and face. For Blur entries, ‚ÄúEyes‚Äù column is always set to 0.

In [None]:
!pip install --upgrade ngboost

import numpy as np
import pandas as pd
from pandas.plotting import parallel_coordinates

import os
import cv2
import math

from pathlib import Path
from tqdm import tqdm

import wandb

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

from scipy.stats import skew
from ngboost import NGBRegressor

from sklearn.decomposition import PCA 
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE 
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from pprint import pprint

import tensorflow.compat.v2 as tf
tf.enable_v2_behavior()

import tensorflow_probability as tfp
tfd = tfp.distributions

sns.reset_defaults()
#sns.set_style('whitegrid')
#sns.set_context('talk')
sns.set_context(context='talk',font_scale=0.7)

tfd = tfp.distributions


# Set Style
sns.set_style("white")
mpl.rcParams['xtick.labelsize'] = 16
mpl.rcParams['ytick.labelsize'] = 16
mpl.rcParams['axes.spines.left'] = False
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False
plt.rcParams.update({'font.size': 17})


In [None]:
train=pd.read_csv('../input/petfinder-pawpularity-score/train.csv')
test=pd.read_csv('../input/petfinder-pawpularity-score/test.csv')

ROOT_PATH = Path('../input/petfinder-pawpularity-score')
TRAIN_IMGS_PATH = ROOT_PATH / 'train/'
columns = train.columns.tolist()
columns.insert(1, 'image')



<img src="https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

> I will be integrating W&B for visualizations and logging artifacts!
> 
> [PetFinder - Popularity Score Project on W&B Dashboard]
(https://wandb.ai/usharengaraju/Pawpularity)
> 
> - To get the API key, create an account in the [website](https://wandb.ai/site) .
> - Use secrets to use API Keys more securely 

In [None]:
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    secret_value_0 = user_secrets.get_secret("api_key")
    wandb.login(key=secret_value_0)
    anony=None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')
    
CONFIG = dict(competition = 'PetFinder',_wandb_kernel = 'tensorgirl')

In [None]:
feature_columns = ['Subject Focus', 'Eyes', 'Face', 'Near', 'Action', 'Accessory',
       'Group', 'Collage', 'Human', 'Occlusion', 'Info', 'Blur']
Y = train['Pawpularity']
train_features = train[feature_columns]
test_features = test[feature_columns]

negloglik = lambda y, rv_y: -rv_y.log_prob(y)

x = train_features.to_numpy("float64")
y = Y.to_numpy("float64")
test = test_features.to_numpy("float64")


# **<span style="color:#AB51E9;">Target Variable Distribution</span>**

In [None]:
plt.figure(figsize=(15, 7))
plt.subplot(121)
sns.kdeplot(train.Pawpularity , color = "#E4916C")
plt.subplot(122)
sns.boxplot(train.Pawpularity , color = "#BCE6EF")

# **<span style="color:#AB51E9;">Distribution of Meta Features vs Target </span>**

In [None]:
# code copied from https://www.kaggle.com/aakashnain/which-features-to-use-and-why
features = train.columns[1:-1].tolist()
num_cols = 2
num_rows = len(features) // num_cols


fig, axs = plt.subplots(num_rows,
                        num_cols,
                        figsize=(20, 15),
                        sharex=False,
                        sharey=True
                       )

for i, feature in enumerate(features):
    _ = sns.kdeplot(data=train,
                 x="Pawpularity",                 
                 ax=axs[i // num_cols, i % num_cols],
                 hue=feature,
                 palette =sns.color_palette(["#E4916C", "#BCE6EF"])
                )
plt.show()

# **<span style="color:#AB51E9;">Frequency Distribution of Meta Features</span>**

In [None]:
df_train = train_features.melt(value_vars=feature_columns)
plt.figure(figsize = (15, 7))
sns.countplot(data=df_train, y="variable", hue="value" , palette =sns.color_palette(["#E4916C", "#BCE6EF"]))
plt.show()

# **<span style="color:#AB51E9;">Clustering</span>**

# **<span style="color:#e76f51;">t-SNE</span>**

In [None]:
tsne = TSNE(n_components=2, random_state=4)
t_train = tsne.fit_transform(train_features)
plt.figure(figsize=(15, 7))
sns.scatterplot(
    x=t_train[:, 0], y=t_train[:, 1],
    hue=train['Pawpularity'],
    alpha=0.3
)
plt.show()

# **<span style="color:#e76f51;">K - Nearest Neighbours</span>**

In [None]:
t_train = pd.DataFrame(t_train, index=train.Id, columns=["c"+str(c) for c in range(2)])
km = KMeans(n_clusters=2, random_state= 4).fit(t_train)
y_km = km.predict(t_train)
t_train["cluster"] = y_km
plt.figure(figsize=(15, 7))
sns.scatterplot(
    data=t_train,
    x="c0", y="c1",
    hue="cluster",
    alpha=0.3,
    palette =sns.color_palette(["#E4916C", "#BCE6EF"])
)
plt.show()

# **<span style="color:#e76f51;">PCA</span>**

In [None]:
n_comp = 2
pca = PCA(n_components=n_comp, svd_solver='full', random_state=4)
X_pca = pca.fit_transform(train_features)
plt.figure(figsize=(15, 7))
sns.scatterplot(
    data=X_pca,
    x=X_pca[:, 0], y=X_pca[:, 1],
    hue=train['Pawpularity'],
    alpha=0.3
    )
plt.show()

# **<span style="color:#AB51E9;">Visualize Dataset Interactively using W&B Tables</span>**

It only requires 5 lines of extra code to get the power of W&B Tables. 

1. You first need to initialize a W&B run using `wandb.init` API. This step is common for any W&B Logging.
2. Create a `wandb.Table` object. Imagine this to be an empty Pandas Dataframe. 
3. Iterate through each row of the `train.csv` file and `add_data` to the `wandb.Table` object. Imagine this to be appending new rows to your Dataframe. 
4. Log the W&B Tables using `wandb.log` API. You will use this API to log almost anything to W&B.
5. In a Juypter like interactive session, you need to call `wandb.finish` to close the initialized W&B run. 

Source : Content copied from Ayush notebook


In [None]:
# Initialize a W&B run to log images
run = wandb.init(project='Pawpularity', config=CONFIG, anonymous=anony) # W&B Code 1

data_at = wandb.Table(columns=columns) # W&B Code 2

for i in tqdm(range(len(train))):
    row = train.loc[i]
    img_id = row.Id

    data_at.add_data(img_id,                                            
                     wandb.Image(f'{TRAIN_IMGS_PATH}/{img_id}.jpg'),
                     *tuple(row.values[1:])) # W&B Code 3

wandb.log({'Raw Petfinder data': data_at}) # W&B Code 4
wandb.finish() # W&B Code 5

In [None]:
# This is just to display the W&B run page in this interactive session.
#from IPython import display

# we create an IFrame and set the width and height
#iF = display.IFrame(run.url, width=1080, height=720)
#iF

### [Check out the W&B Tables $\rightarrow$](https://wandb.ai/anony-mouse-139969/Pawpularity/runs/27ip81sk)

![img](https://i.imgur.com/cV9ycET.gif)

# **<span style="color:#AB51E9;">TensorFlow Probability</span>**  

[Source :](https://blog.tensorflow.org/2019/03/regression-with-probabilistic-layers-in.html)

Regression is one of the most basic techniques that a machine learning practitioner can apply to prediction problems However, many analyses based on regression omit a proper quantification of the uncertainty in the predictions, owing partially to the degree of complexity required. To start to quantify the uncertainty, a particularly elegant way of posing the problem is to write the regression model as P(y | x, w), the probability distribution of labels (y), given the inputs (x) and some parameters (w). We can fit this model to the data by maximizing the probability of the labels, or equivalently, minimizing the negative log-likelihood loss: -log P(y | x). In Python:



In [None]:
negloglik = lambda y, rv_y: -rv_y.log_prob(y)

We can use a variety of standard continuous and categorical and loss functions with this model of regression. Mean squared error loss for continuous labels, for example, means that P(y | x, w) is a normal distribution with a fixed scale (standard deviation). Cross-entropy loss for classification means that P(y | x, w) is the categorical distribution.

In [None]:
# Build model.
model = tf.keras.Sequential([
  tf.keras.layers.Dense(1),
  tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t, scale=1)),
])

# Do inference.
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss=negloglik)
model.fit(x,y, epochs=1000, verbose=False);

# Profit.
[print(np.squeeze(w.numpy())) for w in model.weights];
yhat = model(test)
assert isinstance(yhat, tfd.Distribution)

# **<span style="color:#AB51E9;">NGBoost: Natural Gradient Boosting for Probabilistic Prediction</span>**
[Website](https://stanfordmlgroup.github.io/projects/ngboost/)

[Paper](https://arxiv.org/pdf/1910.03225.pdf)

NGBoost generalizes gradient boosting to probabilistic regression by treating the parameters of the conditional distribution as targets for a multiparameter boosting algorithm.NGBoost matches or
exceeds the performance of existing methods for probabilistic prediction while offering additional
benefits in flexibility, scalability, and usability.


# **<span style="color:#e76f51;">Predictive Uncertainty Estimation in the real world.</span>**

Estimating the uncertainty in the predictions of a machine learning model is crucial for production deployments in the real world. In addition to making accurate predictions, we also want a correct estimate of uncertainty along with each prediction. When model predictions are part of an automated decision-making workflow or production line, predictive uncertainty estimates are important for determining manual fallback alternatives or for human inspection and intervenion.

Probabilistic prediction (or probabilistic forecasting), which is the approach where the model outputs a full probability distribution over the entire outcome space, is a natural way to quantify those uncertainties.

Comparison between the point predictions vs probabilistic predictions 

![](https://drive.google.com/uc?id=1gurE8gafuf5m9xM7Rg7zhbxT0_qdTHGN)

# **<span style="color:#e76f51;">NGBoost brings predictive uncertainty estimation to Gradient Boosting.</span>**

Gradient Boosting methods have generally been among the top performers in predictive accuracy over structured or tabular input data.

NGBoost enables predictive uncertainty estimation with Gradient Boosting through probabilistic predictions (including real valued outputs). With the use of Natural Gradients, NGBoost overcomes technical challenges that make generic probabilistic prediction hard with gradient boosting.

![](https://drive.google.com/uc?id=1v7ootml1mOBt32SjQPRo2UcrMWDIn7QF)

[Github](https://github.com/stanfordmlgroup/ngboost)

# **<span style="color:#e76f51;">Simple and modular approach.</span>**

The NGBoost algorithm is simple to use. It has three abstract modular components that are chosen as configuration:


**Base Learner**

The most common choice is Decision Trees, which tend to work well on structured inputs.

**Probability Distribution**

The distribution needs to be compatible with the output type. For e.g. Normal distribution for real valued outputs, Bernoulli for binary outputs.

**Scoring rule**

Maximum Likelihood Estimation is an obvious choice. More robust rules such as Continuous Ranked Probability Score are also suitable.
The above choices can be mixed and matched to be customized for the specific prediction problem at hand.

![](https://drive.google.com/uc?id=1sgb1BJKIH8PN4NhuXQ1tcxkfO7T1RMOT)

# **<span style="color:#e76f51;">The natural gradient makes learning efficient and effective.</span>**

Our key innovation is in employing the natural gradient to perform gradient boosting by casting it as a problem of determining the parameters of a probability distribution.

Ordinary gradients can be highly unsuitable for learning multi-parameter probability distributions (such as the Normal distribution). The training dynamics with the use of natural gradients tends to be much more stable and result in a better fit, as seen in the probabilistic regression example above.

![](https://drive.google.com/uc?id=11okOcrrcFK8ErqqGchNL1xa9GSCt95mt)

# **<span style="color:#e76f51;">Competitive performance in both uncertainty estimates and traditional metrics.</span>**

NGBoost requires far less expertise to use than competing methods, and performs as well on common benchmarks. NGBoost has particularly strong performance on smaller data sets.

![](https://drive.google.com/uc?id=1vuz4OymodyDBxbxwT5sMyjYDBs429pNN)

In [None]:

X_train, X_test, Y_train, Y_test = train_test_split(train[feature_columns], Y, test_size=0.2)

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)



In [None]:
rmse = math. sqrt(test_MSE)

# **<span style="color:#AB51E9;">References</span>**

https://arxiv.org/pdf/1910.03225.pdf

https://stanfordmlgroup.github.io/projects/ngboost/

@kooose  https://www.kaggle.com/kooose/eda-by-t-sne

@aakashnain https://www.kaggle.com/aakashnain/which-features-to-use-and-why

@ayuraj Wandb Content

# Work in progress üöß